return

Co-occurrence Matrices and their Applications in Information Science: Extending ACA to the Web Environment

Journal of the American Society for Information Science and Technology (JASIST)

<pdf-version>

 Loet Leydesdorff [1] and Liwen Vaughan [2]


Abstract

Co-occurrence matrices, such as co-citation, co-word, and co-link matrices, have been used widely in the information sciences. However, confusion and controversy have hindered the proper statistical analysis of this data. The underlying problem, in our opinion, involved understanding the nature of various types of matrices. This paper discusses the difference between a symmetrical co-citation matrix and an asymmetrical citation matrix as well as the appropriate statistical techniques that can be applied to each of these matrices, respectively. Similarity measures (like the Pearson correlation coefficient or the cosine) should not be applied to the symmetrical co-citation matrix, but can be applied to the asymmetrical citation matrix to derive the proximity matrix. The argument is illustrated with examples. The study then extends the application of co-occurrence matrices to the Web environment where the nature of the available data and thus data collection methods are different from those of traditional databases such as the Science Citation Index. A set of data collected with the Google Scholar search engine is analyzed using both the traditional methods of multivariate analysis and the new visualization software Pajek that is based on social network analysis and graph theory.

 


1. Introduction

 

Co-occurrence matrices, such as co-citation, co-word, and co-link matrices, provide us with useful data for mapping and understanding the structures in the underlying document sets. Various types of analysis have been carried out on this data and a significant body of literature has been built up, making it an important area of information science (e.g., White & McCain, 1998). However, confusion persists about the nature of these matrices and the kinds of analysis that are appropriate. For example, the debate between Ahlgren, Jarneving, & Rousseau (2003, 2004a and b),  White (2003, 2004), and Bensman (2004) on the use of the Pearson correlation coefficient or the cosine in the case of author cocitation analysis (ACA) shows some of these problems. In our opinion, co-occurrence matrices like the ones used in ACA are proximity data which do not require conversion before mapping. We shall argue that it is advisable to use, if possible, the asymmetrical matrices of documents versus attributes from which the co-occurrence matrices can be derived for mapping purposes.

 

This problem of how to process and understand co-occurrence matrices has entered a new dimension because co-occurrence matrices can be used extensively in Internet research. In this environment, one often can no longer retrieve the entire document set that is needed to construct the co-occurrence matrix, but one can construct these matrices directly, for example, by searching in a domain with Boolean ANDs. We will discuss the nature of the various matrices and the issues surrounding their analysis in the hope of clarifying some confusion and thus contributing to the further development of this area of information science. Our argument is methodological, but we shall illustrate the argument by using the example of an author co-citation analysis (ACA) in information science based on the ISI database which has been the subject of a previous debate in this journal (Ahlgren et al., 2003, 2004a, 2004b; White, 2003, 2004; Bensman, 2004; Leydesdorff, 2005). In the next step, we extend the data collection and analysis to the Web environment by using the Google Scholar search engine for the same set of scholars.

 

2. Symmetrical Co-citation Matrix vs. Asymmetrical Citation Matrix

 

2.1      The symmetrical co-citation matrix

 

Small (1973) pioneered co-citation analysis (cf. Marshakova, 1973). He constructed co-citation matrices in the form shown in Figure 1. The number in each cell of the matrix is the number of times two papers are co-cited. For example, Paper 1 and Paper 2 are co-cited 10 times while Paper 1 and Paper 3 are co-cited 20 times. At that time (early 1970s) Small had to use the ISI data as lists instead of matrices because of computational constraints. Using single linkage clustering Small could extract co-citation maps from this data without generating the matrices (Leydesdorff, 1987).

 

 

Paper 1

Paper 2

Paper 3

Paper 4

Paper 1

 

10

20

25

Paper 2

10

 

30

15

Paper 3

20

30

 

12

Paper 4

25

15

12

 

Figure 1: Co-citation matrix (symmetrical matrix)

 

White & Griffith (1981) extended the co-citation analysis concept to author co-citation analysis (ACA), making significant contributions to the development of the field. They used first authors, rather than papers, as the units for analysis; as against co-citation analysis, the cited authors, not the cited documents, were their units of analysis. So their matrix is essentially the same as shown in Figure 1 except that Paper 1, Paper 2 etc. are replaced with Author 1, Author 2 etc. Both Small (1973) and White & Griffith (1981) used multidimensional scaling (MDS) and cluster analysis to analyze their data. White & McCain (1998) also used factor analysis. The difference is that Small normalized the original data (the raw co-citation data collected from the ISI databases) with the Jaccard Index while White & Griffith used the Pearson correlation coefficient for this purpose. Small & Sweeny (1985) began to use the cosine as an alternative similarity measure (Salton & McGill, 1983).

 

A matrix in the form of Figure 1 is a proximity matrix. As Kruskal (1978, p. 7) formulated: “A proximity is a number which indicates how similar or how different two objects are, or are perceived to be, or any measure of this kind.” Proximity matrices can be either similarity or dissimilarity matrices (Cox & Cox, 2001, p. 9). Co-citation or co-author matrices are similarity (not dissimilarity) matrices. The higher the number in the cell, the more similar two papers (or two authors) appear to be. A proximity matrix can be input into multidimensional scaling software directly to generate a map which shows the relative positions of the papers or authors. The mapping principle is that the higher the proximity (the more similar two units are), the closer the two papers or authors will be located in the map.

 

2.2      The asymmetrical citation matrix

 

An alternative way of using citation data is to construct a matrix in the form shown in Figure 2. We will show an example of using this matrix for author co-citation analysis later. In this matrix, the rows are the citing papers and the columns represent cited papers. So Paper A is cited by Paper 1, 4, and 5 while C is cited by Paper 2 and 3.

 

 

Cited Paper A

Cited Paper B

Cited Paper C

Cited Paper D

Citing Paper 1

1

1

0

1

Citing Paper 2

0

0

1

1

Citing Paper 3

0

0

1

1

Citing Paper 4

1

1

0

0

Citing Paper 5

1

1

0

1

Figure 2: Citation Matrix (asymmetrical matrix)

 

This matrix is VERY different from that shown in Figure 1. The matrix in Figure 1 is a symmetrical matrix in that (1) rows and columns are the same objects; (2) the number of rows is the same as the number of columns; and (3) data in the matrix is symmetrical about the diagonal, so that only half of the matrix is enough to contain all the data. Obviously, the matrix in Figure 2 does not have any of these three features, and this matrix is asymmetrical. Furthermore, data in the Figure 2 matrix are NOT proximity measures, so this matrix cannot be input directly into MDS. However, one can convert this attribute matrix into a proximity matrix. “A very common way to get proximities from data that are not proximities and hence inappropriate for MDS in their original form is to compute some measure of profile similarity or dissimilarity between rows (or columns) of a table. [...] The most common ways to derive a profile proximity measure are to compute correlations between variables or squared (Euclidean) distances between the stimuli” (Kruskal, 1978, p. 10). The Euclidean distance matrix can be considered as a dissimilarity matrix, while the Pearson correlation matrix can be considered as a similarity matrix. However, Ahlgren et al.’s (2003) argued that Pearson’s correlation coefficient is formally not a similarity measure, but a measure of linear dependence. (See the discussion on similarity vs. dissimilarity matrices in the next section).

 

We focus here on the Pearson correlation coefficient, but a similar reasoning could be applied to the cosine as a similarity measure, or to Euclidean distances as dissimilarity measures (Ahlgren et al., 2003, at p. 551). The problem of the potentially negative values of Pearson’s r as a proximity measure can be overcome by a linear transformation of (r + 1) / 2 which will result in values between 0 and 1. By applying the Pearson correlation to Figure 2 data (column pair wise correlation) and then using the conversion of (r + 1) / 2, one obtains the proximity matrix shown in Figure 3. This is a proximity matrix which has all the three features of the symmetry that Figure 1 has. Looking at data in Figure 2, we see that Paper A and Paper B are cited similarly by this set of papers (Paper 1, 4, and 5). The coefficient of 1 in Figure 3 reflects this fact. In contrast, Paper A and Paper C are cited completely dissimilarly in this set: they have the opposite citing papers, as shown by the coefficient of 0 in Figure 3.

 

 

Paper A

Paper B

Paper C

Paper D

Paper A

1

1

0

0.295

Paper B

1

1

0

0.295

Paper C

0

0

1

0.705

Paper D

0.295

0.295

0.705

1


Figure 3:
Proximity matrix derived from data in Figure 2

 

In the case of the asymmetrical matrix (Figure 2), the cited papers are considered as attributes of the citing papers because they are contained in the reference lists of the latter. Paper A shares two out of three of its citing papers with Paper D so their coefficient is a number between 0 and 1, i.e. 0.295.

 

In summary, the Pearson correlation coefficient can be used in co-citation analysis when it is applied to an asymmetrical citation matrix. However, applying Pearson’s r to a symmetrical proximity matrix is problematic. White (2003, p. 1251) noted that on the first page of Davison’s (1983) textbook on multidimensional scaling, the correlation coefficient is mentioned as one of the two basic proximity measures in MDS. However, in this book, Pearson correlation coefficients were always used to construct proximity matrices from data that were not already proximity measures. A co-citation matrix is a proximity matrix, so there is no need to apply a similarity measure to construct a proximity matrix. On the contrary, doing so may distort the data, as we will now show with an empirical example.

 

2.3      An example

 

The data for our example (see Table 1) were copied from SPSS (1993). This table provides flying mileages among ten American Cities. The map from which these distances are generated is two-dimensional and thus one can evaluate straightforwardly the quality of the reconstruction of the geographical map using this data.

 

Table 1. Flying mileages between 10 American Cities

 

Atlanta

Chicago

Denver

Houston

Los Angeles

Miami

New York

San Francisco

Seattle

Washington DC

Atlanta

0

.

.

.

.

.

.

.

.

.

Chicago

587

0

.

.

.

.

.

.

.

.

Denver

1212

920

0

.

.

.

.

.

.

.

Houston

701

940

879

0

.

.

.

.

.

.

Los Angeles

1936

1745

831

1374

0

.

.

.

.

.

Miami

604

1188

1726

968

2339

0

.

.

.

.

New York

748

713

1631

1420

2451

1092

0

.

.

.

San Francisco

2139

1858

949

1645

347

2594

2571

0

.

.

Seattle

2182

1737

1021

1891

959

2734

2408

678

0

.

Washington DC

543

597

1494

1220

2300

923

205

2442

2329

0

 

Obviously, this is a symmetrical proximity matrix. The data measures dissimilarity, as the larger the numbers, the further apart the cities are, i.e. the more “dissimilar” they are in location. By inputting this matrix into SPSS and choosing  PROXSCAL as an option of MDS, we obtain Figure 3, which is an almost perfect mapping of the relative positions of these cities (the positions are relative and the map is reversed in terms of west and east. However, because of this relativity of the positions the results of MDS can be rotated freely for the interpretation).

 

Figure 4: MDS mapping (PROXSCAL) of ten American cities using the original distance matrix (normalized raw stress = 0.0001)

 

After applying Pearson’s r to the data of Table 1 and then map this new matrix with MDS, we obtain a distorted map of the ten cities and the normalized raw stress of this picture is very high (0.11341).

 

Figure 5: MDS mapping of ten American cities using the Pearson correlation matrix of the distances (normalized raw stress = 0.11341)

 

Apparently, Figure 5 does not improve on Figure 4 (the stress has become very high). By using the Pearson correlations instead of the distances, the representation is distorted. For example, Los Angeles is positioned closer to Seattle than San Francisco while New York is closer to Chicago than to Washington, D.C. The Pearson correlation normalizes the data with reference to the mean, and the pattern of co-occurrences as variables, as indicated by the Pearson correlation, is in some cases different from the proximities in the network.

 

Unlike this geographical data—which is two-dimensional and therefore can be mapped unambiguously—the intellectual structure as measured, for example, by using co-authorship or co-citation data is usually multi-dimensional. Multi-dimensional scaling (or factor analysis) searches for a projection of the n-dimensional data in a space with lower dimensionality. MDS uses the stress measure as an indicator for the fit, but this can only be considered as a heuristic. Eventually, the analyst also has to appreciate the representation of the represented structure on qualitative grounds. In other words, the multi-dimensional representation of intellectual structure in terms of co-authorship data can be very good, while this representation cannot easily be projected in a two- or three-dimensional visualization. Factor analysis allows us to study the quality of data reduction in more dimensions in precise numbers (algorithmically) and may thus be helpful in understanding the quality of the geometrical visualization as a projection.

 

3. Similarity vs. dissimilarity measures

 

As stated above, there are two kinds of proximity measures: similarity or dissimilarity. Obviously the two are opposite, so they should be treated differently in MDS. In recent versions of SPSS, there are two options for MDS: ALSCAL and PROXSCAL. ALSCAL assumes that the input is a dissimilarity matrix, while PROXSCAL allows one to specify whether the proximities are similarity or dissimilarity measures. There is no doubt that co-citation is a similarity measure (the more co-citations two papers or two authors have, the more similar they are), so the similarity option of PROXSCAL should be used. If one reverses the two types of similarity measures, the mapping results will be wrong. For example, the America city mileage data in Table 1 provides a dissimilarity measure and if we specify it as a similarity measure in MDS, the result is a very distorted map (the resulting map is omitted here due to space limitations).

 

In early versions of SPSS, only the ALSCAL option was available (the dissimilarity measure only). In this case, a co-citation matrix should be converted into a dissimilarity matrix before it is input into SPSS. Kruskal & Wish (1978, p. 77) clearly state that “If the proximities are similarities, they must be ‘turned upside down’ into dissimilarities, for example by forming dissimilarity = (constant – similarity) where the value of the constant is judiciously chosen.” If the similarity measure is between 0 and 1 (e.g. the above example of using Pearson’s r to obtain the proximity matrix of Figure 3), then the constant can be 1, i.e. dissimilarity = (1 – similarity). One of us conducted extensive testing of the formulae and found that the mapping results from using dissimilarity measures after the correct conversion from similarity to dissimilarity, and from using the similarity measures directly, are always the same.

 

A widely used form of MDS is on asymmetrical attribute matrices as in Figure 2 above. MDS is then primarily a visualization technique within a class of multivariate instruments like factor analysis, cluster analysis, etc. In this case, the data is analyzed as dissimilar variables, and thus both ALSCAL and PROXSCAL can be used. Euclidean distances are the default measure of dissimilarity. For input data that are not proximity measures, PROXSCAL can construct the proximity matrix. Because we study both types of matrices in the various sections below, we will use PROXSCAL throughout this study. Note that a visualization such as MDS remains a representation of the data in two or three dimensions, while factor analysis, for example, adds the possibility of rotating the data in order to obtain a higher-dimensional and quantitative understanding of the structures underlying these geometrical representations (Schiffman et al., 1981).

 

4. An example of Author Co-citation Analysis (ACA)

 

Let us return to the example of an author co-citation analysis that was discussed previously in several contributions to this journal (Ahlgren et al., 2003; White, 2003; Bensman, 2004; Leydesdorff, 2005) and discuss in considerable detail the consequences of not using the symmetrical co-occurrence matrix but rather the asymmetrical matrix of documents versus references.

 

Ahlgren et al. (2003: 554) downloaded from the Web of Science 430 bibliographic descriptions of articles published in Scientometrics and 483 such descriptions published in the Journal of the American Society for Information Science and Technology (JASIST) in the period 1996-2000. From the 913 bibliographic references in these articles they composed a co-occurrence matrix for twelve authors in the field of information retrieval and 12 authors doing bibliometric-scientometric research. They provide both the co-occurrence matrix and the Pearson correlation table in their paper (at pp. 555 and 556, respectively).

 

We repeated the analysis in order to obtain the original (asymmetrical) data matrix. Using precisely the same searches we found 469 articles in Scientometrics and 494 in JASIST on 18 November 2004. The somewhat higher numbers are consistent with the practice of the ISI to reallocate papers sometimes at a later date to a previous year. Thus, we disregarded these differences.

 

4.1      Descriptive statistics

 

Of the (469 + 494 =) 963 documents thus retrieved, 902 contain 21,813 references. 279 records contain at least one co-citation to two or more authors of the list of 24 authors under study.

There are no citing records which contain a reference to only a single author in this set of 279 citing documents. Thus, this can with good reason be considered as a set of highly co-cited authors. Figure 6 shows that one citing paper even co-cited ten of the authors included in the analysis.

Figure 6: Distribution of 279 co-citations in terms of the number of authors co-cited in a single citing document

 

Figure 7 exhibits the total citations of these authors within the set of citing documents. Note that the scientometrics authors have on average a citation rate of 44.6 (± 14.8), while the information retrieval researchers have a lower average of 26.1 (± 6.5). Citation rates are field-specific, indeed.

Figure 7. Number of times each of the 24 authors is cited in the 279 citing documents

 

Let us now move on from these descriptive statistics to an analysis of the data.

 

4.2      Data analysis of the asymmetrical matrix

 

The data can be imported into SPSS and the asymmetrical matrix can then be subjected to various forms of multivariate analysis. For example, one can ask for a Pearson correlation matrix. Table 2 provides this matrix for our 24 authors. These Pearson correlations are very different from the ones provided for this set of authors by Ahlgren et al. (2003, at p. 556), since they applied Pearson’s r to the symmetrical co-citation matrix. For example, the correlation coefficient between the co-citation pattern of Van Raan and Schubert in the latters’ Table 9 is 0.74, while we found a negative correlation between their citation patterns (r = -0.131; p < 0.05). The Pearson correlations derived from the symmetrical co-citation matrix are all high and significant because this matrix is symmetrical, so all values and relations occur twice.

 

Table 2 Pearson correlations among the 24 cited authors on the basis of 279 citing documents



Figure 8 shows the results of inputting the asymmetrical matrix into PROXSCAL for the MDS. The visualization suggests that the information retrieval researchers are more organized along a single (almost horizontal) axis than the scientometricians along a vertical one. Factor analysis of the matrix confirms this observation and makes it possible to inform this picture with a quantitative interpretation.

 

Figure 8:

PROXSCAL MDS on the basis of the asymmetrical matrix (normalized raw stress = 0.044)

 

Choosing four factors enables us to understand the relations between the two groups of authors and the fine structures within each of these groups (Table 3). The first two factors exhibit factor loadings exclusively for information retrieval researchers. These two factors explain 26.8% of the common variance in the matrix, as against 14.2% for the two factors with high loadings for the scientometric authors. This means that the information retrieval researchers are co-cited much more consistently than the scientometric authors: their co-citation patterns are more highly correlated than those of the scientometricians. The subdivisions between factors 1 and 2 and factors 3 and 4, respectively, are of a different nature. Braun, Schubert, and Glänzel are a separate group; they are mainly co-cited because of their (until recently) common address in Budapest and their many coauthored articles. The position of Cronin is special and correlates highly with that of Derek de Solla Price as a cited author. His and Price’s citation patterns do not correlate specifically with any of the four factors.

 

                       Rotated Component Matrix(a)

 

 

Component

 

1

2

3

4

VANRIJSBERGEN

.679

-.160

-.191

 

HARMAN

.652

 

-.145

 

ROBERTSON

.648

.207

 

 

CROFT

.559

.153

.139

 

COOPER

.546

-.168

-.260

-.137

BLAIR

.522

 

 

 

MOED

-.270

-.229

.216

.231

FIDEL

 

.717

 

 

KUHLTHAU

-.189

.710

-.238

 

MARCHIONINI

 

.648

-.110

 

SPINK

.267

.568

.164

 

DERVIN

-.159

.566

-.236

-.124

BELKIN

.218

.564

-.165

 

TIJSSEN

 

 

.695

 

VANRAAN

 

-.123

.615

 

CALLON

-.105

 

.478

-.352

NEDERHOF

-.111

 

.426

.278

LEYDESDORFF

-.201

-.179

.426

-.416

NARIN

-.271