return

On the Normalization and Visualization of Author Co-Citation Data:

Salton’s Cosine versus the Jaccard Index

Journal of the American Society of Information Science & Technology (forthcoming)

click here for pdf

Loet Leydesdorff

Amsterdam School of Communications Research (ASCoR)

Kloveniersburgwal 48, 1012 CX  Amsterdam, The Netherlands

loet@leydesdorff.net ; http://www.leydesdorff.net

 

Abstract

 

The debate about which similarity measure one should use for the normalization in the case of Author Co-citation Analysis (ACA) is further complicated when one distinguishes between the symmetrical co-citation—or, more generally, co-occurrence—matrix and the underlying asymmetrical citation—occurrence—matrix. In the Web environment, the approach of retrieving original citation data is often not feasible. In that case, one should use the Jaccard index, but preferentially after adding the number of total citations (occurrences) on the main diagonal. Unlike Salton’s cosine and the Pearson correlation, the Jaccard index abstracts from the shape of the distributions and focuses only on the intersection and the sum of the two sets. Since the correlations in the co-occurrence matrix may partially be spurious, this property of the Jaccard index can be considered as an advantage in this case.

 

Keywords: cosine, correlation, co-occurrence, co-citation, Jaccard, similarity

 

1. Introduction

 

Ahlgren et al. (2003) argued that one should consider using Salton’s cosine instead of the Pearson correlation coefficient as a similarity measure in author co-citation analysis and showed the effects of this change on the basis of a dataset provided in Table 7 (at p. 555) of their paper. This has led to discussions in previous issues of this journal about the pros and cons of using the Pearson correlation or other measures (Ahlgren et al., 2004; Bensman, 2004; White, 2003, 2004; Leydesdorff, 2005). Leydesdorff and Vaughan (2006) used the same dataset for showing why one should use the (asymmetrical) citation instead of the (symmetrical) co-citation matrix as the basis for the normalization. They argued that not only the value, but also the sign of the correlation may change between two cited authors when using the Pearson correlation in the symmetrical versus the asymmetrical case. For example in the dataset under study, Ahlgren et al. (2003, at p. 556) found a correlation of r = + 0.74 between “Schubert” and “Van Raan,” while Leydesdorff & Vaughan (at p. 1620) report r = – 0.131  (p < 0.05) using the underlying citation matrix.

 

One can download a set of documents in which the authors under investigation are potentially (co-)cited in the library environment, but this approach of retrieving original citation data and then using Pearson’s r or Salton’s cosine to construct a similarity matrix is often not feasible in the web environment. In this environment, the researcher may only have the index available and searches the database with a Boolean AND in order to construct a co-citation or, more generally, a co-occurrence matrix without first generating an occurrence matrix. Should one in such cases also normalize using the cosine or the Pearson correlation coefficient or, perhaps, use still another measure?

 

I shall argue that in this case, one may prefer to use the Jaccard index (Jaccard, 1901). The Jaccard index was elaborated by Tanimoto (1957) for the non-binary case. Thus, one can distinguish between using the Jaccard index for the normalization of the binary citation matrix and the Tanimoto index in the case of the non-binary co-citation matrix. The results will be compared with using Salton’s cosine (Salton & McGill, 1983), the Pearson correlation, and the probabilistic activity index (Zitt et al., 2000) in the case of both the symmetrical co-citation and the asymmetrical citation matrix.

 

The argument is illustrated with an analysis using the same data as Ahlgren et al. (2003). This dataset (provided in Table 1) is extremely structured: it contains exclusively positive correlations within both groups and negative correlations between the two groups. The two groups are thus completely separated in terms of the Pearson correlation coefficients. However, there are relations between individual authors in the two groups. An optimal representation should reflect both this complete separation in terms of correlations at the level of the set and the weak overlap generated by individual relations (Waltman & Van Eck, forthcoming; Leydesdorff, 2005). (A visualization of the co-citation matrix before normalization is provided as Figure 13 by Leydesdorff & Vaughan (2006, at p. 1625).)

 

In summary, two problems have to be distinguished: the problem of normalization and the type of matrix to be normalized. In principle, one can normalize both symmetrical and asymmetrical matrices with the various measures. Ahlgren et al. (2003) provided arguments for using the cosine instead of the Pearson correlation coefficient, particularly if one aims at visualization of the structure like in the case of social network analysis or MDS. Bensman (2004) provided arguments why one might nevertheless prefer the Pearson correlation coefficient when the purpose of the study is a statistical (e.g., multivariate) analysis. The advantage of the cosine of being not a statistics, but a similarity measure then disappears. Formally, these two measures are equivalent with the exception that Pearson normalizes for the arithmetic mean, while the cosine doesn’t use this mean as a parameter (Jones & Furnas, 1997). The cosine normalizes for the geometrical mean. The question remains which normalization one should use when one has only co-occurrence data available.

 

2. The Jaccard index

 

In his original paper introducing co-citation analysis, Small (1973, at p. 269) suggested the following solution to the normalization problem in footnote 6:

 

We can also give a more formal definition of co-citation in terms of set theory notation. If A is the set of papers which cites document a and B is the set which cites b, then A∩B, that is n(A∩B), is the co-citation frequency. The relative co-citation frequency could be defined as n(A∩B) ¸ n(AB).

 

This proposal for the normalization corresponds with using the Jaccard index or its extension (for the non-binary case) into the Tanimoto index. The index is defined for a pair of vectors, Xm and Xn, as the size of the intersection divided by the size of the union of the sample sets or in numerical terms:

 

 


where Xij = XiXj. The value of Smn ranges from 0 to 1 (Lipkes, 1999; cf. Salton & McGill, 1983, at pp. 203f.).

 

In a number of studies (e.g., Egghe & Rousseau, 1990; Glänzel, 2001; Hamers et al., 1989; Leydesdorff & Zaal, 1988; Luukkonen et al., 1993; Michelet, 1988; Wagner & Leydesdorff, 2005), the Jaccard index and the cosine have systematically been compared for co-occurrence data, but this debate has remained inconclusive. Using co-authorship data, for example, Luukkonen et al. (1993, at p. 23) argued that “the Jaccard measure is preferable to Salton’s measure since the latter underestimatess the collaboration of smaller countries with larger countries; […].” Wagner & Leydesdorff (2005, at p. 208) argued that “whereas the Jaccard index focuses on strong links in segments of the database the Salton Index organizes the relations geometrically so that they can be visualized as structural patterns of relations.”

 

In many cases, one can expect the Jaccard and the cosine measures to be monotonic to each other (Schneider & Borlund, forthcoming). However, the cosine metric measures the similarity between two vectors (by using the angle between them), whereas the Jaccard index focuses only on the relative size of the intersection between the two sets when compared to their union. Furthermore, one can normalize differently using the margintotals in the asymmetrical occurrence or the symmetrical co-occurrence matrix. Luukkonen et al. (1993, at p. 18), for example, summed the co-occurrences in their set (of 30 countries) for obtaining the denominator, while Small’s (1973) definition of a relative co-citation frequency suggests to use the sum of the total number of occurrences as the denominator. White & Griffith (1981, at p. 165) also proposed using “total citations” as values for the main diagonal, but these authors decided not to use this normalization for empirical reasons.

 

Braun

50

29

19

19

8

13

5

9

7

7

2

0

0

0

0

0

0

0

0

0

0

0

0

0

118

Schubert

29

60

30

18

10

20

5

5

5

14

2

1

0

0

0

0

0

0

0

0

0

0

0

0

139

Glanzel

19

30

53

16

10

22

9

14

9

11

5

3

0

0

0

0

0

0

0

0

0

0

0

0

148

Moed

19

18

16

55

11

20

5

17

14

12

6

4

0

0

0

0

0

0

0

0

0

0

0

0

142

Nederhof

8

10

10

11

31

12

8

13

7

4

4

2

0

0

0

0

0

0

0

0

0

0

0

0

89

Narin

13

20

22

20

12

64

11

20

21

20

11

9

0

0

1

1

0

0

1

1

0

0

0

0

183

Tijssen

5

5

9

5

8

11

22

13

10

5

6

1

0

1

2

1

0

0

0

1

0

0

0

0

83

VanRaan

9

5

14

17

13

20

13

50

13

12

11

6

0

1

2

1

0

0

0

1

0

0

0

0

138

Leydesdorff

7

5

9

14

7

21

10

13

46

18

14

9

1

0

1

1

0

0

0

2

0

0

0

0

132

Price

7

14

11

12

4

20

5

12

18

54

10

9

1

1

1

1

0

0

2

0

1

0

1

2

132

Callon

2

2

5

6

4

11

6

12

14

10

26

4

0

0

1

1

0

0

0

1

0

0

0

0

79

Cronin

0

1

3

4

2

9

1

6

9

9

4

24

1

0

0

0

0

0

0

1

0

1

1

1

53

Cooper

0

0

0

0

0

0

0

0

1

1

0

1

30

14

5

11

5

8

6

2

0

0

1

1

56

Vanrijsbergen

0

0

0

0

0

0

1

1

0

1

0

0

14

30

7

15

5

13

5

3

1

0

1

1

68

Croft

0

0

0

0

0

1

2

2

1

1

1

0

5

7

18

9

6

7

8

6

2

1

2

2

63

Robertson

0

0

0

0

0

1

1

1

1

1

1

1

11

15

9

36

7

12

11

10

8

5

4

4

103

Blair

0

0

0

0

0

0

0

0

0

0

0

0

5

5

6

7

18

9

4

2

2

2

0

0

42

Harman

0

0

0

0

0

0

0

0

0

0

0

0

8

13

7

12

9

31

9

5

5

3

1

1

73

Belkin

0

0

0

0

0

1

0

0

0

2

0

0

6

5

8

11

4

9

36

9

9