Classification and Powerlaws: The logarithmic transformation
Journal of the American Society for Information Science and Technology (forthcoming)
<click
here for the PDF version>
Loet Leydesdorff [1] & Stephen Bensman [2]
Abstract
Logarithmic transformation of the data has been recommended by the literature in the case of highly skewed distributions such as those commonly found in information science. The purpose of the transformation is to make the data conform to the lognormal law of error for inferential purposes. How does this transformation affect the analysis? We factor analyze and visualize the citation environment of the Journal of the American Chemical Society (JACS) before and after a logarithmic transformation. The transformation strongly reduces the variance necessary for classificatory purposes and therefore is counterproductive to the purposes of the descriptive statistics. We recommend against the logarithmic transformation when sets cannot be defined unambiguously. The intellectual organization of the sciences is reflected in the curvilinear parts of the citation distributions, while negative powerlaws fit excellently to the tails of the distributions.
Keywords: classification, citation, journal, logarithmic transformation, powerlaw
1. Introduction
The problem under analysis in this paper has its genesis in a controversy that erupted on the pages of JASIST over the use of the Pearson correlation coefficient as a similarity measure in author cocitation analysis (ACA). Ahlgren et al. (2003) challenged basing ACA on the Pearson r with the argument that this measure is sensitive to zeros in the sense that the relationships among the authors change when authors not citing any of them are added to the set. These authors proposed alternative measures such as the cosine. White (2003) defended the method of the Drexel school (White & Griffith, 1981, 1982; McKain, 1990) by showing that the Pearson r and the cosine lead to similar classification and mapping results when using Ahlgren et al.’s own data.[3]
The Pearson r is a measure of the closeness of the fit of observation points to a regression line and is therefore a linear statistical model. Linear statistical models rely upon a number of basic assumptions. Without these assumptions, the data for them must be mathematically transformed so that the condition of linearity is satisfied. In a review of the key literature on such transformations, Hoyle (1973) summarized the assumptions conditional to the use of linear models as follows:
(a) additivity—that is, the main effects combine linearly to “explain” the observations;
(b) constant variance—that is, the observations are assumed to have a constant variance about their varying means. Explicitly this means that the variance
is independent of both the expected value of the observations and the sample size;
(c) normality—that is, the observations are assumed to have a normal distribution. (p. 203)
For their part, Box and Cox (1964, 211) further qualified the assumptions underlying linear statistical models by adding to them simplicity of model structure and independence of observations.
Information science data rarely allow for the satisfaction of these assumptions. This is particularly true of scientific journal citation data, due to the structure of scientific journal sets even after an initial classification process, and the stochastic processes underlying the distributions resulting from this structure. If the data is heavily skewed—like it is often the case in information science—one should consider to perform a logarithmic transformation. Logarithmically transformed data may exhibit log-normality, and thus allow for using the Pearson correlation coefficient.
In this study, we lognormalize journal-journal citation data before using the Pearson correlation (as an initial step in factor analysis). Might this transformation provide an option for testing different possible classifications of journals for their significance (Leydesdorff, forthcoming)? We found that the logarithmic transformation did not add clarity to the classificatory process. This accords with White’s (2004, p. 844) expectation that “if mapping the correlation data is the goal, one merely wants the r’s to reflect degrees of similarity among the authors, and so significance tests from inferential statistics are not (I would think) of primary interest.”
We shall show below that the logarithmic transformation even worsens the quality of the classification. These results raise the question of the role of inferential statistics and the logarithmic transformation in the mathematical and statistical classification of observations into sets. We explore this question in this study by combining the theoretical background with empirical tests. In short, we will explain why the logarithmic transformation is counterproductive to the objective of classification in the case of bibliometric data (which typically exhibit heavily skewed distributions). This conclusion has implications for the interpretation and use of powerlaws in bibliometric data (Katz, 2000).
In terms of their underlying subject structure scientific journal sets are governed by two bibliometric laws: Bradford’s Law of Scattering and Garfield’s Law of Concentration. The first was posited by Bradford (1934, at p. 86), the director of the Science Museum Library in London, as a result of bibliographic studies done at this library. The second law was formulated by Garfield (1971) in the context of the selection of journals for inclusion in the Science Citation Index (SCI). The implications of these insights for information science were elaborated by Brookes (1977, 1979, 1980a, 1980b, 1984; Brookes & Griffith, 1979).
2.1 Scattering and concentration of journal sets
Bradford (1934) analyzed the distribution of articles in two subject areas: Applied Geophysics, 1928-1931, and Lubrication, 1931-1933. In neither area was he able to determine the number of journals that had no articles on the topics but potentially could, stating:
…the number of journals which contain journals on the subjects in question is of the order of a thousand. But the periodicals themselves could not be specified without scrutinizing a much larger number of periodicals during a long period. And even when the actual producers during a period of years had been ascertained, new sources would certainly appear during a further period. It follows that the only way to glean all the articles on these subjects would be to scrutinize continually thousands of journals, the bulk of which would only yield occasional references or none at all. (p. 86)
In other words, Bradford’s Law states that the distribution of articles on a given scientific topic over a set of journals is such that a large proportion of these articles appear in a relatively small core set of journals, while the remaining articles are spread over zones of journals that must increase exponentially in numbers of titles to obtain the same number of articles on the topic as in the core. Due to Bradford’s Law, unambiguously delineated (“crisp”) subject sets of scientific journals cannot be expected, and the purpose of the initial classification process is merely to approximate such subject sets as closely as practicable (Bensman, 2000; 2001; Zadeh, 1965).
The composite and multidisciplinary nature of science underlies also Garfield’s Law of Concentration, which Garfield (1971) considered as the citation corollary of Bradford’s Law of Scattering. Garfield (1971; 1972; 1983, 21-23 and 158-163) developed his law as a result of an analysis of references published during the last quarter of 1969 in the 2,200 journals then covered by the SCI. He found a distribution similar to the one discovered by Bradford because citations in an individual discipline like chemistry concentrate on a small core of journals. The ubiquity of such disciplinary cores caused Garfield to reformulate Bradford’s Law by transposing it from the level of individual disciplines to the level of science as a whole. Likening Bradford’s Law to a comet with the core journals of a discipline representing the nucleus and the zones acting as the tail, Garfield posited that the tail of the literature of any given scientific discipline consists in large part of the nuclei or cores of the literatures of other disciplines. Thus, a multidimensional space is spanned in terms of a variety of core sets, but each core includes a large part of the others in the tail of the distribution. According to Garfield, this phenomenon causes citations to concentrate on a small multidisciplinary core of some 500 to 1,000 journals representing all of science.
On the basis of these two laws, one cannot expect that scientific journal sets will be homogeneous in terms of their subject matter. A journal set defined by a given scientific discipline can be comprised of subsets of journals which can be classed in the sub-disciplines of this discipline as well as subsets of journals from other disciplines that contain materials of interest to the defining discipline. This latter subset can be considered a partial subset because it also contains materials not pertinent to the defining discipline. Moreover, a scientific journal set can also be broken down into subsets by criteria other than subject ones such as nationality, language, type of publisher, or purpose, e.g., research, review, informational, and instructional.
The composite structure of scientific journal sets dictates that their data distributions are for the most part compound ones. A compound distribution can be defined as a type of probability distribution arising when a parameter of the distribution such as the arithmetic mean is itself a random variable with its own probability distribution (Everitt, 1998, 71). Scientific journal distributions result from the Poisson process, which is the random occurrence of events such as citations over continuums of time and space. For these distributions space is defined in terms of the subsets comprising the set. Each of the subsets of a scientific journal set has different underlying probabilities and therefore a different expected value or arithmetic mean.
Two stochastic processes govern these scientific journal distributions. The first is heterogeneity. The variances around the arithmetic means tend to vary in proportion to the size of the arithmetic means, thereby violating one of the basic assumptions of linear statistical models. The second stochastic process is contagion. A term first suggested by the study of the probability distributions of epidemics, contagion became more broadly used to designate situations where trials are not independent, because the occurrence of an event affects the probability of its further occurrence. Citations act in such a manner, since each citing of a journal increases its probability of being cited again. This has been discussed in science studies as the Matthew effect (Merton, 1968) and more recently as the mechanism of preferential attachment which is well-known for generating negative powerlaws (Barabási, 2002; Barabási et al., 2002; Katz, 1999, 2000; Wagner & Leydesdorff, forthcoming). The linear fit of a log-log distributional chart can be used as a test for this preferential attachment mechanism.
Both heterogeneity and contagion act multiplicatively instead of additively, creating exponential and curvilinear relationships instead of the assumed additive, linear ones. Feller (1943) proved that heterogeneity and contagion serve as the basis for two different models of the negative binomial distritbution (NBD). Therefore, the NBD could serve as a probabilistic model of the causal processes in scientific journal distributions. The NBD can be normalized by the arc-sinh transformation (Anscombe, 1948). However, these precise mathematical probability models require crisp sets, which cannot be expected to exist in scientific journal data given Bradford’s and Garfield’s Laws.
2.2 The logarithmic transformation
As a result of their structure, scientific journal subject sets contain data unrelated to the subject, causing extreme statistical outliers that distort parameter estimates and prevent precise mathematical fits to theoretical curves. However, these outliers are meaningful because they span the structure in the data. They are indicated by the variance, but much less so by the arithmetic mean. Consequently, the latter is not an accurate measure under these circumstances. The vast majority of science journal distributions have a variance significantly much greater than their arithmetic mean.
In a landmark article Bartlett (1947, 43) specified the significance of this phenomenon in terms of the dynamics of a biological system. According to him, the natural explanation of a variance greater than the mean is that the mean level itself fluctuates. He noted that for biological populations, increases in numbers are often proportional to the numbers already present, giving rise to variations in the mean from place to place themselves proportional to the local mean. In the case of a variance greater than the mean, the literature advises considering a logarithmic transformation of the data (Bartlett, 1947; Quenouille, 1950). For his part, Elliott (1977, 33) considered the variance being greater than the mean as a sign of the negative binomial distribution, and he made the following recommendations: 1) with no zero counts, simple logarithmic transformation of the data; 2) with some zero counts, add one to the observations before performing the logarithmic transformation. Quenouille (1950, 165) stated that the logarithmic transformation tends to restore normality in the distribution and equalize the variances simultaneously, whereas Hoyle (1973, 207) cites a number of studies empirically showing the logarithmic transformation as a way of making the data conform to the three linear-model assumptions of additivity, constant variance, and normality.[4]
In summary, the logarithmic transformation of data enables the analyst to switch the law of error for tests of significance in linear models from the normal distribution to the lognormal distribution. In their book on the latter distribution Aitchison and Brown (1957) defined the lognormal distribution as “the distribution of a variate whose logarithm obeys the normal law of probability” (p. 1). According to them, many of the properties of the lognormal may be immediately derived from those of the normal distribution.
Aitchison and Brown believed that the lognormal distribution was as fundamental a distribution in statistics as the normal distribution: “It arises from a theory of elementary errors combined by a multiplicative process, just as the normal distribution arises from a theory of elementary errors combined by addition” (pp. 1-2). Keynes (1921, 198-200) regarded as the main advantage of the lognormal distribution the possibility it offered of adapting without much trouble to asymmetrical phenomena numerous expressions which had already been calculated for the normal law of error. In contrast to the normal distribution, which is centered on the arithmetic mean, the lognormal distribution is centered on the geometric mean, which can be calculated by first calculating the arithmetic mean of the logarithmically transformed data and then taking this mean’s antilogarithm. Thus, we can see that the purpose of the logarithmic transformation is to create a model that conforms to the requirements of the normal law of error for inferential purposes. It does this by artificially reducing the amount of variance to that of the normal distribution.
2.3 The implications for information science and technology
In a series of papers B. C. Brookes worked out the deeper implications of the logarithmic transformation for information science. In the first of this series, Brookes (1977) came to the conclusion that Bradford had succeeded in formulating an empirical regularity, which has pure and hybrid forms, but that all the variants can be subsumed under a simple logarithmic law which escapes exact expression in conventional frequency terms. In this analysis he closely linked Bradford’s Law with set definition, insisting upon the need for homogeneity of the data. Brookes (pp. 194-197) stated that most Bradford anomalies are due to inhomogeneous data, and he characterized SCI citation data specifically as inhomogeneous.
Utilizing the logarithmic Law of Anomalous Numbers advanced by Benford (1938), Brookes developed Bradford’s Law into a linear model of social reality with the type of deviations from linearity indicating the nature of the stochastic process that is occurring. On the basis of this model he developed a new theory of frequency-rank statistics especially applicable to social analysis. Brookes and Griffiths (1979) noted that in many social contexts, when a homogeneous ensemble of sources has been engaged in some discrete homogeneous activity, ranking the sources in descending order by frequency counts results in a distribution that is logarithmic. Brookes (1979) came thus to regard Bradford’s Law as a new calculus for the social sciences.
3. Methods and materials
3.1 Data
The role of inferential statistics and logarithmic transformation in numerical classification and mapping will be analyzed in terms of the allocation of scientific journals into different subject sets. Our data was collected from the CD-Rom version of the Journal Citation Reports 2003 of the Science Citation Index. We included all journals which provide more than one percent of the citations to articles in the Journal of the American Chemical Society during this year (Leydesdorff & Cozzens, 1993). This leads to the demarcation of the set of 21 journals listed in Table 1.
|
Table 1. Library of Congress Subject Headings and Class Groups for the 21 Journals Citing the Journal of the American Chemical Society |
|
||||||||||||
|
|
Titles |
Publishers |
Subject Headings |
Call Number |
Class Group |
Class Group Hierarchy |
|
|||||||
|
|
Science |
American Association for the Advancement of Science |
1. Science. |
Q1 |
Science (General) |
Science (General) |
|
|||||||
|
|
Angewandte Chemie-International Edition |
Wiley-VCH (1) |
1. Chemistry. |
QD1 |
Chemistry |
Chemistry |
|
|||||||
|
|
Chemical Communications (2) |
Royal Society of Chemistry |
1. Chemistry. |
QD1 |
Chemistry |
Chemistry |
|
|||||||
|
|
Chemistry-A European Journal |
VCH Verlagsgesellschaft (1) |
1. Chemistry. |
QD1 |
Chemistry |
Chemistry |
|
|||||||
|
|
Chemical Reviews |
American Chemical Society |
1. Chemistry. |
QD1 |
Chemistry |
Chemistry |
|
|||||||
|
|
Journal of the American Chemical Society |
American Chemical Society |
1. Chemistry. |
QD1 |
Chemistry |
Chemistry |
|
|||||||
|
|
Dalton Transactions (3) |
Royal Society of Chemistry |
1. Chemistry, Inorganic. 2. Chemistry, Physical and theoretical. |
QD146 |
Inorganic chemistry |
Chemistry--Inorganic chemistry |
|
|||||||
|
|
Inorganic Chemistry |
American Chemical Society. |
1. Chemistry, Inorganic. 2. Bioinorganic chemistry . |
QD146 |
Inorganic chemistry |
Chemistry--Inorganic chemistry |
|
|||||||
|
|
Journal of Organic Chemistry |
American Chemical Society |
1. Chemistry, Organic. |
QD241 |
Organic chemistry |
Chemistry--Organic chemistry |
|
|||||||
|
|
Organic and Biomolecular Chemistry (4) |
Royal Society of Chemistry |
1. Chemistry, Organic. 2. Bioorganic chemistry. 3. Chemistry, Physical organic |
QD241 |
Organic chemistry |
Chemistry--Organic chemistry |
|
|||||||
|
|
Tetrahedron |
Pergamon Press |
1. Chemistry, Organic. |
QD241 |
Organic chemistry |
Chemistry--Organic chemistry |
|
|||||||
|
|
Tetrahedron Letters |
Pergamon Press |
1. Chemistry, Organic. |
QD241 |
Organic chemistry |
Chemistry--Organic chemistry |
|
|||||||
|
Organic Letters |
American Chemical Society |
1. Chemistry, Organic. |
QD241 |
Organic chemistry |
Chemistry—Organic chemistry |
|
||||||||
|
Macromolecules |
American Chemical Society |
1. Macromolecules. 2. Polymers. 3. Polymerization. |
QD380 |
Polymers. Macromolecules |
Chemistry--Organic chemistry--Polymers. Macromolecules |
|
||||||||
|
Journal of Organometallic Chemistry |
Elsevier Sequoia |
1. Organometallic compounds . |
QD410 |
Organometallic chemistry and compounds |
Chemistry--Organic chemistry--Organometallic chemistry and compounds |
|
||||||||
|
Organometallics |
American Chemical Society |
1. Organometallic compounds. |
QD410 |
Organometallic chemistry and compounds |
Chemistry--Organic chemistry--Organometallic chemistry and compounds |
|
||||||||
|
Journal of Chemical Physics |
American Institute of Physics |
1. Chemistry . 2. Physics 3. Chemistry, Physical and theoretical. |
QD450 |
Physical and theoretical chemistry |
Chemistry--Physical and theoretical chemistry |
|
||||||||
|
Journal of Physical Chemistry A (5) |
American Chemical Society |
1. Chemistry, Physical and theoretical . |
QD450 |
Physical and theoretical chemistry |
Chemistry--Physical and theoretical chemistry |
|
||||||||
|
Journal of Physical Chemistry B (5) |
American Chemical Society |
1. Chemistry, Physical and theoretical . |
QD450 |
Physical and theoretical chemistry |
Chemistry--Physical and theoretical chemistry |
|
||||||||
|
Langmuir |
American Chemical Society |
1. Surface chemistry. 2. Colloids. 3. Surfaces (Physics). |
QD506. |
Surface chemistry |
Chemistry--Physical and theoretical chemistry--Surface chemistry |
|
||||||||
|
Biochemistry-US |
American Chemical Society |
1. Biochemistry. |
QP501 |
Animal biochemistry |
Physiology--Animal biochemistry |
|
||||||||
|
|
(1) A journal of the Gesellschaft Deutscher Chemiker. |
|||||||||||||
|
|
(2) Title changed in 1996 from: Journal of the Chemical Society. Chemical Communications. |
|||||||||||||
|
|
(3) Formed by the union in 2000 of Journal of the Chemical Society, Dalton Transactions, and Acta Chemica Scandinavica to become Dalton, which in 2003 became Dalton Transactions. (4) Formed in 2003 by the union of Perkin 1 and Perkin 2. Perkin was formed in 2000 by the merger of: Journal of the Chemical Society. Perkin Transactions 1;, and part of Acta Chemica Scandinavica. Perkins 2 was formed in 2000 by the merger of Journal of the Chemical Society. Perkin Transactions II, and part of Acta Chemica Scandinavica. |
|||||||||||||
|
|
(5) Continues in part as Journal of Physical Chemistry since 1997. |
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
One interesting feature of these journals is their publisher structure. Most of these journals are either published by scientific societies or associated with scientific societies. Thus, eleven are published by the American Chemical Society; three are published by the Royal Society of Chemistry; one by the American Association for the Advancement of Science; one by the American Institute of Physics; and two are journals of the Gesellschaft Deutscher Chemiker even though issued by commercial publishers. Society journals are the ones most highly rated by chemists and used in chemistry libraries (Bensman, 1996). Citations concentrate on both journals of scientific societies and elite research programs, showing that scientists from these programs publish in society journals (Bensman & Wilder, 1998). Thus, the publisher structure of the 21 journals is evidence that these journals rank high in the social structure of chemistry and are a manifestation of the intercommunication pattern of the chemistry scientific elite.
The set structure of the database will first be analyzed by the logical method of induction and analogy set forth by Keynes (1921). This can be done by showing what subject headings and class numbers are assigned to these 21 journals by the United States Library of Congress (LC). Table 1 gives these subject headings and class numbers. The subject headings should be self-evident, but the class numbers may require some explanation. In the standard work on LC Classification, Chan (1999, p. 12-16) states that the LC scheme is based on “literary warrant.” A classification scheme based on literary warrant is not logically deduced from some abstract philosophical system for classifying knowledge but inductively developed in reference to the holdings of a particular library or to what is or has been published. In other words, it is based on what the actual literature of the time warrants. Each of the individual schedules was initially drafted by LC subject specialists, who consulted bibliographies, treatises, comprehensive histories, and existing classification schemes to determine the scope and content of an individual class and its subclasses. The LC has a policy of continuous revision to take current literary warrant into account, so that new areas are developed and obsolete elements are removed or revised.
Analysis of the class numbers shows that the 21 journals have been classified logically into three basic subclasses or sets. Thus, the journal Science is classed in Q1 or Science (General). It is followed by 19 journals that are classed within QD or Chemistry and its hierarchical subclasses. The last journal, Biochemistry-US, has been classed within the subclass Animal Biochemistry within the subclass QP or Physiology. Thus, the conclusion from the logical LC classification of this citation environment of the JACS is that we dealing with a core of 19 journals fully within the chemistry set and two journals—Science and Biochemistry-US—only partially within the chemistry set. However, given Bradford’s and Garfield’s Laws, even the 19 journals of the chemistry core can be expected to be only partially within the chemistry set and as to have facets outside this set.
3.2 Methods
A matrix of 21 x 21 cells can be constructed from the list of journals provided in Table 1 (Appendix I). This matrix is asymmetrical: the cases (rows) are cited by the same set of journals in the columns. The descriptive analysis of the subject relationships among the 21 journals of the database will first be done in terms of the frequency with which each of the journals was cited in 2003 by the journals of the database. After two sections with descriptive statistics, we shall proceed to the (Q‑)factor analysis of the aggregated citation matrix among the 21 journals in order to find communalities in their being-cited patterns. Varimax rotation and Kaiser maximalization on the basis of the Pearson correlation matrix will be used. The results are visualized using the Pearson correlation matrix as input to the algorithm of Kamada & Kawai (1989)[5] as available in Pajek.[6] The data matrix is thereafter transformed by taking the logarithm of the values in the cells, and the analysis is then repeated. Because the citation matrix contains some zeros and log(0) = - ∞ , 1 was added to all values in this pass (Elliott, 1977, 33).
The vector-space model based on the cosine (Salton & McGill, 1983) is more suitable for the visualization since the cosine runs from 0 to 1, while the Pearson correlations can vary from –1 to + 1.[7] The two similarity measures are otherwise equivalent (Jones & Furna, 1987). Since the matrix under study did not contain many zeros (cf. Ahlgren et al., 2003), and given our research focus on the effects of the logarithmic transformation on the normality and/or lognormality of the distribution, we shall use the Pearson correlation exclusively as the basis of both the statistics and the visualizations.
4. Results
4.1 The effects of the logarithmic transformation on the distributions
To begin the analysis, the shape of the frequency distributions of the citing journals and the effect of the logarithmic transformation on this shape will be shown in detail for two of the journals, the Journal of the American Chemical Society (JACS) and Science. The first is the linchpin of the database’s chemistry set; the second has been logically classified above as being outside this chemistry set. Figures 1 and 2 graph the shapes of the distributions for these journals in both the raw-count and logarithmic form. These histograms were constructed by dividing the range of the citations into deciles and then grouping the citing journals by these deciles.
Figure 1. Journal of the American Chemical Society Distributions

Arithmetic Mean = 4,744.95
Variance = 162,772,608.55
Variance-to-Mean Ratio = 3,429.46
Type of Distribution: Compound Poisson, Contagious

Arithmetic Mean = 3.57
Variance = 0.11
Variance-to-Mean Ratio = 0.03
Type of Distribution: Lognormal
Figure 2. Science Distributions

Arithmetic Mean = 863.57
Variance = 836,520.76
Variance-to-Mean Ratio = 968.68
Type of Distribution: Compound Poisson, Contagious

Arithmetic Mean = 2.72
Variance = 0.21
Variance-to-Mean Ratio = 0.08
Type of Distribution: Lognormal
In both cases it is clear that the top journals citing these two journals were themselves—with JACS having 20,469 self-citations and Science having 3,397 self-citations. It can be deduced that the bulk of the Science self-citations were not to chemistry articles. This can be seen in the imbalance with which these two journals cited each other. Thus, Science was the lowest of the journals citing JACS, with a count of only 304, whereas JACS was the second-highest of the journals citing Science, with a count of 2,776. In the raw-count form both journals’ distribution manifest the typical shape of a compound Poisson, contagious distribution with the majority of the journals concentrated below the arithmetic mean, the long tail to the right causing huge variance, and an extremely high variance-to-mean ratio—3,429.46 for JACS and 968.68 for Science. These shapes and high variance-to-mean ratios are natural products of the probabilistic heterogeneity of the journals and their subsets acting in conjunction with a contagious process.
The effect of the logarithmic transformation is similar for both JACS and Science. First, the location of the distributions as measured by the arithmetic mean shifts from near the bottom of the range to near the top of the range, indicating an increase in relative probability. Second, the variance is drastically below the arithmetic mean, resulting in extremely low variance-to-mean ratios—0.03 for JACS and 0.08 for Science. Third, instead of being skewed asymmetrically, the observations tend to distribute themselves symmetrically around the arithmetic mean within the constricted variance. This is the shape that results from random measurement error around the mean. From this demonstration it is easy to see that logarithmic transformation for purposes of inferential statistics results not in a more accurate description of reality, but is a mental model of reality artificially structured to conform to a law of error. It is interesting to note that the logarithmic transformation of the JACS distribution reveals Science as a possible outlier.
4.2 Negative powerlaws at the level of the database
While the previous analysis showed the lognormality of the distribution in a local citation environment, one can wonder whether this lognormality also exists in the larger dataset, that is, including the tails of the distributions. Is the JCR data loglinear? Does the logarithmic transformation provide us with a more adequate description of the citation distribution of these journals at the level of the database? Let us inspect the fit with a negative powerlaw by plotting the citation distributions of these 21 journals log-log using the full set of the 5907 journals included in this database.

Figure 3: Citation distribution of 21 selected journals over the full journal set of 5907 journals included in the JCR 2003.
Figure 3 shows that the citation distributions of the journals exhibit the powerlaw-type distributions for the largest part of the curve (Barabási, 2002; Katz, 2000). The journals are related with citations to between 102 and 103 journals in their respective environments. (The number of journals in the JCR 2003 database was 5907.) The fits of the negative log-log curves are all high (r2 > 0.96; see Table 2).
|
Journal name |
Number of journals in the citation environment |
Citation distribution |
Fit of log-log line |
|
Angew Chem Int Edit |
686 |
log(y) = -1.43 log(x) + 4.37 |
r2 > 0.98 |
|
Biochemistry-US |
952 |
log(y) = -1.53 log(x) + 4.89 |
r2 > 0.97 |
|
Chem Commun |
500 |
log(y) = -1.48 log(x) + 4.30 |
r2 > 0.98 |
|
Chem Rev |
703 |
log(y) = -1.49 log(x) + 4.51 |
r2 > 0.97 |
|
Chem-Eur J |
530 |
log(y) = -1.51 log(x) + 4.43 |
r2 > 0.98 |
|
Dalton T |
394 |
log(y) = -1.56 log(x) + 4.38 |
r2 > 0.98 |
|
Inorg Chem |
558 |
log(y) = -1.58 log(x) + 4.67 |
r2 > 0.98 |
|
J Am Chem Soc |
981 |
log(y) = -1.65 log(x) + 5.39 |
r2 > 0.97 |
|
J Chem Phys |
728 |
log(y) = -1.65 log(x) + 5.08 |
r2 > 0.97 |
|
J Org Chem |
580 |
log(y) = -1.64 log(x) + 4.80 |
r2 > 0.98 |
|
J Organomet Chem |
315 |
log(y) = -1.62 log(x) + 4.36 |
r2 > 0.98 |
|
J Phys Chem A |
633 |
log(y) = -1.56 log(x) + 4.71 |
r2 > 0.97 |
|
J Phys Chem B |
869 |
log(y) = -1.58 log(x) + 5.02 |
r2 > 0.96 |
|
Langmuir |
892 |
log(y) = -1.46 log(x) + 4.64 |
r2 > 0.97 |
|
Macromolecules |
561 |
log(y) = -1.58 log(x) + 4.65 |
r2 > 0.97 |
|
Org Biomol Chem |
543 |
log(y) = -1.39 log(x) + 4.08 |
r2 > 0.99 |
|
Org Lett |
416 |
log(y) = -1.58 log(x) + 4.39 |
r2 > 0.97 |
|
Organometallics |
246 |
log(y) = -1.78 log(x) + 4.65 |
r2 > 0.98 |
|
Science |
1,113 |
log(y) = -1.19 log(x) + 3.91 |
r2 > 0.98 |
|
Tetrahedron |
518 |
log(y) = -1.55 log(x) + 4.48 |
r2 > 0.98 |
|
Tetrahedron Lett |
516 |
log(y) = -1.59 log(x) + 4.55 |
r2 > 0.99 |
Table 2: Characterization of the powerlaw distributions the 21 selected journals
As has been noted before (Barabási et al., 2002; Pennock et al., 2002; Price & Thelwall, 2005), the initial parts of the distributions are typically ‘hooked’ off from the respective curves in the loglinear plots. Thus, there is a first environment of 20-50 journals which form a set with different relations with the journal under study than the larger set that fits the curve. This accords with the typical structure of specialties (20-50 journals) in which intellectually related journals cite each other more systematically than the larger set. The negative powerlaw fits to the scatter in the large tails of the distributions, but not to the core sets. The core sets follow a curvilinear distribution instead of a loglinear one.
In other words, nearby journals in the overall set experience another attraction to one another which is absent in their relations with more distanced journals. The latter pattern exhibits scattering, while the former pattern indicates the intellectual organization of these journals in specialties and fields. The deviatio