KNOWLEDGE REPRESENTATIONS, BAYESIAN INFERENCES,
AND EMPIRICAL SCIENCE STUDIES
Social Science Information 31 (1992), 213-237
(preprint version)
Loet Leydesdorff
Department of Science Dynamics
Nieuwe Achtergracht 166
1018 WV AMSTERDAM
The Netherlands
Abstract
The use of probability theory in philosophy of science, artificial intelligence, and empirical science studies enables us to compare and combine insights from these three hitherto only weakly connected traditions. The importation of Bayes' formula in Shannon's information theory provides us with a model for action/structure contingencies and their dynamic interactions. Bayesian philosophy is then the special case in which the network of hypotheses provides the structure, and the evidence acts upon it. In general, the differences among the three bodies of theory can be understood in terms of whether the information content of the messages about events is evaluated in terms of the a posteriori or in terms of the a priori information content of the system, and whether correspondingly it adds to the information content of the system or to its redundancy. The research programme of empirical science studies can then be delineated as containing a specific philosophy of science.
The issue of how to reconstruct scientific developments is crucial in modern philosophy of science. In artificial intelligence, the focus is on knowledge engineering and knowledge representations. In the empirical sociology of scientific knowledge, one studies knowledge in terms of variance. Although these traditions focus on the same object, i.e., the knowledge content of science, the three corresponding bodies of theory have been hitherto only weakly related.
In this study, I will take probability theory as an entrance point to distinguish these three bodies of theory in terms of their underlying models. I will therefore address the issue of variance (in empirical studies) from the angle of Shannon's (1948) information theory (see, e.g., Theil 1972), philosophy of science from the angle of Bayesian philosophy (see, e.g., Howson and Urbach 1989), and artificial intelligence from the perspective of Pearl's (1988) elaboration of the probabilistic approach. This kinship in methods provides me with a leverage to study more precisely the differences in theoretical premises. However, my conclusions will be extendable to similar theories which do not use this particular mathematical, i.e. probabilistic, format.
The combination of these three approaches will be fruitful in terms of each of them. In both the Bayesian approach and the information theoretical one relations between a priori and a posteriori probabilities have been formulated in order to assess the value of new evidence or information. The importation of the Bayesian formula for the a posteriori probability (which will be derived in the next section) into an information theoretical framework will provide us with an unexpectedly general model for the study of interaction over time ("coupling") of self-referential systems. The model is applicable, for example, to structure/action contingency relations in sociology, the relation between the development of cognition and of language, and the updating of hypotheses in relation to available evidence. Bayesian philosophy of science can thus be understood as the analysis of a special case of how two dynamic principles (i.e., hypotheses and evidences) interact.
In the various sections of this study, I shall discuss the implications of this finding in the noted contexts. For example, the application of the model to the dynamic development of the relation between cognition and language is of relevance to important debates in both the philosophy of science and discourse analysis.
Artificial intelligence, however, is in the first place a form of engineering, and not a science. I shall specify the gradual shifts in perspective between the two by taking the example of the use of Bayes' Rule. However, the focus on engineering challenges us to specify what the empirical perspective can contribute in practical terms.
More fundamentally, I shall specify the methodological and epistemological differences among the three approaches with respect to the question of whether the a priori distribution or the a posteriori one is taken as the frame of reference for the measurement of the events.
1. An information theoretical evaluation of Bayes' Formula
The probabilistic turn in the philosophy of science is mainly based on the use of Bayes' formula for the evaluation of belief updates: according to Bayesian philosophers, empirical evidence can be assessed in terms of changes in the probabilities of the various hypotheses which may be invoked for its explanation (see, e.g., Howsen and Urbach 1989; Phillips 1973). Whether these probabilities are empirically measurable or more subjective is a matter of considerable controversy among Bayesians; and of some relevance to empirical science studies (see, e.g., Giere 1988). I will return to these substantive issues in a later section.
The problem of belief updates is also highly relevant for intelligent knowledge based systems. However, in this context Bayes' formula is used pragmatically without its philosophical connotations. The major advantage of its usage is then that it allows for "local" and incremental updates. This may considerably reduce the amount of computation, and allows for the elaborations of models of parallel distributed computing (Pearl 1988; Leydesdorff 1992d). I will explain these advantages in more detail in a later section, but first I will now introduce the model in more abstract terms.
The derivation of Bayes' theorem follows straightforwardly from the third law of the probability calculus:
p(A and B) = p(A) * p(B|A) (1)
or equivalently:
p(A and B) = p(B) * p(A|B) (1')
Therefore:
p(A) * p(B|A) = p(B) * p(A|B) (2)
and thus:
p(A) * p(B|A)
p(A|B) = ───────────── (3)
p(B)
However, while p(A|B) is the a posteriori probability and p(A) is the a priori one, one can evaluate the expected information content (I) of the message that A is conditioned by B, using Shannon (1948), as follows:[i]
I_{(A|B : A)} = Σ p(A|B) * log{ p(A|B) / p(A)} (4)
Combining (3) and (4) leads to:
I_{(A|B : A)} = Σ p(A|B) * log{ p(A) * p(B|A) / p(B) * p(A)}
= Σ p(A|B) * log{ p(B|A) / p(B)} (5)
This formula can also be written as a difference between two logarithms:
I_{(A|B : A)} = Σ p(A|B) * log{ p(A|B) / p(B)} +
- Σ p(A|B) * log{ p(A|B) / p(B|A)
= I_{(A|B : B)} - I_{(A|B : B|A)} (6)
In words: the expected information of the message that A is conditioned by B is equal to an improvement of the prediction of the a posteriori distribution (Σ p(A|B)) if we add to our knowledge of the a priori distribution (Σ p(B)) how it is conditioned by the other distribution (Σ p(B|A)). Note the shift in the respective system of reference: if we know how B is conditioned by A, this improves our prediction of how A is conditioned by B.
In itself the derivation of these formulas is still completely analytical. (The circularity of this reasoning has always been the major objection against the Bayesian perspective.) However, if we make the perspective dynamic, and we consider A and B as coupled systems, the formulas are not trivial. On the contrary, I will show that formula 5 provides us with a probabilistic model for the design of relations between two (or more) structurally coupled self-referential developments.
Application to the problem of structure/action relations in sociology
If we assume a structure A and a distribution of actors B, at each moment in time (t = t) the actors will take action given the structure; thus, B is conditioned by A when it operates. Action may have an impact on structure A at the next moment (t = t + 1). This effect can be expressed in terms of the amount of information expected from the message that A has become conditioned by B. (See Figure 1 for a visual representation.)
If more actors operate, A can be considered as a network of which these actors are the nodes (Burt 1982). At each moment in time, the network conditions all the actors; if action takes place at any node(s) this conditions the network at the next moment. This model can easily be generalized to the model of parallel distributed computers: each actor is comparable to a local processor which acts on the basis of its own programme given the conditions set by the network; while the operation of the network, which operates according to its own programme, is conditioned by the sum total of previous actions.
If we initially (at t = t) had knowledge only of the distribution of the nodes (Σ p(B)), and then became informed of how action at the nodes is distributed given the network (Σ p(B|A)) at this moment, formulas 4 and 5 from the previous section teach us that this would provide us with the same expected information (I) about the network distribution, given the nodes at t = t + 1 (i.e., the a posteriori distribution), as would the message do that the network distribution was conditioned by the nodes. Therefore, the formula explicates how action at the nodes and the network are conditioned by each other dynamically: the right-hand factor ( p(B|A) / p(B) ) describes the instantaneous conditioning of any action by structure at t = t, while the left-hand factor contains the description of the network after action, i.e. p(A|B) at t = t + 1. The I then expresses the improvement in the prediction of the later (a posteriori) state of the network, when we know how the action was conditioned by the network at the previous moment (a priori). We may also write equation 5 (above) as follows:
I_{(A|B : A)} = Σ p(A|B)_{posterior} * [log{p(B|A)/p(B)}]_{prior} (5')
The right-hand factor has B as its system of reference; it describes how the actors are conditioned by the network at t = t. The left-hand factor indicates the impact on the network at the next moment.
Analytically, the resulting improvement of the prediction of A at the later moment is due to the mutual information between A and B at the earlier moment. It is an evaluation of how much our knowledge of this mutual information ("transmission") at the earlier moment informs us about the conditioning at the later moment. However, mutual information (T) between distributions is defined in a static model in terms of the uncertainties (H) in the distributions[ii] as follows (see also Figure 2):
H_{AB} = H_{A} + H_{B|A} = H_{B} + H_{A|B} (7)
T_{AB} = H_{A} - H_{A|B} = H_{B} - H_{B|A} (8)
Therefore, the relation between two terms in the right hand factor of equation 5' (p_{B|A} and p_{B}) informs us only about the static transmission at t = t, i.e., about the impact of the vertical arrow in Figure 1, which indicates the instantaneous conditioning of action by structure (cf. Giddens 1979). At any moment, knowledge of the uncertainty in the network improves our prediction of the uncertainty in the action systems, but only for the part of the transmission, i.e., H_{B} - H_{B|A} ( = T ), and not for the other part of the uncertainty (H_{B|A}). In other words: at each moment, how B is conditioned by A does not inform us about how A is conditioned by B, but only about how the conditioning reduces the remaining uncertainty. The remaining uncertainty itself in A (or B) remains "free": A represents a self-referential system which determines its own total uncertainty. Its uncertainty is minimally equal to its (a priori) expected information content, and maximally limited only by the logarithm of A's elements (log n_{A}). However, if there is conditionality in the (static) probabilistic relations between A and B, there is necessarily also dynamic coupling. After the cycle (a posteriori), the fact that a specific interaction with B has occurred belongs to the history of A (i.e., "is a given for A"). Note that dynamic coupling is thus technically a consequence of the (static) mutual information between A and B.
A communication network A is structurally coupled to the nodes (B) of this network (cf. Luhmann 1984). However, if the nodes pursue their own respective operations, the effects of the boundedness of A and B to each other is not necessarily the same in either case. Since I can also be used as a measure of the quality of the prediction, the formulas allow us to develop a measure of whether (and to what extent) the one type of data in empirical research on structure/action contingencies represents structure, while the other represents action, or vice versa. One can easily imagine designs in which what is action at one moment in time will operate as structure at a later moment. Methodologically the two perspectives are symmetrical tests, just as they are conceptually symmetrical in the idea of a "mutual shaping" of structure and action by each other. If the systems A and B are completely coupled in one operation, the I_{(A|B : A)} is equal to I_{(B|A : B)}. (See the example, below.)
Furthermore, the improvement in the prediction is necessarily positive, since formula 5 is equivalent to formula 4; and the latter can be shown to be necessarily positive (Theil 1972: 59f.). Therefore, the prediction of "structure given action" at the later state is always improved if we know how structure conditioned action at the previous stage. In each cycle, there is an increase of expected information content, since the new value H_{A|B} will be the initial value (H_{A}) for the next cycle. Therefore, a structure/action contingency produces Shannon-type information, i.e. entropy in the probabilistic interpretation of this concept (cf. Bailey 1990), and thus has a history. However, having a history does not imply that this history is always important for its further development. Historical information can sometimes lose its relevance (for example, if the system has the Markov property or goes through path-dependent transitions (cf. Leydesdorff 1992a)).
In summary: in coupled systems like structure/action relations, action pre-sorts information for structure, while it is itself conditioned by structure. However, the total information content of structure is independent of action: each system remains also "free" at each moment in time. The improvement in the prediction is based only on the local interaction at the a priori moment. Furthermore, because of the Σ in the formula the improvements are additive, and can therefore also be decomposed into single actions or subgroups of actions. This allows us to study the impact of specific actions on structure in detail.
An empirical example
For an empirical illustration, let us use the transaction matrix of the aggregated citation data of 13 major chemistry journals among each other. These matrices can be easily compiled using the Journal Citation Reports of the Science Citation Index.[iii] I will use the 1984 matrix for the journals listed in Table 1 as the a priori distribution; the 1985 one as the a posteriori distribution. The "being cited patterns" are taken as structure, and "citing" as action. The two systems are completely coupled, since there is only one operation, viz. citation. Since the ciation matrix contains information with respect to both the cited and the citing dimension, it should provide us with an opportunity to make predictions about the impact of citation behaviour on citation structure in the next year.
In 1984, a static interpretation of formulas 4 or 5 gives an I (which is necessarily equal to the mutual information T between "citing" and "cited" as obtained by using formula 8) of 964.2 mbits. For 1985, this is 972.4 mbits, i.e., 8.2 mbits more. However, in the dynamic model (i.e., by using formula 5'), we find an improvement in the prediction for 1985 to 969.7 mbits on the basis of 1984 data. This means that 5.5 mbits of the 8.2 mbits change in the transmission (67.1%) can be attributed to the previous transmission. In other words: the increase in the coupling is above expectation. (One reason for this may be that there is more structure in the data--e.g., feedback among these journals--than assumed above. See for an elaboration of the example: Leydesdorff 1992c.)
Since the operation of "cited" and "citing" are mutual in this universe of 13 journals, this result would remain the same if the matrix were transposed. However, in the parallel computer model developed above, one may also assume a communication system between the cited and the citing journals, in which case the operations are mediated, and thus in principle asymmetrical. In this case, one needs an independent operationalization (e.g., in terms of the eigenstructure of the matrix)[iv] of what the communication system does when articles in these journals cite each other.
2. Bayesian philosophy of science
In Bayesian philosophy of science, this argument for structure/action contingencies is repeated, but now not with respect to social structure, but to cognitive structure operationalized in terms of hypotheses. Structure is in this case the set of hypothesis H, and action is the evidence e. Thus, Bayesian philosophers use Bayes' theorem (formula 3 above) in the following format:
p(e|H) * p(H)
p(H|e) = ───────────── (6)
p(e)
and the information content of the message is correspondingly:
I_{(posterior:prior)} = Σ p(H|e) * log { p(e|H) / p(e) } (7)
This I, the expected information value of the evidence for the a posteriori distribution of the hypotheses, is equal to the information improvement for predicting the latter from the evidence, given the prior belief distribution, over the prediction without the prior belief distribution. Since p(e) is essentially a normalization term, we may more simply state that the expected information value of the evidence for changing the a priori belief distribution into the a posteriori distribution is equal to the improvement of the prediction from the evidence brought about by accepting the a priori hypotheses. This result illustrates how intrinsically hypotheses and evidence shape each other in Bayesian reasoning: one can look at the belief update either in terms of data becoming more informative by accepting an a priori belief structure, or in terms of a belief structure being informed by new data. However, from the Bayesian perspective, the latter is the correct approach.
Obviously, we need to distinguish between Bayes' theorem and Bayesian philosophy of science. Bayes' theorem in itself is not controversial, since it is an analytical consequence of the fundamental laws of the probability calculus. It is rather the interpretation which Bayesians give to the theorem which is controversial. As Phillips (1973, p. 63) expressed it:
"For a Bayesian, prior probabilities and likelihoods are degrees of belief; both are the results of human judgement. And since all inferential procedures in statistics are variations on the general theme of revision of opinion in the light of new information, Bayes' theorem plays a central role. But for the statistician who takes a relative frequency view of probability it is rare that he can give a relative frequency interpretation to a prior probability, so he makes little use of Bayes' theorem."
However, the information theoretical approach extends the dynamic approach with a multi-variate perspective (cf. Leydesdorff 1992b), and therefore the relative frequency view can be integrated with the Bayesian approach. Bayes' theorem, as we will see in a later section, contributes to the information theoretical approach by making it possible to assess the impact of local events locally, so that we do not have to recompute all the probability distributions in reaction to each event.
Whether the a priori distribution of beliefs is in itself an empirical matter, as subjectivist Bayesians maintain, or a matter for idealization from empirical data, as some objectivist Bayesians claim, is a separate debate among Bayesians. However, whatever their positions in this debate, Bayesians share the belief that a scientist should quantify his opinions as probabilities before (in the logical sense of this word) performing an experiment, and then use Bayes' theorem formally to revise these prior probabilities to yield new, posterior ones. These posterior probabilities are taken as the scientists' revised opinions in the light of the information provided by the data.[v] In a Bayesian framework, neither events nor beliefs have significance in themselves: they constitute one another, necessarily, in cycles of belief updating.
However fruitful and fascinating this enterprise may be in itself, in relation to the empirical study of the sciences, the Bayesian perspective raises several questions. First, with respect to the status of its results, it has become obvious from many studies that scientists in general do not behave as Bayesian agents;[vi] so that Bayesian philosophy is not popular as a paradigm for a sociology of science. Bayesian philosophy offers a normative perspective: it claims to be able to explain why some scientists were successful, while others were not, and to offer a scheme intended to maximize the likelihood of success for scientists. Therefore, it is at best a general methodological tool which operates within the sciences at the level of knowledge production and control, regardless of whether or not the scientists involved are aware of it. (Whether, and to what extent scientists behave as Bayesian agents is again an empirical question.)
Secondly, in the multi-dimensional space of heterogeneous objects which constitute the sciences as complex phenomena (scientists, knowledge contents, texts, etc.; cf. Callon et al. 1986; Leydesdorff 1990c), Bayesian philosophy takes the beliefs and opinions which are attributed to (concrete or ideal) scientists and relates them to the knowledge content of science. Therefore, in terms of this space it addresses the same sectional plane (of scientists and knowledge contents) as the sociology of scientific knowledge. However, it has a much less sophisticated understanding of how beliefs and opinions are generated in science, and of why they may vary (see, e.g., Gilbert and Mulkay 1984). In the sociology of scientific knowledge, it has also been noted that evidence is not just evidence, but is brought to bear by various forms of action (Pinch 1985).
Indeed, in Bayesian philosophy "evidence" is the motor of further development. It relates to the belief-structure of the scientists as action relates to structure in the action/structure contingency model which we discussed above. Evidence can be aggregated, and the total impact on the belief-structure can be decomposed in terms of pieces of evidence. Therefore, the sociological structure/action contingency relation can be seen as a general problem, and Bayesian philosophy as a specific elaboration of it with respect to a certain domain (i.e., "cognitive structure" and "cognitive action") and with the help of certain assumptions and idealizations (e.g., concerning measurement).
Although Bayesian philosophy recognizes belief and evidence as two dynamic principles, it however misses the point that in a dynamic model one has to operationalize not only "what the hypotheses do when they hypothesize," but also what "the evidence does when it evidences." Only thereafter will one be able to explain the resulting dynamics (the explanandum) in empirical terms. However, without an explanatory framework, the description of the development of the resulting variable (i.e., the belief-structure of hypotheses) can only be evaluated normatively.
A related point can be made with respect to the position of language in Bayesian philosophy: scientists do not communicate about probabilities but about substantive theorizing, which may, of course, also be of a probabilistic nature. The focus on belief updates in terms of probabilities sometimes obscures the essence of the communication process and the use of language to maintain communication. In my opinion, language (text, discourse) is a separate third structure, which may constrain and/or promote the development of belief-structures and the possibilities for testing them against evidence. The dynamic relations between language and scientific development are the object of a whole new set of research questions which have been developed in the realm of artificial intelligence, drawing on computational linguistics, information retrieval, etc. However, Bayesian philosophers either reconstruct using their own meta-language, or too easily equate language with cognition.
This latter position is held by the British philosopher Mary Hesse (1974; 1980). She maintains that science gets its shape only in and through language: her plea is against a meta-language of "correspondence rules," etc., as proposed by positivism. The fact that Hesse combines this view with a Bayesian perspective on scientific inference, while others combine it with more positivist or critical rationalist traditions, illustrates yet another problem in Bayesian thinking: if we cannot derive the prior probabilities empirically from an opinion poll among scientists, can we take them from scientific discourse, and if so, how (and why?). The relation between belief, evidence and language has not yet been elaborated in empirical terms by these authors.
The crucial point is to consider the different categories no longer as hierarchical but as heterarchical dynamic systems which influence one another in specific interactions. Knowledge is not buried in language, from which it can then be reconstructed, or with which it can be equated; nor is evidence. Language and cognition are dynamic systems which operate analytically independently, but they are structurally coupled (by human actors), and therefore one can specify when one has to update one's belief structures in light of certain discursive evidence, or why and when one has to change one's vocabulary given new concepts and insights. The parallel computer model adds to this mechanism of structural coupling the locality and incrementality of change: each system has to decide according to its own rules how far it has to backtrack in order to update; while how revolutionary or normal the update has to be remains an empirical question!
However, once this relativation of Bayesian philosophy is made, the challenge is then to import the results of Bayesian studies in order to provide us with the probability distributions at different points in time for developments in the cognitive dimension. If possible, we may subsequently ask how these probabilities change with distributions of other indicators which can either be measured in scientific texts directly or be related to the opinions of practicing scientists given in interviews and assembled in surveys (see, e.g., Leydesdorff 1990a; 1990b)
Bayesian philosophy provides us mainly with a descriptive tool for exploring one of the relevant dimensions in empirical science studies. At least at this moment, the Bayesian programme seems more vigorous than the alternative of measuring the expected information value of scientific evidence. Among others, Mitroff (1972) once tried to evaluate the value of expeditions to the moon by the various Apollo flights in terms of the expected information content of the evidence they brought to earth for the competing theories about the origins of the moon. Mitroff (1972, pp. 158ff.) reported that although the shift in judgement was significant in terms of the difference between two belief distributions (a priori and a posteriori), the amounts of transferred information were extremely small (10^{-2} - 10^{-1} bits). Part of the reason for this seems to be that this study reveals that under uncertainty some scientists will use evidence to upgrade their belief in theory T_{1} and others to upgrade their belief in T_{2}, and that therefore the effects in terms of distributions may cancel each other out. Thus, the assumption of Bayesian philosophy that information will help us to reach consensus in the long run, seems sociologically debatable. The paradox is that the uncertainty, as noted, increases with each cycle of the process.[vii]
In summary, I have argued in this section that the Bayesian approach in the philosophy of science is the special case of a contingency relation between dynamic structure (i.e., sets of hypotheses) and dynamic action (i.e., evidence). The domain (i.e., "cognitions") is specific, as are the operationalizations which various branches of Bayesian philosophers apply in order to "measure" beliefs. The information theoretical reformulation of the Bayesian formula strongly suggests a model of mutual shaping between hypotheses and evidence (or between cognition and language) which replaces the internal relations between those dimensions with external, empirical categories.
The Use of the Bayesian Theorem in Artificial Intelligence
In artifical intelligence a much more pragmatic use is made of Bayes' theorem. The use of Bayesian statistics localizes the relevant environment for new knowledge, and therefore limits the amount of computation needed for updating in response to evidence (Pearl 1988). Such updating had been seen as one of the major disadvantages of a probabilistic approach to artificial intelligence, which is otherwise attractive since it allows for handling (local) context-dependencies as conditional probabilities. As Pearl (1988, at p. 35) put it:
"the power of Bayesian techniques comes primarily from the fact that in causal reasoning the relationship P(e|H) is fairly local, namely, given that H is true, the probability of e can be estimated naturally and is not dependent on many other propositions in the knowledge base. For example, once we establish that a patient suffers from a given disease H, it is natural to estimate the probability that he will develop a certain symptom e. The organization of medical knowledge rests on the paradigm that a symptom is a stable characteristic of the disease and should therefore be fairly independent of other factors, such as epidemic conditions, previous diseases, and faulty diagnostic equipment. For this reason the conditional probabilities P(e|H), as opposed to P(H|e), are the atomic relationships in Bayesian analysis. The former possess modularity features similar to logical production rules. They convey a degree of confidence in rules such as "If H then e," a confidence that persists regardless of what other rules or facts reside in the knowledge base."
In addition to this potential to localize the impact of new information, Bayesian probability can also be shown to be recursive (Pearl 1988, p. 37): if e_{n} denotes a sequence of data observed in the past, and e denotes a new fact, we do not have to include all the data of the sequence from the past in order to compute the posterior probability P(H|e_{n},e), but we can instead use the old belief P(H|e_{n}) as the prior probability in the computation of new impact; it completely summarizes the past experience, and for updating need only be multiplied by the likelihood function P(e|e_{n},H), which measures the probability of the new datum e, given the hypothesis and the past observations.[viii]
Thirdly, by using the information theoretical reformulation above problems of aggregation and disaggregation become fully tractable. These properties of Bayesian and probabilistic reasoning make it possible to limit the effects of new evidence, i.e., the "propagation" of belief update, which otherwise would make it necessary to recalculate all the probabilities in the light of new data. (As noted, this has been an argument against using probabilistic models.)
Should we count this potential advantage as an argument in favour of Bayesian philosophy? In my opinion, that would be a category mistake: the argument for using Bayes' theorem in artificial intelligence (e.g., Pearl 1988) is a pragmatic one, concerning the role which Bayes' Rule can play in offering shortcuts in computation in connection with an otherwise technically daunting approach to certain problems in constructing expert systems. Whether the knowledge contained in the expert systems should also be expressed in terms of Bayesian belief distributions depends on the type of expert system which we wish to develop. If we indeed want an expert system which is able to express different beliefs about states of affairs, and to update those beliefs in the light of new information-- this is almost the definition of an expert system-- we may use a Bayesian framework, since it also provides us with an elegant inference engine. However, we may wish to use other inference engines; and, if we develop intelligent knowledge based systems for other purposes, for example, for classification and retrieval, we may still use the properties of Bayesian formulas for updates without necessarily requiring the expression of beliefs only.
One argument in favour of choosing a Bayesian framework at all relevant levels of analysis may be that of elegance, i.e., of using the same method in knowledge engineering, inferencing, updating, etc., with probable advantages in computing and interfacing. However, there may be pragmatic reasons for choosing other solutions. Moreover, we have as yet no convincing models of how to get the "Bayesanized" knowledge into the knowledge-based system, since most promising work has hitherto remained reconstructive at the philosophical level. (See, e.g., Dorling 1972; Rosenkrantz 1977 and 1980; Howson and Franklin 1985; Franklin 1986. See also: Giere 1988; Howson and Urbach 1989.) We do not know whether scientists are in practice able to deconstruct complex "facts" and "insights" into events, to which they can then attach probabilities. This ability would be a necessary prerequisite at least for updating by a user-scientist; and of course we cannot leave the update of a running expert system only to Bayesian philosophers.
The decomposition of the a posteriori state in terms of the a priori one
As we mentioned, the recursive formulation of Bayes' Rule given above is only true on the assumption that "given that H is true, the probability of e can be estimated naturally and is not dependent on many other propositions in the knowledge base." The negation of the possibility of assessing e independently of other propositions in the knowledge base is also known in the philosophy of science as the Duhem thesis: new evidence e does not necessarily update H, but can also be evaluated in relation to other (auxiliary; e.g., instrumental) propositions in the knowledge base. However, Dorling (1979; 1982) has argued that this problem can be completely solved within Bayesian philosophy of science by showing that the effects of e on the hypothesis (H) and on other propositions (e_{n}) are asymmetrical.
Let us generalize the problem for action/structure contingency relations, or more generally the interaction of two dynamic principles, and again use the information theoretical perspective. (I return to the previous notation, with A indicating structure and B action.) After a given event B, the total uncertainty in the structure A can be written as follows:
H_{(A|B)} = - Σ p_{(A|B)} * log(p_{(A|B)})
By using Bayes' formula, we can evaluate this a posteriori result into its a priori components, as follows:
p_{(A)} * p_{(B|A)} p_{(A)} * p_{(B|A)}
H_{(A|B)} = - Σ ───────────── * log{ ───────────── }
p_{(B)} p_{(B)}
= - Σ [p_{(A)} * {p_{(B|A)}/ p_{(B)}}] * [ log{p_{(A)}} + log{p_{(B|A)}/ p_{(B)}}]
(I postpone the issue of the interpretation of {p_{(B|A)} / p_{B}} as an a priori system to the next section, and proceed with the decomposition.)
H_{(A|B)} = - Σ p_{(A)} * log{p_{(A)}} - Σ {p_{(B|A)}/ p_{(B)}} * log{p_{(B|A)}/ p_{(B)}} +
- Σ p_{(A)} * log{p_{(B|A)}/ p_{(B)}} - Σ {p_{(B|A)}/ p_{(B)}} * log{p_{(A)}}
= H_{(A)} + H_{(B|A)/(B)} +
- Σ p_{(A)} * log{p_{(B|A)}/ p_{(B)}} - Σ {p_{(B|A)}/ p_{(B)}} * log{p_{(A)}}
= H_{(A)} + H_{(B|A)/(B)} +
- Σ p_{(A)} * log{p_{(A|B)}/ p_{(A)}}
- Σ {p_{(B|A)}/ p_{(B)}} * log[p_{(A|B)}/ {p_{(B|A)}/ p_{(B)}}]
= H_{(A)} + H_{(B|A)/(B)} +
+ Σ p_{(A)} * log{p_{(A)}/ p_{(A|B)}}
+ Σ {p_{(B|A)}/ p_{(B)}} * log[{p_{(B|A)}/ p_{(B)}} / p_{(A|B)}]
Thus, the total uncertainty of the system a posteriori is equal to the sum of the uncertainties of two a priori systems (A and (B|A)/B) plus the sum of the information values of the messages that these systems have merged into one a posteriori structure. The sum of the two additional terms[ix] is equivalent to what may also be called the "in-between" group uncertainty (H_{0}) upon decomposition of the total uncertainty in H_{(A)} and H_{(B|A)/(B)}. Note the analogy between "later" and "more aggregated": both contain more information.
However, the "in-between group" uncertainty is composed of two terms, i.e., the difference which it makes for the one a priori subset in relation to the a posteriori set, and the difference it makes for the other. Indeed, this result is consistent with Dorling's (1979) thesis: an update cycle affects two (or more) a priori systems asymmetrically.
In the above formula, I decomposed the a posteriori expected information content H_{(A|B)} into various parts which are given a meaning (in the right hand side of the equation) in terms of the a priori states of the respective systems. Paradoxically, this is precisely what Bayesians always do, although they use a different rhetoric. The Bayesian frame of reference is not the a posteriori situation, but the a priori one. For example, the philosopher asks what it means for the prior hypothesis that a piece of evidence becomes available. That the hypothesis (or, analogously, his belief) itself may have changed, and thus no longer be the same hypothesis, is for him/her usually of little concern. The Bayesian, however, is not interested in the further development of the a priori stage into the a posteriori one thanks to the new evidence, but only in the corroboration or falsification of the a priori hypothesis.
However, from a social science perspective one is interested in what happened empirically, and not only in what this means in terms of the previous stage. Explication of the latter only adds to the redundancy. Of course, this will in itself have a positive function for the reflexive understanding (Luhmann 1990, pp. 469ff.). However, in this respect, Bayesian philosophy of science is normatively oriented. As I will show in a later section, the fundamental shift of perspective in empirical science studies towards giving priority to the study of what happens in terms of its information content also has implications for assessing artificial intelligence.
The evidencing of the evidence
Let us first turn to the interpretation of the a priori system which can be described with Σ p_{(B|A)}/p_{(B)}. Obviously, this is the ratio between the uncertainty in the action system which is left free by structure and the total uncertainty in the action system. Since p_{(B)} = Σ_{A} p_{(B|A)} if the actions are independent, in this case:
p_{(B|A)}/p_{(B)} = p_{(B|A)} / Σ_{A} p_{(B|A)}
and the denominator can be considered as a normalization term for the size of the action system.[x]
Thus, a self-referential structure does not merge with the uncertainty in events within its environment, but only with the uncertainty which it conditions in these events, and after normalization for the size of the action system. However, if the actions (events) are not independent (e.g., one action triggers other actions before there is an impact on structure), the above equality does not hold, and there can be a size effect on structure. In the sociology of scientific knowledge, for example, attention has been given to how evidence is brought to bear (Pinch 1985), and thus the relation between various actions is a prerequisite for the evidencing of the evidence. Action may then begin to behave as another system.
3. Expert systems in science
We know that expert systems function rather well in certain environments (diagnosis, etc.), particularly if the underlying knowledge base is rather codified. In an impressive study, Langley et al. (1987) showed that they could develop expert systems (BACON I, BACON II, etc.) which by using very elementary assumptions were able, for example, to infer Boyle's law when provided with Boyle's laboratory notes, etc. The crucial assumptions concern the psychology of discovery (the heuristics), by which scientists usually first try to find a linear fit for the data, etc. In this case, the inference rules are not Bayesian, but are informed by psychological theory.
The sometimes astonishing successes of expert systems based on such simple assumptions suggest that the task of building more informed models should be not too complex. However, we should keep in mind that the purpose of (knowledge) engineering is pragmatic, e.g., to produce a user interface which is sufficiently reliable according to some criteria of use; while the purpose of model formulation is primarily theoretical, i.e., aiming at a better understanding of structure in the data. Knowledge engineers often skip the corroboration of the model by assuming simple relations among decision criteria (see, e.g., Bakker 1987; De Vries 1989).
However, on the theoretical side, we also have to make assumptions and idealizations if we wish to operationalize and to measure. The differences are therefore only gradual: while an intelligent knowledge-based system (IKBS) is an explicitly structured representation of the underlying rules of some area of human expertise which organizes the data (see, e.g., Black 1986, at p. x), in theoretical research one tests the representation against the data. While in the former case one is quite satisfied "when it works," the purpose of the latter is to investigate "why" it works.
An IKBS that is based on rules from psychology concerning human reasoning, or on mathematical assumptions concerning decision-making, invokes models of how scientists (should) reason when doing research or making decisions. They inform us about the subjective side of science, but not about its knowledge content.
The crucial question for science studies in relation to the challenge of these knowledge representations can now be formulated as follows: can science studies provide us with theoretical insights which are (additionally) useful for the specification of inference rules in relation to scientific knowledge? In other words: is there specific knowledge about science in comparison with other forms of organized knowledge that is of relevance for the construction of artificial intelligence in a scientific environment?
Obviously, the contribution of science studies should make a difference not on the formal side, but on the substantive side. If in a scientific environment the data are intrinsically related to theories, then this relationship should have consequences precisely for one of the most problematic issues in the construction of artificial intelligence: the so-called frame problem.
The frame problem
In the empirical sciences, theories are never purely formal; they are essentially embedded in meaning. Consequently, the data are not independent but theory-laden, and over time the data change both independently and in their relations to the inference rules. While in other artificial intelligence systems we may be able to separate the two, and keep one of the two dimensions constant over a relevant period of time, in science the two dimensions are mutually contingent: the data change with their interpretation. In artificial intelligence, this problem is also known as the frame problem (see, e.g., Chabris 1989, pp. 61-6).
At each moment in time, we can feed the knowledge base with the best of our knowledge after appropriate knowledge engineering and/or on the basis of reconstruction on the basis of scientific texts. As I specified elsewhere (Leydesdorff 1990b, at pp. 294f.), given the data and given the decision criteria, the database may then begin to learn when confronted with new data if we are able to specify inference rules.
However, in relation to these later events the expert system is necessarily a priori; it can only assess the meaning of what happens next in relation to what it already contains, even if, in the longer term, we may be able to provide the systems with algorithms for dealing with new patterns. For example, we can instruct the system to disregard, for the prediction of the system's future behaviour, data which describe stages before a path dependent transition has happened (cf. Leydesdorff 1992a). In practice, this may prolong the life-cycle of the system, but it does not solve the fundamental problem of how the system operates in empirical reality.
Every knowledge representation necessarily has a time index attached to it; it is stamped by the time when it was engineered and framed. In this perspective, an IKBS is not essentially different from a textbook, although the former may be interactive and also have some capacity to learn. However, as in literary research, creative combinations within it may lead to new applications. Like the library, an IKBS may have a service role in research. However, in a sense, one might say that in an IKBS what was previously a structural condition of the system (e.g., the library) might begin to act as another system (see, e.g., Swanson 1990). However, whether this is actually the case, and in relation to which structures, remain empirical questions for science studies.
In summary, the choice of an a posteriori perspective also makes it possible to raise evaluative and empirical questions with respect to the functions and impacts of an IKBS in the science system, and to relate its development systematically to developments in other dimensions.
4. Genesis and Validity
We have seen that the Bayesian philosophy of science and artificial intelligence share an emphasis on learning with reference to the a priori situation. The two traditions differ, however, with respect to the normative aspect. Assessing the relevance of events a posteriori (i.e., the empirical question of "what happened, and why?") brings us to another important issue which may easily lead to a confusion of normative and empirical perspectives. This is the issue of the analysis of the a posteriori state in terms of its genesis, and in terms of its validity.
In systems theory one finds the notion that the new state is in a sense contained in the older one, and that it is therefore important to follow the development of the a priori system as a process in order to investigate how it shapes the a posteriori state. For example, Luhmann (1984: 148 ff.) discussed the auto-catalysis which is supposedly contained in the double contingency of interpersonal relations, and which then "produces" the communication system.
However, one has to distinguish between information and redundancy. In terms of empirical theory, the communicative function in the double contingency is a consequence of the existence (i.e., a posteriori) of communication. The evaluation of this a posteriori state in terms of the a priori states leads to a surplus of uncertainty which cannot be reduced to the a priori states, and which was indicated above as the "in-between group" uncertainty. A posteriori, the process through which the new state has come about may be only one of the possible pathways which might have led to this state. Other decompositions are possible; decompositions do not add to the information. The contingent pathway of emergence is therefore not in itself an indicator of the validity of the description of the a posteriori system. Its specification creates redundancy; this may be useful for the understanding, but it adds nothing to the information.
As I showed above, there is an analogy between the dynamic problem and the multilevel problem. At the higher level of aggregation, each case or each subgroup of cases contributes only a part of the overall uncertainty. Additionally, there may be "in between group" variance. Analogously, the a posteriori situation is a result of various a priori ones and the processes attached to them if there has been a dynamic interaction.[xi] Once the a posteriori situation is a given, the pathway of its genesis becomes one contingent route, and thus explains only part of the overall uncertainty. As noted, other decompositions remain possible.
In other words, the result of the development cannot structure the development ex ante, but only with hindsight: there is simply no causa finalis in an empirical model. Prediction in empirical (science) studies should never be confused with speculative forecasting: it is strictly a methodological concept, and not an epistemological one. The specification of redundancy may, with hindsight, inform us about which entities the information has added.
Note that the priority of the a posteriori stage in empirical science studies makes its philosophy of science necessarily progressive; but "progressive" in an empirical sense. The contingency of science and its progress does not imply that "anything goes," but only that it goes when it goes (cf. Luhmann 1990, at p. 177). Science develops with time, like the society (to which it belongs as a part) and all other autopoietic systems.
In this research programme,
the rationality question can also be reformulated as an empirical question:
what was the rationale of what happened? Analogously, the universality question
then becomes an empirical question: everything in science is defined with
reference to a universe, and all questions about things defined with reference
to a universe are also necessarily empirical.
References
Bailey, K. D. (1990) "Why H does not measure information: the role of the "special case" legerdemain solution in the maintenance of anomalies in normal science," Quality and Quantity, 24: 159-71.
Bakker, R. R. (1987) Knowledge Graphs: representation and structuring of scientific knowledge. Ph. D. Thesis, University Twente.
Black, W. J. (1986) Intelligent Knowledge Based Systems. Wokinghan: Van Nostrand Reinhold.
Burt, R. S. (1982) Toward a Structuralist Theory of Action. New York, etc.: Academic Press.
Callon, M., J. Law, and A. Rip (eds.) (1986) Mapping the Dynamics of Science and Technology. London: Macmillan.
Chabris, C. F. (1989) Artificial Intelligence and Turbo C. Homewood, Ill.: Dow Jones-Irwin.
De Vries, P. H. (1989) Representation of Scientific Texts in Knowledge Graphs. Ph. D. Thesis, State University Groningen.
Dorling, J. (1972) "Bayesianism and the rationality of scientific inference," British Journal for the Philosophy of Science, 23: 181-90.
Dorling, J. (1979) "Bayesian Personalism, The Methodology of Research Programmes, and Duhem's Problem," Studies in History and Philosophy of Science 10: 177-87.
Dorling, J. (1982) "Further Illustrations of the Bayesian Solution of Duhem's Problem," (unpublished).
Franklin, A. (1986) The Neglect of Experiment. Cambridge: Cambridge University Press.
Giddens, A. (1979) Central Problems in Social Theory. London, etc.: Macmillan.
Giere, R. (1988) Explaining Science. A Cognitive Approach. Chicago/ London: Chicago University Press.
Gilbert G. N., and M. Mulkay (1984) Opening Pandora's Box. A Sociological Analysis of Scientists' Discourse. Cambridge: Cambridge University Press.
Hesse, M. (1974) The Structure of Scientific Inference. London: Macmillan.
Hesse, M. (1980) Revolutions and Reconstructions in the Philosophy of Science. London: Harvester Books.
Howson C., and A. Franklin (1985) "Newton and Kepler," Studies in History and Philosophy of Science, 16: 379-86.
Howson C., and P. Urbach (1989) Scientific Reasoning. The Bayesian Approach. La Salle, Ill.: Open Court.
Langley, P., H. A. Simon, G. L. Bradshaw, and J. M. Zytkow (1987) Scientific Discovery. Computational Explorations of the Creative Processes. Cambridge, Mass./ London: MIT.
Leydesdorff, L. (1990a) "Relations Among Science Indicators I. The Static Model," Scientometrics 18: 281-307.
Leydesdorff, L. (1990b) "Relations Among Science Indicators II. The Dynamics of Science," Scientometrics 19: 271-96.
Leydesdorff, L. (1990c) "The Scientometrics Challenge to Science Studies," EASST Newsletter 9 (Nr. 1): 5-11.
Leydesdorff, L. (1992a) "Irreversibilities in Science and Technology Networks: An Empirical and Analytical Approach," Scientometrics (forthcoming).
Leydesdorff, L. (1992b) "The Static and Dynamic Analysis of Network Data Using Information Theory," Social Networks (forthcoming).
Leydesdorff, L. (1992c) "The Impact of Citation Behaviour on Citation Structure," Proceedings of the Joint EC/Leiden Conference on Science and Technology Indicators 1991, (forthcoming).
Leydesdorff, L. (1992d) "Structure/Action Contingencies and the Model of Parallel Computing," (forthcoming).
Luhmann, N. (1984) Soziale Systeme. Grundrisz einer allgemeinen Theorie. Frankfurt a.M.: Suhrkamp.
Luhmann, N. (1990) Die Wissenschaft der Gesellschaft. Frankfurt a.M.: Suhrkamp.
Mitroff, I.I. (1972) The Subjective Side of Science. Amsterdam: North Holland.
Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, California: Morgan Kaufman.
Phillips, L. D. (1973) Bayesian Statistics for Social Scientists. London, etc.: Nelson.
Pinch, T. (1985) "Towards an Analysis of Scientific Observation: The Externality and Significance of Observational Reports in Physics," Social Studies of Science 15: 3-26.
Rosenkrantz, R. D. (1977) Inference, Method and Decision. Towards a Bayesian Philosophy of Science. Dordrecht, etc.: Reidel.
Rosenkrantz, R. D. (1980) "Induction as information acquisition," in: Applications of inductive logic, ed. L. J. Cohen and M. B. Hesse, pp. 68-89. Oxford: Oxford University Press.
Shannon, C. H. (1948) "A Mathematical Theory of Communication," Bell System Technical Journal 27: 379-423 and 623-56.
Swanson, D. R. (1990) "Medical literature as a potential source of new knowledge," Bull. Med. Libr. Assoc. 78: 29-37.
Theil, H. (1972) Statistical Decomposition Analysis. Amsterdam, etc.: North Holland.
[i]. See also: Theil 1972; Leydesdorff 1990b.
[ii]. H = p_{i} * log(1/p_{i})
[iii]. See for a further discussion of this matrix, Leydesdorff 1992b.
[iv]. Note that the eigenstructure of an asymmetrical matrix is asymmetrical indeed. See also: Leydesdorff (1992b and 1992c).
[v]. "That is the key idea behind all Bayesian methods (...)" (Phillips 1973, at p. 9).
[vi]. See for a review of empirical studies: Giere 1988, pp. 149-57.
[vii]. A Bayesian might counter-argue that information is strictly defined by the prior distribution in such a way that its becoming available would change only this a priori distribution, and that therefore the entropy is bound to a maximum. However, the problem is then how to account for the additional uncertainty generated by the confrontation with the empirical evidence. I will return to this in a later section.
[viii]. The incremental nature of Bayesian updates can also be seen from the so-called log odds log likelihood formulation of Bayes' theorem, since each additional datum adds an independent piece of evidence to a summation.
Ω_{(posterior)} = Ω_{(prior)} π _{Li}
log Ω_{(posterior)} = log Ω_{(prior)} + Σ log L_{i}
in which Ω stand for the odds, and L for the likelihood of evidence_{i}. (Pearl 1988, at pp. 38f.)
[ix]. Since these are information contents of messages about change, they can be shown to be necessarily positive (cf. Theil 1972, pp. 59f.).
[x]. In Bayesian philosophy, this term is even a normalization constant because of the logical complementary of the hypothesis and its negation. See also: Pearl 1988, at p. 32.
[xi]. The analogy is a consequence of the Second Law of Thermodynamics: the later state of the system contains the entropy of the previous states plus the entropy generated by the process. Analogously, the aggregate contains the entropy of the previous subgroups plus the "in between group" entropy. If the system is additionally able to increase its redundancy (e.g., by self-organization) the relative information may nevertheless decrease.