return to home
These two programs are
specialized derivates from Ti.Exe and FullText.Exe
for the construction of words.dbf containing (without producing the cosine
normalized matrix for analysis in Pajek):
- A variable named “Chi_Sq”
which provides Chi-square contributions for each of the variables; these
are defined for wordi as Σiχ2
= (Observedij – Expectedij)2 / Expectedin.
In other words, the sum of the contributions over the column for the
variable in each row (Mogoutov et al., 2008);
- A variable named “ObsExp”
which provides the sum of |Observed – Expected| for the word as a variable
summed over the column;
- A variable named “TfIdf”
which use Salton & McGill’s (1983: 63)
TermFrequency-InverseDocumentFrequency measure defined as follows: WEIGHTik
= FREQik * [log2 (n) – log2 (DOCFREQk)].
This function assigns a high degree of importance to terms occurring in
only a few documents in the collection;
- The word frequency within
- Magerman, T., Van
Looy, B., & Song, X. (2007). Exploring the feasibility and accuracy of
Latent Semantic Analysis based text mining techniques to detect similarity
between patent documents and scientific publications. Paper presented at
the 6th Triple Helix Conference, 16-19 May 2007, Singapore.
- Mogoutov, A.,
Cambrosio, A., Keating, P., & Mustar, P. (2008). Biomedical innovation at
the laboratory, clinical and commercial interface: A new method for mapping
research projects, publications and patents in the field of microarrays. Journal
of Informetrics (In print); doi:10.1016/j.joi.2008.06.005.
- Salton, G. & M. J.
McGill (1983). Introduction to Modern Information Retrieval. Auckland, etc.: McGraw-Hill.
Links to programs for (Porter’s) stemming:
Links to programs for parsing:
php-versions of Porter’s stemmer:
return to home