return to home page


TiNoCos.exe for Word Selection

 FtNoCos.exe for Word Selection


These two programs are specialized derivates from Ti.Exe and FullText.Exe for the construction of words.dbf containing (without producing the cosine normalized matrix for analysis in Pajek):


  1. A variable named “Chi_Sq” which provides Chi-square contributions for each of the variables; these are defined for wordi as Σiχ2 = (Observedij – Expectedij)2 / Expectedin. In other words, the sum of the contributions over the column for the variable in each row (Mogoutov et al., 2008);
  2. A variable named “ObsExp” which provides the sum of |Observed – Expected| for the word as a variable summed over the column;
  3. A variable named “TfIdf” which use Salton & McGill’s (1983: 63) TermFrequency-InverseDocumentFrequency measure defined as follows: WEIGHTik = FREQik * [log2 (n) – log2 (DOCFREQk)]. This function assigns a high degree of importance to terms occurring in only a few documents in the collection;
  4. The word frequency within the set.



-        Magerman, T., Van Looy, B., & Song, X. (2007). Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications. Paper presented at the 6th Triple Helix Conference, 16-19 May 2007, Singapore.

-        Mogoutov, A., Cambrosio, A., Keating, P., & Mustar, P. (2008). Biomedical innovation at the laboratory, clinical and commercial interface: A new method for mapping research projects, publications and patents in the field of microarrays. Journal of Informetrics (In print); doi:10.1016/j.joi.2008.06.005.

-        Salton, G. & M. J. McGill (1983). Introduction to Modern Information Retrieval. Auckland, etc.: McGraw-Hill. 



Links to programs for (Porter’s) stemming:

Links to programs for parsing:

php-versions of Porter’s stemmer:



return to home page