Homepage  |  Publications  |   Software  |   Courseware; indicators  |   Animation  |   Geo  |   Search website (Google)


click here to download program

 

ChiTi.Exe for Co-Word Analysis in Chinese

 

ChiTi.Exe is freely available for academic usage. The program generates a word-occurrence matrix, a word co-occurrence matrix, and a normalized co-occurrence matrix from a set of lines (e.g., titles) and a word list. The output files can be read into standard software (like SPSS, Ucinet/Pajek, etc.) for statistical analysis and visualization. (A version adapted for the Korean character set is available at http://www.leydesdorff.net/krkwic  (32 bits) and at http://www.leydesdorff.net/software/korean for 64 bits operating systems.) This Chinese version assumes that the texts are preprocessed with spaces as separators between the words. This can be done, for example, at http://www.hylanda.com/product/fenci/tiyan/ .

 

1.    Use ChiWords.exe for breaking “text.txt” into words;

2.    Make a selection from wrdfrq.txt and save as “words.txt” (in ASCII/ANSI);

3.    Run ChiTi.exe

4.    Read cosine.net or coocc.dat into Notepad++ and transcode into UTF8 + Bom; save;

5.    The resulting files can be used in Pajek and then brought into VOSviewer;

6.    VOSviewer can also directly read cosine.net; use Pajek for rewriting .dat into .net

 

input files

 

The program needs two informations, notably, (a) the name of the file <words.txt> that contains the words (as variables) to be analyzed in ASCII format and (b) a file text.txt in which each line provides a textual unit (e.g., a title). The number of lines is unlimited, but each line can at the maximum contain 999 characters. Each line has to be ended with a hard carriage return (CR + LF). The number of words is limited to 1024, but keep in mind that most programs (e.g., Excel) will not allow you to handle more than 256 variables in the follow-up. The words have to be on separate lines which are ended with a hard character return and line feed. (Save in Word as plain text with CR/LF or use a DOS utility (e.g., CRLF.EXE, available at the Internet) for saving the file.)

 

       If some texts are larger than 999 characters, you can use ChiText.exe instead. ChiText.exe can handle an unlimited number of text files to a size of 64 k each.

       One can build a word frequency list with ChiWords.Exe. This DOS-program reads <text.txt>; the results are provided in <wrdfrq.txt>. If a file <stopword.txt> is available in the same folder, these words will be excluded from the word frequency analysis. Take care that words.txt and text.txt are both ANSI (that is, extended ASCII) files.

 

program file

 

The program is based on DOS-legacy software from the 1980s (Leydesdorff, 1995). It runs in a MS-Dos Command Box under Windows. The programs and the input files have to be contained in the same folder. The output files are written into this directory as well. Please, note that existing files from a previous run are overwritten by the program. Save output elsewhere if you wish to continue with the materials.

 

output files

 

If the characters are not readable as Chinese, the encoding is wrong. Please, use the freeware programme Notepad++ ( at https://notepad-plus-plus.org/ )  which transcodes the output files into UTF-8.  Chose: UTF + BOM. This encoding can be read by VOSviewer and Pajek.

 

One can use the following output files::

 

a. cosine.net for a cosine normalized map;

b. coocc.dat for a co-occurrences-base map,

c. cos_vosm.txt and cos_vosn.txt which are respectively map and network files for VOSviewer. (Transcode both files into UTF + BOM!)

 

The program produces three output files in dBase IV format. These files can be read into Excel and/or SPSS for further processing. Two files with the extension “.dat” are in DL-format (ASCII) and can be read into Pajek for the visualization (freely available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/ ). Similarly, files with the extension .net are in the format of Pajek.

 

a. matrix.dbf contains an occurrence matrix of the words in the texts. This matrix is asymmetrical: it contains the words as the variables and the texts as the cases. In other words, each row represents a text in the sequential order of the text numbering, and each column represents a word in the sequential order of the word list. The words are counted as frequencies. This file can be imported into SPSS for further analysis. Words.Txt can be used for the variable labels. The programs also generate varlist.sps using the SPSS syntax for this purpose.

 

b. coocc.dbf contains a co-occurrence matrix of the words from this same data. This matrix is symmetrical and it contains the words both as variables and as labels in the first field. The main diagonal is set to zero. The number of co-occurrences is equal to the multiplication of occurrences in each of the texts. (The procedure is similar to using the file matrix.dbf as input to the routine “affiliations” in UCINET, but the main diagonal is here set to zero in this matrix.) The file coocc.dat contains this information in the DL-format.

 

c. cosine.dbf contains a normalized co-occurrence matrix of the words from the same data. Normalization is based on the cosine between the variables conceptualized as vectors (Salton & McGill, 1983). (The procedure is similar to using the file matrix.dbf as input to the corresponding routing in SPSS.) The file cosine.dat contains this information in the Pajek-format. The size of the nodes is equal to the logarithm of the occurrences of the respective word; this feature can be turned on in Pajek.

 

Examples of using these programs can be found in:

 

•     Loet Leydesdorff & Ping Zhou, Co-Word Analysis using the Chinese Character Set, Journal of the American Society for Information Science and Technology 59(9), 1528-1530, 2008; <pdf-version>; <software>

     Loet Leydesdorff & Iina Hellsten, Metaphors and Diaphors in Science Communication: Mapping the Case of ‘Stem-Cell Research’, Science Communication 27(1)  (2005), 64-99. <pdf-version>

  

References

 

Leydesdorff, L. (1995). The Challenge of Scientometrics: The development, measurement, and self-organization of scientific communications. Leiden: DSWO Press, Leiden University; at http://www.upublish.com/books/leydesdorff-sci.htm .

Salton, G. & M. J. McGill (1983). Introduction to Modern Information Retrieval. Auckland, etc.: McGraw-Hill.

click here to download program

(revised, March 2019). 

return to home page