@author Andreas Thor @date 05/12/2015 RefMatchCluster = References Matching and Clustering This programs automatically (yet not 100% perfectly) identifies equivalent cited references It takes as input a DBF file with a "CR" attribute (cited references) It reads all objects (i.e., all entries in the file) and extracts author and journal names and the publication year Here, it performs some pre-processing - merging author names consisting of multiple names, e.g., "von neumann" is merged to "vonneumann" - only the first letter of the first firstname ist extracted - attribute journal_short is the concatenation of the first letters of each word (e.g., "american phys." is converted to "ap) attribute journal_short equals journal if the journal is just one word - further pre-processing steps can be implemented, e.g., extraction of pages or volume The subsequent matching compares all objects with all other objects sharing the same year and the same first letter of the last name Matching is controlled by matchers that are given as parameters. Each matcher consists of - attribute = the attribute that is used for similarity computation (e.g., lastname) - similarity = the similarity function that determines the similarity value in a 0 to 1 range (Levenshtein or Trigram) - threshold = real value between 0 and 1; only matching entry pairs with similarity >= threshold are considered All matchers are executed independently All match results (=mappings) are then merged. A matching pair appears in the merged mapping iff - it appears in *all* individual match results (i.e., attribute similarity >= matcher threshold for each mapping) *AND* - the merged similarity (average of all mapping similarity) is >= overall threshold (if specified) The merged match result is then used to cluster equivalent objects (In mathematics: The match result is considered as a binary relation R and two objects are in the same cluster if and only if the pair is in the transitive closure of R) Two objects o1 and o2 are assigned the same cluster id if - the pair (o1,o2) appears in the merged match result or - there is a list of other object t1, ..., tn so that (o1,t1), (t1,t2), (t2,t3), ..., (tn-1,tn), and (tn, o2) are in the match result The clustering can be exported to a CSV or DBF file (if specified) It has the exact same data as in the input file but adds a column "clusterid" The clustering is used for aggregation, i.e., the values of N_CR, PERC_YR, and PERC_ALL are summarized per cluster. For all other attributes (CR, RPY) one cluster representative is chosen (the one with the highest number of N_CR in the cluster) In the (unlikely) case of multiple objects with a maximal number of N_CR, one is chosen at random. The output has the exact same structures as in the input file but adds a column "clusterid" There is only one entry per cluster id Clustering file can be either in CSV or DBF format For analysis purposes (finding good matchers, i.e., a meaningful combination of attribute, similarity function and threshold) the match result can be exported to a CSV file. Each line represents a matching pair and gives all individual similarity values as well as the (global) average similarity. Command line arguments -input= ... dbf file -match= ... csv file -cluster= ... csv or dbf file -aggregate= ... csv or dbf file -matcher=,, ... similarity=Trigram or Levenshtein, threshold is a real number between 0.0 and 1.0 -threshold= ... global threshold for merged match results Notes: - -input is mandatory - there must be at least one -matcher argument but there can be multiple - -match, -cluster and -aggregate are optional but one should specify at least one of them to have some output - -threshold is optional (if not given: threshold=0, i.e., a pair is in the merged result if it is in all individual match results) Example: java -Dfile.encoding=ISO8559-1 -Duser.country=US -Duser.language=en -jar RefMatchCluster.jar -input=lutz/yearcr.dbf -matcher=journal_short,Levenshtein,0.75 -matcher=lastname,Levenshtein,0.75 -match=lutz/yearcr_match.csv -cluster=lutz/yearcr_cluster.dbf -aggregate=lutz/yearcr_aggregate.dbf (assuming all data files are in a sub-folder named "lutz") === VM configuration === -Dfile.encoding=ISO8559-1 ... standard for DBF -Duser.country=US ... to avoid problems with decimal separator -Duser.language=en ... dito