Mapping (USPTO) Patent Data using Overlays of Google Maps
Journal of the American Society for Information Science and Technology 63(7) (2012) 1442-1458.
available at http://arxiv.org/ftp/arxiv/papers/1110/1110.5712.pdf
Figure 1: Map of 2,946 USPTO-patents recalled using the search string “ttl/nano$ and (isd/2008$$ or isd/2009$$ or isd/2010$$)”; with 78 inventors participating in the top-quartile for “Seoul (KR);” above expectation at the level p< 0.05. (The interactive map can be found at http://www.leydesdorff.net/patentmaps/nano_a.htm .)
The mapping of overlays using Google Maps requires downloading of the following computer programs:
1. uspto1.exe for the initial download at the interface of USPTO (advanced searching at http://patft.uspto.gov/netahtml/PTO/search-adv.htm);
2. uspto2.exe for the organization of the databases from the retrieval under 1;
3. patref3.exe for downloading the numbers of citations for each patent downloaded under 1; uspto2.exe generates an input file for this routine;
4. patref5.exe for the generation of the overlays for Google Maps.
(5. usappl.exe is similar to uspto2.exe, but organizes downloads from the database of patent applications at http://appft1.uspto.gov/netahtml/PTO/search-adv.html. Patent applications, however, do not contain forward citations.)
All programs have to be saved in the same folder; they are all needed although one does not need to start them as a user (but one can). Files from previous runs may be overwritten; it is advised to empty the folder before downloading and running these routines in order to prevent confusion with previous runs, etc. (The routines access the Internet using the MicroSoft Internet protocol in the file MSINet.OCX. If this file was not yet installed when installing another program, an error message may be generated by Windows since the file is not installed with the original installation of Windows. This error can be solved by following the instructions at http://www.leydesdorff.net/software/patentmaps/ocx.htm.)
The first three routines are integrated into uspto2.exe. This routine calls both uspto1.exe and patref3.exe (which access the Internet). The user first is prompted for an input string. This search string can be generated and tested at http://patft.uspto.gov/netahtml/PTO/search-adv.htm. One needs a search string which results in more than fifty patents.
For example, the search string “ttl/nano and isd/2010$$” provides 171 patents. The resulting screen looks as follows:
Click in this screen on the button “Next 50 Hits” (because the format of the first 50 hits is deviant). Enter then into one of the patents (by clicking on the name) and copy the search string at the top of the screen into uspto1.exe when prompted for a search string. It is not important which patent is used, as long as the sequence number is larger than 50; the search string will be parsed by the program. Feed also the number of 171 into the search when prompted for the number of patents. (A lower number is also ok, but a higher number may lead to an error message disturbing the flow of the programs.)
Uspto2.exe will thereupon download the (in this case, 171) patents (as p1.htm, p2.htm, …, etc.) and organize the information in these files into relational databases (as explained in more detail at http://www.leydesdorff.net/indicators/lesson5.htm. These relational databases (in .dbf format) can also be opened in Excel or related using MS Access.) The program produces additionally a file “cit_inv.txt” (or “cit_ass.txt” if one has chosen for Assignees to be mapped) which we will use below for the geo-coding.
Thereafter, the program uspto2.exe will call patref3.exe and patref4.exe. Patref3.exe prompts you with a button for confirming the internet connection. (If you already downloaded the data in a previous run, you can answer “n” for “no,” but not the first time.) This routine downloads the numbers of citations in the USPTO database for each of the (171) downloaded patents at the present date (in files q1.htm, q2.htm, …., etc.). This number is stored in the file ti.dbf under the field “tc” (analogously to the abbreviation of “times cited” in the Web-of-Science). This completes the information retrieval for the purpose of this set of routines. (If you wish to continue with analysis of the cited and citing patterns of USPTO patents and applications, go here.)
Note that the USPTO provides a warning (at http://www.uspto.gov/patft/help/notices.htm) stating that downloads above 1,000 may lead to banning your IP address for further downloading. I disclaim any responsibility for this.
The file cit_inv.txt (or cit_ass.txt) can be used for the geo-coding, for example, at http://www.gpsvisualizer.com/geocoder/ . (The Sci2 Tool contains an option for geo-coding large numbers (upto 50k) automatically, but one has to reformat into comma-separated-variables (CSV) in accordance with the output format of the gpsvisualizer thereafter because the next routine assumes this format: Bing Maps since Oct. 1, 2013. One can ask for a free API key at http://www.bingmapsportal.com/) Cut and paste the information in cit_inv.txt into the input window of the gpsvisualizer, and cut and paste thereafter the output into a file geo.txt. This file should be saved as a plain text file, named “geo.txt”. Don’t change these files!
Thereafter one can run patref5.exe. The program asks a number of questions (about thresholds, etc.) and then generates two output files: ztest.txt and patents.txt. These files are in the format that can be used at http://www.gpsvisualizer.com/map_input?form=data for the generation of Google Maps. Patents.txt will generate a map with different colors for six percentile ranks (top-1%, top-5%, top-10%, top-25%, top-50%, and bottom-50%); the nodes are sized according to the logarithm of the number of patents (+1, in order to prevent log(1) = 0). Ztest.txt is based on using the z-test between the observed versus expected numbers of citations of these patents. If the numbers are higher than expected, the color is green; if lower, the color is red. Dark-green and dark-red indicate significance, while lighter colors are used for non-significant cities and for expected values smaller than five (because testing is then not reliable). By clicking on the city-nodes one obtains this quantitative information in a pop-up window of Google Maps.
The Google Map can be downloaded and used locally (e.g., in a powerpoint presentation). One can ask for an API with Google which has to be placed in the file at an indicated place, and then the map can also be made available on a website. One can also edit the html file.
Note that the default attribution of patents to city addresses is based on “fractional counting” and using inventors. One has the option to ask for assignees instead of inventors. Each inventor or assignee address adds a proportionate point (1/n) to the number of patents (n) at this address if the patent was co-invented. The files iztest.txt and ipatents.txt provide the same files using “integer counting” that is counting the number of inventors or assignees. In the future, I plan to extend this routine with the geographical network among co-inventors as for the mapping of scientific publications at http://www.leydesdorff.net/maps.
Note that these programs are not able to control for misspellings in the USPTO database itself. In principle, one can edit the data at in-between steps using Excel for the databases (save back as .dbf!) and a text editor for the .txt files. Another routine usappl.exe is available at http://www.leydesdorff.net/software/patentmaps/usappl.exe for searching among applications (instead of granted patents) at the USPTO (at http://appft1.uspto.gov/netahtml/PTO/search-adv.html ). The program operates strictly analogous to uspto2.exe from which it was derived, but with the necessary adaptations to the format. Applications cannot be cited (hence: tc=0 for all applications). The files cit_inv.txt and cit_ass.txt are less reliable for geo-coding than the ones for granted patents. For example, the “CA” for Canada is often coded as “California, US” and similarly “IL” provides ambiguity between “Illinois” and “Israel.” One may wish to inspect these files in detail before using them.
See for the full paper (v. 25 Oct. 2011) at http://arxiv.org/ftp/arxiv/papers/1110/1110.5712.pdf.
Networks of Co-inventors and co-applicants (extension in March 2013):
An additional routine usp_netw.exe enables the user to make also inventor and applicant networks as overlays. The output files “network.txt” can be cut-and pasted behind the other files and then GPSVisualizer can use to combination of the two files for a single Google Map (e.g., ztest.txt + network.txt).
The routine evaluates first whether the files were made by uspto2.exe using inventor or assignee addresses; one cannot combine the two because they are potentially different in terms of nodes. An asymmetrical matrix of patents versus city addresses (matrix.txt) is first generated in the Pajek format. The output named coocc.dat and if so wished additionally cosine.dat is generated containing the cosines between the address-vectors across the asymmetrical file. However, the procedure can be time-consuming since not based on matrix algebra. (One may wish to run this routine during the night). Alternatively, the program will indicate after a while that one can interrupt using Ctrl-Break. The user is then prompted with the option to discontinue the operation.
At that moment, a file “matrix.txt” was already generated which can be read into Pajek as a network file (File > Read > Network). The co-occurrence matrix can be made in Pajek (v.3) by choosing: Net > 2-Mode Network > Transform > 2-Mode to 1-Mode > Columns. Save the resulting network as a valued matrix with the .mat extension (File > Network > Save). This file should be named “pajek.mat”, and can be read by Paj2Cooc.Exe. This program generates the file coocc.dbf which can be used by patref5.exe. Note that previous files with the same name are overwritten both by Pajek and by these programs. Patref5.exe writes network.txt (as input for GPSVisualizer) if coocc.dbf is available in the same folder.
Examples are provided at http://www.leydesdorff.net/patentmaps/sirna.htm from the original article; at http://www.leydesdorff.net/patentmaps/rna_inv.htm with the co-inventor relations; and at http://www.leydesdorff.net/patentmaps/rna_ass.htm with the co-applicant relations for the same set.
The static maps made for US patents in terms of geographical addresses (at http://www.leydesdorff.net/patentmaps) or IPC categories (at http://www.leydesdorff.net/ipcmaps) can be converted into dynamic animations using the filing years and the corresponding routines usptoyr.exe and ipcyr.exe, respectively. The two programs run on the downloaded sets first parsed at the aggregated level by uspto2.exe and ipc.exe, respectively. Both routines use the filing dates for organizing the files into years. The user gets the option to specify a time-window in terms of number of years. For example, if one specifies 3 years and the first patents are from 1984, the respective output file for 1984 will also contain 1985 and 1986; and the next one for 1985 includes patents of 1986 and 1987. This enables users to dampen the variation.
See for more information at http://www.leydesdorff.net/software/patentmaps/dynamic . An example can be found at http://www.leydesdorff.net/photovoltaic/cuinse2/index.htm (for the geographic diffusion) and at http://www.leydesdorff.net/photovoltaic/cuinse2/cuinse2.ppsx (for the diffusion in terms of IPC categories).
The analysis of cited
patents (August 2013)
USPTO2.exe now (August 2013) also writes a file “cited.txt” (since Aug. 18, 2013). This file is derived from the file “patref.dbf” which contains the patent references from the initial download after parsing, but differently from patref.dbf these are only the patent references to USPTO patents after 1975 (because older patents cannot be automatically retrieved and parsed). Both granted patents and patent applications are included. If in the same folder, the program uscited1.exe reads “cited.txt” and writes the cited patents and patent applications into this same folder. The patents are numbered r1.htm, r2.htm, … etc. for granted patents, and a1.htm, a2.htm …, etc. for patent applications. See also at http://www.leydesdorff.net/software/uspatents .
If so wished, uscited2.exe can parse the files r1.htm, r2.htm …, etc. for cited patents; uscited3.exe parses the patent applications. The resulting .dbf files can be used for relational database management in MS Access or read into spreadsheet-based programs (such as Excel or SPSS) for statistical purposes. Using the various fields one can also relate the citing and cited documents in MSAccess or similar programs.