Originally, patent databases consisted of bibliographic data only, but in the past decade there has been a steady increase in the number of full-text databases coming online.  Bibliographic patent databases (essentially what you’d find on the first page of the patent document) are published by many of the world’s patent-issuing authorities—but therein lies a problem.  While the World Intellectual Property Office (WIPO) has issued numerous guidelines, recommendations and standards for bibliographic data, with close to 190 member states publishing IP, there are inevitable variations in the quality and, of course, the language of that data.

The breadth, depth and timeliness of data offered by database vendors is highly dependent upon the patent offices—in particular, the European Patent Office’s (EPO) master documentation database, DOCDB.  DOCDB includes bibliographic data as well as abstracts, citations and the DOCDB simple patent family from over 90 countries worldwide.  The data goes back as far as the 1830s for some patent authorities and is updated weekly.  However, DOCDB does not contain any full text or images.

While DOCDB is the backbone of many commercial products and services, it represents fewer than half of the 190 WIPO Member States.

The best patent databases are those that standardize and normalize specific bibliographic data elements such as the Publication, Application and Priority numbers and dates, as well as the Applicant(s) and ideally the Inventor(s) and Law Firm names.  The ability to search patents by Classification code(s) is also enhanced if they have been standardized to a common format.

Let’s look at the reasons for normalizing data:

The date on which an application is filed, granted or published are all critical search elements, but they are also essential when it comes to analytics.  Hence it is important that users know whether the dates have been consistently normalized.  If so, which format should be used and is any punctuation, such as periods, commas, dashes, slashes or spaces, required?  The most commonly used date formats are MMDDYYYY (month, day and year) and DDMMYYYY.  Application and/or priority dates (and numbers) are also used by database vendors to link inventions by family, so any inconsistencies in these formats will lead to incorrect patent families.  Ensuring that that your patent research tool normalizes all dates to a common format and that it supports search and retrieval by year only, as well as combinations of month, day and year, can help avoid pitfalls.

Publication, Application and Priority Number formats vary enormously; some contain a mixture of Country Codes, Serial Numbers (separated by a slash) and Kind Codes all separated by spaces such as this published U.S. Application: US 2016/0000001 A1.  Others like the International Application Number PCT/SE2007/050194, while not containing a Kind Code, does contain the Country Code for Sweden (SE), indicating that the application was initially filed in Sweden.  Numbers can also vary in length but are sometimes “padded” with zeros (see U.S. example above) to a minimum number of characters.  Often it is not clear to users whether the number they are searching for is a publication, application or priority number.  So tools that allow users to enter any numbers, regardless of type and format, are very useful.