Thursday, August 10, 2017

Notes on India's website repository for Patents 8-10-2017

Some notes on the difference between the websites that house the patents for India and the US.

A simple title search (which is possible in the US Patent Office website generates an error) in India’s online system. I therefore have to search other aspects of Patents on the Indian trademark office to achieve results for patents that are related to ‘open source’ and ‘python’. Also the results of the India Patent office are not directly linkable as they are in the USTrademark Office’s website. Therefore, searching and scraping of India’s patents system results cannot be scraped in the same way as the US Patent Office’s site.

Additionally, the results seem to be not related to software development. Most of the results are about data science, machine learning and analytics. There are many more records received from an initial search with the terms “open-source” + “python” in the Description area/section of the Indian patent office records.

Question: What is a better way to retrieve granted patents in India’s patent office website?  
Answer: Probably developments in new scraping methods in python can retrieve and  scrape India’s patent office website.

Question: How can we bypass the captcha system that the Indian patent website has?
Answer: We need to work on different scraping methods in python and how they can be used for scraping along with their pros and cons is a thing to look at. This is one of the technical obstacles we face in transferring our existing php technology that scrapes the US trademark office’s website.

Indian patent site is not friendly for scraping the way that we have executed scraping for the US Trademark office’s site. The Indian patent office website uses JavaScript to produce the results. We need to explore more ways we can rebuild our algorithmn specific to the Indian Trademark Office’s website. I will be conferring with my PHP Developer to get his suggestions on how to approach the Indian Trademark Office’s Site.

Beautifulsoup python is the function that Joshi has found to filter patents from the USTrademark Office. Beautifulsoup python is not able to scrape the Indian Trademark Office’s Site because data is not on an html page but produced through a Javascript. We may need to research asynchronous web processes to speak with the Javascript that is used on the Indian Trademark Office’s Website.

Question: Where do we start?
Answer: Joshi has located the officer of the Indian Trademark Office’s email address. Diane will craft a letter to the Indian Trademark Patent Officer requesting access to the patents website. How can we access the patent records published in the database? Joshi will review then we will send it off and see what results we can receive. In the interim we will manually download the Abstract and the Claims of the Patents granted in India. 

Right now, google patents do not include patents from the Indian patents website. Additionally, captcha is installed on the Indian patents website to deter scraping.

No comments:

Post a Comment

Announcing our github account link

Here is our Github account that we are using to share Django and Python files in progress. Here is the address: