Google has open-sourced its 'Web Crawler' after two decades

If you are a web developer or publisher, you would have heard about robots.txt. For the people who are unaware, Robots.txt is Google’s Robot Exclusion Protocol which is used by millions of websites to guide the crawlers. This simple file contains the areas which Google’s Web Crawler should crawl and which ones to avoid. This can be done simply by adding a do follow or no follow tag at the end.

Now, we know that robots.txt is used by Google’s crawler to crawl millions of websites. But it is not the universal protocol for crawling a website. However, Google wants to make its REP or Robot Exclusion Protocol as the web standard. So the company has decided to open-source this protocol which will make it available for all. And Google also hopes that this will become the official standard for crawling the web.

This announcement of open-sourcing the robots.txt comes after nearly 20 years of keeping it closed source. To understand and make use of this protocol and its library, you will have to find it on Github from here.

It also means that if Googlebot does not find a robots.txt file, it will scan the entire website as it thinks that there is no need to ignore anything. Some say that this protocol has been interpreted “somewhat differently over the years” by developers and this has caused difficulty in “writing the rules correctly”

It is still not known if there is any chance that REP would be accepted as the official standard. But Google says that REP has fixed rules for all which is why it should be adopted. Google has also documented how REP should be used and they have submitted their proposal to the Internet Engineering Task Force too. REP’s adoption could be a great thing for the internet but it might not be liked by many against the search giant.

Google has open-sourced its ‘Web Crawler’ after two decades

Website

Policies