Official Google Webmaster Central Blog: A note on unsupported rules in robots.txt

Webmaster Central Blog

Official news on crawling and indexing sites for the Google index

A note on unsupported rules in robots.txt

mardi, juillet 02, 2019

Yesterday we announced that we're open-sourcing Google's production robots.txt parser. It was an exciting moment that paves the road for potential Search open sourcing projects in the future! Feedback is helpful, and we're eagerly collecting questions from developers and webmasters alike. One question stood out, which we'll address in this post:
Why isn't a code handler for other rules like crawl-delay included in the code?
The internet draft we published yesterday provides an extensible architecture for rules that are not part of the standard. This means that if a crawler wanted to support their own line like "unicorns: allowed", they could. To demonstrate how this would look in a parser, we included a very common line, sitemap, in our open-source robots.txt parser.
While open-sourcing our parser library, we analyzed the usage of robots.txt rules. In particular, we focused on rules unsupported by the internet draft, such as crawl-delay, nofollow, and noindex. Since these rules were never documented by Google, naturally, their usage in relation to Googlebot is very low. Digging further, we saw their usage was contradicted by other rules in all but 0.001% of all robots.txt files on the internet. These mistakes hurt websites' presence in Google's search results in ways we don’t think webmasters intended.
In the interest of maintaining a healthy ecosystem and preparing for potential future open source releases, we're retiring all code that handles unsupported and unpublished rules (such as noindex) on September 1, 2019. For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options:

Noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.
404 and 410 HTTP status codes: Both status codes mean that the page does not exist, which will drop such URLs from Google's index once they're crawled and processed.
Password protection: Unless markup is used to indicate subscription or paywalled content, hiding a page behind a login will generally remove it from Google's index.
Disallow in robots.txt: Search engines can only index pages that they know about, so blocking the page from being crawled usually means its content won’t be indexed. While the search engine may also index a URL based on links from other pages, without seeing the content itself, we aim to make such pages less visible in the future.
Search Console Remove URL tool: The tool is a quick and easy method to remove a URL temporarily from Google's search results.

For more guidance about how to remove information from Google's search results, visit our Help Center. If you have questions, you can find us on Twitter and in our Webmaster Community, both offline and online.

Posted by Gary

Google

Hey! Check here if your site is mobile-friendly.

Libellés

accessibility 10
advanced 195
AMP 13
Android 2
API 7
apps 7
autocomplete 2
beginner 173
CAPTCHA 1
Chrome 2
cms 1
crawling and indexing 158
encryption 3
events 51
feedback and communication 83
forums 5
general tips 90
geotargeting 1
Google Assistant 3
Google I/O 3
Google Images 3
Google News 2
hacked sites 12
hangout 2
hreflang 3
https 5
images 12
intermediate 205
interstitials 1
javascript 8
job search 2
localization 21
malware 6
mobile 63
mobile-friendly 14
nohacked 1
performance 17
product expert 1
product experts 2
products and services 63
questions 3
ranking 1
recipes 1
rendering 2
Responsive Web Design 3
rich cards 7
rich results 10
search console 35
search for beginners 1
search queries 7
search results 140
security 12
seo 3
sitemaps 46
speed 6
structured data 33
summit 1
TLDs 1
url removals 1
UX 3
verification 8
video 6
webmaster community 24
webmaster forum 1
webmaster guidelines 57
webmaster tools 177
webmasters 3
youtube channel 6

Archive

2020
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2019
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2018
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2017
- déc.
- nov.
- oct.
- sept.
- août
- juin
- mai
- avr.
- mars
- févr.
- janv.

2016
- déc.
- nov.
- oct.
- sept.
- août
- juin
- mai
- avr.
- mars
- janv.

2015
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- mai
- avr.
- mars
- févr.
- janv.

2014
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2013
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2012
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2011
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2010
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2009
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2008
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2007
- déc.
- nov.
- oct.
- sept.
- août
- juil.
- juin
- mai
- avr.
- mars
- févr.
- janv.

2006
- déc.
- nov.
- oct.
- sept.
- août

Feed

Give us feedback in our Product Forums.

Subscribe via email

Enter your email address:

Delivered by FeedBurner

Google
Privacy
Terms