Webmaster Central Blog
Official news on crawling and indexing sites for the Google index
New User Agent for News
onsdag, desember 02, 2009
Webmaster Level: Intermediate
Today we are announcing a new user agent for robots.txt called Googlebot-News that gives publishers even more control over their content. In case you haven't heard of
robots.txt
, it's a web-wide standard that has been in use
since 1994
and which has support from all major search engines and well-behaved "robots" that process the web. When a search engine checks whether it has permission to crawl and index a web page, the "check if we're allowed to crawl this page" mechanism is robots.txt.
Publishers could easily contact us
via a form
if they didn't want to be included in Google News but did want to be in Google's web search index. Now, publishers can manage their content in Google News in an even more automated way. Site owners can just add Googlebot-News specific directives to their robots.txt file. Similar to the Googlebot and Googlebot-Image user agents, the new Googlebot-News user agent can be used to specify which pages of a website should be crawled and ultimately appear in Google News.
Here are a few examples for publishers:
Include pages in both Google web search and News:
User-agent: Googlebot
Disallow:
This is the easiest case. In fact, a robots.txt file is not even required for this case.
Include pages in Google web search, but not in News:
User-agent: Googlebot
Disallow:
User-agent: Googlebot-News
Disallow: /
This robots.txt file says that no files are disallowed from Google's general web crawler, called Googlebot, but the user agent "Googlebot-News" is blocked from all files on the website.
Include pages in Google News, but not Google web search:
User-agent: Googlebot
Disallow: /
User-agent: Googlebot-News
Disallow:
When parsing a robots.txt file, Google obeys the most specific directive. The first two lines tell us that Googlebot (the user agent for Google's web index) is blocked from crawling any pages from the site. The next directive, which applies to the more specific user agent for Google News, overrides the blocking of Googlebot and gives permission for Google News to crawl pages from the website.
Block different sets of pages from Google web search and Google News:
User-agent: Googlebot
Disallow: /latest_news
User-agent: Googlebot-News
Disallow: /archives
The pages blocked from Google web search and Google News can be controlled independently. This robots.txt file blocks recent news articles (URLs in the /latest_news folder) from Google web search, but allows them to appear on Google News. Conversely, it blocks premium content (URLs in the /archives folder) from Google News, but allows them to appear in Google web search.
Stop Google web search and Google News from crawling pages:
User-agent: Googlebot
Disallow: /
This robots.txt file tells Google that Googlebot, the user agent for our web search crawler, should not crawl any pages from the site. Because no specific directive for Googlebot-News is given, our News search will abide by the general guidance for Googlebot and will not crawl pages for Google News.
For some queries, we display results from Google News in a discrete box or section on the web search results page, along with our regular web search results. We sometimes do this for Images, Videos, Maps, and Products, too. This is known as
Universal search results
. Since Google News powers Universal "News" search results, if you block the Googlebot-News user agent then your site's news stories won't be included in Universal search results.
We are currently testing our support for the new user agent. If you see any problems
please let us know
. Note that
it is possible for Google
to return a link to a page in some situations even when we didn't crawl that page. If you'd like to
read more about robots.txt
, we provide additional documentation on our website. We hope webmasters will enjoy the flexibility and easier management that the Googlebot-News user agent provides.
Written by
Jonathan Simon
, Webmaster Trends Analyst
Hey!
Check here if your site is mobile-friendly.
Etiketter
accessibility
10
advanced
195
AMP
13
Android
2
API
7
apps
7
autocomplete
2
beginner
173
CAPTCHA
1
Chrome
2
cms
1
crawling and indexing
158
encryption
3
events
51
feedback and communication
83
forums
5
general tips
90
geotargeting
1
Google Assistant
3
Google I/O
3
Google Images
3
Google News
2
hacked sites
12
hangout
2
hreflang
3
https
5
images
12
intermediate
205
interstitials
1
javascript
8
job search
2
localization
21
malware
6
mobile
63
mobile-friendly
14
nohacked
1
performance
17
product expert
1
product experts
2
products and services
63
questions
3
ranking
1
recipes
1
rendering
2
Responsive Web Design
3
rich cards
7
rich results
10
search console
35
search for beginners
1
search queries
7
search results
140
security
12
seo
3
sitemaps
46
speed
6
structured data
33
summit
1
TLDs
1
url removals
1
UX
3
verification
8
video
6
webmaster community
24
webmaster forum
1
webmaster guidelines
57
webmaster tools
177
webmasters
3
youtube channel
6
Archive
2020
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2019
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2018
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2017
des.
nov.
okt.
sep.
aug.
juni
mai
apr.
mars
feb.
jan.
2016
des.
nov.
okt.
sep.
aug.
juni
mai
apr.
mars
jan.
2015
des.
nov.
okt.
sep.
aug.
juli
mai
apr.
mars
feb.
jan.
2014
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2013
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2012
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2011
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2010
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2009
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2008
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2007
des.
nov.
okt.
sep.
aug.
juli
juni
mai
apr.
mars
feb.
jan.
2006
des.
nov.
okt.
sep.
aug.
Feed
Follow @googlewmc
Give us feedback in our
Product Forums
.
Subscribe via email
Enter your email address:
Delivered by
FeedBurner