Webmaster Central Blog
Official news on crawling and indexing sites for the Google index
Optimize your crawling & indexing
Sunday, August 09, 2009
Webmaster Level: Intermediate to Advanced
Many questions about website architecture, crawling and indexing, and even ranking issues can be boiled down to one central issue:
How easy is it for search engines to crawl your site?
We've spoken on this topic at a number of recent events, and below you'll find our presentation and some key takeaways on this topic.
The Internet is a
big
place
; new content is being created all the time. Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that's available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we've crawled, we're only able to index a portion.
URLs are like the bridges between your website and a search engine's crawler: crawlers need to be able to find and cross those bridges (i.e., find and crawl your URLs) in order to get to your site's content. If your URLs are complicated or redundant, crawlers are going to spend time tracing and retracing their steps; if your URLs are organized and lead directly to distinct content, crawlers can spend their time accessing your content rather than crawling through empty pages, or crawling the same content over and over via different URLs.
In the slides above you can see some examples of what
not
to do—real-life examples (though names have been changed to protect the innocent) of homegrown URL hacks and encodings, parameters masquerading as part of the URL path, infinite crawl spaces, and more. You'll also find some recommendations for straightening out that labyrinth of URLs and helping crawlers find more of your content faster, including:
Remove user-specific details from URLs.
URL parameters that don't change the content of the page—like session IDs or sort order—can be removed from the URL and put into a cookie. By putting this information in a cookie and
301 redirecting
to a "clean" URL, you retain the information and reduce the number of URLs pointing to that same content.
Rein in infinite spaces.
Do you have a calendar that links to an infinite number of past or future dates (each with their own unique URL)? Do you have paginated data that returns a
status code of 200
when you add
&page=3563
to the URL, even if there aren't that many pages of data? If so, you have an
infinite crawl space
on your website, and crawlers could be wasting their (and your!) bandwidth trying to crawl it all. Consider
these tips
for reining in infinite spaces.
Disallow actions Googlebot can't perform.
Using your
robots.txt file
, you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is something that a crawler can't perform. (Crawlers are notoriously cheap and shy, so they don't usually "Add to cart" or "Contact us.") This lets crawlers spend more of their time crawling content that they can actually do something with.
One man, one vote.
One URL, one set of content.
In an ideal world, there's a one-to-one pairing between URL and content: each URL leads to a unique piece of content, and each piece of content can only be accessed via one URL. The closer you can get to this ideal, the more streamlined your site will be for crawling and indexing. If your CMS or current site setup makes this difficult, you can
use the rel=canonical element
to indicate the preferred URL for a particular piece of content.
If you have further questions about optimizing your site for crawling and indexing, check out some of our
previous writing
on the subject, or stop by our
Help Forum
.
Posted by
Susan Moskwa
, Webmaster Trends Analyst
Hey!
Check here if your site is mobile-friendly.
Labels
accessibility
10
advanced
195
AMP
13
Android
2
API
7
apps
7
autocomplete
2
beginner
173
CAPTCHA
1
Chrome
2
cms
1
crawling and indexing
158
encryption
3
events
51
feedback and communication
83
forums
5
general tips
90
geotargeting
1
Google Assistant
3
Google I/O
3
Google Images
3
Google News
2
hacked sites
12
hangout
2
hreflang
3
https
5
images
12
intermediate
205
interstitials
1
javascript
8
job search
2
localization
21
malware
6
mobile
63
mobile-friendly
14
nohacked
1
performance
17
product expert
1
product experts
2
products and services
63
questions
3
ranking
1
recipes
1
rendering
2
Responsive Web Design
3
rich cards
7
rich results
10
search console
35
search for beginners
1
search queries
7
search results
140
security
12
seo
3
sitemaps
46
speed
6
structured data
33
summit
1
TLDs
1
url removals
1
UX
3
verification
8
video
6
webmaster community
24
webmaster forum
1
webmaster guidelines
57
webmaster tools
177
webmasters
3
youtube channel
6
Archive
2020
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2019
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2018
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jun
May
Apr
Mar
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2007
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2006
Dec
Nov
Oct
Sep
Aug
Feed
Follow @googlewmc
Give us feedback in our
Product Forums
.
Subscribe via email
Enter your email address:
Delivered by
FeedBurner