Webmaster Central Blog
Official news on crawling and indexing sites for the Google index
Crawling through HTML forms
pátek, dubna 11, 2008
Written by Jayant Madhavan and Alon Halevy, Crawling and Indexing Team
Google is constantly trying new ideas to improve our coverage of the web. We already do some pretty smart things like scanning JavaScript and Flash to discover links to new web pages, and today, we would like to talk about another new technology we've started experimenting with recently.
In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.
Needless to say, this experiment follows good Internet citizenry practices. Only a small number of particularly useful sites receive this treatment, and our crawl agent, the
ever-friendly Googlebot
, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information. For example, we omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc. We are also mindful of the impact we can have on web sites and limit ourselves to a very small number of fetches for a given site.
The web pages we discover in our enhanced crawl do not come at the expense of regular web pages that are already part of the crawl, so this change doesn't reduce PageRank for your other pages. As such it should only increase the exposure of your site in Google. This change also does not affect the crawling, ranking, or selection of other web pages in any significant way.
This experiment is part of Google's broader effort to increase its coverage of the web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or
Invisible Web
have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide webmasters and users alike with a better and more comprehensive search experience.
Hey!
Check here if your site is mobile-friendly.
Štítky
accessibility
10
advanced
195
AMP
13
Android
2
API
7
apps
7
autocomplete
2
beginner
173
CAPTCHA
1
cms
1
crawling and indexing
158
encryption
3
events
51
feedback and communication
83
forums
5
general tips
90
geotargeting
1
Google Assistant
3
Google I/O
3
Google Images
3
Google News
2
hacked sites
12
hangout
2
hreflang
3
https
5
Chrome
2
images
12
intermediate
205
interstitials
1
javascript
8
job search
2
localization
21
malware
6
mobile
63
mobile-friendly
14
nohacked
1
performance
17
product expert
1
product experts
2
products and services
63
questions
3
ranking
1
recipes
1
rendering
2
Responsive Web Design
3
rich cards
7
rich results
10
search console
35
search for beginners
1
search queries
7
search results
140
security
12
seo
3
sitemaps
46
speed
6
structured data
33
summit
1
TLDs
1
url removals
1
UX
3
verification
8
video
6
webmaster community
24
webmaster forum
1
webmaster guidelines
57
webmaster tools
177
webmasters
3
youtube channel
6
Archive
2020
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2019
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2018
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2017
pro
lis
říj
zář
srp
čvn
kvě
dub
bře
úno
led
2016
pro
lis
říj
zář
srp
čvn
kvě
dub
bře
led
2015
pro
lis
říj
zář
srp
čvc
kvě
dub
bře
úno
led
2014
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2013
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2012
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2011
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2010
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2009
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2008
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2007
pro
lis
říj
zář
srp
čvc
čvn
kvě
dub
bře
úno
led
2006
pro
lis
říj
zář
srp
Feed
Follow @googlewmc
Give us feedback in our
Product Forums
.
Subscribe via email
Enter your email address:
Delivered by
FeedBurner