Webmaster Central Blog
Official news on crawling and indexing sites for the Google index
Duplicate content due to scrapers
lunedì, giugno 09, 2008
Written by Sven Naumann, Search Quality Team
Since duplicate content is a hot topic among webmasters, we thought it might be a good time to address common questions we get asked regularly at conferences and on the
Google Webmaster Help Group
.
Before diving in, I'd like to briefly touch on a concern webmasters often voice: in most cases a webmaster has no influence on third parties that scrape and redistribute content without the webmaster's consent. We realize that this is not the fault of the affected webmaster, which in turn means that identical content showing up on several sites in itself is not inherently regarded as a violation of our
webmaster guidelines
. This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.
Generally, we can differentiate between two major scenarios for issues related to duplicate content:
Within-your-domain-duplicate-content, i.e. identical content which (often unintentionally) appears in more than one place on your site
Cross-domain-duplicate-content, i.e. identical content of your site which appears (again, often unintentionally) on different external sites
With the first scenario, you can take matters into your own hands to avoid Google indexing duplicate content on your site. Check out Adam Lasnik's post
Deftly dealing with duplicate content
and Vanessa Fox's
Duplicate content summit at SMX Advanced
, both of which give you some great tips on how to resolve duplicate content issues within your site. Here's one additional tip to help avoid content on your site being crawled as duplicate: include the preferred version of your URLs in your Sitemap file. When encountering different pages with the same content, this may help raise the likelihood of us serving the version you prefer. Some additional information on duplicate content can also be found in our comprehensive
Help Center article
discussing this topic.
In the second scenario, you might have the case of someone scraping your content to put it on a different site, often to try to monetize it. It's also common for many web proxies to index parts of sites which have been accessed through the proxy. When encountering such duplicate content on different sites, we look at various signals to determine which site is the original one, which usually works very well. This also means that you shouldn't be very concerned about seeing negative effects on your site's presence on Google if you notice someone scraping your content.
In cases when you are syndicating your content but also want to make sure your site is identified as the original source, it's useful to ask your syndication partners to include a link back to your original content. You can find some additional tips on dealing with syndicated content in a recent post by Vanessa Fox,
Ranking as the original source for content you syndicate
.
Some webmasters have asked what could cause scraped content to rank higher than the original source. That should be a rare case, but if you do find yourself in this situation:
Check if your content is still accessible to our crawlers. You might unintentionally have blocked access to parts of your content in your robots.txt file.
You can look in your Sitemap file to see if you made changes for the particular content which has been scraped.
Check if your site is in line with our webmaster guidelines.
To conclude, I'd like to point out that in the majority of cases, having duplicate content does not have negative effects on your site's presence in the Google index. It simply gets filtered out. If you check out some of the tips mentioned in the resources above, you'll basically learn how to have greater control about what exactly we're crawling and indexing and which versions are more likely to appear in the index. Only when there are signals pointing to deliberate and malicious intent, occurrences of duplicate content might be considered a violation of the webmaster guidelines.
If you would like to further discuss this topic, feel free to visit our
Webmaster Help Group
.
For the German version of this post, go to "
Duplicate Content aufgrund von Scraper-Sites"
.
Hey!
Check here if your site is mobile-friendly.
Etichette
accessibility
10
advanced
195
AMP
13
Android
2
API
7
apps
7
autocomplete
2
beginner
173
CAPTCHA
1
Chrome
2
cms
1
crawling and indexing
158
encryption
3
events
51
feedback and communication
83
forums
5
general tips
90
geotargeting
1
Google Assistant
3
Google I/O
3
Google Images
3
Google News
2
hacked sites
12
hangout
2
hreflang
3
https
5
images
12
intermediate
205
interstitials
1
javascript
8
job search
2
localization
21
malware
6
mobile
63
mobile-friendly
14
nohacked
1
performance
17
product expert
1
product experts
2
products and services
63
questions
3
ranking
1
recipes
1
rendering
2
Responsive Web Design
3
rich cards
7
rich results
10
search console
35
search for beginners
1
search queries
7
search results
140
security
12
seo
3
sitemaps
46
speed
6
structured data
33
summit
1
TLDs
1
url removals
1
UX
3
verification
8
video
6
webmaster community
24
webmaster forum
1
webmaster guidelines
57
webmaster tools
177
webmasters
3
youtube channel
6
Archive
2020
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2019
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2018
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2017
dic
nov
ott
set
ago
giu
mag
apr
mar
feb
gen
2016
dic
nov
ott
set
ago
giu
mag
apr
mar
gen
2015
dic
nov
ott
set
ago
lug
mag
apr
mar
feb
gen
2014
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2013
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2012
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2011
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2010
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2009
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2008
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2007
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2006
dic
nov
ott
set
ago
Feed
Follow @googlewmc
Give us feedback in our
Product Forums
.
Subscribe via email
Enter your email address:
Delivered by
FeedBurner