December 22, 2006

Duplicate Content Issues

Adam Lasnik discussed a few days ago to “not worry too much about duplicate content.” Essentially that Google has ways of identifying similar pages and will, in most cases, index only one of them.

I think this statement is exactly why webmasters worry “a lot” about duplicate content, because of all the scapers and content hijackers, they don’t want the thief’s version getting indexed and their version getting left out in the dark.

Some common issues are:

  • You can have two identical pages on your site, which is common in blogs where pages representing a particular date is more or less identical to a regular post page.
  • You can have content that is grabbed from another site (with our without the original author’s knowledge).
  • You can have “printer friendly” or “mobile friendly” pages with the same content as your web based content.
  • Different URLs may point to the same page (www.seorevolution.com, www.seorevolution.com/index.html, seorevolution.com, etc.)

Statements were also made that if you use duplicate content in hopes to influence the rankings to your benefit, Google will make adjustments. “In the rare cases in which we perceive that duplicate content … we’ll also make adjustment in the indexing and ranking of the sites involved.”

No where does it state what that “perception” may be or what the pending “adjustment” would be.

Let’s go over the main points of the article that you need to know:

What is duplicate content? “It generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.”

A good tip is that Google will NOT count the same article written in English and Spanish to be duplicate content. Also, “snippets” of text or “quotes” aren’t seen as duplicate content either.

Here is the issue: When Google serves results, they want to serve unique results. Results that have distinct information. “This filtering mean, for instance, that if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via noindex meta tag, we’ll chose one version to list.” Believe me, you don’t want to rely on a bot to make the right choice. Adam does state that they prefer to filter rather than do ranking adjustments.

What can you do?

  • Block Appropriately: Instead of relying on Google to make the best choice, you should make that choice. If you have two versions of a document, including a printer version, place them in their own folder and disallow that folder from being indexed in the robots.txt file.
  • Use 301s: If you have “dead ends” in your site, use 301 Redirects in your .htaccess file to smartly reroute users and bots. You might also consider using a custom 404 page with the main categories listed in a “site map” format.
  • Show Consistentcy in Linking: Keep things consistent so you aren’t linking to /page/, /page, and page.htm.
  • Use TLDs: If you use TLDs (Top Level Domains) to handle country-specific content, Google will know that .de indicates Germany-focused content easier than it would if it was de.domain.com.
  • Syndicate Carefully: If you syndicate your content on other sites, make sure it contains a link back to the original article on each syndicated article. This will help Google realize which one is the original for them to index.
  • Minimize “Repetition”: If you have lengthy (emphasis on lengthy) copyright text at the bottom of every page, include a brief summary and then a link for more details. The lengthy text could trigger duplicate filters.
  • Avoid Publishing “Stubs”: No one likes seeing “empty” pages and that includes GoogleBot. So, avoid having “placeholders” whenever possible. So, if you have a review site, block pages that have zero reviews. If you have a real estate site, block pages that have zero listings. Nothing is more annoying that a page that states, “Below you’ll find a superb list of all the great rental opportunities in [insert city name]” - but the page contains no listings.
  • Understand Your CMS: Make sure you are familiar, or you hire someone who is familiar, with how the content of your site is displayed. This includes your blog, your forums, or related systems that often shows the same content in multiple formats.
  • Understand the “Big Picture”: Don’t fret too much about sites that scrape your content. Though annoying, it’s highly unlikely that such sites can negatively impact your presence in Google. If there is a major issue, let Google know through a DMCA request.
  • Now that you understand better about Duplicate Content, you can fix the current problems on your site and spend more time focusing on revenue generation.

Filed under Google by Jerry West

Permalink Print Comment

Leave a Comment

Made with WordPress and Semiologic • Bankers Hours skin by Techie Coach