Google’s New Anti-Scrapers Formula - No More Algorithmic Boomerang

Anti web-scraper algorithm

Google’s new Report Scraper Page campaign is an important step toward more effective algorithm against web scrapers. The first Panda update proved that working with the brightest programmers alone will not suffice in preventing a very simple yet costly blunder. Google learned this painful lesson, and is now developing and testing metrics with inputs from the entire online community.

A series of upcoming Panda overhauls will be a big blow to greedy web scrapers. Google is now taking its battle against online plagiarism, spamming and content piracy to the next level, this time involving community-aided changes to its algorithm. Last month, Google launched the Report Scraper Page. It is an online form through which scrapers' victims can report the URLs of the copyrights violators and spammers. This page is intended only to help Google test and improve the algorithmic changes aimed at scraper sites. Spam reports that require corrective actions should be sent to other designated Google sites as usual.

Countless bloggers have complained about the poor ranking they got relative to the sites that copied their original content. Sometimes, even authorized republishers, such as customers of content syndication services like ezine.com, rank higher than the original sources. Google is now finally addressing the loopholes in its algorithm by redefining what is an original content with more accuracy.

Manual Reviews versus Algorithmic Results

According to Google's anti-spam head Matt Cutts, the search engine is looking for a pattern that will distinguish scraper sites from the original content sources. The goal is to develop a formula that will automate scraper site detection notwithstanding the fact that millions of new web pages are being published every day. But Google wants the webmaster community itself to be involved in the process of pre-qualifying which sites are scrapers and which are the original sources. This is yet another crowd-sourcing in action which will not only reduce error rates but also uncover possible areas of improvements.

What could be more accurate than manual reviews of scraper sites? Webmasters are in the best position to identify web scrapers. When Panda ridiculously favored web scrapers above original sources early this year, Google came to realize the hazards of over-reliance to formulas with limited end-user inputs and feedbacks.

Tiffany Oberoi from Google’s Search Quality team said that the algorithmic method is the most preferred approach in tackling web spam. Of course, reporting scraper sites, although it’s more accurate, cannot match the speed and massive scale of online detection done automatically. But accuracy and output are both important.

Dim Prospects for Web Scrapers

Continuous improvements in Panda emphasize Google’s long-standing generic definition of a quality site: original, useful and user-friendly. Ironically, what encouraged web scraping was Google’s paying little attention to the nature of RSS and content syndication. We learned from the past that link popularity can overshadow originality, as manifested in countless scrapers that stuff their sites with RSS feeds instead of original content, yet getting higher ranking than the original sources. The current dilemma is how to effectively brand a content as a plagiarism given the interconnections among various web publishers, writers and content syndicators.

Google is heading to the right direction. Google took the lead in supporting the rel=author html attribute in June. Large-scale adoption of this html attribute will make it easier for Google to accurately assess the originality of a web page. The rel-author attribute of a link tag acts as a copyright seal of a web page bearing it. This still has loopholes, but the use of a HTML ownership indicator is better than none.

The Report Scraper Page campaign is perhaps the most effective and cost-efficient way to know how to demote scraper sites and reward originality. In order for an algorithmic method to succeed, enormous practical tests should be run throughout the development stage. Identifying web scrapers is just one side of the picture. The second one is knowing the original source. For many years, Google has miserably failed in the second task, so the importance of human reviews and community participation in algorithmic design and tests cannot be emphasized more.

6 COMMENTS

  1. Everyone wants to latest information about Google's activities who has belongs to seo fields. It is not easy to Google that they read all contents of millions of web pages those are daily added on the web. so it is a great post for me, thanks for sharing.

  2. Very nice post.

    I have a research about this loophole google. Every one who know the landing page will know, that a best idea to reconstrucre the category of each page landing and using a unique of meta tag and post title with "|"

    This my research and works perfect for me to get my blogs rank high than before :)

  3. yeah. It is really not so easy to automatically identify the content scraper sites. I don't think Google will take a risk of implementing some wrong algo in this direction. Hence it may involve a bit more study and research of some easily identifiable patterns for finding out content scrapper sites.

  4. Last month Google launch a Scraper Page campaign for algorithm point of view.Google is now finally addressing the loopholes in its algorithm by redefining what is an original content with more accuracy.Main motive to launch this service is for making more accurate than manual reviews of scraper sites.

  5. Google is really trying their best to keep their search result as relevant as possible and I like the thing that they are taking action against spammers. Looks like more update will be coming from google soon

  6. nice sir i also faced a same situation some dayz before but never find such useful post thank you for support to reader

Comments are closed.