Google’s New Anti-Scrapers Formula – No More Algorithmic Boomerang

Anti web-scraper algorithm

Google’s new Report Scraper Page campaign is an important step toward more effective algorithm against web scrapers. The first Panda update proved that working with the brightest programmers alone will not suffice in preventing a very simple yet costly blunder. Google learned this painful lesson, and is now developing and testing metrics with inputs from the entire online community.

A series of upcoming Panda overhauls will be a big blow to greedy web scrapers. Google is now taking its battle against online plagiarism, spamming and content piracy to the next level, this time involving community-aided changes to its algorithm. Last month, Google launched the Report Scraper Page. It is an online form through which scrapers’ victims can report the URLs of the copyrights violators and spammers. This page is intended only to help Google test and improve the algorithmic changes aimed at scraper sites. Spam reports that require corrective actions should be sent to other designated Google sites as usual.

Countless bloggers have complained about the poor ranking they got relative to the sites that copied their original content. Sometimes, even authorized republishers, such as customers of content syndication services like ezine.com, rank higher than the original sources. Google is now finally addressing the loopholes in its algorithm by redefining what is an original content with more accuracy.

Manual Reviews versus Algorithmic Results

According to Google’s anti-spam head Matt Cutts, the search engine is looking for a pattern that will distinguish scraper sites from the original content sources. The goal is to develop a formula that will automate scraper site detection notwithstanding the fact that millions of new web pages are being published every day. But Google wants the webmaster community itself to be involved in the process of pre-qualifying which sites are scrapers and which are the original sources. This is yet another crowd-sourcing in action which will not only reduce error rates but also uncover possible areas of improvements.

What could be more accurate than manual reviews of scraper sites? Webmasters are in the best position to identify web scrapers. When Panda ridiculously favored web scrapers above original sources early this year, Google came to realize the hazards of over-reliance to formulas with limited end-user inputs and feedbacks.

Tiffany Oberoi from Google’s Search Quality team said that the algorithmic method is the most preferred approach in tackling web spam. Of course, reporting scraper sites, although it’s more accurate, cannot match the speed and massive scale of online detection done automatically. But accuracy and output are both important.

Dim Prospects for Web Scrapers

Continuous improvements in Panda emphasize Google’s long-standing generic definition of a quality site: original, useful and user-friendly. Ironically, what encouraged web scraping was Google’s paying little attention to the nature of RSS and content syndication. We learned from the past that link popularity can overshadow originality, as manifested in countless scrapers that stuff their sites with RSS feeds instead of original content, yet getting higher ranking than the original sources. The current dilemma is how to effectively brand a content as a plagiarism given the interconnections among various web publishers, writers and content syndicators.

Google is heading to the right direction. Google took the lead in supporting the rel=author html attribute in June. Large-scale adoption of this html attribute will make it easier for Google to accurately assess the originality of a web page. The rel-author attribute of a link tag acts as a copyright seal of a web page bearing it. This still has loopholes, but the use of a HTML ownership indicator is better than none.

The Report Scraper Page campaign is perhaps the most effective and cost-efficient way to know how to demote scraper sites and reward originality. In order for an algorithmic method to succeed, enormous practical tests should be run throughout the development stage. Identifying web scrapers is just one side of the picture. The second one is knowing the original source. For many years, Google has miserably failed in the second task, so the importance of human reviews and community participation in algorithmic design and tests cannot be emphasized more.

Related from the web