- Category: Search Analytics - September 2015
Web scraping (web harvesting or web data extraction) refers to an application that processes the HTML of a web page to extract data for manipulation, such as converting the web page to another format (i.e. HTML to WML). It is closely related to web automation, which simulates human browsing using computer software, as well as web indexing, which indexes information on the web using a bot or web crawler, since it is a universal technique adopted by most search engines.
However, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Therefore, common uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.
Web scraping is a field with active developments, sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence, and human-computer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations.
To be clear, a search engine is not a scraper site itself; sites such as Google or Yahoo gather content from other websites and index it so that the index can be searched with keywords. Search engines then display snippets of the original site content in response to a user's search. However, in the last few years, scraper sites (websites that copy content from other websites using web scraping) have proliferated at a high rate for spamming search engines, and especially open content is a common source of material for scraper sites.
The purpose of creating such a site can obviously be to collect advertising revenue or to manipulate search engine rankings by linking to other sites to improve search engine rankings. Some scraper sites are even only created to make money by using advertising programs. In such case, they are called Made for AdSense sites or MFA. This derogatory term refers to websites that have no redeeming value except to lure visitors to the website for the sole purpose of clicking on advertisements. Made for AdSense sites are considered sites that are spamming search engines and diluting the search results by providing surfers with less-than-satisfactory search results.