legistationSince we have chosen web scraping as the buzzword this time, let’s take a look at it from a legal point of view as well, considering that taking content from an open content site can even be a copyright violation, if done in a way which does not respect the license. For instance, the GNU Free Documentation License (GFDL) and Creative Commons Share Alike (CC-BY-SA) licenses, still require that a re-publisher give credit to the original author or informs readers of the license conditions.

Depending upon the objective of a scraper, the methods in which websites are targeted of course differ. For example, sites with massive amounts of content such as airlines, consumer electronics, department stores, etc. may be routinely targeted by their competition, often to stay abreast of pricing information.

Some scrapers will pull snippets and text from websites that rank high for keywords they have targeted. This way they hope to rank highly in the search engine results pages (SERPs) and especially RSS feeds are vulnerable to scrapers.

Some scraper sites consist of advertisements and paragraphs of words randomly selected from a dictionary. Often a visitor will click on a pay-per-click advertisement, because it is the only comprehensible text on the page.

Sophisticated scraping activity can be camouflaged by utilizing multiple IP addresses and timing search actions, so they don't proceed at robot-like speeds and instead are more human like.

Operators of these scraper sites gain financially from these clicks. Advertising networks claim to be constantly working to remove these sites from their programs, although there is an active polemic about this since these networks benefit directly from the clicks generated at this kind of site. Fact is that from the advertisers' point of view, the networks don't seem to be making enough effort to stop this problem.

Scrapers tend to be associated with link farms and are sometimes perceived as the same thing, when multiple scrapers link to the same target site. A frequent target victim site might be accused of link-farm participation, due to the artificial pattern of incoming links to a victim website, linked from multiple scraper sites.

Besides that, some spammers who create scraper sites may hijack a recently expired domain name. Doing so will allow spammers to utilize the already-established search rankings for the domain name and incoming links. Some spammers may even try to match the topic of the expired site, to utilize their search rankings for those keywords. For example, an expired website for a photographer may be hijacked by a spammer who would generate a scraper site about photography tips.

The administrator of a website can use various measures to stop or slow such activities by applying, for instance, the following techniques:

• Blocking an IP address either manually or based on criteria such as Geolocation and DNSRBL. This will also block all browsing from that address

• Disabling any web service API that the website's system might expose.

• Bots sometimes declare who they are (using user agent strings) and can be blocked on that basis (using robots.txt); 'googlebot' is an example. Some bots make no distinction between themselves and a human browser.

• Bots can be blocked by excess traffic monitoring.

• Bots can sometimes be blocked with tools to verify that it is a real person accessing the site, like a CAPTCHA, but unfortunately are sometimes coded to explicitly break specific Captcha patterns.

• Commercial anti-bot services and anti-scraping services for websites can be a solution. However, a few web application firewalls have limited bot detection capabilities as well.

• Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers could work.

Source: Wikipedia