Web scraping

From Wikipedia, the free encyclopedia

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Mozilla Firefox.

Web scraping is closely related to web indexing, which indexes information on the web using a bot or web crawler and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.

Legal issues

Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear

Technical measures to stop bots

  • Blocking an IP address either manually or based on criteria such as Geolocation and DNSRBL. This will also block all browsing from that address.
  • Disabling any web service API that the website's system might expose.
  • Bots sometimes declare who they are (using user agent strings) and can be blocked on that basis (using robots.txt); 'googlebot' is an example. Other bots make no distinction between themselves and a human using a browser.
  • Bots can be blocked by excess traffic monitoring.
  • Bots can sometimes be blocked with tools to verify that it is a real person accessing the site, like a CAPTCHA. Bots are sometimes coded to explicitly break specific CAPTCHA patterns or may employ third-party services that utilize human labor to read and respond in real-time to CAPTCHA challenges.
  • Commercial anti-bot services: Companies offer anti-bot and anti-scraping services for websites. A few web application firewalls have limited bot detection capabilities as well.
  • Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.
  • Obfuscation using CSS sprites to display such data as phone numbers or email addresses, at the cost of accessibility to screen reader users.
  • Because bots rely on consistency in the front-end code of a target website, adding small variations to the HTML/CSS surrounding important data and navigation elements would require more human involvement in the initial set up of a bot and if done effectively may render the target website too difficult to scrape due to the diminished ability to automate the scraping process.

 

What nextinfo can help?

Checking robots.txt

     we need to interpret robots.txt to avoid downloading blocked URLs

Examining the Sitemap

Throttling downloads


If you crawl a website too fast, your risk being blocked by the website. To minimize these risks, we need throttle our crawl by waiting for a delay between
downloads.

Spider traps

How to avoid spider traps is very important when developing your web scraping code.

Solving CAPTCHA


     CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. As the acronym suggests, it is a test to determine whether the user is human or not