Over a million developers have joined DZone.

Want to Extract a Big Amount of Data from the Web? Use Web Scraping

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Need to extract large data from web? It's not possible to do it manually because it is very time consuming process. It wastes your precious time. So we have to use some techniques to do it fast and easily.  

The solution is WEB SCRAPING!! 

Web scraping is the process of extracting large amount of data from websites. It is also called Screen Scraping or Web Data Extraction or Web Harvesting.

Various web scraping methods are:  

  • Text grepping & Regular Expression matching
  • HTTP Programming
  • HTML Parsers 
  • DOM Parser
  • Web Scraping Software

We can use PHP, Java, .Net, ASP, Ajex, Python and many other programming languages for web scraping. 

Let’s take an example of web scraping using PHP

$url = 'http://www.gurutechnolabs.com';
$output = file_get_contents($url); 
echo $output;

This is a small script to get the content of webpage “gurutechnolabs.com” using file_get_content() method. We can also use CURL for Web Scraping.


$url = "gurutechnolabs.com";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
echo $curl_scraped_page;

So, web scraping is very useful to get data from any web page. We can scrap any web page which can be viewed on the web browser.

Any web page can be viewed in a web browser can be scraped

But, there is one question mark about web scraping. Is it Legal?

Sometimes, it may be against the terms of use of some websites. The enforceability of these terms is unclear.

There is a nice article by Justin Abrahms on what are the ethics of Web Scraping? 

Web scraping tools are also available. You can do web scraping by using those tools. webscraper.io and import.io are the famous web scraping tools. 

Read article on Web Scraping tools by Dianna Labrien4 web scraping tools to save data extraction time 

Why to Use Web Scraping?

Web scraping costs low; it provides accurate and fast results.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.

bigdata,big data,web scraping,web crawling web scraping,web data extraction,php

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}