Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Scraping With PHP and Python

DZone's Guide to

Data Scraping With PHP and Python

Limitless types of data analysis can be opened by web scraping, making it a highly valuable tool. Learn how to do it with PHP and Python!

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

According to the latest estimates, the total number of websites is above one billion, with new sites being added and removed all the time. Just imagine the amount of data that's floating around the internet. It’s much more than any human can digest in a lifetime. To harness that data, you need not merely get access to that information but also need a scalable way to collect data so that you can organize and analyze it. That’s why you need web data scraping.

Web scraping, also known as data mining, web harvesting, web data extraction, or screen scraping is a technique in which a computer program extracts large amounts of data from a website, and then, that data is saved to a local file on a computer, database, or spreadsheet in a format that you can work with for doing your analysis. Web scraping saves tons of time because it automates the process of copying and pasting selected information on a page or even entire website.

Mastering data scraping can open up a new world of great possibilities for content analysis. Content and news are crucial for increasing website traffic, so monitoring news and popular web publications on a daily basis using web scraping techniques can be very helpful.

Web Scraping With Python and BeautifulSoup

If you need to get content from a large number of internet sources, you will likely need to develop your own data scraping tools. Here, we are going to show you how to build a web scraper using Python and the simple and powerful BeautifulSoup library.

The first step is importing the libraries we are going to use: requests and BeautifulSoup:

# Import libraries
import requests
from bs4 import BeautifulSoup

Next, let’s specify the variable for the URL using the request.get method and access the HTML content from this page:

import requests
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
print(r.content)

Our next step is parsing a web page, so we need to create a BeautifulSoup object:

import requests 
from bs4 import BeautifulSoup
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

 # Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())

Now, we can move to extracting some useful data from HTML content. For our example, we took a page that consists of some quotes in order to create a program to save the quotes. But first, you should look through the HTML content of the web page that was printed using the soup.pretify() method to identify a pattern for how to navigate the quotes. In our example, all the quotes are inside a div container with ID ‘container’. We can find this div element using the find() method.

table = soup.find('div', attrs = {'id':'container'})

Each quote is inside a div container that belongs to the class ‘quote’. We have to repeat the process with each div container that belongs to the class ‘quote’. To do that, we use the findAll() method and iterate the process with each quote using a variable row.

Then we have to create a dictionary where we will save all data about the quote in a list called  ‘quotes’.

quotes=[]  # a list to store quotes
 table = soup.find('div', attrs = {'id':'container'})
 for row in table.findAll('div', attrs = {'class':'quote'}):
    quote = {}
    quote['theme'] = row.h5.text
    quote['url'] = row.a['href']
    quote['img'] = row.img['src']
    quote['lines'] = row.h6.text
    quote['author'] = row.p.text
    quotes.append(quote)

Our final step is writing the data to a CSV file, which is a common format for databases and spreadsheets.

filename = 'inspirational_quotes.csv'
with open(filename, 'wb') as f:
    w = csv.DictWriter(f,['theme','url','img','lines','author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

This is a simple example of how to perform web scraping with Python and the BeautifulSoup library, which is great for small-scale web scraping. If you want to scrape data at a large scale, you should consider using alternatives.

Scraping Websites With PHP and cURL

If you want to download graphics, pictures, and videos of a number of websites, a good option is to use PHP with the cURL library, which allows connections to a variety of servers and protocols. cURL can transfer files using an extensive list of protocols, including not only HTTP but also FTP, which can be useful for creating a web spider to download virtually anything off of the web to a server automatically.

<?php

function curl_download($Url){

    if (!function_exists('curl_init')){
        die('cURL is not installed. Install and try again.');
    }

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $Url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    curl_close($ch);

    return $output;

print curl_download('http://www.gutenberg.org/browse/scores/top');

?>

Paperwritten content providers should set specific goals that they want to accomplish: scraping the latest content, the content that was indexed by Google on certain dates, capturing page titles, the date/time when posts were published, gathering the follower counts from social networks, etc. The possibilities of using web scraping to analyze content and apply it to your content marketing strategies are virtually endless. With the limitless types of data analysis opened by web scraping, it can be a highly valuable tool for any content provider.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
web scraping ,big data ,data scraping ,php ,python ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}