DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Python Async/Sync: Advanced Blocking Detection and Best Practices (Part 2)
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB

Trending

  • AWS Kiro: The Agentic IDE That Makes Specs the Unit of Work
  • The Hidden Bottlenecks That Break Microservices in Production
  • The 7 Pillars of Meeting Design: Transforming Expensive Conversations into Decision Assets
  • Liquibase: Database Change Management and Automated Deployments
  1. DZone
  2. Coding
  3. Languages
  4. Data Scraping With PHP and Python

Data Scraping With PHP and Python

Limitless types of data analysis can be opened by web scraping, making it a highly valuable tool. Learn how to do it with PHP and Python!

By 
Joyce Cantu user avatar
Joyce Cantu
·
Sep. 01, 17 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
37.6K Views

Join the DZone community and get the full member experience.

Join For Free

According to the latest estimates, the total number of websites is above one billion, with new sites being added and removed all the time. Just imagine the amount of data that's floating around the internet. It’s much more than any human can digest in a lifetime. To harness that data, you need not merely get access to that information but also need a scalable way to collect data so that you can organize and analyze it. That’s why you need web data scraping.

Web scraping, also known as data mining, web harvesting, web data extraction, or screen scraping is a technique in which a computer program extracts large amounts of data from a website, and then, that data is saved to a local file on a computer, database, or spreadsheet in a format that you can work with for doing your analysis. Web scraping saves tons of time because it automates the process of copying and pasting selected information on a page or even entire website.

Mastering data scraping can open up a new world of great possibilities for content analysis. Content and news are crucial for increasing website traffic, so monitoring news and popular web publications on a daily basis using web scraping techniques can be very helpful.

Web Scraping With Python and BeautifulSoup

If you need to get content from a large number of internet sources, you will likely need to develop your own data scraping tools. Here, we are going to show you how to build a web scraper using Python and the simple and powerful BeautifulSoup library.

The first step is importing the libraries we are going to use: requests and BeautifulSoup:

# Import libraries
import requests
from bs4 import BeautifulSoup

Next, let’s specify the variable for the URL using the request.get method and access the HTML content from this page:

import requests
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
print(r.content)

Our next step is parsing a web page, so we need to create a BeautifulSoup object:

import requests 
from bs4 import BeautifulSoup
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

 # Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())

Now, we can move to extracting some useful data from HTML content. For our example, we took a page that consists of some quotes in order to create a program to save the quotes. But first, you should look through the HTML content of the web page that was printed using the soup.pretify() method to identify a pattern for how to navigate the quotes. In our example, all the quotes are inside a div container with ID ‘container’. We can find this div element using the find() method.

table = soup.find('div', attrs = {'id':'container'})

Each quote is inside a div container that belongs to the class ‘quote’. We have to repeat the process with each div container that belongs to the class ‘quote’. To do that, we use the findAll() method and iterate the process with each quote using a variable row.

Then we have to create a dictionary where we will save all data about the quote in a list called  ‘quotes’.

quotes=[]  # a list to store quotes
 table = soup.find('div', attrs = {'id':'container'})
 for row in table.findAll('div', attrs = {'class':'quote'}):
    quote = {}
    quote['theme'] = row.h5.text
    quote['url'] = row.a['href']
    quote['img'] = row.img['src']
    quote['lines'] = row.h6.text
    quote['author'] = row.p.text
    quotes.append(quote)

Our final step is writing the data to a CSV file, which is a common format for databases and spreadsheets.

filename = 'inspirational_quotes.csv'
with open(filename, 'wb') as f:
    w = csv.DictWriter(f,['theme','url','img','lines','author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

This is a simple example of how to perform web scraping with Python and the BeautifulSoup library, which is great for small-scale web scraping. If you want to scrape data at a large scale, you should consider using alternatives.

Scraping Websites With PHP and cURL

If you want to download graphics, pictures, and videos of a number of websites, a good option is to use PHP with the cURL library, which allows connections to a variety of servers and protocols. cURL can transfer files using an extensive list of protocols, including not only HTTP but also FTP, which can be useful for creating a web spider to download virtually anything off of the web to a server automatically.

<?php

function curl_download($Url){

    if (!function_exists('curl_init')){
        die('cURL is not installed. Install and try again.');
    }

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $Url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    curl_close($ch);

    return $output;

print curl_download('http://www.gutenberg.org/browse/scores/top');

?>

Paperwritten content providers should set specific goals that they want to accomplish: scraping the latest content, the content that was indexed by Google on certain dates, capturing page titles, the date/time when posts were published, gathering the follower counts from social networks, etc. The possibilities of using web scraping to analyze content and apply it to your content marketing strategies are virtually endless. With the limitless types of data analysis opened by web scraping, it can be a highly valuable tool for any content provider.

Data (computing) PHP Python (language)

Opinions expressed by DZone contributors are their own.

Related

  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Python Async/Sync: Advanced Blocking Detection and Best Practices (Part 2)
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook