DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • How to Simplify Complex Conditions With Python's Match Statement
  • Unleashing the Power of Gemini With LlamaIndex

Trending

  • Navigating and Modernizing Legacy Codebases: A Developer's Guide to AI-Assisted Code Understanding
  • The Role of AI in Identity and Access Management for Organizations
  • Navigating Change Management: A Guide for Engineers
  • Analyzing Techniques to Provision Access via IDAM Models During Emergency and Disaster Response
  1. DZone
  2. Coding
  3. Languages
  4. Data Scraping With PHP and Python

Data Scraping With PHP and Python

Limitless types of data analysis can be opened by web scraping, making it a highly valuable tool. Learn how to do it with PHP and Python!

By 
Joyce Cantu user avatar
Joyce Cantu
·
Sep. 01, 17 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
37.2K Views

Join the DZone community and get the full member experience.

Join For Free

According to the latest estimates, the total number of websites is above one billion, with new sites being added and removed all the time. Just imagine the amount of data that's floating around the internet. It’s much more than any human can digest in a lifetime. To harness that data, you need not merely get access to that information but also need a scalable way to collect data so that you can organize and analyze it. That’s why you need web data scraping.

Web scraping, also known as data mining, web harvesting, web data extraction, or screen scraping is a technique in which a computer program extracts large amounts of data from a website, and then, that data is saved to a local file on a computer, database, or spreadsheet in a format that you can work with for doing your analysis. Web scraping saves tons of time because it automates the process of copying and pasting selected information on a page or even entire website.

Mastering data scraping can open up a new world of great possibilities for content analysis. Content and news are crucial for increasing website traffic, so monitoring news and popular web publications on a daily basis using web scraping techniques can be very helpful.

Web Scraping With Python and BeautifulSoup

If you need to get content from a large number of internet sources, you will likely need to develop your own data scraping tools. Here, we are going to show you how to build a web scraper using Python and the simple and powerful BeautifulSoup library.

The first step is importing the libraries we are going to use: requests and BeautifulSoup:

# Import libraries
import requests
from bs4 import BeautifulSoup

Next, let’s specify the variable for the URL using the request.get method and access the HTML content from this page:

import requests
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
print(r.content)

Our next step is parsing a web page, so we need to create a BeautifulSoup object:

import requests 
from bs4 import BeautifulSoup
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

 # Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())

Now, we can move to extracting some useful data from HTML content. For our example, we took a page that consists of some quotes in order to create a program to save the quotes. But first, you should look through the HTML content of the web page that was printed using the soup.pretify() method to identify a pattern for how to navigate the quotes. In our example, all the quotes are inside a div container with ID ‘container’. We can find this div element using the find() method.

table = soup.find('div', attrs = {'id':'container'})

Each quote is inside a div container that belongs to the class ‘quote’. We have to repeat the process with each div container that belongs to the class ‘quote’. To do that, we use the findAll() method and iterate the process with each quote using a variable row.

Then we have to create a dictionary where we will save all data about the quote in a list called  ‘quotes’.

quotes=[]  # a list to store quotes
 table = soup.find('div', attrs = {'id':'container'})
 for row in table.findAll('div', attrs = {'class':'quote'}):
    quote = {}
    quote['theme'] = row.h5.text
    quote['url'] = row.a['href']
    quote['img'] = row.img['src']
    quote['lines'] = row.h6.text
    quote['author'] = row.p.text
    quotes.append(quote)

Our final step is writing the data to a CSV file, which is a common format for databases and spreadsheets.

filename = 'inspirational_quotes.csv'
with open(filename, 'wb') as f:
    w = csv.DictWriter(f,['theme','url','img','lines','author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

This is a simple example of how to perform web scraping with Python and the BeautifulSoup library, which is great for small-scale web scraping. If you want to scrape data at a large scale, you should consider using alternatives.

Scraping Websites With PHP and cURL

If you want to download graphics, pictures, and videos of a number of websites, a good option is to use PHP with the cURL library, which allows connections to a variety of servers and protocols. cURL can transfer files using an extensive list of protocols, including not only HTTP but also FTP, which can be useful for creating a web spider to download virtually anything off of the web to a server automatically.

<?php

function curl_download($Url){

    if (!function_exists('curl_init')){
        die('cURL is not installed. Install and try again.');
    }

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $Url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    curl_close($ch);

    return $output;

print curl_download('http://www.gutenberg.org/browse/scores/top');

?>

Paperwritten content providers should set specific goals that they want to accomplish: scraping the latest content, the content that was indexed by Google on certain dates, capturing page titles, the date/time when posts were published, gathering the follower counts from social networks, etc. The possibilities of using web scraping to analyze content and apply it to your content marketing strategies are virtually endless. With the limitless types of data analysis opened by web scraping, it can be a highly valuable tool for any content provider.

Data (computing) PHP Python (language)

Opinions expressed by DZone contributors are their own.

Related

  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • How to Simplify Complex Conditions With Python's Match Statement
  • Unleashing the Power of Gemini With LlamaIndex

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!