Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Web Scraping Using Python (Part 2)

DZone's Guide to

Web Scraping Using Python (Part 2)

Check out how you can pull information from webpages, like these eBay items.

· Open Source Zone ·
Free Resource

In this article, I will outline how to efficiently automate web scraping to return any data you may desire and then store it in a structured format for analysis.

This is the second part of the series from part one in which I went through just the basics of web scraping. If you'd like to get a quick basic overview you can check part one in this link.

Alright with that said, let's get to the exciting stuff!

Scraping All the Data from One Product

Note that I will continue with the eBay example we did in part one.

Just like before I will start by importing our libraries.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import re


I have decided I will scrape data for cell phones from eBay starting from this link.

First, just like we did in part one we will:

  1. Perform a get request for the link we're interested in after visually inspecting the web page.
  2. Parse it with Beautiful Soup.
  3. Make things more concise by only getting the portion we want into the items variable.


source = requests.get('https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1').text    
soup = BeautifulSoup(source, 'lxml')
items = soup.find('li', class_='s-item')


Here I am interested in 13 categories and those are the ones I will be getting for all products:

  1. The title of the product, which is the first thing written.
  2. The description of the product (written under the title).
  3. The brand of the product.
  4. The model.
  5. Any miscellaneous features (for some products we have style, color, connectivity... etc.).
  6. The origin of the product.
  7. It's price.
  8. Shipping information.
  9. Whether it comes from a top seller or not.
  10. It's rating (how many stars?).
  11. The number of reviewers who gave a rating.
  12. Quantity sold (written for some products in red below the shipping information)
  13. Finally, I will also retrieve the link to the product in case I want to get back to it later.

I need to highlight a few things to keep in mind here:

  • Some products have missing attributes (for example a product might simply not have a model, or maybe the country of origin is not stated). The way we will deal with that is that we will just return "None" to any attribute that the script does not find as it goes along the web page.
  • The quantity sold element for some products sometimes contains "Watching" instead which tells you how many people are watching this product. Since I am interested only in the quantity sold whenever it is available for a product we will have to work around that to find a way to only return the value if it is actually the quantity sold and not how many people are watching the product.

Alright let's do this for each attribute one by one.

Title

try:  
    item_title = items.find('h3', class_='s-item__title').text
except Exception as e:
    item_title = 'None'

print(item_title)

Here I simply used the find method just like we did in part one, specifying 'h3' as the tag and 's-item__title' as the class with .text at the end to return only the text we need.

The only difference this time is that I used try  and except  to ask Python to return "None" into the variable if an error is raised which will come in handy if this item does not have that attribute (a title in this case)

Printing the result gives exactly what we want. The title of the first product on the webpage

New *UNOPENDED* Apple iPhone SE - 16/64GB 4.0" Unlocked Smartphone


Description

try:  
    item_desc = items.find('div', class_='s-item__subtitle').text
except Exception as e:
    item_desc = 'None'

print(item_desc)

The same way I did the title, here I used .find with the relevant tag 'div' and the relevant class 's-item__subtitle' and .text   at the end.

Again, printing the result gives us the description we want.

NO-RUSH 14 DAYS SHIPPING ONLY!  US LOCATION!


Brand

try:
    item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text
except Exception as e:
    item_brand = 'None'

print(item_brand)

Ok perfect, everything is the same as before.

Let's print the result:

Brand: Apple


Hmm... looks ok, but we do not want to have "Brand:" then the actual brand written every time for each product. This will look a bit messy if we want to have this in an Excel sheet later.

Let's try again with a minor modification at the end of the second line of code:

try:
    item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text.split(' ')[1]
except Exception as e:
    item_brand = 'None'

print(item_brand)


Let's print:

Apple

Great, we got just the brand. Now what I did here is very simple. I added " .split(' ') " at the end which simply splits any text we give it based on whatever we specify between it's brackets. Here I specified to make splits based on spaces between words. The result we will get is an array ["Brand:", "Apple"] as follows. Next, I simply added  [1]  to specify that I want the second element of the array returned since I am not interested in the "Brand:" part.

Model

try:
    item_model = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes2').text.split(' ')[1:]
    item_model = ' '.join(item_model)
except Exception as e:
    item_model = 'None'

print(item_model)

For the model, I did exactly what we did before, with another minor modification:  [1:] at the end of the first line. This is because I want to return everything after "Model:" since it is very expected that the model will not be just 1 word. This way I am telling Python I want everything after the element in index 1 in the array.

In the third line, we used .join which is the exact opposite of .split . It simply joins all elements of the returned array based on whatever I specify before .join. Here I specified a space to return all words in the array with spaces in between them.

Let's print the result:

Apple iPhone SE


Features

try:
    item_features = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes3').text.split(' ')[1]
except Exception as e:
    item_features = 'None'

print(item_features) 

Same like before, nothing new here.

Result:

Bar


Origin

try:  
    item_origin = items.find('span', class_='s-item__location s-item__itemLocation').text
    item_origin = re.sub('From ', '', item_origin)
except Exception as e:
    item_origin = 'None'

print(item_origin)

Here we are in the same situation as we were with "Model." We handle the text as we did before, but I thought I would show you a different method of doing this using the "re" which is great for regular expressions you can check it out. Anyway, for re.sub , what you do is give it a sequence of characters to look for. Here I put "From", then whatever you want to replace this sequence with and here I just put '' which means to just replace it with nothing. Finally you specify the variable which holds your text.

Result:

None

Which is exactly what we expect since this first item indeed does not have any origin specified.

Price

try:
    item_price = items.find('span', class_='s-item__price').text
except Exception as e:
    item_price = 'None'

print(item_price)

Result:

$187.99


Shipping

try:
    item_shipping = items.find('span', class_='s-item__shipping s-item__logisticsCost').text
except Exception as e:
    item_shipping = 'None'

print(item_shipping)

Result:

$19.99 shipping


Top Seller

try:
    item_top_seller = items.find('span', class_='s-item__etrs-text').text
except Exception as e:
    item_top_seller = 'None'

print(item_top_seller)

Result:

None

Indeed it is not from a top seller.

Rating

try:
    item_stars = items.find('span', class_='clipped').text.split(' ')[0]
except Exception as e:
    item_stars = 'None'

print(item_stars)

Result:

None

The product has no rating.

Number of Reviews

try:
    item_nreviews = items.find('a', class_='s-item__reviews-count').text.split(' ')[0]
except Exception as e:
    item_nreviews = 'None' 

print(item_nreviews)

Result:

None

There are no reviews.

Quantity Sold

try:
    item_qty_sold = items.find('span', class_='s-item__hotness s-item__itemHotness').text.split(' ')
    if item_qty_sold[1] == 'sold':
        item_qty_sold = item_qty_sold[0]
    else:
        item_qty_sold = 0
except Exception as e:
    item_qty_sold = 'None' 


print(item_qty_sold) 

Ok, here is the second issue we highlighted previously. This element on the webpage is sometimes denoted as quantity sold and sometimes as how many people are watching. Since normally I see the pattern goes as "some number here + sold", I added an if statement to check if the second element of the returned array is equal to "sold". If indeed it is I returned the first element which is just the number.

Else I return it as zero.


Result:

0


Here it works as expected as we do not have a quantity sold for this item.

Item Link

try:
    item_link = items.find('a', class_='s-item__link')['href']
except Exception as e:
    item_link = 'None'

print(item_link)

Getting links is something we did not address before, but it is nothing too complicated.

We follow the same sequence just like we always did, but this time instead of using .text   at the end we  add ['href'] . Very simply by clicking right click on the title of the item and inspecting the HTML code we see that right next to the class we have href = our link.



And the result is indeed the link.

https://www.ebay.com/itm/New-UNOPENDED-Apple-iPhone-SE-16-64GB-4-0-Unlocked-Smartphone/254064108562?epid=224938264&hash=item3b2766a412:m:mO70b8PJ4lgqKv5E555u-dg&var=553411573675

Scraping All the Data for All Products

Ok, now what if we want this data returned for all the products within the page how would we do that?

Very simple, we make a very minor modification our original 3 lines of code below:

source = requests.get('https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1').text    
soup = BeautifulSoup(source, 'lxml')
items = soup.find('li', class_='s-item')


Instead of assigning  soup.find('li', class_='s-item')  to a variable items, which will only return the first element with this 'li' tag and class = 's-item' , we want to ask Python to look for all the products within the full page's parsed HTML code assigned in the soup variable.

We do this by simply using a for loop to do everything we did above for each element that meets those conditions for the tag and class.

The full code will be as follows:

for items in soup.find_all('li', class_='s-item'):

    try:  
        item_title = items.find('h3', class_='s-item__title').text
    except Exception as e:
        item_title = 'None'

    print(item_title)

    try:  
        item_desc = items.find('div', class_='s-item__subtitle').text
    except Exception as e:
        item_desc = 'None'

    print(item_desc)

    try:
        item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text.split(' ')[1]
    except Exception as e:
        item_brand = 'None'

    print(item_brand)

    try:
        item_model = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes2').text.split(' ')[1:]
        item_model = ' '.join(item_model)
    except Exception as e:
        item_model = 'None'

    print(item_model)

    try:
        item_features = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes3').text.split(' ')[1]
    except Exception as e:
        item_features = 'None'

    print(item_features)    

    try:  
        item_origin = items.find('span', class_='s-item__location s-item__itemLocation').text
        item_origin = re.sub('From ', '', item_origin)
    except Exception as e:
        item_origin = 'None'

    print(item_origin)

    try:
        item_price = items.find('span', class_='s-item__price').text
    except Exception as e:
        item_price = 'None'

    print(item_price)

    try:
        item_shipping = items.find('span', class_='s-item__shipping s-item__logisticsCost').text
    except Exception as e:
        item_shipping = 'None'

    print(item_shipping)

    try:
        item_top_seller = items.find('span', class_='s-item__etrs-text').text
    except Exception as e:
        item_top_seller = 'None'

    print(item_top_seller)    

    try:
        item_stars = items.find('span', class_='clipped').text.split(' ')[0]
    except Exception as e:
        item_stars = 'None'

    print(item_stars)

    try:
        item_nreviews = items.find('a', class_='s-item__reviews-count').text.split(' ')[0]
    except Exception as e:
        item_nreviews = 'None' 

    print(item_nreviews)

    try:
        item_qty_sold = items.find('span', class_='s-item__hotness s-item__itemHotness').text.split(' ')
        if item_qty_sold[1] == 'sold':
            item_qty_sold = item_qty_sold[0]
        else:
            item_qty_sold = 0
    except Exception as e:
        item_qty_sold = 'None'

    print(item_qty_sold)

    try:
        item_link = items.find('a', class_='s-item__link')['href']
    except Exception as e:
        item_link = 'None'

    print(item_link)
    print()


I will not show you the result here as it will print all the data we previously returned, but now for all the products within our page..

Putting Our Data in A Structured Format

Now we only have one step left. We want to put all this data in a structured format.

We do this using a pandas dataframe to hold all this data.

First, I start by creating the dataframe, assigning the columns we'll need and putting all this into a variable called df .

df = pd.DataFrame(columns = ['Title', 'description',
                             'Brand', 'Model', 'Features', 'Origin', 
                             'Price', 'Shipping',
                             'Top Seller','Stars', 'No. Of Reviews',
                             'Qty Sold',  'Link'])


Next, we can simply use the .loc   method of pandas dataframes to put our values into the dataframe every time we loop through a product.

The .loc   method can take the index of the row (which starts from zero) and the column name.

So, for example,  df.loc[0, 'Title'] = 'My product'  will put this value into the zeroth row which is our first row under the Title column.

To do this efficiently I assign  n an  variable at the very beginning before our loop block of code to be  n=0  to act as a counter within our loop starting from zero.

After that I add this block of code at the end of the loop ending it with adding +1 to n each time we go through the loop.

    df.loc[n, 'Title'] = item_title
    df.loc[n, 'description'] = item_desc
    df.loc[n, 'Brand'] = item_brand
    df.loc[n, 'Model'] = item_model
    df.loc[n, 'Features'] = item_features
    df.loc[n, 'Origin'] = item_origin
    df.loc[n, 'Price'] = item_price
    df.loc[n, 'Shipping'] = item_shipping
    df.loc[n, 'Top Seller'] = item_top_seller
    df.loc[n, 'Stars'] = item_stars
    df.loc[n, 'No. Of Reviews'] = item_nreviews
    df.loc[n, 'Qty Sold'] = item_qty_sold
    df.loc[n, 'Link'] = item_link    

    n+=1


Finally, we can check if this worked by doing a quick df.head()   which returns the first five rows of the dataframe.

df.head()


Result:


Perfect, we got all this data in a very structured format now. One more step here is to simply save it to an Excel file by using the df.to_excel

df.to_excel('ebay_phones.xlsx')


This will save an Excel sheet with the data in your working directory.

I hope you find this useful and in the next part, I will be discussing how to get this data from multiple pages and do some exploratory analysis.

To be continued!


Originally posted on https://oaref.blogspot.com/.

Topics:
data science ,data analysis ,web scraping ,web scraping pirces ,python ,web scraping python ,data science blog ,data visualization python

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}