DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report

Getting Started With Scrapy

This article provides a basic view of how to use the Python scrapy function to extract data and other information from websites.

Jay Sridhar user avatar by
Jay Sridhar
CORE ·
Jun. 20, 17 · Tutorial
Like (5)
Save
Tweet
Share
12.34K Views

Join the DZone community and get the full member experience.

Join For Free

Scrapy is a Python-based web crawler that can be used to extract information from websites. It is fast and simple, and can navigate pages just like a browser can.

However, note that it is not suitable for websites and apps that use JavaScript to manipulate the user interface. Scrapy loads just the HTML. It has no facilities to execute JavaScript that might be used by the website to tailor the user’s experience.

Installation

We use Virtualenv to install scrapy. This allows us to install scrapy without affecting other system-installed modules.

Create a working directory and initialize a virtual environment in that directory.

mkdir working
cd working
virtualenv venv
. venv/bin/activate

Install scrapy now.

pip install scrapy 

Check that it is working. The following display shows the version of scrapy as 1.4.0.

scrapy
# prints
Scrapy 1.4.0 - no active project
 
Usage:
  scrapy <command></command> [options] [args]
 
Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
...

Writing a Spider

Scrapy works by loading a Python module called a spider, which is a class inheriting from scrapy.Spider.

Let's write a simple spider class to load the top posts from Reddit.

To begin with, create a file called redditspider.py and add the following to it. This is a complete spider class, though one which does not do anything useful for us. A spider class requires, at a minimum, the following:

  • A name identifying the spider.
  • A start_urls list variable containing the URLs from which to begin crawling.
  • A parse() method, which can be a no-op as shown.
import scrapy
 
class redditspider(scrapy.Spider):
    name = 'reddit'
    start_urls = ['https://www.reddit.com/']
 
    def parse(self, response):
        pass

This class can now be executed as follows:

scrapy runspider redditspider.py
 
# prints
...
2017-06-16 10:42:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-16 10:42:34 [scrapy.core.engine] INFO: Spider opened
2017-06-16 10:42:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...

Turn Off Logging

As you can see, this spider runs and prints a bunch of messages, which can be useful for debugging. However, since it obscures the output of out program, let's turn it off for now.

Add these lines to the beginning of the file:

import logging
logging.getLogger('scrapy').setLevel(logging.WARNING) 

Now, when we run the spider, we should not see the obfuscating messages.

Parsing the Response

Let's now parse the response from the scraper. This is done in the method parse(). In this method, we use the method response.css() to perform CSS-style selections on the HTML and extract the required elements.

To identify the CSS selections to extract, we use Chrome’s DOM Inspector tool to pick the elements. From reddit’s front page, we see that each post is wrapped in a <div class="thing">...</div>.

So we select all div.thing from the page and use it to work with further.

def parse(self, response):
    for element in response.css('div.thing'):
        pass

We also implement the following helper methods within the spider class to extract the required text.

The following method extracts all text from an element as a list, joins the elements with a space, and strips away the leading and trailing whitespace from the result.

def a(self, response, cssSel):
    return ' '.join(response.css(cssSel).extract()).strip()

And this method extracts text from the first element and returns it.

def f(self, response, cssSel):
    return response.css(cssSel).extract_first()

Extracting Required Elements

Once these helper methods are in place, let's extract the title from each Reddit post. Within div.thing, the title is available at div.entry>p.title>a.title::text. As mentioned before, this CSS selection for the required elements can be determined from any browser’s DOM Inspector.

def parse(self, resp):
    for e in resp.css('div.thing'):
        yield {
            'title': self.a(e,'div.entry>p.title>a.title::text'),
        }

The results are returned to the caller using python’s yield statement. The way yield works is as follows — executing a function which contains a yield statement returns a generator to the caller. The caller repeatedly executes this generator and receives results of the execution till the generator terminates.

In our case, the parse() method returns a dictionary object containing a key (title) to the caller on each invocation till the div.thing list ends.

Running the Spider and Collecting Output

Let us now run the spider again. A part of the copious output is shown (after re-instating the log statements).

scrapy runspider redditspider.py
# prints
...
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from 
{'title': u'The Plight of a Politician'}
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from 
{'title': u'Elephants foot compared to humans foot'}
...

It is hard to see the real output. Let us redirect the output to a file (posts.json).

scrapy runspider redditspider.py -o posts.json 

And here is a part of posts.json.

...
{"title": "They got fit together"},
{"title": "Not all heroes wear capes"},
{"title": "This sub"},
{"title": "So I picked this up at a flea market.."},
...

Extract All Required Information

Let's also extract the subreddit name and the number of votes for each post. To do that, we just update the result returned from the yield statement.

def parse(S, r):
    for e in r.css('div.thing'):
        yield {
            'title': S.a(e,'div.entry>p.title>a.title::text'),
            'votes': S.f(e,'div.score.likes::attr(title)'),
            'subreddit': S.a(e,'div.entry>p.tagline>a.subreddit::text'),
        }

The resulting posts.json:

...
{"votes": "28962", "title": "They got fit together", "subreddit": "r/pics"},
{"votes": "6904", "title": "My puppy finally caught his Stub", "subreddit": "r/funny"},
{"votes": "3925", "title": "Reddit, please find this woman who went missing during E3!", "subreddit": "r/NintendoSwitch"},
{"votes": "30079", "title": "Yo-Yo Skills", "subreddit": "r/gifs"},
{"votes": "2379", "title": "For every upvote I won't smoke for a day", "subreddit": "r/stopsmoking"},
...

Conclusion

This article provided a basic view of how to extract information from websites using Scrapy. To use scrapy, we need to write a spider module which instructs scrapy to crawl a website and extract structured information from it. This information can then be returned in JSON format for consumption by downstream software.

Extract

Published at DZone with permission of Jay Sridhar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Cloud Performance Engineering
  • A Beginner’s Guide To Styling CSS Forms
  • Java REST API Frameworks
  • gRPC on the Client Side

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: