Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Getting Started With Scrapy

DZone's Guide to

Getting Started With Scrapy

This article provides a basic view of how to use the Python scrapy function to extract data and other information from websites.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Scrapy is a Python-based web crawler that can be used to extract information from websites. It is fast and simple, and can navigate pages just like a browser can.

However, note that it is not suitable for websites and apps that use JavaScript to manipulate the user interface. Scrapy loads just the HTML. It has no facilities to execute JavaScript that might be used by the website to tailor the user’s experience.

Installation

We use Virtualenv to install scrapy. This allows us to install scrapy without affecting other system-installed modules.

Create a working directory and initialize a virtual environment in that directory.

mkdir working
cd working
virtualenv venv
. venv/bin/activate

Install scrapy now.

pip install scrapy 

Check that it is working. The following display shows the version of scrapy as 1.4.0.

scrapy
# prints
Scrapy 1.4.0 - no active project
 
Usage:
  scrapy <command></command> [options] [args]
 
Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
...

Writing a Spider

Scrapy works by loading a Python module called a spider, which is a class inheriting from scrapy.Spider.

Let's write a simple spider class to load the top posts from Reddit.

To begin with, create a file called redditspider.py and add the following to it. This is a complete spider class, though one which does not do anything useful for us. A spider class requires, at a minimum, the following:

  • name identifying the spider.
  • start_urls list variable containing the URLs from which to begin crawling.
  • parse() method, which can be a no-op as shown.
import scrapy
 
class redditspider(scrapy.Spider):
    name = 'reddit'
    start_urls = ['https://www.reddit.com/']
 
    def parse(self, response):
        pass

This class can now be executed as follows:

scrapy runspider redditspider.py
 
# prints
...
2017-06-16 10:42:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-16 10:42:34 [scrapy.core.engine] INFO: Spider opened
2017-06-16 10:42:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...

Turn Off Logging

As you can see, this spider runs and prints a bunch of messages, which can be useful for debugging. However, since it obscures the output of out program, let's turn it off for now.

Add these lines to the beginning of the file:

import logging
logging.getLogger('scrapy').setLevel(logging.WARNING) 

Now, when we run the spider, we should not see the obfuscating messages.

Parsing the Response

Let's now parse the response from the scraper. This is done in the method parse(). In this method, we use the method response.css() to perform CSS-style selections on the HTML and extract the required elements.

To identify the CSS selections to extract, we use Chrome’s DOM Inspector tool to pick the elements. From reddit’s front page, we see that each post is wrapped in a <div class="thing">...</div>.

So we select all div.thing from the page and use it to work with further.

def parse(self, response):
    for element in response.css('div.thing'):
        pass

We also implement the following helper methods within the spider class to extract the required text.

The following method extracts all text from an element as a list, joins the elements with a space, and strips away the leading and trailing whitespace from the result.

def a(self, response, cssSel):
    return ' '.join(response.css(cssSel).extract()).strip()

And this method extracts text from the first element and returns it.

def f(self, response, cssSel):
    return response.css(cssSel).extract_first()

Extracting Required Elements

Once these helper methods are in place, let's extract the title from each Reddit post. Within div.thing, the title is available at div.entry>p.title>a.title::text. As mentioned before, this CSS selection for the required elements can be determined from any browser’s DOM Inspector.

def parse(self, resp):
    for e in resp.css('div.thing'):
        yield {
            'title': self.a(e,'div.entry>p.title>a.title::text'),
        }

The results are returned to the caller using python’s yield statement. The way yield works is as follows — executing a function which contains a yield statement returns a generator to the caller. The caller repeatedly executes this generator and receives results of the execution till the generator terminates.

In our case, the parse() method returns a dictionary object containing a key (title) to the caller on each invocation till the div.thing list ends.

Running the Spider and Collecting Output

Let us now run the spider again. A part of the copious output is shown (after re-instating the log statements).

scrapy runspider redditspider.py
# prints
...
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from 
{'title': u'The Plight of a Politician'}
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from 
{'title': u'Elephants foot compared to humans foot'}
...

It is hard to see the real output. Let us redirect the output to a file (posts.json).

scrapy runspider redditspider.py -o posts.json 

And here is a part of posts.json.

...
{"title": "They got fit together"},
{"title": "Not all heroes wear capes"},
{"title": "This sub"},
{"title": "So I picked this up at a flea market.."},
...

Extract All Required Information

Let's also extract the subreddit name and the number of votes for each post. To do that, we just update the result returned from the yield statement.

def parse(S, r):
    for e in r.css('div.thing'):
        yield {
            'title': S.a(e,'div.entry>p.title>a.title::text'),
            'votes': S.f(e,'div.score.likes::attr(title)'),
            'subreddit': S.a(e,'div.entry>p.tagline>a.subreddit::text'),
        }

The resulting posts.json:

...
{"votes": "28962", "title": "They got fit together", "subreddit": "r/pics"},
{"votes": "6904", "title": "My puppy finally caught his Stub", "subreddit": "r/funny"},
{"votes": "3925", "title": "Reddit, please find this woman who went missing during E3!", "subreddit": "r/NintendoSwitch"},
{"votes": "30079", "title": "Yo-Yo Skills", "subreddit": "r/gifs"},
{"votes": "2379", "title": "For every upvote I won't smoke for a day", "subreddit": "r/stopsmoking"},
...

Conclusion

This article provided a basic view of how to extract information from websites using Scrapy. To use scrapy, we need to write a spider module which instructs scrapy to crawl a website and extract structured information from it. This information can then be returned in JSON format for consumption by downstream software.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,tutorial ,scrapy ,web scraping ,python

Published at DZone with permission of Jay Sridhar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}