Getting Started With Scrapy
This article provides a basic view of how to use the Python scrapy function to extract data and other information from websites.
Join the DZone community and get the full member experience.
Join For FreeScrapy is a Python-based web crawler that can be used to extract information from websites. It is fast and simple, and can navigate pages just like a browser can.
However, note that it is not suitable for websites and apps that use JavaScript to manipulate the user interface. Scrapy loads just the HTML. It has no facilities to execute JavaScript that might be used by the website to tailor the user’s experience.
Installation
We use Virtualenv to install scrapy. This allows us to install scrapy without affecting other system-installed modules.
Create a working directory and initialize a virtual environment in that directory.
mkdir working
cd working
virtualenv venv
. venv/bin/activate
Install scrapy now.
pip install scrapy
Check that it is working. The following display shows the version of scrapy as 1.4.0
.
scrapy
# prints
Scrapy 1.4.0 - no active project
Usage:
scrapy <command></command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
...
Writing a Spider
Scrapy works by loading a Python module called a spider
, which is a class inheriting from scrapy.Spider
.
Let's write a simple spider
class to load the top posts from Reddit.
To begin with, create a file called redditspider.py
and add the following to it. This is a complete spider class, though one which does not do anything useful for us. A spider
class requires, at a minimum, the following:
- A
name
identifying the spider. - A
start_urls
list variable containing the URLs from which to begin crawling. - A
parse()
method, which can be a no-op as shown.
import scrapy
class redditspider(scrapy.Spider):
name = 'reddit'
start_urls = ['https://www.reddit.com/']
def parse(self, response):
pass
This class can now be executed as follows:
scrapy runspider redditspider.py
# prints
...
2017-06-16 10:42:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-16 10:42:34 [scrapy.core.engine] INFO: Spider opened
2017-06-16 10:42:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...
Turn Off Logging
As you can see, this spider
runs and prints a bunch of messages, which can be useful for debugging. However, since it obscures the output of out program, let's turn it off for now.
Add these lines to the beginning of the file:
import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)
Now, when we run the spider, we should not see the obfuscating messages.
Parsing the Response
Let's now parse the response from the scraper. This is done in the method parse()
. In this method, we use the method response.css()
to perform CSS-style selections on the HTML and extract the required elements.
To identify the CSS selections to extract, we use Chrome’s DOM Inspector tool to pick the elements. From reddit’s front page, we see that each post is wrapped in a <div class="thing">...</div>
.
So we select all div.thing
from the page and use it to work with further.
def parse(self, response):
for element in response.css('div.thing'):
pass
We also implement the following helper methods within the spider class to extract the required text.
The following method extracts all text from an element as a list, joins the elements with a space, and strips away the leading and trailing whitespace from the result.
def a(self, response, cssSel):
return ' '.join(response.css(cssSel).extract()).strip()
And this method extracts text from the first element and returns it.
def f(self, response, cssSel):
return response.css(cssSel).extract_first()
Extracting Required Elements
Once these helper methods are in place, let's extract the title from each Reddit post. Within div.thing
, the title is available at div.entry>p.title>a.title::text
. As mentioned before, this CSS selection for the required elements can be determined from any browser’s DOM Inspector.
def parse(self, resp):
for e in resp.css('div.thing'):
yield {
'title': self.a(e,'div.entry>p.title>a.title::text'),
}
The results are returned to the caller using python’s yield
statement. The way yield
works is as follows — executing a function which contains a yield
statement returns a generator to the caller. The caller repeatedly executes this generator and receives results of the execution till the generator terminates.
In our case, the parse()
method returns a dictionary object containing a key (title
) to the caller on each invocation till the div.thing
list ends.
Running the Spider and Collecting Output
Let us now run the spider again. A part of the copious output is shown (after re-instating the log statements).
scrapy runspider redditspider.py
# prints
...
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from
{'title': u'The Plight of a Politician'}
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from
{'title': u'Elephants foot compared to humans foot'}
...
It is hard to see the real output. Let us redirect the output to a file (posts.json
).
scrapy runspider redditspider.py -o posts.json
And here is a part of posts.json
.
...
{"title": "They got fit together"},
{"title": "Not all heroes wear capes"},
{"title": "This sub"},
{"title": "So I picked this up at a flea market.."},
...
Extract All Required Information
Let's also extract the subreddit name and the number of votes for each post. To do that, we just update the result returned from the yield
statement.
def parse(S, r):
for e in r.css('div.thing'):
yield {
'title': S.a(e,'div.entry>p.title>a.title::text'),
'votes': S.f(e,'div.score.likes::attr(title)'),
'subreddit': S.a(e,'div.entry>p.tagline>a.subreddit::text'),
}
The resulting posts.json
:
...
{"votes": "28962", "title": "They got fit together", "subreddit": "r/pics"},
{"votes": "6904", "title": "My puppy finally caught his Stub", "subreddit": "r/funny"},
{"votes": "3925", "title": "Reddit, please find this woman who went missing during E3!", "subreddit": "r/NintendoSwitch"},
{"votes": "30079", "title": "Yo-Yo Skills", "subreddit": "r/gifs"},
{"votes": "2379", "title": "For every upvote I won't smoke for a day", "subreddit": "r/stopsmoking"},
...
Conclusion
This article provided a basic view of how to extract information from websites using Scrapy. To use scrapy, we need to write a spider
module which instructs scrapy to crawl a website and extract structured information from it. This information can then be returned in JSON format for consumption by downstream software.
Published at DZone with permission of Jay Sridhar, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments