Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Web Scraping With Python: Would Alliances Have Mattered?

DZone 's Guide to

Web Scraping With Python: Would Alliances Have Mattered?

In this post, we take a look at how to use Python to scrap data from the web in order to perform analyses. Read on to start scraping!

· Big Data Zone ·
Free Resource

The Indian state of Gujarat went to polls in December of 2017. The ruling Bhartiya Janata Party (BJP) set itself a target of winning 150 seats but ended up winning 99. Some politicians commented that if only the opposition Indian National Congress (INC) party had allied with other opposition parties, the BJP would not have won.

Now, what are the seats in which the victory margin was narrow enough that an alliance between the INC runner-up and the second runner-up candidate would have made the BJP candidate lose? That is the constituencies in which the victory margin of the BJP candidate is less than the votes polled by the candidate who came in third.

The website IndiaVotes provides a large amount of data on Indian elections. To answer our question, we can use Python to scrape the relevant web pages, process, and extract the data we require. We can also write the data to an Excel file so that we can use convenient features like sorting and filtering to get more perspectives.

The Python libraries I have used are "requests" for fetching webpage content, "BeautifulSoup" for navigating and parsing the DOM tree-structured data, and "openpyxl" for creating the Excel file. There are numerous tutorials that explain the basics and usage of BeautifulSoup. For a gentle and smooth introduction, you can check out "Web Scraping Tutorial with Python: Tips and Tricks" by Jekaterina Kokatjuhha.

The page that displays the Gujarat 2017 results can be found here. Let's call this the master page. On this page, we see the results in a table.

But the rub is that it makes an AJAX call to fetch the data and load them. The method of web scraping a page to retrieve AJAX-returned data has some additional steps and is well explained in Todd Hayton's article, "Scraping AJAX Pages with Python."

The key step is to figure out the data format required by the endpoint to which the AJAX request is being made. Similarly, we need to know the format in which the data is received. Understanding the request and response data format is easily done with the browser's Developer Tools and checking out the Network tab. Once we know the formats, in our program, we just need to create the data in the request, send it as input to the AJAX call, and, after receiving the response, transform them into suitable data structures. The second important thing to do is to send the header, 'X-Requested-With': 'XMLHttpRequest'. With this, we send a POST request.

Are we done with our background work? Not quite. The data we receive on the master page is not complete enough for our analysis. Though the data in the table is of the election results per constituency. We can only see the winning candidate's name, party, and margin.

Master Page table

The second column is a hyperlink to each constituency's detailed result. For example, the URL for Abdasa is http://www.indiavotes.com/ac/details/29/37954/257/. Let's call it a detail page. Browsing to this page, we see the result of Abdasa displayed in a table.

Abdasa constituency detailed result

Therefore, we have to do web-scraping in two phases. In the first phase, we fetch the master page and retrieve the constituency names and their respective URLs. In the second phase, we fetch the detail pages of all constituencies and retrieve the electoral data.

In order to understand how to extract the data, we inspect the tables in the two pages. For the master page, the DOM element hierarchy is as follows:

div : id = "m1", class = "mapTabData"
div : id = "DataTables_Tables_Table_0_wrapper", class = "dataTables_wrapper", role = "grid"
table : id = "DataTables_Table_0", class = "grid sortable dataTable"
tbody : role = "alert"
tr : class = "odd" or "even"
td : class="ta1" (second column)

Navigation turns out to be quite easy. We take the element with id "m1" and fetch all "td"s of class "ta1".

For the detailed page, the element hierarchy is:

div : id = "m1", class = "mapTabData"
div : id = "DataTables_Table_0_wrapper", class="dataTables_wrapper" role = "grid"
table : id = "DataTables_Table_0", class = "grid sortable dataTable"
tbody : role = "alert"
tr : class = "bgR1 odd" or "even" or "odd"
td : class = "numberTable", "tar sorting_1", "ta1", "tar", "tar", "ta1" (six columns)

Here, too, the navigation is simple. We take the element with id "m1" and fetch all the "td"s under it. One set of five cells represents one candidate's details.

The data structure I've used to hold the election results is called "results." It is a dictionary of constituency-candidates. The key in the dictionary is the place (constituency). The value is a list in which the members are lists, each list of five values representing one candidate. Its visual depiction is:

results data structure

Now that we have all the building blocks of our program, we design the program flow:

  • Fetch Master Page.
  • In the content, find table 'm1'.
  • Get all the rows in the table.
  • Grab the anchor elements and place (constituency) names.
  • Extract the URLs from the anchor elements.
  • Store the place and URL as a tuple in an array.
  • Iterate through the array, that is, we take one place (an electoral constituency) at a time and call its results page.
  • In the page content, navigate to table 'm1' and grab all the rows.
  • Each row has five cells, so after reading the fifth cell, append the row to our main data structure.
  • After the loop ends, that is, data for all candidates in all constituencies are fetched, process the main data structure.
  • For each constituency, we calculate the winner's margin. If the margin is smaller than the votes polled by the candidate who came third, we have a match; we are interested in that constituency.
  • Take the party of the winning candidate and store the constituency names in separate arrays for INC and BJP.
  • Print the constituency wise candidates information with the victory margin.
  • Finally, write the entire data (full election results) out to an Excel file.

So, that's it. I coded in Python 2.7 and ran the program. You can check out the complete program on GitHub.

You can see the output of a sample run on GitHub, here.

What does the program say about our analytics question?

There are 28 constituencies that had close results. 17 of them were won by the BJP. In 15 of those, the INC came second. Among the 15, the Bahujan Samaj party won two and the Nationalist Congress Party won three. A three-party alliance between INC, BSP and NCP would, thus, have brought down BJP tally to 94. That still puts the BJP with a simple majority and the alliance would not have defeated it.

You can also see the complete results Excel file that the file generates here.

Topics:
python ,big data ,tutorial ,web scraping

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}