Over a million developers have joined DZone.

Python: Refactoring to Iterator

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Over the last week I’ve been building a set of scripts to scrape the events from the Bayern Munich/Barcelona game and I’ve ended up with a few hundred lines of nested for statements, if statements and mutated lists. I thought it was about time I did a bit of refactoring.


The following is a function which takes in a match file and spits out a collection of maps containing times & events.

import bs4
import re
from bs4 import BeautifulSoup
from soupselect import select

def extract_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())

    all_events = []
    for event in select(soup, 'div#live-text-commentary-wrapper div#live-text'):
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())

    for event in select(soup, 'div#live-text-commentary-wrapper div#more-live-text'):
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())

    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

We call it like this:

match_id = "32683310"
for event in extract_events("data/%s" % (match_id))[:10]:
    print event

The file we’re loading is the Bayern Munich vs Barcelona match HTML file which I have saved locally. After we’ve got that read into beautiful soup we locate the two divs on the page which contain the match events.

We then iterate over that list and create a new list containing (time, event) pairs which we return.

I think we should be able to get to our resulting collection without persisting an intermediate list, but first things first – let’s remove the duplicated for loops:

def extract_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())

    all_events = []
    events = select(soup, 'div#live-text-commentary-wrapper div#live-text')
    more_events = select(soup, 'div#live-text-commentary-wrapper div#more-live-text')

    for event in events + more_events:
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())

    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

The next step is to refactor towards using an iterator. After a bit of reading I realised a generator would make life even easier.

I created a function which returned an iterator of the raw events and plugged that into the original function:

def raw_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
    events = select(soup, 'div#live-text-commentary-wrapper div#live-text')
    more_events = select(soup, 'div#live-text-commentary-wrapper div#more-live-text')
    for event in events + more_events:
        for child in event.children:
            if type(child) is bs4.element.Tag:
                yield child.getText().strip()

def extract_events(file):
    all_events = list(raw_events(file))

    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

If we run that function we still get the same output as before which is good. Now we need to work out how to clean up the second bit of the code which groups the appropriate rows together.

The goal is that ‘extract_events’ returns an iterator rather than a list – we need to figure out how to iterate over the output of ‘raw_events’ in such a way that when we find a ‘time row’ we can yield that and the row immediately after.

Luckily I found a Stack Overflow post explaining that you can use the ‘next’ function inside an iterator to achieve this:

def extract_events(file):
    events = raw_events(file)
    for event in events:
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            yield {'time': formatted_time, 'event': next(events)}

It’s not that much less code than the original function but I think it’s an improvement. Any thoughts/tips to simplify it further are always welcome.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
python ,bigdata ,refactoring ,data analysis ,big data ,iterator ,python big data

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}