Over a million developers have joined DZone.

Clojure/Enlive: Screen Scraping a HTML File from Disk

· Java Zone

Discover how AppDynamics steps in to upgrade your performance game and prevent your enterprise from these top 10 Java performance problems, brought to you in partnership with AppDynamics.

I wanted to play around with some Champions League data and I came across the Rec Sport Soccer Statistics Foundation which has collected results of all matches since the tournament started in 1955.

I wanted to get a list of all the matches for a specific season so I started out by downloading the file:

$ pwd
$ wget http://www.rsssf.com/ec/ec200203det.html

The next step was to load that page and then run a CSS selector over it to extract the matches. In Ruby land I usually use nokogiri or Web Driver to do this but I’d heard that Clojure’s enlive is good for this type of work so I thought I’d give it a try.

I found a couple of examples showing how to get started but they both seemed to rely on the web page being at a HTTP URI rather than on disk.

I eventually spotted an example which passed in HTML as a string to html-resource and decided to load the contents of my file as a string and then pass that in:

(ns ranking-algorithms.parse
  (:use [net.cgrand.enlive-html]))
(defn fetch-page
  (html-resource (java.io.StringReader. (slurp file-path))))

The next step was to take that page representation and extract the matches. Since the page isn’t particularly well laid out for that purpose I ended up writing a regular expression to find the matching parts:

(defn matches [file]
  (->> file
       (map extract-content)
       (filter recognise-match?)))
(defn extract-rows [page]
  (select page [:div.Section1 :p :span]))
(defn extract-content [row]
  (first (get row :content)))
(defn recognise-match? [row]
  (and (string? row) (re-matches #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row)))

The interesting part is extract-rows where we apply the CSS selector ‘div.Section1 p span’, the only difference being that we prefix the selector with ‘:’.

We then filter everything through recgonise-match? to find the matches since almost every row of the page is returned by our CSS selector. Unfortunately I don’t think there is a more specific selector that I could have used.

When I execute that function I ended up with the following output:

> (matches "/tmp/football/ec200203det.html")
( ... "Lokomotiv\nMoskou-Borussia Dortmund 1 - 2" "Borussia\nDortmund-AC Milan 0 - 1" 
"Real\nMadrid-Lokomotiv Moskou 2 - 2" "Real\nMadrid-Borussia Dortmund 2 - 1" 
"AC Milan-Lokomotiv\nMoskou 1 - 0" "Borussia Dortmund-Real\nMadrid 1 - 1" 
"Lokomotiv\nMoskou-AC Milan 0 - 1" ... )

The next step was to split out the strings into a structure that I can use in a rankings algorithm so I applied another function to each string to pull out the appropriate parts:

(defn matches [file]
  (->> file
       (map extract-content)
       (filter recognise-match?)
       (map as-match)))
(defn cleanup [word]
  (clojure.string/replace word "\n" " "))
(defn as-match
  (let [match
        (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" row))]
    {:home (cleanup (nth match 1)) :away (cleanup (nth match 2))
     :home_score (nth match 3) :away_score (nth match 4)}))

If we run the function now we get a much nicer output to play with:

> (matches "/tmp/football/ec200203det.html")
( ...  {:home "AC Milan", :away "Internazionale Milaan", :home_score "0", :away_score "0"} 
       {:home "Juventus Turijn", :away "Real Madrid", :home_score "3", :away_score "1"} 
       {:home "Internazionale Milaan", :away "AC Milan", :home_score "1", :away_score "1"} )

The Java Zone is brought to you in partnership with AppDynamics. AppDynamics helps you gain the fundamentals behind application performance, and implement best practices so you can proactively analyze and act on performance problems as they arise, and more specifically with your Java applications. Start a Free Trial.


Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}