Over a million developers have joined DZone.

Clojure/Enlive: Screen Scraping a HTML File from Disk

DZone's Guide to

Clojure/Enlive: Screen Scraping a HTML File from Disk

· Java Zone
Free Resource

Just released, a free O’Reilly book on Reactive Microsystems: The Evolution of Microservices at Scale. Brought to you in partnership with Lightbend.

I wanted to play around with some Champions League data and I came across the Rec Sport Soccer Statistics Foundation which has collected results of all matches since the tournament started in 1955.

I wanted to get a list of all the matches for a specific season so I started out by downloading the file:

$ pwd
$ wget http://www.rsssf.com/ec/ec200203det.html

The next step was to load that page and then run a CSS selector over it to extract the matches. In Ruby land I usually use nokogiri or Web Driver to do this but I’d heard that Clojure’s enlive is good for this type of work so I thought I’d give it a try.

I found a couple of examples showing how to get started but they both seemed to rely on the web page being at a HTTP URI rather than on disk.

I eventually spotted an example which passed in HTML as a string to html-resource and decided to load the contents of my file as a string and then pass that in:

(ns ranking-algorithms.parse
  (:use [net.cgrand.enlive-html]))
(defn fetch-page
  (html-resource (java.io.StringReader. (slurp file-path))))

The next step was to take that page representation and extract the matches. Since the page isn’t particularly well laid out for that purpose I ended up writing a regular expression to find the matching parts:

(defn matches [file]
  (->> file
       (map extract-content)
       (filter recognise-match?)))
(defn extract-rows [page]
  (select page [:div.Section1 :p :span]))
(defn extract-content [row]
  (first (get row :content)))
(defn recognise-match? [row]
  (and (string? row) (re-matches #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row)))

The interesting part is extract-rows where we apply the CSS selector ‘div.Section1 p span’, the only difference being that we prefix the selector with ‘:’.

We then filter everything through recgonise-match? to find the matches since almost every row of the page is returned by our CSS selector. Unfortunately I don’t think there is a more specific selector that I could have used.

When I execute that function I ended up with the following output:

> (matches "/tmp/football/ec200203det.html")
( ... "Lokomotiv\nMoskou-Borussia Dortmund 1 - 2" "Borussia\nDortmund-AC Milan 0 - 1" 
"Real\nMadrid-Lokomotiv Moskou 2 - 2" "Real\nMadrid-Borussia Dortmund 2 - 1" 
"AC Milan-Lokomotiv\nMoskou 1 - 0" "Borussia Dortmund-Real\nMadrid 1 - 1" 
"Lokomotiv\nMoskou-AC Milan 0 - 1" ... )

The next step was to split out the strings into a structure that I can use in a rankings algorithm so I applied another function to each string to pull out the appropriate parts:

(defn matches [file]
  (->> file
       (map extract-content)
       (filter recognise-match?)
       (map as-match)))
(defn cleanup [word]
  (clojure.string/replace word "\n" " "))
(defn as-match
  (let [match
        (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" row))]
    {:home (cleanup (nth match 1)) :away (cleanup (nth match 2))
     :home_score (nth match 3) :away_score (nth match 4)}))

If we run the function now we get a much nicer output to play with:

> (matches "/tmp/football/ec200203det.html")
( ...  {:home "AC Milan", :away "Internazionale Milaan", :home_score "0", :away_score "0"} 
       {:home "Juventus Turijn", :away "Real Madrid", :home_score "3", :away_score "1"} 
       {:home "Internazionale Milaan", :away "AC Milan", :home_score "1", :away_score "1"} )

Strategies and techniques for building scalable and resilient microservices to refactor a monolithic application step-by-step, a free O'Reilly book. Brought to you in partnership with Lightbend.


Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}