Clojure/Enlive: Screen Scraping a HTML File from Disk
Join the DZone community and get the full member experience.
Join For FreeI wanted to play around with some Champions League data and I came across the Rec Sport Soccer Statistics Foundation which has collected results of all matches since the tournament started in 1955.
I wanted to get a list of all the matches for a specific season so I started out by downloading the file:
$ pwd /tmp/football $ wget http://www.rsssf.com/ec/ec200203det.html
The next step was to load that page and then run a CSS selector over it to extract the matches. In Ruby land I usually use nokogiri or Web Driver to do this but I’d heard that Clojure’s enlive is good for this type of work so I thought I’d give it a try.
I found a couple of examples showing how to get started but they both seemed to rely on the web page being at a HTTP URI rather than on disk.
I eventually spotted an example which passed in HTML as a string to html-resource and decided to load the contents of my file as a string and then pass that in:
(ns ranking-algorithms.parse (:use [net.cgrand.enlive-html])) (defn fetch-page [file-path] (html-resource (java.io.StringReader. (slurp file-path))))
The next step was to take that page representation and extract the matches. Since the page isn’t particularly well laid out for that purpose I ended up writing a regular expression to find the matching parts:
(defn matches [file] (->> file fetch-page extract-rows (map extract-content) (filter recognise-match?))) (defn extract-rows [page] (select page [:div.Section1 :p :span])) (defn extract-content [row] (first (get row :content))) (defn recognise-match? [row] (and (string? row) (re-matches #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row)))
The interesting part is extract-rows where we apply the CSS selector ‘div.Section1 p span’, the only difference being that we prefix the selector with ‘:’.
We then filter everything through recgonise-match? to find the matches since almost every row of the page is returned by our CSS selector. Unfortunately I don’t think there is a more specific selector that I could have used.
When I execute that function I ended up with the following output:
> (matches "/tmp/football/ec200203det.html") ( ... "Lokomotiv\nMoskou-Borussia Dortmund 1 - 2" "Borussia\nDortmund-AC Milan 0 - 1" "Real\nMadrid-Lokomotiv Moskou 2 - 2" "Real\nMadrid-Borussia Dortmund 2 - 1" "AC Milan-Lokomotiv\nMoskou 1 - 0" "Borussia Dortmund-Real\nMadrid 1 - 1" "Lokomotiv\nMoskou-AC Milan 0 - 1" ... )
The next step was to split out the strings into a structure that I can use in a rankings algorithm so I applied another function to each string to pull out the appropriate parts:
(defn matches [file] (->> file fetch-page extract-rows (map extract-content) (filter recognise-match?) (map as-match))) (defn cleanup [word] (clojure.string/replace word "\n" " ")) (defn as-match [row] (let [match (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" row))] {:home (cleanup (nth match 1)) :away (cleanup (nth match 2)) :home_score (nth match 3) :away_score (nth match 4)}))
If we run the function now we get a much nicer output to play with:
> (matches "/tmp/football/ec200203det.html") ( ... {:home "AC Milan", :away "Internazionale Milaan", :home_score "0", :away_score "0"} {:home "Juventus Turijn", :away "Real Madrid", :home_score "3", :away_score "1"} {:home "Internazionale Milaan", :away "AC Milan", :home_score "1", :away_score "1"} )
Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
How To Use the Node Docker Official Image
-
MLOps: Definition, Importance, and Implementation
-
Building and Deploying Microservices With Spring Boot and Docker
-
Hibernate Get vs. Load
Comments