Scraping DZone Syndication Stats
Join the DZone community and get the full member experience.
Join For Freedzone syndicate some of my blog entries. i used to keep a list by hand, but it is easy to get out of step that way. that’s a manual process too, so why not automate it?
the dzone page is a little tricky to scrape in that it does not show all syndicated articles until a “more” button has been pressed exhaustively. that means we’re looking at selenium rather than wget or curl. here’s some groovy:
@grapes( @grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0') ) import org.openqa.selenium.* import org.openqa.selenium.support.ui.* import org.openqa.webdriver.* import org.openqa.selenium.firefox.firefoxdriver import groovy.json.* webdriver driver = new firefoxdriver() driver.with { navigate().to("http://dzone.com/users/paulhammant") // keep clicking "more" button until it disappears def articleelements = [] while (true) { def morebutton = by.xpath("//button[@class='more-button']") try { // button may arrive in page slowly. timeout after 3 secs new webdriverwait(driver, 3).until(expectedconditions.visibilityofelementlocated(morebutton)) findelement(morebutton).click() } catch(timeoutexception) { break } finally { articleelements = findelements(by.xpath("//div[@class='activity-stream']/ul/li")) } } def results = new expando() results.from = null results.to = null list articles = new arraylist() // turn each article element (and child elements) into pojo articleelements.each() { link -> def divs = link.findelements(by.tagname("div")) def article = new expando() divs.each() { div -> // article and stats for it are sibling elements, not parent/child. if (div.getattribute("class").contains("stream-article")) { def a = div.findelement(by.tagname("a")) article.title = a.gettext() article.url = a.getattribute("href") } if (div.getattribute("class").contains("activity-stats-group")) { def divs2 = div.findelements(by.tagname("div")) divs2.each() { div2 -> def text = div2.gettext() if (text.endswith("views")) { article.views = integer.parseint(text.replace("views","").replace(",","").trim()) } if (text.endswith("comments")) { article.comments = integer.parseint(text.replace("comments","").replace(",","").trim()) } if (text.startswith("on ")) { article.date = new date().parse("mmm dd, yyyy", text.substring(text.indexof("|")+1, text.indexof(".")).trim()) } } } } articles.add(article) if (results.from == null || article.date < results.from) { results.from = article.date } if (results.to == null || article.date > results.to) { results.to = article.date } } results.articles = articles.reverse() // json diffs look better. results.when = new date() new file("dzone.json").withwriter { out -> out.write(new jsonbuilder(results).toprettystring()) } quit() }
the json is consumed by angularjs, and you can see it at http://paulhammant.com/dzone.html . given i have used angularjs, the page won’t be indexed by search crawlers presently. that’s not a problem to me, really, as i don’t want dzone’s rankings for my articles to be higher than the originals. dzone sometimes change the titles, i note.
the json is committed to github, and i can watch changes over time from the comfort of an armchair:
perhaps this is interesting only to people who’s blogs are aggregated into dzone, and until the dzone people make a proper feed. of course they may have that already, and i missed it.
Published at DZone with permission of Paul Hammant, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments