Over a million developers have joined DZone.

Scraping DZone Syndication Stats

DZone 's Guide to

Scraping DZone Syndication Stats

· DevOps Zone ·
Free Resource

DZone syndicate some of my blog entries. I used to keep a list by hand, but it is easy to get out of step that way. That’s a manual process too, so why not automate it?

The DZone page is a little tricky to scrape in that it does not show all syndicated articles until a “More” button has been pressed exhaustively. That means we’re looking at Selenium rather than wget or curl. Here’s some Groovy:

    @Grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0')
import org.openqa.selenium.*
import org.openqa.selenium.support.ui.*
import org.openqa.WebDriver.*
import org.openqa.selenium.firefox.FirefoxDriver
import groovy.json.*

WebDriver driver = new FirefoxDriver()
driver.with {

  // keep clicking "More" button until it disappears

  def articleElements = []
  while (true) {
    def moreButton = By.xpath("//button[@class='more-button']")
    try {
      // button may arrive in page slowly. timeout after 3 secs
      new WebDriverWait(driver, 3).until(ExpectedConditions.visibilityOfElementLocated(moreButton))
    } catch(TimeoutException) {
    } finally {
      articleElements = findElements(By.xpath("//div[@class='activity-stream']/ul/li"))

  def results = new Expando()
  results.from = null
  results.to = null
  List articles = new ArrayList()
  // Turn each article element (and child elements) into POJO
  articleElements.each() { link ->
    def divs = link.findElements(By.tagName("div"))
    def article = new Expando()

    divs.each() { div ->
      // article and stats for it are sibling elements, not parent/child.
      if (div.getAttribute("class").contains("stream-article")) {
        def a = div.findElement(By.tagName("a"))
        article.title = a.getText()        
        article.url = a.getAttribute("href")
      if (div.getAttribute("class").contains("activity-stats-group")) {
        def divs2 = div.findElements(By.tagName("div"))
        divs2.each() { div2 ->
          def text = div2.getText()
          if (text.endsWith("VIEWS")) {
            article.views = Integer.parseInt(text.replace("VIEWS","").replace(",","").trim())    
          if (text.endsWith("COMMENTS")) {
            article.comments = Integer.parseInt(text.replace("COMMENTS","").replace(",","").trim())
          if (text.startsWith("on ")) {
            article.date = new Date().parse("MMM dd, yyyy", text.substring(text.indexOf("|")+1, text.indexOf(".")).trim())
    if (results.from == null || article.date < results.from) {
       results.from = article.date
    if (results.to == null || article.date > results.to) {
       results.to = article.date
  results.articles = articles.reverse() // JSON diffs look better.
  results.when = new Date()
  new File("DZone.json").withWriter { out ->
    out.write(new JsonBuilder(results).toPrettyString())

The JSON is consumed by AngularJS, and you can see it at http://paulhammant.com/dzone.html. Given I have used AngularJS, the page won’t be indexed by search crawlers presently. That’s not a problem to me, really, as I don’t want DZone’s rankings for my articles to be higher than the originals. DZone sometimes change the titles, I note.

The JSON is committed to Github, and I can watch changes over time from the comfort of an armchair:

Perhaps this is interesting only to people who’s blogs are aggregated into DZone, and until the DZone people make a proper feed. Of course they may have that already, and I missed it.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}