Over a million developers have joined DZone.

Scraping DZone Syndication Stats

DZone's Guide to

Scraping DZone Syndication Stats

· DevOps Zone ·
Free Resource

Discover a centralized approach to monitor your virtual infrastructure, on-premise IT environment, and cloud infrastructure – all on a single platform.

DZone syndicate some of my blog entries. I used to keep a list by hand, but it is easy to get out of step that way. That’s a manual process too, so why not automate it?

The DZone page is a little tricky to scrape in that it does not show all syndicated articles until a “More” button has been pressed exhaustively. That means we’re looking at Selenium rather than wget or curl. Here’s some Groovy:

    @Grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0')
import org.openqa.selenium.*
import org.openqa.selenium.support.ui.*
import org.openqa.WebDriver.*
import org.openqa.selenium.firefox.FirefoxDriver
import groovy.json.*

WebDriver driver = new FirefoxDriver()
driver.with {

  // keep clicking "More" button until it disappears

  def articleElements = []
  while (true) {
    def moreButton = By.xpath("//button[@class='more-button']")
    try {
      // button may arrive in page slowly. timeout after 3 secs
      new WebDriverWait(driver, 3).until(ExpectedConditions.visibilityOfElementLocated(moreButton))
    } catch(TimeoutException) {
    } finally {
      articleElements = findElements(By.xpath("//div[@class='activity-stream']/ul/li"))

  def results = new Expando()
  results.from = null
  results.to = null
  List articles = new ArrayList()
  // Turn each article element (and child elements) into POJO
  articleElements.each() { link ->
    def divs = link.findElements(By.tagName("div"))
    def article = new Expando()

    divs.each() { div ->
      // article and stats for it are sibling elements, not parent/child.
      if (div.getAttribute("class").contains("stream-article")) {
        def a = div.findElement(By.tagName("a"))
        article.title = a.getText()        
        article.url = a.getAttribute("href")
      if (div.getAttribute("class").contains("activity-stats-group")) {
        def divs2 = div.findElements(By.tagName("div"))
        divs2.each() { div2 ->
          def text = div2.getText()
          if (text.endsWith("VIEWS")) {
            article.views = Integer.parseInt(text.replace("VIEWS","").replace(",","").trim())    
          if (text.endsWith("COMMENTS")) {
            article.comments = Integer.parseInt(text.replace("COMMENTS","").replace(",","").trim())
          if (text.startsWith("on ")) {
            article.date = new Date().parse("MMM dd, yyyy", text.substring(text.indexOf("|")+1, text.indexOf(".")).trim())
    if (results.from == null || article.date < results.from) {
       results.from = article.date
    if (results.to == null || article.date > results.to) {
       results.to = article.date
  results.articles = articles.reverse() // JSON diffs look better.
  results.when = new Date()
  new File("DZone.json").withWriter { out ->
    out.write(new JsonBuilder(results).toPrettyString())

The JSON is consumed by AngularJS, and you can see it at http://paulhammant.com/dzone.html. Given I have used AngularJS, the page won’t be indexed by search crawlers presently. That’s not a problem to me, really, as I don’t want DZone’s rankings for my articles to be higher than the originals. DZone sometimes change the titles, I note.

The JSON is committed to Github, and I can watch changes over time from the comfort of an armchair:

Perhaps this is interesting only to people who’s blogs are aggregated into DZone, and until the DZone people make a proper feed. Of course they may have that already, and I missed it.

Learn how to auto-discover your containers and monitor their performance, capture Docker host and container metrics to allocate host resources, and provision containers.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}