Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Scraping DZone Syndication Stats

DZone's Guide to

Scraping DZone Syndication Stats

· DevOps Zone ·
Free Resource

Learn more about how CareerBuilder was able to resolve customer issues 5x faster by using Scalyr, the fastest log management tool on the market. 

DZone syndicate some of my blog entries. I used to keep a list by hand, but it is easy to get out of step that way. That’s a manual process too, so why not automate it?

The DZone page is a little tricky to scrape in that it does not show all syndicated articles until a “More” button has been pressed exhaustively. That means we’re looking at Selenium rather than wget or curl. Here’s some Groovy:

@Grapes(
    @Grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0')
)
 
import org.openqa.selenium.*
import org.openqa.selenium.support.ui.*
import org.openqa.WebDriver.*
import org.openqa.selenium.firefox.FirefoxDriver
import groovy.json.*

WebDriver driver = new FirefoxDriver()
driver.with {
  navigate().to("http://dzone.com/users/paulhammant")

  // keep clicking "More" button until it disappears

  def articleElements = []
  while (true) {
    def moreButton = By.xpath("//button[@class='more-button']")
    try {
      // button may arrive in page slowly. timeout after 3 secs
      new WebDriverWait(driver, 3).until(ExpectedConditions.visibilityOfElementLocated(moreButton))
      findElement(moreButton).click()
    } catch(TimeoutException) {
      break
    } finally {
      articleElements = findElements(By.xpath("//div[@class='activity-stream']/ul/li"))
    }
  }

  def results = new Expando()
  results.from = null
  results.to = null
  List articles = new ArrayList()
  
  // Turn each article element (and child elements) into POJO
  
  articleElements.each() { link ->
    def divs = link.findElements(By.tagName("div"))
    def article = new Expando()

    divs.each() { div ->
      // article and stats for it are sibling elements, not parent/child.
      if (div.getAttribute("class").contains("stream-article")) {
        def a = div.findElement(By.tagName("a"))
        article.title = a.getText()        
        article.url = a.getAttribute("href")
      }
      if (div.getAttribute("class").contains("activity-stats-group")) {
        def divs2 = div.findElements(By.tagName("div"))
        divs2.each() { div2 ->
          def text = div2.getText()
          if (text.endsWith("VIEWS")) {
            article.views = Integer.parseInt(text.replace("VIEWS","").replace(",","").trim())    
          }
          if (text.endsWith("COMMENTS")) {
            article.comments = Integer.parseInt(text.replace("COMMENTS","").replace(",","").trim())
          }
          if (text.startsWith("on ")) {
            article.date = new Date().parse("MMM dd, yyyy", text.substring(text.indexOf("|")+1, text.indexOf(".")).trim())
          }
        }
      }
    }
    articles.add(article)
    if (results.from == null || article.date < results.from) {
       results.from = article.date
    }
    if (results.to == null || article.date > results.to) {
       results.to = article.date
    }
  }
  results.articles = articles.reverse() // JSON diffs look better.
  results.when = new Date()
  new File("DZone.json").withWriter { out ->
    out.write(new JsonBuilder(results).toPrettyString())
  }
  quit()
}

The JSON is consumed by AngularJS, and you can see it at http://paulhammant.com/dzone.html. Given I have used AngularJS, the page won’t be indexed by search crawlers presently. That’s not a problem to me, really, as I don’t want DZone’s rankings for my articles to be higher than the originals. DZone sometimes change the titles, I note.

The JSON is committed to Github, and I can watch changes over time from the comfort of an armchair:

Perhaps this is interesting only to people who’s blogs are aggregated into DZone, and until the DZone people make a proper feed. Of course they may have that already, and I missed it.


Find out more about how Scalyr built a proprietary database that does not use text indexing for their log management tool.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}