DZone
DevOps Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > DevOps Zone > Scraping DZone Syndication Stats

Scraping DZone Syndication Stats

Paul Hammant user avatar by
Paul Hammant
·
Dec. 16, 14 · DevOps Zone · Interview
Like (0)
Save
Tweet
2.96K Views

Join the DZone community and get the full member experience.

Join For Free

dzone syndicate some of my blog entries. i used to keep a list by hand, but it is easy to get out of step that way. that’s a manual process too, so why not automate it?

the dzone page is a little tricky to scrape in that it does not show all syndicated articles until a “more” button has been pressed exhaustively. that means we’re looking at selenium rather than wget or curl. here’s some groovy:

@grapes(
    @grab(group='org.seleniumhq.selenium', module='selenium-java', version='2.44.0')
)
 
import org.openqa.selenium.*
import org.openqa.selenium.support.ui.*
import org.openqa.webdriver.*
import org.openqa.selenium.firefox.firefoxdriver
import groovy.json.*

webdriver driver = new firefoxdriver()
driver.with {
  navigate().to("http://dzone.com/users/paulhammant")

  // keep clicking "more" button until it disappears

  def articleelements = []
  while (true) {
    def morebutton = by.xpath("//button[@class='more-button']")
    try {
      // button may arrive in page slowly. timeout after 3 secs
      new webdriverwait(driver, 3).until(expectedconditions.visibilityofelementlocated(morebutton))
      findelement(morebutton).click()
    } catch(timeoutexception) {
      break
    } finally {
      articleelements = findelements(by.xpath("//div[@class='activity-stream']/ul/li"))
    }
  }

  def results = new expando()
  results.from = null
  results.to = null
  list articles = new arraylist()
  
  // turn each article element (and child elements) into pojo
  
  articleelements.each() { link ->
    def divs = link.findelements(by.tagname("div"))
    def article = new expando()

    divs.each() { div ->
      // article and stats for it are sibling elements, not parent/child.
      if (div.getattribute("class").contains("stream-article")) {
        def a = div.findelement(by.tagname("a"))
        article.title = a.gettext()        
        article.url = a.getattribute("href")
      }
      if (div.getattribute("class").contains("activity-stats-group")) {
        def divs2 = div.findelements(by.tagname("div"))
        divs2.each() { div2 ->
          def text = div2.gettext()
          if (text.endswith("views")) {
            article.views = integer.parseint(text.replace("views","").replace(",","").trim())    
          }
          if (text.endswith("comments")) {
            article.comments = integer.parseint(text.replace("comments","").replace(",","").trim())
          }
          if (text.startswith("on ")) {
            article.date = new date().parse("mmm dd, yyyy", text.substring(text.indexof("|")+1, text.indexof(".")).trim())
          }
        }
      }
    }
    articles.add(article)
    if (results.from == null || article.date < results.from) {
       results.from = article.date
    }
    if (results.to == null || article.date > results.to) {
       results.to = article.date
    }
  }
  results.articles = articles.reverse() // json diffs look better.
  results.when = new date()
  new file("dzone.json").withwriter { out ->
    out.write(new jsonbuilder(results).toprettystring())
  }
  quit()
}

the json is consumed by angularjs, and you can see it at http://paulhammant.com/dzone.html . given i have used angularjs, the page won’t be indexed by search crawlers presently. that’s not a problem to me, really, as i don’t want dzone’s rankings for my articles to be higher than the originals. dzone sometimes change the titles, i note.

the json is committed to github, and i can watch changes over time from the comfort of an armchair:

perhaps this is interesting only to people who’s blogs are aggregated into dzone, and until the dzone people make a proper feed. of course they may have that already, and i missed it.


DZone

Published at DZone with permission of Paul Hammant, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • MEAN vs MERN Stack: Which One Is Better?
  • Python 101: Equality vs. Identity
  • Version Number Anti-Patterns
  • Artificial Intelligence (AI) And Its Assistance in Medical Diagnosis

Comments

DevOps Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo