DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Best GitHub-Like Alternatives for Machine Learning Projects
  • Deploy MuleSoft App to CloudHub2 Using GitHub Actions CI/CD Pipeline
  • Continuous Integration for iOS and macOS
  • Tips and Tricks for Efficient Coding in R

Trending

  • Essential Complexity Is the Developer's Unique Selling Point
  • Freedom to Code on Low-Code Platforms
  • What Technical Skills Can You Expect To Gain From a DevOps Course Syllabus?
  • Using Open Source for Data Integration and Automated Synchronizations
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. R: Scraping the Release Dates of Github Projects

R: Scraping the Release Dates of Github Projects

Mark Needham user avatar by
Mark Needham
·
Jun. 29, 15 · Interview
Like (0)
Save
Tweet
Share
1.06K Views

Join the DZone community and get the full member experience.

Join For Free

Continuing on from my blog post about scraping Neo4j’s release dates I thought it’d be even more interesting to chart the release dates of some github projects.

In theory the release dates should be accessible through the github API but the few that I looked at weren’t returning any data so I scraped the data together.

We’ll be using rvest again and I first wrote the following function to extract the release versions and dates from a single page:

library(dplyr)
library(rvest)
 
process_page = function(releases, session) {
  rows = session %>% html_nodes("ul.release-timeline-tags li")
 
  for(row in rows) {
    date = row %>% html_node("span.date")
    version = row %>% html_node("div.tag-info a")
 
    if(!is.null(version) && !is.null(date)) {
      date = date %>% html_text() %>% str_trim()
      version = version %>% html_text() %>% str_trim()
      releases = rbind(releases, data.frame(date = date, version = version))
    }  
  }
  return(releases)
}

Let’s try it out on the Cassandra release page and see what it comes back with:

> r = process_page(data.frame(), html_session("https://github.com/apache/cassandra/releases"))
> r
           date               version
1  Jun 22, 2015       cassandra-2.1.7
2  Jun 22, 2015      cassandra-2.0.16
3   Jun 8, 2015       cassandra-2.1.6
4   Jun 8, 2015   cassandra-2.2.0-rc1
5  May 19, 2015 cassandra-2.2.0-beta1
6  May 18, 2015      cassandra-2.0.15
7  Apr 29, 2015       cassandra-2.1.5
8   Apr 1, 2015      cassandra-2.0.14
9   Apr 1, 2015       cassandra-2.1.4
10 Mar 16, 2015      cassandra-2.0.13

That works pretty well but it’s only one page! To get all the pages we can use the follow_link function to follow the ‘Next’ link until there aren’t anymore pages to process.

We end up with the following function to do this:

find_all_releases = function(starting_page) {
  s = html_session(starting_page)
  releases = data.frame()
 
  next_page = TRUE
  while(next_page) {
    possibleError = tryCatch({  
      releases = process_page(releases, s)
      s = s %>% follow_link("Next") 
    }, error = function(e) { e })
 
    if(inherits(possibleError, "error")){
      next_page = FALSE
    }
  }
  return(releases)
}

Let’s try it out starting from the Cassandra page:

> cassandra = find_all_releases("https://github.com/apache/cassandra/releases")
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.8
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.0-rc1
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.0-beta2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.6
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.0-rc2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.7
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.6.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.5.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.4.0-final
 
> cassandra %>% sample_n(10)
            date               version
151 Mar 13, 2010   cassandra-0.5.0-rc2
25   Jul 3, 2014      cassandra-1.2.18
51  Jul 27, 2013       cassandra-1.2.8
21  Aug 19, 2014   cassandra-2.1.0-rc6
73  Sep 24, 2012 cassandra-1.2.0-beta1
158 Mar 13, 2010   cassandra-0.4.0-rc2
113 May 20, 2011     cassandra-0.7.6-2
15  Oct 24, 2014       cassandra-2.1.1
103 Sep 15, 2011 cassandra-1.0.0-beta1
93  Nov 29, 2011       cassandra-1.0.4

I want to plot when the different releases happened in time and in order to do that we need to create an extra column containing the ‘release series’ which we can do with the following transformation: 

series = function(version) {
  parts = strsplit(as.character(version), "\\.")  
  return(unlist(lapply(parts, function(p) paste(p %>% unlist %>% head(2), collapse = "."))))  
}
 
bySeries = cassandra %>%
  mutate(date2 = mdy(date), series = series(version),
         short_version = gsub("cassandra-", "", version),
         short_series = series(short_version))
 
> bySeries %>% sample_n(10)
            date               version      date2        series short_version short_series
3    Jun 8, 2015       cassandra-2.1.6 2015-06-08 cassandra-2.1         2.1.6          2.1
161 Mar 13, 2010 cassandra-0.4.0-beta1 2010-03-13 cassandra-0.4   0.4.0-beta1          0.4
62  Feb 15, 2013      cassandra-1.1.10 2013-02-15 cassandra-1.1        1.1.10          1.1
153 Mar 13, 2010 cassandra-0.5.0-beta2 2010-03-13 cassandra-0.5   0.5.0-beta2          0.5
37   Feb 7, 2014       cassandra-2.0.5 2014-02-07 cassandra-2.0         2.0.5          2.0
36   Feb 7, 2014      cassandra-1.2.15 2014-02-07 cassandra-1.2        1.2.15          1.2
29   Jun 2, 2014   cassandra-2.1.0-rc1 2014-06-02 cassandra-2.1     2.1.0-rc1          2.1
21  Aug 19, 2014   cassandra-2.1.0-rc6 2014-08-19 cassandra-2.1     2.1.0-rc6          2.1
123 Feb 16, 2011       cassandra-0.7.2 2011-02-16 cassandra-0.7         0.7.2          0.7
135  Nov 1, 2010 cassandra-0.7.0-beta3 2010-11-01 cassandra-0.7   0.7.0-beta3          0.7

Now let’s plot those releases and see what we get:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(!grepl("beta|rc", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015 06 23 22 59 19

An interesting thing we can see from this visualisation is what overlap the various series of versions have. Most of the time there are only two series of versions overlapping but the 1.2, 2.0 and 2.1 series all overlap which is unusual.

In this chart we excluded all beta and RC versions. Let’s bring those back in and just show the last 3 versions:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(grepl("2\\.[012]\\.|1\\.2\\.", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015 06 23 23 08 04

From this chart it’s clearer that the 2.0 and 2.1 series have recent releases so there will probably be three overlapping versions when the 2.2 series is released as well.

The chart is still a bit cluttered although less than before. I’m not sure of a better way of visualising this type of data so if you have any ideas do let me know!


Release (agency) R (programming language) GitHub

Opinions expressed by DZone contributors are their own.

Related

  • Best GitHub-Like Alternatives for Machine Learning Projects
  • Deploy MuleSoft App to CloudHub2 Using GitHub Actions CI/CD Pipeline
  • Continuous Integration for iOS and macOS
  • Tips and Tricks for Efficient Coding in R

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: