Open Data Backed by Source Control
Open Data Backed by Source Control
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Source-control backing is a decade-long obsession of mine. Now I'm thinking about “open data.” If something can be represented by a textual document, is structured or regular, and tends towards completeness over time, then source-control is a viable alternative to a relational schema (or a document store). JSON, provided it adheres to a standard/tidy/pretty format is particularly good for source control, although semantic merge would make that even better.
I’ll go into a couple of examples...
Food Nutrition Tables
Cupcake has Raw Data like so (again click it to see the real data):
It is easy to see how you could collect nutrition info in a Github repo. Not just generic foodstuffs, but also brands (assuming litigation did not try to stop that). It’d be nice to see per 100g quantities as well, as there’s some rounding down allowed in the US for sub one gram quantities in a ‘serving’. The application I made (I cribbed code from Jonathon Cihlar’s 2007 blog entry on nutrition tables in XHTML and CSS) is output only; You would have to push data into GitHub manually.
US Hotel Data
This one is much the same idea. Colleague Andrew Dean led the development of it (at a ThoughtWorks hosted ‘GeekNight’ in Dallas some weeks ago), and I’ve forked it on GitHub to push it a little further for this blog entry. This one is not a GitHub Pages thing. It requires a custom Sinatra server to allow data live editing — see the ruby script in the source root. That script uses a unix utility Jshon to tidy JSON source, as you want to minimize diffs when you’re committing to the source repo. Lastly, the commits don’t happen in the app we’ve made, although we know how as we’ve previously made a Sinatra app that did commit to github. Instead, you’ll have to do the commit yourself (for now), after editing data over http://localhost:4567. There are two pages for this (so far) -
An index page that asks for key fields for an entry, for direct access:
The resulting data page for that (note the URL is as the data in the repo — without the .json suffix):
With a small shim in GitHub’s infrastructure that could process PUTS and mess around with URLS for the sake of a GET, GitHub pages could accept writes of JSON content for authenticated and authorized users. Pull Requests could be a way of donating them back to the master repo. The GitHub folks would have to work out how to allow developers to program that server side handler (hopefully it would look like Sinatra).
Then again, there could be forks of the repos that maintain divergence (also a blog entry) in that they add data that the master does not really want. For the Hotels example, one could have fork that adds “costs of wifi” (detailed in that blog entry). Nutrution Info could have forks that add “good/bad for Paleo diets”, which could also have a fork “good/bad for Primal Blueprint diets”. You get the picture hopefully.
Published at DZone with permission of Paul Hammant , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.