For those who are impatient or already aware of the background, feel free to skip straight to the results. For the rest of you, let me begin with a little background and an explicit description of my methodology.
Open Data is, increasingly, recognised as being A Good Thing. Governments are releasing data, making them more accountable, (possibly) saving themselves money by avoiding the need to endlessly answer Freedom of Information requests, and providing the foundation upon which a whole new generation of websites and mobile apps are being built. Museums and Libraries are releasing data, increasing visibility of their collections and freeing these institutional collections from their decades-long self imposed exile in the ghetto of their own web sites. Scientists are beginning to release their data, making it far easier for their peers to engage in that fundamental principle of science; the reproduction of published results.
Open Data is good, and useful, and valuable, and increasingly visible. But without a license telling people what they can and cannot do, how much use is it? Former colleague Leigh Dodds did some work a few years ago, to look at the extent to which (notionally open) data was being explicitly licensed. He was concerned with a very specific set of data; contributions to the Linked Data Cloud. At the time, Leigh found only a third of the data sets carried an explicit license, and several of those license choices were dubious.
I was interested to see how the situation had changed. I’m running a short survey, inviting people to describe their own licensing choices. I’ve also taken a look at the Data Hub, which “contains 4004 datasets that you can browse, learn about and download.” This is a far larger set of data than the one Leigh studied back in 2009, and should hopefully therefore provide a richer picture of licensing choices. It’s worth remembering, though, that data owners must actively choose to contribute their data to the Data Hub. The Hub is run by the Open Knowledge Foundation, and it therefore seems likely that submissions will skew in favour of those who are more than normally enthusiastic about their data and more than normally predisposed toward open. For more, listen to my podcast with the Open Knowledge Foundation’s Rufus Pollock and Irina Bolychevsky.
- Not Specified [notspecified]
- Open Data Commons Public Domain Dedication & License [odc-pddl]
- Open Data Commons Open Database License [odc-odbl]
- Open Data Commons Attribution License [odc-by]
- Creative Commons CC0 [cc-zero]
- Creative Commons Attribution [cc-by]
- Creative Commons Attribution Share Alike [cc-by-sa]
- GNU Free Documentation License [gfdl]
- Other Open Licenses [other-open]
- Other Public Domain Licenses [other-pd]
- Other Attribution Licenses [other-at]
- UK Open Government License [uk-ogl]
- Creative Commons Non-Commercial [cc-nc]
- Other Non-Commercial Licenses [other-nc]
- Other closed licenses [other-closed]
I then downloaded the JSON dump from the Data Hub, but found that it was far older (and smaller) than the set of data available to the API. The JSON dump was last updated on 30 August 2011, and only contained just over 2,000 entries. At the time of writing, the API offers access to 4,004 entries. With the help of Adrià Mercader, I learned how to submit the correct query to the API itself, giving me access to all 4,004 records. Results included 44 different values for the license_id attribute; the 15 above, 12 numeric values that were presumably errors of some kind, assorted ways of either saying nothing or specifying that the data had no license, and then a small number of records associated with some specific licenses such as a Canadian Crown Copyright and the MIT License. Of 4,012 records, 874 appear to say nothing whatsoever about their license conditions; not even the
used by 523.
Looking at the raw numbers, the first impression must be a depressing one. Fully 50% of the records either explicitly state that there is no license (14), explicitly state that the license is ‘not specified’ (604), explicitly record a null value (523), or fail to include the license_id attribute at all (874). Given all of the effort that has gone into evangelising the importance of data licensing, and all the effort that Data Hub contributors have gone to in collecting, maintaining and submitting their data in the first place, that really isn’t very good at all. But at least it’s an improvement on what Leigh observed back in 2009.
If we remove the 2,015 unlicensed records and the 31 errors (those well-known data licenses, including ’1,’ ’34,’ ’73,’ etc), the picture becomes somewhat clearer.
The licenses that many have worked so hard to promote for open data (CC0, the Open Data Commons family and – in some circumstances – CC-BY) are far less prevalent than I’d expected them to be. 125 resources are licensed CC0, 273 CC-BY, 119 ODC-PDDL, 61 ODC-ODBL, and 36 ODC-BY. That’s a total of 614 out of 1,966 licensed resources, or just 31%. 44% of the 614 are licensed CC-BY; an attribution license based upon copyright rather than database rights. At least some of those may therefore be wrongly licensed. The two core data licenses are almost tied, (125 for CC0, 119 for ODC-PDDL), but together account for a tiny 12% of all the licensed resources in the Data Hub.
The picture’s not all bad, as there is clearly a move toward the principle of ‘open’ and ‘public domain’ licenses. CC0 (125) and ODC-PDDL (119) are joined by 167 data sets licensed with some other public domain license. And with 444 data sets, ‘other open license’ is the single most popular choice; almost one quarter of the licensed data sets use an open license that is not one of the mainstream ones.
In total, the Creative Commons family of licenses (including the odd ‘sharealike’ variant and the hugely annoying ‘noncommercial’ anachronism) account for 602 data sets, or 31%. The Open Data Commons family account for 216, or 11%.
By most measures, we should probably welcome the use of any open or public domain license. But the more choices there are, the more scope there is for confusion, contradiction, and a lack of interoperability. Every time I want to take an ‘open’ dataset licensed with Open License A, and combine it with an ‘open’ dataset licensed with Open License B, there’s the nagging doubt that some wording in one of the licenses introduces a problem. Do I need to check with a lawyer? Do I need to check with one or both of the data providers? Is this all too much bother, and should I just go and do something else? License proliferation is friction.
So those are the results. What do they say to you?
It will be interesting to check back over time, and see how the proportions shift. Let’s work to eradicate the ‘None/ Not Specified’ category altogether, and then see what we can do to shrink all of the ‘Other’ categories.