From Communications Data to #midata
From Communications Data to #midata
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
The content of this article was originally written by Tony Hirst over at his blog, OUseful.Info.
A BIS Press Release (Next steps making midata a reality) seems to have resulted in folk tweeting today about the #midata consultation that was announced last month. If you haven’t been keeping up, #midata is the policy initiative around getting companies to make “[consumer data] that may be actionable and useful in making a decision or in the course of a specific activity” (whatever that means) available to users in a machine readable form. To try to help clarify matters, several vignettes are described in this July 2012 report – Example applications of the midata programme – which plays the role of a ‘draft for discussion’ at the September midata Strategy Board [link?]. Here’s a quick summary of some of them:
- form filling: a personal datastore will help you pre-populate forms and provide certified evidence of things like: proof of her citizenship, qualified to drive, passed certain exams and achieved certain qualifications, passed a CRB check, and so on. (Note: I’ve previously tried to argue the case for the OU starting to develop a service (OU Qualification Verification Service) around delivering verified tokens relating to the award of OU degrees, and degrees awarded by the polytechnics, as was (courtesy of the OU’s CNAA Aftercare Service), but after an initial flurry of interest, it was passed on. midata could bring it back maybe?
- home moving admin: change your details in a personal “mydata” data store, and let everyone pick up the changes from there. Just think what fun you could have with an attack on this;-)
- contracts and warranties dashboard: did my crApple computer die the week before or after the guarantee ran out?
- keeping track of the housekeeping: bank and financial statement data management and reporting tools. I thought there already was software for doing this? do we use it though? I’d rather my bank improved the tools it provided me with?
- keeping up with the Jones’s: how does my house’s energy consumption compare with that of my neighbours?
- which phone? Pick a tariff automatically based on your actual phone usage. From going through this recently, the problem is not with knowing how I use my phone (easy enough to find out), it’s with navigating the mobile phone sites trying to understand their offers. (And why can’t Vodafone send me an SMS to say I’m 10 minutes away from using up this month’s minutes, rather than letting me go over? The midata answer might be an agent that looks at my usage info and tells me when I’m getting close to my limit, which requires me having access to my contract details in a machine readable form, I guess?
And here’s a BIS blog post summarising them: A midata future: 10 ways it could shape your choices.
(The #midata policy seems based on a belief that users want better access to data so they can do things with it. I’m not convinced – why should I have to export my bank data to another service (increasing the number of services I must trust) rather than my bank providing me with useful tools directly? I guess one way this might play out is that any data that does dribble out may get built around by developers who then sell the tools back to the data providers so they can offer them directly? In this context, I guess I should read the BIS commissioned Jigsaw Research report: Potential consumer demand for midata.)
Today has also seen a minor flurry of chat around the call for evidence on the Communications Data Bill, presumably because the closing date for responses is tomorrow (draft Communications Data Bill). (Related reading: latest Annual Report of the Interception of Communications Commissioner.) Again, if you haven’t been keeping up, the draft Communications Data Bill describes communications data in the following terms:
- Communications data is information about a communication; it can include the details of the time, duration, originator and recipient of a communication; but not the content of the communication itself
- Communications data falls into three categories: subscriber data; use data; and traffic data.
The categories are further defined in an annex:
- Subscriber Data – Subscriber data is information held or obtained by a provider in relation to persons to whom the service is provided by that provider. Those persons will include people who are subscribers to a communications service without necessarily using that service and persons who use a communications service without necessarily subscribing to it. Examples of subscriber information include:
– ‘Subscriber checks’ (also known as ‘reverse look ups’) such as “who is the subscriber of phone number 012 345 6789?”, “who is the account holder of e-mail account email@example.com?” or “who is entitled to post to web space www.xyz.anyisp.co.uk?”;
– Subscribers’ or account holders’ account information, including names and addresses for installation, and billing including payment method(s), details of payments;
– information about the connection, disconnection and reconnection of services which the subscriber or account holder is allocated or has subscribed to (or may have subscribed to) including conference calling, call messaging, call waiting and call barring telecommunications services;
– information about the provision to a subscriber or account holder of forwarding/redirection services;
– information about apparatus used by, or made available to, the subscriber or account holder, including the manufacturer, model, serial numbers and apparatus codes.
– information provided by a subscriber or account holder to a provider, such as demographic information or sign-up data (to the extent that information, such as a password, giving access to the content of any stored communications is not disclosed).
- Use data – Use data is information about the use made by any person of a postal or telecommunications service. Examples of use data may include:
– itemised telephone call records (numbers called);
– itemised records of connections to internet services;
– itemised timing and duration of service usage (calls and/or connections);
– information about amounts of data downloaded and/or uploaded;
– information about the use made of services which the user is allocated or has subscribed to (or may have subscribed to) including conference calling, call messaging, call waiting and call barring telecommunications services;
– information about the use of forwarding/redirection services;
– information about selection of preferential numbers or discount calls;
- Traffic Data – Traffic data is data that is comprised in or attached to a communication for the purpose of transmitting the communication. Examples of traffic data may include:
– information tracing the origin or destination of a communication that is in transmission;
– information identifying the location of equipment when a communication is or has been made or received (such as the location of a mobile phone);
– information identifying the sender and recipient (including copy recipients) of a communication from data comprised in or attached to the communication;
– routing information identifying equipment through which a communication is or has been transmitted (for example, dynamic IP address allocation, file transfer logs and e-mail headers – to the extent that content of a communication, such as the subject line of an e-mail, is not disclosed);
– anything, such as addresses or markings, written on the outside of a postal item (such as a letter, packet or parcel) that is in transmission;
– online tracking of communications (including postal items and parcels).
To put the communications data thing into context, here’s something you could try for yourself if you have a smartphone. Using something like the SMS to Text app (if you trust it!), grab your txt data from your phone and try charting it: SMS analysis (coming from an Android smartphone or an IPhone). And now ask yourself: what if I also mapped my location data, as collected by my phone? And will this sort of thing be available as midata, or will I have to collect it myself using a location tracking app if I want access to it? (There’s an asymmetry here: the company potentially collecting the data, or me collecting the data…)
It’s also worth bearing in mind that even if access to your data is locked down, access to the data of people associated with you might reveal quite a lot of information about you, including your location, as Adam Sadilek et al. describe: Finding Your Friends and Following Them to Where You Are (see also Far Out: Predicting Long-Term Human Mobility). My own tinkerings with emergent social positioning (looking at who the followers of particular twitter users also follow en masse) also suggest we can generate indicators about potential interests of a user by looking at the interests of their followers… Even if you’re careful about who your friends are, your followers might still reveal something about you you have tried not to disclose yourself (such as your birthday…). (That’s one of the problems with asymmetric trust models! Hmmm… could be interesting to start trying to model some of this… )
Both of these consultations provide a context for reflecting on the extent to which companies use data for their own processing purposes (for a recent review, see What happens to my data? A novel approach to informing users of data processing practices), the extent to which they share this data in raw and processed form with other companies or law enforcement agencies, the extent to which they may use it to underwrite value-added/data-powered services to users directly or when combined with data from other sources, the extent to which they may be willing to share it in raw or processed form back with users, and the extent to which users may then be willing (or licensed) to share that data with other providers, and/or combine it with data from other providers.
One of the biggest risks from a “what might they learn about me” point of view – as well as some of the biggest potential benefits – comes from the reconciliation of data from multiple different sources. Mosaic theory is an idea taken from the intelligence community that captures the idea that when data from multiple sources is combined, the value of the whole view may be greater than the sum of the parts. When privacy concerns are idly raised as a reason against the release of data, it is often suspicion and fears around what a data mosaic picture might reveal that act as drivers of these concerns. (Similar fears are also used as a reason against the release of data, for example under Freedom of Information requests, in case a mosaic results in a picture that can be used against national interests: eg D.E. Pozen, The Mosaic Theory, National Security, and the Freedom of Information Act and MP Goodwin, A National Security Puzzle: Mosaic Theory and the First Amendment Right of Access in the Federal Courts).
Note that within a particular dataset, we might also appeal to mosaic theory thinking; for example, might we learn different things when we observe individual data records as singletons, as opposed to a set of data (and the structures and patterns it contains) as a single thing: GPS Tracking and a ‘Mosaic Theory’ of Government Searches. And as a consequence, might we want to treat individual data records, and complete datasets, differently?
PS via this ORG post – Consulympics: opportunities to have your say on tech policies – which details a whole raft of currently open ICT related consultations in the UK, I am reminded of this ICO Consultation on the draft Anonymisation code of practice along with a draft of the anaoymisation code itself.
Published at DZone with permission of Eric Genesky . See the original article here.
Opinions expressed by DZone contributors are their own.