Over a million developers have joined DZone.

Government and Big Data: Friend or Foe?

DZone's Guide to

Government and Big Data: Friend or Foe?

There's a potential for governments to use big data to improve the quality of life in a country — but this is achieved at the cost of compromising individual privacy.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Not everything that we do online is accessible to companies and governments without a warrant, but most of the data is. The accumulation of our online footprint and its collection could be called the “big data” but how big is the Big Data in reality?

Note: The relevance of the information is based on the country of interest, as legal standards for data collection differ between countries. For the purpose of this discussion, our primary focus will be on the United States and Great Britain.

What We Are Aware Of                                               

First, let’s talk about the data that we are aware is collected within legal norms. Most of this data is the type that we give away consensually. For example, we know that loyalty cards are used to collect data about our shopping patterns, but we do not consider it to be an issue because we know that this data is used to optimize our shopping experience, cater ads to us, and we give us bonus points.

We also realize that our online browsing information is regularly collected so that advertisements we see and services we use are adjusted to our interests.

We are also aware that the government keeps record of demographic and legal characteristics of the population for security and public organization reasons. But what are the legal types of data collection methods that are around us and that we don't pay attention to?


Anyone who has ever thought about their online browsing privacy has heard about NSA and GCHQ (British apprentice). When these names are mentioned, we feel very insecure about our privacy. But let's discuss publicly available information about the types of data these organizations can collect legally, on a subject that is not involved in criminal activity that needs special attention.

Wiretapping is a 20th-century tool for intelligence collection, but it has been adapted to the modern environment. Snowden leaks provide the information about the bulk data collection that GCHQ and NSA are involved in. On the most basic level, GCHQ taps into the fiber networks carrying the data about phone and internet use (this operation is called Tempora), meaning that they have access to not just metadata (such as who is communicating with whom, what devices, etc.), but to the actual content, too. Content may include emails, internet browsing information, Facebook messages, and recorded phone calls. This data is stored for a month for analysis and, if needed, transferred to NSA because of the joint cooperation effort. If you thought that NSA was the biggest player in the data collection, it might be a surprise that GCHQ collects more metadata than NSA. As leaked documents were reviewed by The Guardian:

“The documents reveal that by last year GCHQ was handling 600m 'telephone events' each day, had tapped more than 200 fiber-optic cables and was able to process data from at least 46 of them at a time.”

Once again, it is important to highlight that this is not the data of people who are suspects in any crime, but of anyone. In response to this monitoring/spying program, a case has been filed in the European court of human rights, but it is important to understand that GCHQ and NSA operate within the legislative loopholes. For example, GCHQ can only collect the data that is entering from outside Britain, but due to modern internet infrastructure, even the data that is sent and received by individuals within a country goes outside of it and then reenters.

Now let's talk more about NSA. NSA's program XKEYSCORE collected data about internet users' emails, messages, images, passwords and even VOIP. Similarly to GCHQ, this is achieved by intercepting fiber optic cables. The content of the data is stored for a short period while metadata remains for months. The sites for the "interception" are located not just in the U.S. but all around the globe. If you have a sweet tooth, it will be hard to accept — cookies that are constantly created actually create our online identity, which helps NSA identify the user in the vast amount of collected data. Cookies are created for our own browsing convenience as well as company usage (ads, etc.). See the technical representation of NSA's XKEYSCORE. In summary, the main tool used for bulk data collection is interception of transatlantic fiber-optic cables.


Now that we know how NSA and GCHQ collect bulk data on the most basic level, let's talk about the implications. There are two primary concepts: privacy and security. Governments say that the cyber counterintelligence is necessary to monitor illegal activities and prevent terrorism.  Ex-director of NSA Keith Brian Alexander stated in 2013 that so far NSA bulk data collection and phone interception programs helped avoid more than 50 terrorist events. The information about these events is not published as it may compromise NSA's strategies for dealing with terrorism.

The data collection also helps address online crime such as selling illegal substances and piracy. As the methods of obtaining pirated and non-pirated material, through torrenting or streaming on FireTV using external applications, become simpler, companies and governments take more effort to address it.

In theory, although the abuse of data is possible, the cases are not very frequent. Since the actual content is only stored for a short period because of its sheer size and metadata for months, we could say that our privacy is compromised to that extent. However, it is important to consider that this data goes through machine analysis and if no suspicious activity is detected, it is later discarded. So the actual identification of a user does not happen until the suspicious patterns in the online ID are detected. So, in reality, we are compromising our short-term privacy for security, and it is not as if our whole online footprint is accessible at the click of a button because of bulk data collection (at least for now).

Alternative Big Data Use

In collaboration with Peter Thiel’s Palantir, the U.S. government is using big data to do something that would have been considered impossible. They are using crime data to predict future crime events and catch repeat offenders.

Governmental support for agricultural research and data collection also helps create higher yields for products.  In general, governments could and should use big data analysis for every public service, given that it is cost-efficient. Beginning with education (understanding what works best for students), public healthcare (analyzing the implications of different medications and diseases), transportation, city planning and analysis of different public policies.  This can improve the quality of the medication and diagnosis, improve the satisfaction of public towards governmental policies, create better-planned cities and increase return on investment in education. In reality, the positive effects of big data usage by governments can be tremendous and widespread across multiple industries.


When discussing governments and big data, we often focus on the surveillance data, which is important for a country’s security and global cyber stance. This is important, as it helps address terrorism before it happens, as well as helps track suspects in illegal activities. This is achieved at the cost of compromising individual privacy to a certain extent. Opinions on whether or not this practice is acceptable can vary, but one outcome is clear. There is a large potential for governments to utilize big data in order to improve the quality of life in a country. Directing more effort towards this matter would be more acceptable, as the data collection methods in these cases are less intrusive as well, although the cases of NSA’s data abuse are not frequent and one could argue that they do not exist. The question remains: who watches the watchers?

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

big data ,governance ,data privacy ,data analytics

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}