How Big Is a Terabyte of Data?
How Big Is a Terabyte of Data?
A terabyte is enormous in size. It’s difficult to put this into perspective, so let's try to understand it from two points of view: spatially and based on time.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
One mile isn’t that long, and a cubic mile isn’t that big if compared to the size of the whole earth. So, you may be surprised to hear that the entire world’s population could all fit within a cubic mile of space. Hendrik Willem van Loon, a Dutch-American writer, once wrote this in one of his books.
Teradata is a famous provider of database warehousing products. The brand name was designed to be impressive in handling massive amounts of data. That was 20 years back. Today, users, as well as vendors, are talking about data in terms of terabytes. It is common to have a scale of dozens of or even nearly 100 terabytes of data, even a petabyte of data — so common that one terabyte becomes unremarkable and that several or dozens of more terabytes may not seem intimidating at all.
Actually, a terabyte, as well as a cubic mile, is enormous in size. It’s difficult to put this into perspective, so let's try to understand it from two points of view.
First, let’s look at it spatially.
Most data analyses and computations are performed on structured data, among which the ever-increasing transaction data takes up the largest space. The size of each piece of transaction data isn’t big; it can range from dozens of bytes to about 100 bytes. For instance, banking transaction information will include account, date, and amount; and a telecom’s call records will include phone number, time, and duration. Suppose each record occupies 100 bytes, or 0.1 KB, and a terabyte of space can accommodate 10G rows, or ten billion records.
What does this mean? There is a little more than 30 million seconds in a year, and to accumulate 1 TB of data in a year requires generating over 300 records per second around the clock.
This isn’t a ridiculously gigantic scale. In a large country like the U.S., businesses like national telecom operators, national banks, and internet giants are easily able to reach that scale. For citywide and even some statewide institutions, however, it is difficult to get 1 TB of data. There is a slim possibility the tax information collected by local tax bureaus, the purchase data of a local chain store, or the transaction data of a city commercial bank can increase exponentially every second. Besides, data of many organizations is generated only on days or weekdays. To have dozens of even 100 terabytes of data, the volume of business should be one or two orders of magnitude larger.
A TB of data may be too abstract for us to make sense of it. But by translating it to the volume of business, we can get a more clear idea. There’s a close connection between the amount of data and the technologies a big data analytics and computing product adopts, so it’s crucial for an organization to make a shrewd assessment of its data amount in order to build a big data platform well.
One terabyte of space becomes small if it is jammed with unstructured data like audio and video data, or if it is used to back up original data. But generally, we only perform storage management tasks or searching on those kinds of data. As there is no need to perform direct analysis and computation, a big data platform is unnecessary. A network file system is sufficient for performing those operations — which reduces cost a lot.
The second way we'll look at it is based on time.
How long is the processing of 1 TB of data? Some vendors claim that their products can process it within a few seconds. That’s what users expect. But is it possible?
The speed of retrieving data from HDD under an operating system is about 150MB per second (the technical parameters that the hard disk manufacturer provides aren't fully achievable). The data retrieval is faster with an SSD, with a doubled speed of 300MB per second. It takes over 3,000 seconds, which is nearly an hour, to retrieve 1 TB of data, without performing any other operations. How can 1 TB of data be processed in seconds? It is simply done by adding more hard disks. With 1,000 hard disks, 1 TB of data can be retrieved within about three seconds.
That is an ideal estimate. Most of the time, in reality, data isn’t stored neatly — performance becomes terrible when discontinuous data is retrieved from a hard disk). For a cluster — obviously, 1,000 hard disks cannot be installed in one machine — there’s the network latency. And some computations may need a rewriting operation (grouping with large result sets and sorting operation). Also, data access within a few seconds is often accompanied by concurrent requests. Considering all these factors, it is not surprising that data retrieval can become several times slower.
Now we realize that a terabyte of data means several hours of data retrieval, or 1,000 hard disks. You can imagine what dozens of or 100 terabytes of data will bring.
You may think that since the hard disk is too slow, so we should use the memory instead.
Indeed, the memory is much faster than the hard disk and is suitable for performing parallel processing. But a machine with a large memory is also expensive (the cost doesn’t increase linearly). To make matters worse, usually, the memory usage ratio is low. For commonly used Java-based computing platforms, JVM’s memory usage ratio is only about 20% if no data compression technique is employed. This means that 5 TB of memory is required to load 1 TB of data from the hard disk. That will be too expensive, as a lot of machines are needed.
With some knowledge about what 1 TB of data looks like, we can have a quick and pretty good idea about the type of transaction, the number of nodes, and the deployment cost any time when we encounter multi-terabyte data, and will be able to make informed decisions when planning a computing platform or choosing a product. Even today, the word "teradata" still carries a lively meaning.
Published at DZone with permission of Buxing Jiang . See the original article here.
Opinions expressed by DZone contributors are their own.