What You Possibly Don’t Know About Column Storage
What You Possibly Don’t Know About Column Storage
The benefits of the columnar storage technique can't be denied. However, it also has undeniable weaknesses. This article explores those weaknesses.
Join the DZone community and get the full member experience.Join For Free
The columnar storage technique has proven effective in handling many computing scenarios and is commonly adopted by data warehousing products. The technique is usually a synonym of high-performance in the industry.
Yet the strategy has its own weaknesses. A Google search shows that criticisms surrounding it are mainly about data modification. There are few discussions of its application to read-only data analysis and computing, which will be taken care of in the following.
Columnar Storage Has Little Significance for In-Memory Computing
The idea behind columnar storage is simple. To retrieve data stored row-wise on disk, all columns will be scanned. As a result, a query involving only a few columns will retrieve a lot of irrelative data. Plus, disk response time suffers as it jumps between tracks picking up data. The column storage strategy only enables the retrieval of useful columns, significantly reducing the amount of to-be-accessed data on most occasions. For big data computing, in particular, disk scanning is time-consuming, and a decrease in the amount of data to be accessed will greatly speed up the overall computing process. Moreover, chances are that there are a lot of duplicate values in one column. It’s easier to compress data into a smaller size under the columnar storage to enhance performance.
According to its working principle, columnar storage increases performance by reducing disk access. But it can’t reduce the number of computations. If data is already in memory, the technique is unnecessary. Structured data processing is row-based, and storing in-memory data column-wise complicates the construction of records and hinders performance. Except for professional vector computing (which is column-based and commonly used in data mining), columnar storage isn’t the right choice for handling relational-style in-memory computing (including in-memory databases).
Columnar Storage Exacerbates Discontinuous Disk Access
Columnar storage stores data column by column, making simultaneous access to multiple columns random and discontinuous. The more columns accessed, the more random the accesses. Random access to the HDD will seriously affect performance due to time-consuming head jumps, even worse than the performance with row-based storage, which enables continuous column access when a lot of columns are being accessed or the total number of columns is small. Concurrent tasks (and parallel processing) will further exacerbate the random access problem. Though disk retrieval suffers from head jumps with concurrency computing in row-based storage, the degree is much smaller. With columnar storage, one concurrent task will generate multiple concurrent access requests (the number of involved columns). With row-based storage, one concurrent task only generates one concurrent access request.
One solution is to increase the buffer area for storing the retrieved data to reduce the proportion of the seek time. But setting the buffer area for each column consumes a large amount of memory space if there are a lot of involved columns. The other solution is to add more hard disks and store columns on different hard disks. As columnar storage is generally applied to scenarios involving a large number of columns, the number of columns is usually far more than that of hard disks that can be installed on a machine and disk access conflict often occurs.
Yet columnar storage is more friendly with an SSD that hasn’t the seek time problem.
Columnar Storage Causes Inefficient Indexing
The commonly used indexing technique is intended to locate desired records from a large dataset according to key values. The nature of indexing is sorting. An index record addresses of both ordered key values and their corresponding records in the original data table. With row-based storage, the position of a record can be represented by one number. But with columnar storage, each column of a record has its own position and, in principle, should be recorded, making an index table that is almost as large as the original table. Accesses become more complicated and space consumption is large. It’s no better than the method of copying the original table and then sorting it.
You might say that we can store only the address of one column for a record and then calculate the addresses of the rest of the columns. Sizes of field values of certain data types, like strings, are unfixed, and the values of other data types, like integer and date, which generally have fixed sizes, may become unfixed thanks to compression techniques often used with columnar storage. If all field values are stored in fixed sizes, the index becomes simple and access speeds up, but the data amount increases, leading to a less cost-effective traversal, which is the operation for which the columnar storage strategy is mainly used in the first place.
A commonly used real-life method is dividing the original table into a number of data blocks. For each block, data is stored in a column-based format. The index is created on blocks to quickly locate the block where the data sits. Then, only in-block data scanning is needed.
This block-based index has lower performance than the row-based index because of the extra in-block scanning. If the original data table is ordered by index key values (usually, the index key is the original table’s primary key), it’s easy to locate the blocks (typically, only one block) holding the target data with bearable performance loss. This index type applies to scenarios where records are located according to unique key values. If the original data table is unordered by the index key, the block-based index is useless. It’s possible that the target data falls in nearly all blocks, causing a similar performance to the full table scanning.
Columnar Storage Complicates Segment-Based Parallel Processing
Multithreaded parallel processing is a must to make full use of the capabilities of multi-core CPU. Having data segmented is necessary for performing parallel processing.
There are two basic requirements for data segmentation:
Almost equal data amount in each segment (making balanced task allocation among threads).
Dynamic segmentation (because the number of threads can’t be predicted).
It’s easier to segment data with row-based storage. We can make a mark at the end of each record (or of every N records) to evenly divide data into K segments according to the number of bytes. The ending mark in each segment is the starting point of the next segment. This is unfeasible with columnar storage because the sizes of field values may be unfixed. Segmenting points within columns may not fall in the same record, resulting in mismatched data.
Block-based segmentation strategy is a common solution to the columnar storage segmentation problem. The unit of segmentation is a block where data won’t be further divided to be processed in parallel. Here is a dilemma. On one hand, the number of blocks should be large enough to enable a dynamic segmentation (because you can’t get ten segments from only five blocks). As a contemporary computer generally has many CPU cores, nearly 100 blocks are needed to achieve a flexible, balanced segmentation. On the other hand, too many blocks means that the columnar data is physically divided into many discontinuous blocks, making the traversal code very complicated and causing the retrieval of useless data between two blocks.
For an HDD, there’s also the seek time problem. The more the blocks data is divided into, the more serious this problem becomes. Only when the space in a the columnar data in a block used is much larger than the buffer area for storing retrieved data, the proportion of the time spent in retrieving useless data and the seek time will become relatively small. This requires there to be a sufficiently large number of records in each block. In other words, the data amount is crucial for processing data in columnar storage format in parallel. With an HDD (including a disk array that contains a large group of HDDs), normally, there should be at least one billion records in one table on a single computer, with the data amount reaching above 100G. Parallel processing won’t cause a noticeable increase in performance if the data amount isn’t large enough. This is that problem that multidimensional analysis, for which the columnar storage strategy is particularly suitable, faces. Besides, the size of each block is pre-determined, but neighboring blocks can’t be physically combined while data is continuously added. Consequently, the number of blocks is ever- increasing, bringing management challenges that demand a special scalable space to store the index of the blocks.
These problems, however, don't deny the great advantages of columnar storage in handling read-only computing scenarios. But any impetuous use of the technique needs to be avoided. For data warehousing products, it’s appropriate to allow the system administer or user to decide whether or not columnar storage should be used, how to divide columnar data into blocks, which part of the data should be stored in columnar format, and which part of data needs to be handled with both row-based storage and columnar storage to achieve higher performance through data redundancy.
Published at DZone with permission of Buxing Jiang . See the original article here.
Opinions expressed by DZone contributors are their own.