File Systems <> Database: Full Circle

The start of the computer storage era was a file-based system, which evolved into databases; However, data advancement made file systems relevant again.

BHUSHAN FADNIS

Sep. 03, 25 · Analysis

Likes (0)

Comment

Save

1.7K Views

File-based systems were the original data storage systems before the invention of database management systems (DBMS). Back in the 1970s, organizations manually stored data across servers in numerous files, such as flat files. These files have a fixed, rigid format and multiple copies of data stored for each department, resulting in data redundancy. These led to various challenges, especially data consistency, sharing, security, and retrieval. Analyzing these files was also challenging if we needed to join multiple files for one end-to-end record. As a result, file-based systems could not keep up with the changing data and innovations.

With the invention of DBMS, data transactions comply with ACID properties (atomicity, consistency, isolation, durability), which allows for data consistency, integrity, recovery, and concurrency. In addition, today's advanced DBMS system provides disaster recovery, backup and restore, data searching, and data encryption and security. Even though the DBMS has evolved, due to the advancement of big data, cloud technologies, the Internet, social media, and advancing data formats, file storage is again a hot topic.

All leading cloud services, such as AWS, GCP, and Azure, each provide object-level storage through a bucket or container, all supporting file-based data storage. Choosing the correct file system and considering all aspects below is the foundation for future machine learning and artificial intelligence use cases. Companies should focus on their file system architecture, as it will be crucial for all future growth.

This file-based storage has the advantages below.

Dynamic Schema

Files or object-level storage do not need a fixed schema like tables, and columns can easily be added or removed with every new file record. Every file record can have a different schema, and tables built on top will comprise all the columns. This provided flexibility to extract, transform, and load (ETL) jobs, as we don’t need to alter the table every time there is a schema change. Often changing the production tables is not easy, and sometimes does require an approved change management process followed by an impact analysis of all downstream processes using the schema.

File Formats

Different data formats can be used for file storage based on the data use case. The file formats can be different for different cases, unlike in DBMS, where the same database/schema is used. Below are a few of the common ones.

JSON is a text-based, easy-to-read format with key-value data formats.
For real-time streaming, formats like Avro are used, which keep the data and schema together.
For analytical purposes, Parquet is effective due to its columnar format; it is easy to aggregate and perform operations quickly.

Additionally, if daily ELT jobs create too many files, we can always merge those files together, so the query read time can be improved.

Data Compression

Compression reduces the size of bigger files, which not only saves space but also increases data retrieval time. Several Pandas or Spark data frames support compression logic, which saves storage space and job run time. The most common types of compression are as follows:

Bzip2 – Burrows-Wheeler Transform and Huffman coding algorithm achieves a good compression ratio, but takes more time and resources.
Gzip – DEFLATE algorithm compressed files with a well-balanced speed and compression ratio, and is compatible with multiple systems.
Xz – LZMA2 algorithm achieves the highest compression but takes the most time.

Data Encryption

Advanced encryption techniques, such as server or client-side with self-defined encryptions, are available in all leading cloud providers. This is essential to mask sensitive personally identifiable (PII) data and build customer trust for data handling practices. The most common types are as follows:

AES (Advanced Encryption Standard) is a commonly used systematic encryption logic that utilizes the secret key and supports 128 and 256-bit keys and ensuring data is encrypted at rest as well as in transit.
Base64 encoding is not an encryption, though it can transfer data into a readable ASCII format and safely transfer the binary data.

Data Partition

The partition is basically dividing the data into small chunks of storage based on any specific column. It's good to have a partition based on the query pattern. For example, if a query uses an account number, a partition folder can be created at the account level, ensuring every account has a dedicated folder. This works great when the number of accounts is in thousands; for other cases, the date can also work as a partition. Choosing the correct partition makes queries run faster and avoids the overhead of reading/writing multiple file folders.

Limitations

However, the above object-level file storage does have some challenges.

Data retrieval is slow if files are directly queried and not loaded into the table.
When many data partitions are created, file read/write time increases.
Updates to the existing files are time-consuming and may cause deadlocks.
Storage can increase significantly over time when the file system and operations grow.

Conclusion

Overall, the object-level features in files outweigh the disadvantages, making it easier to opt for a file system and then use data load jobs to ingest data into your data warehouse or data lake systems. It also helps to separate computing and storage from each other and further reduces database admin work.

Big data Database systems

Opinions expressed by DZone contributors are their own.

Related

Trending