This post via Esen Sagynov is by Jun Sung Won.
OwFS, HDFS, NFS... What kind of distributed file systems do we need to provide a more competitive internet service?
In this article I will explain the features of selected distributed file systems which I have experience with working at NHN, and suggest which one is suitable to what case. I will also discuss about recent open source distributed file systems as well asl Google's distributed file system which are not yet used by NHN. This article will also cover the development trends of distributed file systems in general.
What Kind of Distributed File Systems do We Need to Use?
Currently NHN uses various distributed file systems such as NFS, CIFS, HDFS, and OwFS. A distributed file system is a file system that allows access to files via a network. The advantage of a distributed file system is that you can share data among multiple hosts. Before 2008, NHN used only NFS and CIFS by using NAS, but since 2008, it has developed and begun to use its own OwFS, and the usage of OwFs has since increased. As is widely known, HDFS is an open source file system that was influenced by Google's GFS.
For NFS or CIFS, Network Attached Storage (NAS), which is an expensive storage system, is typically used. Whenever you increase its capacity, therefore, there is a high infrastructure cost. On the other hand, as OwFS and HDFS allow you to use relatively economical hardware (commodity server), you can establish high-capacity storage at much lower cost.
However, OwFS or HDFS are not better than NFS or CIFS, which use NAS in all cases. For some purposes, you should use NAS. The same is true of OwFS and HDFS. As they are built for different purposes, you need to select one of these file systems according to the purpose of the internet server you are implementing.
NFS and CIFS
Generally, Network File System (NFS) is used by Linux/Unix, while CIFS (Common Internet File System) is used by MS Windows. NFS is a distributed file system developed in 1984 by Sun Microsystems. It allows you to share and use files from other hosts via a network. NFS has many versions, and it began to be widely used from version NFSv2. Currently, the most-used version is NFSv3. In 2003, NFSv4 was released, and offered better performance and security. In 2010, NFSv4.1, which supports cluster-based expansion, was released. NFS is a distributed file system with a long history, and it continues to evolve.
CIFS is a distributed file system Microsoft applied to MS Windows, which was developed based on Server Message Block (SMB) but with enhanced functionalities, including security.
NFS and CIFS comply with the POSIX standard. Thus, applications by using NFS or CIFS can use a distributed file system like a local file system. This means, when you implement or run an application, you don't need to prepare a local file system and distributed file system separately.
When NFS or CIFS is used, Network Attached Storage (NAS) is used for better performance and availability. NAS has its unique file system, and processes remote requests for access to files by having NAS gateway support NFS or CIFS protocols.
Generally, NAS has the structure shown in Figure 1:
Figure 1: A General Structure of NAS.
The features of NFS or CIFS are as follows:
- Provides the same functionality as a local file system.
- As they mostly use NAS, there is a high purchase cost.
- The scalability of NAS is very low compared to OwFS and HDFS.
OwFS (Owner-based File System)
OwFS is a distributed file system developed by NHN. Currently, it is the distributed file system that NHN uses the most. OwFS features the concept of a container called owner. An owner is a basic unit of metadata managed in OwFS. OwFS manages the replicas of data on the basis of this owner. Each owner has an independent file storage space, in which files and directories can be stored. OwFS forms a single huge file system by gathering these owners together. To access a file, users need to get the information of the owner first. The overall structure of OwFS is as follows:
Figure 2: The OwFS Structure.
The Meta-Data Server (MDS) of OwFS manages the status of data servers (DS). More specifically, MDS manages the capacity of each DS, and when a failure occurs, it performs restoration work by using the replicas of owners.
Compared to HDFS, the advantage of OwFS is that it can process a large number of files efficiently (mainly when the size of a file is within dozens of MB). This is because OwFS has been designed to not increase the burden of MDS even when the number of files increases. DS manages the information on the files and directories stored inowners (i.e., metadata on files and directories), while MDS has only the information of the owner and the location of the owner's replica. For this reason, even when the number of files increases, the metadata, which MDS should process, does not increase, and consequently the burden of MDS does not increase much.
The following shows the features of OwFS:
- Owner, a kind of container, is a single file system, and owners make up an entire file system.
- Owner information (file data and metadata) is stored in data server (DS).
- Multiple owners can be saved in a single DS, and owners are distributed and stored (replicated) in different DSs.
- The information on the location of owner including the location of its replica is stored in Metadata Data Server (MDS).
- It is suitable for processing a large number of files whose size is within dozens of MB.
- As all of its components are duplicated/triplicated, it works stably even when a failure occurs.
- A Map-Reduce Framework for OwFS also exists.
- Apache module to displays files from OwFS.
- OwFs_Fuse module to mount OwFS like NAS.
OwFS provides its own unique interface (API). However, as it also provides the OwFs_Fuse module, you can mount OwFS like NAS and use it as a local disk. In addition, it also provides Apache module so that Apache web server can access the files stored in OwFS.
Hadoop Distributed File System (HDFS)
Google developed Google File System (GFS), its unique distribution system, which stores information about webpages crawled by Google. Google published a paper on the GFS in 2003. HDFS (Hadoop) is an open source system developed using GFS as a model.
For this reason, HDFS has the same characteristics as GFS. HDFS separates a large file into chunks, and stores three of them into each datanode. In other words, one file is stored in multiple distributed data nodes. This also means that one file has three replicas. The typical size of a chunk is 64 MB.
The metadata about which data node stores the chunk is stored in the namenode. This allows you to read data from distributed files and perform operations by using MapReduce.
Figure 2 below shows the configuration of HDFS:
Figure 2: HDFS Structure (http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html).
The namenode of HDFS manages the name space and metadata of all files and the information on file chunks. Chunks are stored in data nodes and these data nodes process file operation requests from the clients.
As explained above, in HDFS, large files can be distributed and stored effectively. Moreover, you can also perform distributed processing of operations by using the MapReduce framework based on the chunk location information.
Compared to OwFS, the weakness of HDFS is that it is not suitable for processing a large number of files. This is because a bottleneck can occur at the namenode. If the number of files increases, OOM (Out of Memory) occurs at the service daemon of the namenode, and consequently the daemon process is terminated.
The features of HDFS are as follows:
- A large file is divided into chunks and distributed and stored into multiple data nodes.
- The size of a chunk is usually 64 MB, each chunk has three replicas, and chunks are stored in different data nodes.
- The information on these chunks is stored in the namenode.
- It is advantageous for storing large files, though if the number of files is large, the burden of the namenode increases.
- The namenode is a SPOF, and if a failure occurs at the namenode, HDFS will stop and must be restored manually.
As HDFS is written in Java, its interface (API) is also a Java API. However, you can also use C API by using JNI. The Hadoop community does not provide a mount for using FUSE officially. But a third party provides the functionality of FUSE mount for HDFS.
What you should know when using HDFS is the measure against a failure in the namenode. When a failure occurs at the namenode, you can't use HDFS itself, and thus you need to take the time of failure occurrence into account. In addition, as NHN does not have an exclusive HDFS team, the relevant service department has to run and operate their own HDFS.
Table 1: OwFS vs. HDFS.
Redundant Metadata Server
Not supported yet.
Metadata to be Managed
Owner allocation information.
File name space, file information (stat. info.), chunk information and chunk allocation information.
File Storage Method
Without any change.
Divided into chunks.
Suitable for storing multiple files.
Suitable for storing a large file.
Choosing the Right Distributed Storage Type For Each Case
The size of files is not big (usually within dozens of MB) and the number of files is likely to increase significantly. A space of approximately 10-100 TB is expected to be required. Once files are stored, they will hardly be modified.
⇒ OwFS is suitable. As the size of files is small and the number of files is likely to increase, OwFS is more suitable than HDFS.
As the content of files is not changed and the total size required is dozens of TB, it is more advantageous to use OwFS than NAS in terms of costs.
You need to store log files created from the web server and analyze their content periodically. The size of a log file is 1 TB on average, approximately 10 log files are created, and they are maintained for a month.
⇒ HDFS is suitable. This is because the size of a file is big but the number of files is small. You had better use HDFS when analyzing such large files. Once a file is stored in HDFS, though, you cannot change the stored file. For this reason, you need to store it in HDFS after a log file is completely created.
You need to store log files created from the web server and analyze their content periodically. One log file per day is created for each server, and the file size is 100 KB on average. Currently the number of servers that should collect logs is around 10,000, and the file retention period is 100 days.
⇒ You had better use OwFS. This is because the number of files will increase due to the maintenance period of 100 days and the size of log files is not that big.
The analytics work can be conducted through the MapReduce framework. If you use MFO (Map-Reduce Framework for OwFS), you can also use MapReduce in OwFS.
Each user has multiple files, you want to establish an index of file information so that users can search their files quickly, and you need storage to store the index file. When a file is added, deleted or changed, this index file should be updated.
If the number of files is small, the size of the index is small. However, if the number of files is large, the index file could be more than hundreds of MB. The total number of users is estimated to be 100,000.
⇒ NAS is suitable. In general, an index file should be updated frequently, but in OwFS or HDFS, stored files cannot be modified.
With OwFS, if you use OwFS_FUSE in full mode, you can change files. But this method reads the entire file, and changes and re-writes it. This is true even if you only need to change 1 byte in the entire file.
In addition, if the entire size is dozens of TB, using NAS will not result in high cost, either.
There are over 1,000,000 files of approximately 1 KB. The total size required is approximately 3 GB. It is difficult to change the existing directory structure.
⇒ NAS is suitable. When you use OwFS_FUSE, you can use a directory structure as flexible as a local file system. If there are many small files, however, it often provides performance lower than that of NAS. HDFS can also be mounted using FUSE, but it is not suitable in an environment in which a large number of files are used. In addition, as the total size required is very small, using NAS will not cause high costs.
File size ranges from a few MB to a few GB, and the total size required is approximately 500 TB. The directory structure should not be changed, and the content of files may be changed from time to time.
⇒ OwFS is suitable. If a large size of around 500 TB is required, NAS is not appropriate due to cost. If the size is 500 TB, that means the number of files required is also large. Therefore, OwFS is more suitable than HDFS. Although it is possible to use the existing directory structure without any change by using OwFS-FUSE, it is better to change it to a structure that is more suitable to the owner. When you change the content of a file, you can prevent the entire system load from increasing by reading and updating the file on the application server and overwriting it from the application server to OwFS.
Table 2: Criteria for Selection of OwFS, HDFS, NFS/CIFS.
Suitable when there are many small files of dozens of MB. Mid (< dozens of GB*)
Large (10 GB* >>) Suitable when there are many large files of over 10 GB
Small and mid-sized files
Number of files
Noteworthy Distributed File Systems
There are some other noteworthy distributed file systems that are not currently used by NHN. These are GFS2, Swift, Ceph and pNFS. Nelow I will briefly introduce them and explain their functionalities.
Google's GFS is to distributed file systems what the Beatles were to the music industry, in that many distributed file systems, including HDFS, were inspired by GFS.
However, GFS also has a huge structural weakness. It is vulnerable to namenode failure. Unlike HDFS, GFS has a slave namenode. This is why GFS is less susceptible to failures than HDFS. Despite its slave namenode, however, when a failure occurs at the master namenode, the transfer time is not short.
If the number of files increases, the amount of metadata also increases, and consequently the processing speed is deteriorated, and the total number of files available is also limited due to the limit of the memory size of the master server.
Usually the size of a chunk is 64 MB, and GFS is too inefficient to store data smaller than this size. Of course, you can reduce the size of a chunk, but if you reduce the size, the amount of metadata will increase. For this reason, even when there are many files smaller than 64 MB, it is still difficult to reduce the size of a chunk.
However, GFS2 overcomes this weakness of GFS. GFS2 uses a much more advanced metadata management method than GFS. The namenode of GFS2 has a distributed structure rather than a single master. In addition, it stores metadata in a correctable database, such as BigTable. Through this, GFS2 addresses the limit of the number of files and the vulnerability to a namenode failure.
As you can easily increase the amount of metadata to be processed, you can reduce the size of a chunk to 1 MB. The structure of GFS2 is expected to have a huge influence on approaches to improving the structure of most other distributed file systems.
Swift is an object storage system used in OpenStack, which is used by Rackspace Cloud and others. Swift uses a structure in which there is no separate master server, as Amazon S3 does. It uses a 3-level object structure (Account, Container and Object) to manage files. The Account object is a kind of account used to manage containers. The Container object is an object used to manage the Object object like a directory. It is like a bucket in Amazon S3. The Object is an object corresponding to a file. To access this Object, you should access the Account object and Container object, in that order. Swift provides REST API, and has a proxy server to provide the REST API. It uses a static table with the predefined location to which an object has been allocated, and this is called Ring. All servers in Swift share the Ring information and find the location of a desired object.
As the use of OpenStack has been growing rapidly with the participation of more and more large companies, Swift has recently been getting more attention. In Korea, KT is participating in OpenStack, and provides its KT uCloud server by using Swift.
Ceph is a distributed file system with a unique metadata management method. Like other distributed file systems, it also manages the namespace and metadata of the entire file system by using the metadata server. But it features the operation of metadata servers in clusters and the dynamic adjustment of the namespace area by metadata according to the degree of load. This allows you to easily respond when load is concentrated on some parts, and easily expand metadata servers. Moreover, unlike other distributed file systems, it is compatible with POSIX. This means that you can access a file stored in the distributed file system as in the local file system.
It also supports REST API, and is compatible with the REST API of Swift or Amazon S3.
The most noticeable thing about Ceph is that it is included in Linux Kernel source. The released version is that high, but as Linux develops, Ceph may eventually become the main file system of Linux. It is also attractive as it is compatible with POSIX and supports kernel mount.
Parallel Network File System (pNFS)
As mentioned above, NFS has multiple versions. To resolve the scalability issue of the versions up to NFSv4, NFSv4.1 has introduced pNFS. This version enables you to process the content of a file and its metadata separately, and to store a single file in multiple distributed places. If the client brings the metadata of a certain file and learns the location of the file, it will be connected to servers that contain the content of the file when it accesses the same file later. At this time, the client can read or write the content of the file in multiple servers in parallel. You can also easily expand the metadata server, which manages metadata, to prevent the occurrence of a bottleneck phenomenon.
pNFS is an advanced version of NFS, and reflects the recent trends of distributed file systems. Therefore, pNFS has the advantages of NFS, as well as the advantages of the latest distributed file systems. Currently, there are some, if not many, products that support pNFS being released.
One of the reasons we should pay attention to pNFS is that currently NFS is not managed by Oracle (Sun) but by the Internet Engineering Task Force (IETF). NFS is used as a standard in many Linux/Unix environments, and thus pNFS may be popularized if many vendors release products that support pNFS.
So far, you have learned that many distributed file systems have their own unique characteristics, and that you need to apply a suitable distributed file system depending on your business needs. This article also introduced some distributed file systems to which people have begun to pay attention, and explained new trends and things you can refer to.