This is a two part series that explores a powerful feature in MongoDB for managing large files called GridFS. In the first part we discuss use cases appropriate for GridFS, and in part 2 we discuss how GridFS works and how to use it in your apps.
In my position at MongoDB, I speak with many teams that are building applications that manage large files (videos, images, PDFs, etc.) along with supporting information that fits naturally into JSON documents. Content management systems are an obvious example of this pattern, where is it necessary to both manage binary artifacts as well as all the supporting information about those artifacts (e.g., author, creation date, workflow state, version information, classification tags, etc.).
MongoDB manages data as documents, with a maximum size of 16MB. So what happens when your image, video, or other file exceeds 16MB? GridFS is a specification implemented by all of the MongoDB drivers that manages large files and their associated metadata as a group of small files. When you query for a large file, GridFS automatically reassembles the smaller files into the original large file.
Content management is just one of the many uses of GridFS. For example, McAfee optimizes delivery of analytics and incremental updates in MongoDB as binary packages for efficiently delivery to customers. Pearson stores student data in GridFS and leverages MongoDB’s replication to distribute data and keep it synchronized across facilities.
There are also many other systems with this requirement, many of which are in healthcare. Hospitals and other care-providing organizations want to centralize patient record information to provide a single view of the patient making this information more accessible to doctors, care providers, and patients themselves. The central system improves the quality of the patient care (care providers get a more complete and global view of the patient’s health status) as well as the efficiency of providing care.
A typical patient health record repository includes general information about the patient (name, address, insurance provider, etc.) along with all various types of medical records (office visits, blood tests and labs, medical procedures, etc.). Much of this information fits nicely into JSON documents and the flexibility of MongoDB makes it easy to accommodate variability in content and structure. Healthcare applications also manage large binary files such as radiology tests, x-ray and MRI images, CAT scans, or even legacy medical records created by scanning paper documents. It is not uncommon for a large hospital or insurance provider to have 50-100 TB of patient data and as much as 90% is large binary files.
When building an application like this, organizations usually have to decide whether to store the binary data in a separate repository, or along with the metadata. With MongoDB, there are a number of compelling reasons for storing the binary data in the same system as the rest of the information. They include:
- The resulting application will have a simpler architecture: one system for all types of data;
- Document metadata can be expressed using the rich flexible document structure, and documents can be retrieved using all the flexibility of MongoDB’s query language;
- MongoDB’s high availability (replica sets) and scalability (sharding) infrastructure can be leveraged for binary data as well as the structured data;
- One consistent security model for authenticating and authorizing access to the metadata and files;
- GridFS doesn’t have the limitations of some filesystems, like number of documents per directory, or file naming rules.
Fortunately, GridFS makes working with large binary files in MongoDB easy. In part two, I will walk through in detail how you use GridFS to store and manage binary content in MongoDB. In the meantime if you’re looking to learn more, download our architecture guide: