A Taste of GridFS, MongoDB's Filesystems
Join the DZone community and get the full member experience.
Join For FreeIn some previous posts on mongodb and python and pymongo, I introduced the NoSQL database MongoDB and how you can use it from Python. This post goes beyond the basics of MongoDB and pymongo to give you a taste for MongoDB's take on filesystems, GridFS.
Why a filesystem?
If you've been doing MongoDB for a while, you may have heard about the 16 MB document size limit. When I started using MongoDB (around version 0.8), the limit was actually 4 MB. What this means is that everything is working just fine, your system is screaming fast, until you try to create a document that's 4.001 MB and boom nothing works any more. For us at SourceForge, what that meant was that we had to restructure our schema and use less embedding.
But what if it's not something that can be restructured? Maybe your site allows users to upload large attachments of unknown size. In such cases you probably can get away with using a Binary field type and crossing your fingers, but a better solution, in my opinion, is to actually store the contents your upload in a series of documents (let's call them "chunks") of limited size. Then you can tie them all together with another document that specifies all the file metadata.
GridFS to the rescue
Well, that's exactly what GridFS does, but it does it with a nice API with a few more bells and whistles than you'd probably build on your own. It's important to note that GridFS, implemented in all the MongoDB language drivers, is a convention and an api, not something that's provided natively by the server. As far as the server is concerned, it's all just collections and documents.
The GridFS schema
GridFS actually stores your files in two collections, named by default fs.files and fs.chunks, although you can change the fs to something else if you'd like. The fs.files collection is where reading or writing a file begins. A typical fs.files document looks like the following (credit):
{ // unique ID for this file "_id" : <unspecified>, // size of the file in bytes "length" : data_number, // size of each of the chunks. Default is 256k "chunkSize" : data_number, // date when object first stored "uploadDate" : data_date, // result of running the "filemd5" command on this file's chunks "md5" : data_string }
The fs.chunks collection contains all the data for your files:
{ // object id of the chunk in the _chunks collection "_id" : <unspecified>, // _id of the corresponding files collection entry "files_id" : <unspecified>, // chunks are numbered in order, starting with 0 "n" : chunk_number, // the chunk's payload as a BSON binary type "data" : data_binary, }
In the Python gridfs package (included with the pymongo driver), several other fields are inserted as well:
- filename
- This is the 'human' name for the file, which may be path-delimited to simulate directories.
- contentType
- This is the mime-type of the file
- encoding
- This is the unicode encoding used for text files.
You can also add in your own attributes to files. At SourceForge, we used things like project_id or forum_id to allow the same filename to be uploaded to multiple places on the site without worrying about namespace collisions. To keep your code future-proof, you should put any custom attributes inside an embedded metadata document, just in case the gridfs spec expands to incorporate more fields.
Using GridFS
So with all that out of the way, how to you actually use GridFS? It's actually pretty straightforward. The first thing you need is a reference to a GridFS filesystem:
>>> import pymongo >>> import gridfs >>> conn = pymongo.Connection() >>> db = conn.gridfs_test >>> fs = gridfs.GridFS(db)
Basic reading and writing
Once you have the filesystem, you can start putting stuff in it:
>>> with fs.new_file() as fp: ... fp.write('This is my new file. It is teh awezum!')
Let's examine the underlying collections to see what actually happened:
>>> list(db.fs.files.find()) [{u'length': 38, u'_id': ObjectId('4fbfa7b9fb72f096bd000000'), u'uploadDate': datetime.datetime(2012, 5, 25, 15, 39, 37, 55000), u'md5': u'332de5ca08b73218a8777da69293576a', u'chunkSize': 262144}] >>> list(db.fs.chunks.find()) [{u'files_id': ObjectId('4fbfa7b9fb72f096bd000000'), u'_id': ObjectId('4fbfa7b9fb72f096bd000001'), u'data': Binary('This is my new file. It is teh awezum!', 0), u'n': 0}]
You can see that there's really nothing surprising or mysterious happening there; it's just mapping the filesystem metaphor onto MongoDB documents. In this case, our file was small enough that it didn't need to be split into chunks. We can force split it by specifying a small chunkSize when creating the file:
>>> with fs.new_file(chunkSize=10) as fp: ... fp.write('This is file number 2. It should be split into several chunks') ... >>> fp <gridfs.grid_file.GridIn object at 0x1010f5950> >>> fp._id ObjectId('4fbfa8ddfb72f0971c000000') >>> list(db.fs.chunks.find(dict(files_id=fp._id))) [{... u'data': Binary('This is fi', 0), u'n': 0}, {... u'data': Binary('le number ', 0), u'n': 1}, {... u'data': Binary('2. It shou', 0), u'n': 2}, {... u'data': Binary('ld be spli', 0), u'n': 3}, {... u'data': Binary('t into sev', 0), u'n': 4}, {... u'data': Binary('eral chunk', 0), u'n': 5}, {... u'data': Binary('s', 0), u'n': 6}]
Now, if we actually want to read the file as a file, we'll need to use the gridfs api:
>>> with fs.get(fp._id) as fp_read: ... print fp_read.read() ... This is file number 2. It should be split into several chunks
Treating it more like a filesystem
There are several other convenience methods bundled into the GridFS object to give more filesystem-like behavior. For instance, new_file() takes any number of keyword arguments that will get added onto the fs.files document being created:
>>> with fs.new_file( ... filename='file.txt', ... content_type='text/plain', ... my_other_attribute=42) as fp: ... fp.write('New file') ... >>> fp <gridfs.grid_file.GridIn object at 0x1010f59d0> >>> db.fs.files.find_one(dict(_id=fp._id)) {u'contentType': u'text/plain', u'chunkSize': 262144, u'my_other_attribute': 42, u'filename': u'file.txt', u'length': 8, u'uploadDate': datetime.datetime(2012, 5, 25, 15, 53, 1, 973000), u'_id': ObjectId('4fbfaaddfb72f0971c000008'), u'md5': u'681e10aecbafd7dd385fa51798ca0fd6'}
Better would be to encapsulate my_other_attribute into the metadata property:
>>> with fs.new_file( ... filename='file2.txt', ... content_type='text/plain', ... metadata=dict(my_other_attribute=42)) as fp: ... fp.write('New file 2') ... >>> db.fs.files.find_one(dict(_id=fp._id)) {u'contentType': u'text/plain', u'chunkSize': 262144, u'metadata': {u'my_other_attribute': 42}, u'filename': u'file2.txt', u'length': 10, u'uploadDate': datetime.datetime(2012, 5, 25, 15, 54, 5, 67000), u'_id':ObjectId('4fbfab1dfb72f0971c00000a'), u'md5': u'9e4eea3dec28d8346b52f18086437ac7'}
We can also "overwrite" files by filename, but since GridFS actually indexes files by _id, it doesn't get rid of the old file, it just versions it:
>>> with fs.new_file(filename='file.txt', content_type='text/plain') as fp: ... fp.write('Overwrite the so-called "New file"') ...
Now, if we want to retrieve the file by filename, we can use get_version or get_last_version:
>>> fs.get_last_version('file.txt').read() 'Overwrite the so-called "New file"' >>> fs.get_version('file.txt', 0).read() 'New file'
Since we've been uploading files with a filename property, we can also list the files in gridfs:
>>> fs.list() [u'file.txt', u'file2.txt']
We can also remove files, of course:
>>> fp = fs.get_last_version('file.txt') >>> fs.delete(fp._id) >>> fs.list() [u'file.txt', u'file2.txt'] >>> fs.get_last_version('file.txt').read() 'New file'
Note that since only one version of "file.txt" was removed, we still have a file named "file.txt" in the filesystem.
Finally, gridfs also provides convenience methods for determining if a file exists and for quickly writing a short file into grifs:
>>> fs.exists(fp._id) False >>> fs.exists(filename='file.txt') True >>> fs.exists({'filename': 'file.txt'}) # equivalent to above True >>> fs.put('The quick brown fox', filename='typingtest.txt') ObjectId('4fbfad74fb72f0971c00000e') >>> fs.get_last_version('typingtest.txt').read() 'The quick brown fox'
So that's the whirlwind tour of GridFS. I'd love to hear how you're using GridFS in your project, or if you think it might be a good fit, so please drop me a line in the comments.
Published at DZone with permission of Rick Copeland. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
A Complete Guide to AWS File Handling and How It Is Revolutionizing Cloud Storage
-
Five Java Books Beginners and Professionals Should Read
-
How To Use Geo-Partitioning to Comply With Data Regulations and Deliver Low Latency Globally
-
Tactics and Strategies on Software Development: How To Reach Successful Software [Video]
Comments