Moving Along with PyMongo
Join the DZone community and get the full member experience.
Join For FreeIn a previous post, I introduced MongoDB and its Python driver pymongo. Here, I continue in the same vein, describing in more detail how you can become productive using pymongo.
Connecting to the Database
In the last post, I gave a brief overview of how to connect to a database hosted on the MongoLab service using pymongo, so I won't go into detail on that here. Instead, I will mention a few options you'll want to be aware of.
Some of the connection options set the default policy for the safety of your
data:
- safe MongoDB by default operates in a "fire-and-forget" mode where all data-modifying operations are optimistically assumed to succeed. Turning on safe mode changes this, waiting for a response from the database server indicating that an operation has succeeded or failed before returning from a data-modifying operation.
- journal In version 1.8, MongoDB introduced journaling to provide single-server durability. MongoDB's flexible approach to balancing safety and performance, however, means that if your application wants to make sure its data has really been saved, you need to wait for the journal to be written.
- fsync This is the really, really safe option. With or without a journal, this will wait until your data is on the physical disk before it returns from update operations.
- w Before journaling, MongoDB used (and still uses) replication to ensure the durability of your data. The w parameter allows you to control how many servers (or which set of servers) your update has been replicated to before returning from a data-modifying operation. Be aware that this parameter can significantly slow down your updates, particularly if you are requiring them to be replicated to different datacenters.
- read_preference This option allows you to specify how you'd like to handle queries. By default, even in a replica set configuration, all your queries will be routed to the primary server in the replica set to ensure strong consistency. Using read_preference you can change this policy, allowing your queries to be distributed across the secondaries of your replica set for increased performance at the cost of moving to "eventual consistency."
One thing that's nice about the pymongo connection is that it's automatically
pooled. What this means is that pymongo maintains a pool of connections to the
mongodb server that it reuses over the lifetime of your application. This is good
for performance since it means pymongo doesn't need to go through the overhead of
establishing a connection each time it does an operation. Mostly, this happens
automatically. you do, however, need to be aware of the connection pooling,
however, since you need to manually notify pymongo that you're "done" with a
connection in the pool so it can be reused.
Now that all that is out of the way, the easiest way to connecto to a MongoDB
database from python is below (assuming you are running a MongoDB server
locally, and that you have installed ipython and pymongo):
In [1]: import pymongo In [2]: conn = pymongo.Connection()
Inserting Documents
Inserting documents begins by selecting a database. To create a database, you
do... well, nothing, actually. The first time you refer to a database, the
MongoDB server creates it for you automatically. So once you have your database,
you need to decide which "collection" in which to store your documents. To create
a collection, you do... right - nothing. So here we go:
In [3]: db = conn.tutorial In [4]: db.test Out[4]: Collection(Database(Connection('localhost', 27017), u'tutorial'), u'test') In [5]: db.test.insert({'name': 'My Document', 'ids': [1,2,3], 'subdocument': {'a':2}}) Out[5]: ObjectId('4f25bcffeb033049af000000')
One thing to note here is that the insert command returned us an ObjectId
value. This is the value that pymongo generated for the _id property, the
"primary key" of a MongoDB document. We can also manually specify the _id if we
want (and we don't have to use ObjectIds):
In [6]: db.test.insert({'_id': 42, 'name': 'My Document', 'ids': [1,2,3], 'subdocument': {'a':2}}) Out[6]: 42
A note on document structure
A note here is order about the kind of documents you can insert into a
collection. Technically, the type of document is described by the BSON
spec, but practically you can think of it as JSON plus a few extra types. Some of
the types you should be aware of are:
- primitive types Python ints, strings, and floats are automatically handled appropriately by pymongo.
- object Objects are represented by pymongo as regular python dicts. Technically, in BSON, objects are ordered dictionaries, so if you need or want to rely on the ordering you can use the bson module, included with pymongo, to create such objects. Objects have strings as their keys and can have any valid BSON type as their values.
- array Arrays are represented by pymongo as Python lists. Again, any valid BSON type can be used as the values, and the values in an array do not need to be of the same type.
- ObjectId ObjectIds can be thought of as globally unique identifiers. They are often used to generate "safe" primary keys for collections without the overhead of using a sequence generator as is often done in relational databases.
- Binary Strings in BSON are stored as UTF-8 - encoded unicode. To store non-unicode data you must use the bson.Binary wrapper around a Python string.
Documents in MongoDB also have a size limit (16MB in the latest version), which
is enough for many use-cases, but is something you'll need to be aware of when
designing your schemas.
Batch inserts
MongoDB and pymongo also allow you to insert multiple documents with a single API
call (and a single trip to the server). This can significantly speed up your
inserts, and is useful for things like batch loads. To perform a multi-insert,
you simply pass a list of documents to the insert() method rather than a single
document:
In [7]: db.test.insert([{'a':1}, {'a':2}]) Out[7]: [ObjectId('4f25c0aceb033049af000001'), ObjectId('4f25c0aceb03
You may have noticed that the structure of the documents inserted in this snippet
differ (significantly!) from the documents inserted before. MongoDB does not make
any requirements that documents share structure with one another. Analogous to
Python's dynamic typing, MongoDB is a dynamically typed database, where the
structure of the document is stored along with the document itself. Practically,
I've found it useful to group similarly structured documents into collections,
but it's definitely not a hard-and-fast rule.
Querying
OK, now that you've got your data into the database, how do you get it back out
again? That's the function of the find() method on collection objects. With no
parameters, it will return all the documents in a collection as a Python
iterator:
In [8]: db.test.find() Out[8]: <pymongo.cursor.Cursor at 0x7f2910068b90> In [9]: list(db.test.find()) Out[9]: [{u'_id': ObjectId('4f25bce9eb033049a0000000'), u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': ObjectId('4f25bcffeb033049af000000'), u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': 42, u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': ObjectId('4f25c0aceb033049af000001'), u'a': 1}, {u'_id': ObjectId('4f25c0aceb033049af000002'), u'a': 2}]
Most of the time, however, you'll want to select particular documents to return.
You do this by providing a query as the first parameter to find. Queries are
represented as BSON objects as well, and are similar to query-by-example that you
may have seen in other database technologies. For instance, to retrieve all
documents that have the name 'My Document', you would use the following query:
In [13]: list(db.test.find({'name':'My Document'})) Out[13]: [{u'_id': ObjectId('4f25bce9eb033049a0000000'), u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': ObjectId('4f25bcffeb033049af000000'), u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': 42, u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}]
MongoDB also allows you to 'reach inside' embedded documents using the dot
notation. Here are some examples of how you can use that:
In [22]: db.testq.insert([ ....: { 'a': { 'b': 1 }, 'c': [{'d': 1}, {'d':2}, {'d':3}]}, ....: { 'a': { 'b': 2 }, 'c': [{'d': 2}, {'d':3}, {'d':4}]}, ....: { 'a': { 'b': 3 }, 'c': [{'d': 3}, {'d':4}, {'d':5}]} ....: ]) Out[22]: [ObjectId('4f25c89beb033049af000009'), ObjectId('4f25c89beb033049af00000a'), ObjectId('4f25c89beb033049af00000b')] In [23]: # reach inside objects In [24]: list(db.testq.find({'a.b': 1})) Out[24]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}] In [25]: list(db.testq.find({'a.b': 2})) Out[25]: [{u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}] In [26]: # find objects where *any* value in an array matches In [27]: list(db.testq.find({'c': {'d':1}})) Out[27]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}] In [28]: # reach into an array In [29]: list(db.testq.find({'c.d': 2})) Out[29]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}, {u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}] In [30]: list(db.testq.find({'c.1.d': 2})) Out[30]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
One other thing that's important to be aware of is that you can return only a
subset of the fields in a query. (By default, _id is always returned unless
you explicitly suppress it.) Here is an example:
In [31]: list(db.testq.find({'c.1.d':2}, {'c':1})) Out[31]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}] In [32]: # we can also reach inside when specifying which fields to return In [33]: list(db.testq.find({'c.1.d':2}, {'a.b':1})) Out[33]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b
We can also restrict the number of results returned from a query by skipping
some documents and limiting the total number returned:
In [66]: list(db.testq.find()) Out[66]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}, {u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}, {u'_id': ObjectId('4f25c89beb033049af00000b'), u'a': {u'b': 3}, u'c': [{u'd': 3}, {u'd': 4}, {u'd': 5}]}] In [67]: list(db.testq.find().skip(1).limit(1)) Out[67]: [{u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}]
The query and advanced query pages in
the MongoDB docs provide the full query syntax, which includes inequalities, size
and type operations, and more.
Indexing
Like any database, MongoDB can only perform so well by scanning collections for
matches. To provide better performance, MongoDB can use indexes on its
collections. The normal method of specifying an index in MongoDB is by "ensuring"
its existence, in keeping with the pattern of having things spring into existence
when they're needed. To create an index on our 'test' collection, for instance,
we would do the following:
In [13]: db.test.drop() In [14]: db.test.ensure_index('a') Out[14]: u'a_1' In [15]: db.test.insert([ ....: {'a': 1, 'b':2}, {'a':2, 'b':3}, {'a':3, 'b':4}]) Out[15]: [ObjectId('4f28261deb033053bc000000'), ObjectId('4f28261deb033053bc000001'), ObjectId('4f28261deb033053bc000002')] In [16]: db.test.find({'a':2}) Out[16]: <pymongo.cursor.Cursor at 0x24f7b90> In [17]: db.test.find_one({'a':2}) Out[17]: {u'_id': ObjectId('4f28261deb033053bc000001'), u'a': 2, u'
Ok, well that's not actually all that exciting. However, MongoDB provides an
explain() method that lets us see whether our index is getting used:
In [18]: db.test.find({'a':2}).explain() Out[18]: {u'allPlans': [{u'cursor': u'BtreeCursor a_1', u'indexBounds': {u'a': [[2, 2]]}}], u'cursor': u'BtreeCursor a_1', u'indexBounds': {u'a': [[2, 2]]}, u'indexOnly': False, u'isMultiKey': False, u'millis': 0, u'n': 1, u'nChunkSkips': 0, u'nYields': 0, u'nscanned': 1, u'nscannedObjects': 1}
There are several important things to note about the result here. The most
important is the cursor type, BtreeCursor a_1. This means that it's using an
index, and in particular the index named a_1 that we created above. If the
field is not indexed, we get a different query plan from MongoDB:
In [19]: db.test.find({'b':2}).explain() Out[19]: {u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}], u'cursor': u'BasicCursor', u'indexBounds': {}, u'indexOnly': False, u'isMultiKey': False, u'millis': 0, u'n': 1, u'nChunkSkips': 0, u'nYields': 0, u'nscanned': 3, u'nscannedObjects': 3}
Here, MongoDB is using a BasicCursor. For you SQL experts out there, this is
equivalent to a full table scan, and is very inefficient. Note also that when we
queried the indexed field, nscanned and nscannedObjects were both one,
meaning that it only had to check one object to satisfy the query, whereas in the
case of our unindexed field, we had to check every document in the collection.
MongoDB has an extremely fast query that it can use in some cases where it
doesn't have to scan any objects, only the index entries. This happens when the
only data you're returning from a query is part of the index:
In [36]: db.test.find({'a':2}, {'a':1, '_id':0}).explain() Out[36]: ... u'indexBounds': {u'a': [[2, 2]]}, u'indexOnly': True, u'isMultiKey': False, ...
Note here the indexOnly field is true, specifying that MongoDB only had to
inspect the index (and not the actual collection data) to satisfy the query.
Another thing that's nice about the MongoDB index system is that it can use
compound indexes (indexes that include more than one field) to satisfy some
queries. In this case, you should specify the direction of each field since
MongoDB stores indexes as B-trees. An illustration is probably best. First, we'll
drop our a_1 index and ensure a new index on a and b, both ascending:
In [44]: db.test.drop_index('a_1') In [45]: db.test.ensure_index([('a', 1), ('b', 1)]) Out[45]: u'a_1_b_1'
Now, let's see what happens when we query by just a:
In [55]: db.test.find({'a': 2}).explain() Out[55]: ... u'cursor': u'BtreeCursor a_1_b_1', u'indexBounds': {u'a': [[2, 2]], u'b': [[{u'$minElement': 1}, {u'$maxElement': 1}]]}, ...
MongoDB's optimizer here is "smart" enough to use the compound (a,b) index to
query just the a value. What if we query just the b value?
In [56]: db.test.find({'b': 2}).explain() Out[56]: {u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}], u'cursor': u'BasicCursor', ...
Oops! That doesn't work because the index is sorted with keys (a,b). Key order
also becomes important when we want to sort our results:
In [64]: db.test.find().sort([ ('a', 1), ('b', 1)]).explain() Out[64]: ... u'cursor': u'BtreeCursor a_1_b_1', u'indexBounds': {u'a': [[{u'$minElement': 1}, {u'$maxElement': 1}]], u'b': [[{u'$minElement': 1}, {u'$maxElement': 1}]]}, ... In [65]: db.test.find().sort([ ('a', 1), ('b', -1)]).explain() Out[65]: {u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}], u'cursor': u'BasicCursor', ...
So if we sort in the same order as our index, we can use the B-tree index to sort
the results. If we sort in a different order, we can't, so MongoDB has to
actually load the entire result set into RAM and sort it there. (MongoDB can
actually use an index for the exact reverse of our sort order as well, so
[(a, -1), (b, -1)] would have worked just fine.) In a collection
of 3 documents, this isn't important, but as your data grows, this can become
quite slow, in some cases actually returning an error because the result set is
too large to sort in RAM.
Deleting data
Deleting data in MongoDB is fairly straightforward. All you need to do is to pass
a query to the remove() method on the collection, and MongoDB will delete all
documents from the collection that match the query. (Note that deletes can still
be slow if you specify the query inefficiently, as they use indexes just like
queries do.)
In [72]: list(db.testq.find()) Out[72]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}, {u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}, {u'_id': ObjectId('4f25c89beb033049af00000b'), u'a': {u'b': 3}, u'c': [{u'd': 3}, {u'd': 4}, {u'd': 5}]}] In [73]: db.testq.remove({'a.b': {'$gt': 1}}) In [74]: list(db.testq.find()) Out[74]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
Updating data
In many cases, an update in MongoDB is as straightforward as calling save() on
a python dict:
In [76]: doc = db.testq.find_one({'a.b': 1}) In [77]: doc['a']['b'] = 4 In [78]: db.testq.save(doc) Out[78]: ObjectId('4f25c89beb033049af000009') In [79]: list(db.testq.find()) Out[79]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 4}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
save() can also be used to insert documents if they don't exist yet (this check
is done by checking the dict to be saved for the _id key).
Unlike some other NoSQL solutions, MongoDB allows you to do quick,
in-place updates of documents using special
[modifier operations][mongodb-modifier]. For instance, to set a field to a
particular value, you can do the following:
In [81]: db.testq.update({'a.b': 4}, {'$set': {'a.b': 7}}) In [82]: list(db.testq.find()) Out[82]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 7}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
MongoDB provides several different modifiers you can use to update documents in
place, including the following (for more details see updates):
- $inc Increment a numeric field (generalized; can increment by any number)
- $set Set certain fields to new values
- $unset Remove a field from the document
- $push Append a value onto an array in the document
- $pushAll Append several values onto an array
- $addToSet Add a value to an array if and only if it does not already exist
- $pop Remove the last (or first) value of an array
- $pull Remove all occurrences of a value from an array
- $pullAll Remove all occurrences of any of a set of values from an array
- $rename Rename a field
- $bit Bitwise updates
There's a lot more that I could cover, but hopefully that whets your appetite to
learn more about MongoDB. In future posts, I'll discuss how to use GridFS (a
"filesystem" on top of MongoDB) and MongoDB's various aggregation options, as
well as how you can use Ming to simplify certain operations.
Source: http://blog.pythonisito.com/2012/01/moving-along-with-pymongo.html
Opinions expressed by DZone contributors are their own.
Comments