Requests in Python and MongoDB
Join the DZone community and get the full member experience.
Join For FreeIf you use PyMongo, 10gen’s official MongoDB driver for Python, I want to ensure you understand how it manages sockets and threads, and I want to brag about performance improvements in PyMongo 2.2, which we plan to release next week.
The Problem: Threads and Sockets
Each PyMongo Connection object includes a connection pool (a pool of sockets) to minimize the cost of reconnecting. If you do two operations (e.g., two find()s) on a Connection, it creates a socket for the first find(), then reuses that socket for the second.
When sockets are returned to the pool, the pool checks if it has more than max_pool_size spare sockets, and if so, it closes the extra sockets. By default max_pool_size is 10.
What if multiple Python threads share a Connection? A possible implementation would be for each thread to get a random socket from the pool when needed, and return it when done. But consider the following code. It updates a count of visitors to a web page, then displays the number of visitors on that web page including this visit:
connection = pymongo.Connection() counts = connection.my_database.counts counts.update( {'_id': this_page_url()}, {'$inc': {'n': 1}}, upsert=True) n = counts.find_one({'_id': this_page_url()})['n'] print 'You are visitor number %s' % n
Since PyMongo defaults to unsafe writes—that is, it does not ask the server to acknowledge its inserts and updates—it will send the update message to the server and then instantly send the find_one, then await the result. If PyMongo gave out sockets to threads at random, then the following sequence could occur:
- This thread gets a socket, which I’ll call socket 1, from the pool.
- The thread sends the update message to MongoDB on socket 1. The thread does not ask for nor await a response.
- The thread returns socket 1 to the pool.
- The thread asks for a socket again, and gets a different one: socket 2.
- The thread sends the find_one message to MongoDB on socket 2.
- MongoDB happens to read from socket 2 first, and executes the find_one.
- Finally, MongoDB reads the update message from socket 1 and executes it.
In this case, the count displayed to the visitor wouldn’t include this visit.
I know what you’re thinking: just do the find_one first, add one to it, and display it to the user. Then send the update to MongoDB to increment the counter. Or use findAndModify to update the counter and get its new value in one round trip. Those are great solutions, but then I would have no excuse to explain requests to you.
Maybe you’re thinking of a different fix: use update(safe=True). That would work, as well, with the added advantage that you’d know if the update failed, for example because MongoDB’s disk is full, or you violated a unique index. But a safe update comes with a latency cost: you must send the update, wait for the acknowledgement, then send the find_one and wait for the response. In a tight loop the extra latency is significant.
The Fix: One Socket Per Thread
PyMongo solves this problem by automatically assigning a socket to each thread, when the thread first requests one. The socket is stored in a thread-local variable within the connection pool. Since MongoDB processes messages on any single socket in order, using a single socket per thread guarantees that in our example code, update is processed before find_one, so find_one’s result includes the current visit.
More Awesome Connection Pooling
While PyMongo’s socket-per-thread behavior nicely resolves the inconsistency problem, there are some nasty performance costs that are fixed in the forthcoming PyMongo 2.2. (I did most of this work, at the direction of PyMongo’s maintainer Bernie Hackett and with co-brainstorming by my colleague Dan Crosta.)
Connection Churn
PyMongo 2.1 stores each thread’s socket in a thread-local variable. Alas, when the thread dies, its thread locals are garbage-collected and the socket is closed. This means that if you regularly create and destroy threads that access MongoDB, then you are regularly creating and destroying connections rather than reusing them.
You could call Connection.end_request() before the thread dies. end_request() returns the socket to the pool so it can be used by a future thread when it first needs a socket. But, just as most people don’t recycle their plastic bottles, most developers don’t use end_request(), so good sockets are wasted.
In PyMongo 2.2, I wrote a “socket reclamation” feature that notices when a thread has died without calling end_request, and reclaims its socket for the pool. Under the hood, I wrap each socket in a SocketInfo object, whose __del__ method returns the socket to the pool. For your application, this means that once you’ve created as many sockets as you need, those sockets can be reused as threads are created and destroyed over the lifetime of the application, saving you the latency cost of creating a new connection for each thread.
Total Number of Connections
Consider a web crawler that launches hundreds of threads. Each thread downloads pages from the Internet, analyzes them, and stores the results of that analysis in MongoDB. Only a couple threads access MongoDB at once, since they spend most of their time downloading pages, but PyMongo 2.1 must use a separate socket for each. In a big deployment, this could result in thousands of connections and a lot of overhead for the MongoDB server.
In PyMongo 2.2 we’ve added an auto_start_request option to the Connection constructor. It defaults to True, in which case PyMongo 2.2′s Connection acts the same as 2.1′s, except it reclaims sockets from dead threads. If you set auto_start_request to False, however, threads can freely and safely share sockets. The Connection will only create as many sockets as are actually used simultaneously. In our web crawler example, if you have a hundred threads but only a few of them are simultaneously accessing MongoDB, then only a few sockets are ever created.
start_request and end_request
If you create a Connection with auto_start_request=False you might still want to do some series of operations on a single socket for read-your-own-writes consistency. For that case I’ve provided an API that can be used three ways, in ascending order of convenience.
You can call start/end_request on the Connection object directly:
connection = pymongo.Connection(auto_start_request=False) counts = connection.my_database.counts connection.start_request() try: counts.update( {'_id': this_page_url()}, {'$inc': {'n': 1}}, upsert=True) n = counts.find_one({'_id': this_page_url()})['n'] finally: connection.end_request()
The Request object
start_request() returns a Request object, so why not use it?
connection = pymongo.Connection(auto_start_request=False) counts = connection.my_database.counts request = connection.start_request() try: counts.update( {'_id': this_page_url()}, {'$inc': {'n': 1}}, upsert=True) n = counts.find_one({'_id': this_page_url()})['n'] finally: request.end()
Using the Request object as a context manager
Request objects can be used as context managers in Python 2.5 and later, so the previous example can be terser:
connection = pymongo.Connection(auto_start_request=False) counts = connection.my_database.counts with connection.start_request() as request: counts.update( {'_id': this_page_url()}, {'$inc': {'n': 1}}, upsert=True) n = counts.find_one({'_id': this_page_url()})['n']
Proof
I wrote a very messy test script to verify the effect of my changes on the number of open sockets, and the total number of sockets created.
The script queries Mongo for 60 seconds. It starts a thread each second for 40 seconds, each thread lasting for 20 seconds and doing 10 queries per second. So there’s a 20-second rampup until there are 20 threads, then 20 seconds of steady-state with 20 concurrent threads (one dying and one created per second), then a 20 second cooldown until the last thread completes. My script then parses the MongoDB log to see when sockets were opened and closed.
I tested the script with the current PyMongo 2.1, and also with PyMongo 2.2 with auto_start_request=True and with auto_start_request=False.
PyMongo 2.1 has one socket per thread throughout the test. Each new thread starts a new socket because old threads’ sockets are lost. It opens 41 total sockets (one for each worker thread plus one for the main) and tops out at 21 concurrent sockets, because there are 21 concurrent threads (counting the main thread):
PyMongo 2.2 with auto_start_request=True acts rather differently (and much better). It ramps up to 21 sockets and keeps them open throughout the test, reusing them for new threads when old threads die:
And finally, auto_start_request=False, PyMongo 2.2 only needs as many sockets as there are threads concurrently waiting for responses from MongoDB. In my test, this tops out at 7 sockets, which stay open until the whole pool is deleted, because max_pool_size is 10:
Conclusion
Applications that create and destroy a lot of threads without calling end_request() should run significantly faster with PyMongo 2.2 because threads’ sockets are automatically reused after the threads die.
Although we had to default the new auto_start_request option to True
for backwards compatibility, virtually all applications should set it to
False. Heavily multithreaded apps will need far fewer sockets this way,
meaning they’ll spend less time establishing connections to MongoDB,
and put less load on the server.
Published at DZone with permission of A. Jesse Jiryu Davis, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments