Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Asynchronous Work With Tarantool Using Python

DZone's Guide to

Asynchronous Work With Tarantool Using Python

Very few Tarantool recipes have been published for Pythoneers. Learn how to cook Tarantool with Python sauce in your projects, as well as the pros and cons of doing so.

Free Resource

Discover Tarantool's unique features which include powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU core!

Image title

You may have seen recent articles about Tarantool, the NoSQL DBMS, and how it’s used at its alma mater, Mail.Ru Group, and other companies. Yet very few Tarantool recipes have been published for Pythoneers. In this article, I’ll describe how we cook “Tarantool with Python sauce” in our projects and will share its pros and cons, as well as pitfalls.

So, first things first.

Tarantool features a Lua application server designed to work with significant data flows. For example, more than 80,000 requests per second (SELECTINSERTUPDATEDELETE) are generated in one of my projects, and the workload is evenly distributed among four servers with 12 Tarantool instances. However, request completions are very expensive when processing such a large amount of data, which is why applications need to be able to quickly switch from one task to another. Thus, to ensure an efficient and balanced workload for all CPU cores, you’ll need Tarantool as well as asynchronous programming techniques.

How Does the Tarantool-Python Connector Work?

Before we start talking about asynchronous code in Python, let’s make sure that we have a clear idea of how regular synchronous code interacts with Tarantool. I’m going to use Tarantool 1.6 for CentOS. It’s very easy to install, you can find detailed instructions on the website as well as in the user guide.

So now that Tarantool is up and running, in order to start working with Python 2.7, we need to install the tarantool-python connector from pypi:

$ pip install tarantool\>0.4

This will be enough for our tasks. I will start with a real task from one of our previous projects. For this task, we needed to “put” our data flow into Tarantool for further processing; the size of one data batch was approximately 1.5 kB.

We did some fieldwork to test the synchronous vs. asynchronous approaches, with the goal of choosing the right one. We started with a performance test script. It took just a couple of minutes to write:

import tarantool
import string

mod_len = len(string.printable)
data = [string.printable[it] * 1536
    for it in range(mod_len)]

tnt = tarantool.connect("127.0.0.1", 3301)

for it in range(100000):
    r = tnt.insert("tester", (it, data[it % mod_len]))

The test algorithm is simple. It creates 100,000 successive insertions into Tarantool. My virtual machine runs this piece of code in 32 seconds on average, which is about 3,000 insertions per second. Had we been happy with this performance, there would’ve been nothing else to do. As you know, premature optimization is the root of all evil. This was not enough for our project, and Tarantool can boast much higher performance.

Code Profiling

It’s good to know how your code works under the hood before you run a code profiler. After all, the best profiling tool is the developer’s head. The script itself is simple and there's not too much to think about, so let’s dig a little deeper...

If we scrutinize the implementation of the tarantool-python connector, we’ll see that requests are packed using the msgpack library, sent to a socket with sendall, and then the response length and the response itself are read from the socket. That sounds more interesting. How many operations with the Tarantool socket will we have as we run our code? Basically, for one tnt.insert request, there will be one socket.sendall call (sending data) and two socket.recv calls (getting the response length and the response itself).

A closer look reveals that for a 100k record insertion there will be 200k + 100k; that is 300k read/write system calls. The code profiler (I used cProfile and kcachegrind) confirms our calculations:

Image title

What can be improved in our algorithm? First of all, of course, we’d like to reduce the number of system calls, i.e. operations with the Tarantool socket. This can be done by grouping tnt.insert  requests into a “batch” and calling socket.sendall to send all of them at once. Likewise, we can read Tarantool’s response “batch” with a single socket.recv. This is not that easily done using classic-style programming; we would need a data buffer, time to wait to collect data in the buffer; and the request results would need to be returned without delays. And what would we do if we got many requests at first, but suddenly very few? Again, there would be delays, which we are trying to avoid. All in all, we need an absolutely new approach, and most importantly, we need to keep our code as simple as before. This is where asynchronous frameworks come in handy.

gevent and Python 2.7

I have worked in the past with several asynchronous frameworks: twisted, tornado, gevent, and others. My favorite one is gevent, primarily due to its effective work with I/O operations and ease of use. Here, you can find a good tutorial for the library, and this tutorial offers the classic example of a fast crawler.

import time
import gevent.monkey
gevent.monkey.patch_socket()

import gevent
import urllib2
import json

def fetch(pid):
    url = 'http://json-time.appspot.com/time.json'
    response = urllib2.urlopen(url)
    result = response.read()
    json_result = json.loads(result)
    return json_result[‘datetime’]

def synchronous():
    for i in range(1,10):
        fetch(i)

def asynchronous():
    threads = []
    for i in range(1,10):
        threads.append(gevent.spawn(fetch, i))
    gevent.joinall(threads)

t1 = time.time()
synchronous()
t2 = time.time()
print('Sync:', t2 — t1)

t1 = time.time()
asynchronous()
t2 = time.time()
print('Async:', t2 — t1)

I received the following test results on my virtual machine:

Sync: 1.529
Async: 0.238

Nice performance improvement, isn’t it? To make my synchronous piece of code work asynchronously with gevent, I had to wrap the fetch function call into gevent.spawn, basically parallelizing the URL download process. I also had to run monkey.patch_socket(), which makes all of the socket calls cooperative. That way, while one URL is being downloaded and the program is waiting for a response from a remote server, the gevent driver switches to the other tasks and tries to download other available documents instead of staying idle. All gevent threads in Python are performed sequentially, but since there is no wait time (no blocking system calls), it’s done faster.

The whole idea looks good, and most importantly, this approach suits our needs perfectly. However, the tarantool-python connector doesn’t work with gevent out of the box, so I had to implement my own gtarantool connector on top of it.

gevent and Tarantool

The gtarantool connector works with gevent and Tarantool 1.6, and it’s available to install from pypi:

$ pip install gtarantool

Meanwhile, the new solution for our task looks like this:

import gevent
import gtarantool

import string

mod_len = len(string.printable)
data = [string.printable[it] * 1536
    for it in range(mod_len)]
cnt = 0

def insert_job(tnt):
    global cnt

    for i in range(10000):
        cnt += 1
        tnt.insert(“tester”, (cnt, data[it % mod_len]))


tnt = gtarantool.connect("127.0.0.1", 3301)

jobs = [gevent.spawn(insert_job, tnt)
    for _ in range(10)]

gevent.joinall(jobs)

What’s different here from the initial synchronous piece of code? We distributed the insertion of 100k records among ten asynchronous greenlets, each making about 10k tnt.insert calls per cycle, and we need to connect to Tarantool just once! The program runtime reduced to 12 seconds, which is almost three times faster than the synchronous version, and the number of data insertions into the database increased to 8,000 per second. Why does this scheme work faster? How does the magic work?

The gtarantool connector uses a buffer of requests to send to a Tarantool socket, and separate read/write greenlets for the socket. Let’s take a look at the results in a code profiler (this time I used Greenlet Profiler, a version of the yappi profiler adapted for greenlets).

Image title

Analyzing the results in kcachegrind, we can see that the number of socket.recv calls was reduced from 100k to 10k, and the number of socket.send calls decreased from 200k to 2.5k. That’s what makes the work with Tarantool more effective: fewer heavy system calls due to the lighter and “cheaper” greenlets. And the most important and best point is that our piece of code remained basically “synchronous.” It has no ugly twisted callbacks.

We successfully used this approach in our project. Here are some other pros:

  1. We don’t use forks. We can use several Python processes, with one gtarantool connection for each process (or connection pooling).
  2. A switch inside greenlets is quicker and more effective than a switch between Unix processes.
  3. The reduced number of processes allows for significantly reduced memory consumption.
  4. The reduced number of operations with the Tarantool socket boosts the efficiency of Tarantool itself as it doesn’t load the CPU as much.

Moving to asyncio in Python 3

The asyncio package is an “out-of-the-box” collection of coroutines for Python 3. It has documentation, examples, and libraries for asyncio and Python 3. At first glance, asyncio looks complicated and confusing (at least in comparison with gevent); however, after a while, everything falls into place. I have created a Tarantool connector version for asyncio, aiotarantool.

Aiotarantool is available to install from pypi:

$ pip install aiotarantool

With asyncio, our piece of code becomes a bit more complex. We get yield from constructions and the @asyncio.coroutine decorators, but I like it overall, and it’s not too different from gevent:

import asyncio
import aiotarantool
import string

mod_len = len(string.printable)
data = [string.printable[it] * 1536
    for it in range(mod_len)]
cnt = 0

@asyncio.coroutine
def insert_job(tnt):
    global cnt

    for it in range(10000):
        cnt += 1
        args = (cnt, data[it % mod_len])
        yield from tnt.insert(“tester”, args)


loop = asyncio.get_event_loop()
tnt = aiotarantool.connect(“127.0.0.1”, 3301)

tasks = [asyncio.async(insert_job(tnt))
    for _ in range(10)]

loop.run_until_complete(asyncio.wait(tasks))
loop.close()

This solves the task in 13 seconds on average (which is about 7.5K insertions per second), a bit slower than Python 2.7 and gevent, but much better than all synchronous versions. The aiotarantool connector has a minor but very important difference from the other available asyncio libraries: the tarantool.connect call is made outside of asyncio.event_loop. In fact, this call doesn’t create any real connection: it will be created later, inside one of the coroutines at the first yield from tnt.insert call. This approach seemed easier and more convenient for me when coding with asyncio.

Again, here are the profiling results (I used the yappi profiler, but I have a concern as to whether it counts the number of function calls correctly when working with asyncio):

Image title

Here, we have 5k calls of StreamReader.feed_data and StreamWriter.write . No doubt that is better than the 200k socket.recv calls and 100k socket.sendall calls we had in synchronous mode.

Comparing the Approaches

Now let’s compare the efficiency of the synchronous and asynchronous Tarantool connectors. You can find our benchmark source code in the tests catalog of the gtarantool and aiotrantoo libraries. The test scenario is to insert, select, update, and delete 100k records (1.5 kB each). We ran each test ten times and the average rounded figures are listed in the tables below. The precise values (which depend on the hardware) are not important; we need only their correlations.

We compared:

  • Synchronous tarantool-python connector for Python 2.7
  • Synchronous tarantool-python connector for Python 3.4
  • Asynchronous gtarantool connector for Python 2.7
  • Asynchronous aiotarantool connector for Python 3.4

Image title

Test runtime in seconds (the fewer, the better)

Image title

Operations per second (the more, the better)

gtarantool is slightly faster than aiotarantool. We’ve been using gtarantool in our project for a while; this solution does really well with heavy workloads. gevent is a third-party library that requires compilation during installation. The asyncio package in turn is fast and comes “out-of-the-box” in Python 3; it has no “duct tape” like the monkey.patch utility in gevent. But as of now, aiotarantool hasn’t been used with the real workload in our project. Well, the night is still young!

Getting Every Last Drop of Performance Out of the CPU

Let’s try to make our benchmark source code a tad more complex to get the most out of our CPU. We’ll be simultaneously deleting, inserting, updating, and selecting data (a fairly common type of workload) in one Python process. We’ll run many processes of the same kind — say, 22. The magic of this figure works as follows. On a machine with 24 cores, we’re leaving one core for the system (just in case), another core for Tarantool (it’ll be fine with just one), and the remaining 22 cores for the 22 Python processes. Let’s run the test for both gevent and asyncio. You can find our benchmark source code for gtarantool and for aiotarantool at GitHub.

For easier comparison, it’d be great to present our test results in a graphically appealing manner. This makes us appreciate the capabilities of Tarantool; it’s a Lua interpreter, so we can write absolutely any Lua code right in the database. Let’s write the simplest program, and Tarantool will send the statistics to graphite. So, next, we add the following piece of code to our Tarantool startup script (in a real project, of course, we’d put it into a separate module).

fiber = require('fiber')
socket = require('socket')
log = require('log')

local host = '127.0.0.1'
local port = 2003

fstat = function()
    local sock = socket('AF_INET', 'SOCK_DGRAM', 'udp')
    while true do
        local ts = tostring(math.floor(fiber.time()))
        info = {
            insert = box.stat.INSERT.rps,
            select = box.stat.SELECT.rps,
            update = box.stat.UPDATE.rps,
            delete = box.stat.DELETE.rps
        }

        for k, v in pairs(info) do
            metric = 'tnt.' .. k .. ' ' .. tostring(v) .. ' ' .. ts
            sock:sendto(host, port, metric)
        end

        fiber.sleep(1)
        log.info('send stat to graphite ' .. ts)
    end
end

fiber.create(fstat)

Now, we start up Tarantool and automatically receive our statistical graphs.

Next, let’s run two benchmark tests. In the first one, we’ll be deleting, inserting, updating, and selecting data. In the second one, we’ll only be selecting data. For each of the graphs, the X-axis is measuring time, and the Y-axis is measuring the number of operations per second.

Image title

gtarantool (insert, select, update, delete)

Image title

aiotarantool (insert, select, update, delete)

Image title

gtarantool (select only)

Image title

aiotarantool (select only)

Let me remind you that Tarantool was using only one core. In the first test, Tarantool used 100% of the core; in the second test, Tarantool used only 60%.

Conclusion

The examples in this article are somewhat artificial, of course. Real-life tasks are more complex and diverse, but the solution is mostly the same as shown above: Tarantool and asynchronous programming. This tandem works great when you need to process a huge number of requests per second. Coroutines are effective when you have event waits (wait system calls); a classic example of an event wait task is a crawler.

Coding for asyncio or gevent is not as hard as it looks, but you still have to pay a lot of attention to code profiling; asynchronous code often doesn’t work the way it is expected to.

Tarantool and its protocol are well suited for asynchronous programming, and Python code can work effectively with Tarantool. Once you immerse yourself into the world of Tarantool and Lua, you’ll be endlessly surprised at the powerful capabilities to which you have access.

Links Used in This Article

And, of course, the Tarantool connectors:

  • github.com/shveenkov/aiotarantool
  • ithub.com/shveenkov/gtarantool
  • Discover Tarantool's unique features such as powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU.

    Topics:
    database ,tarantool ,python ,asynchronous ,tutorial ,nosql ,dbms

    Published at DZone with permission of evan bates, DZone MVB. See the original article here.

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}