Over a million developers have joined DZone.

Profiling Python Using cProfile: a Concrete Case

Writing programs is fun, but making them fast can be a pain. Python programs are no exception to that, but the basic profiling toolchain is actually not that complicated to use. Let's take a look...

· Performance Zone

See Gartner’s latest research on the application performance monitoring landscape and how APM suites are becoming more and more critical to the business, brought to you in partnership with AppDynamics.

Writing programs is fun, but making them fast can be a pain. Python programs are no exception to that, but the basic profiling toolchain is actually not that complicated to use. Here, I would like to show you how you can quickly profile and analyze your Python code to find what part of the code you should optimize.

What's Profiling?

Profiling a Python program is doing a dynamic analysis that measures the execution time of the program and everything that composes it. That means measuring the time spent in each of its functions. This will give you data about where your program is spending time, and what area might be worth optimizing.

It's a very interesting exercise. Many people focus on local optimizations, such as determining e.g. which of the Python functionsrange or xrange is going to be faster. It turns out that knowing which one is faster may never be an issue in your program, and that the time gained by one of the functions above might not be worth the time you spend researching that, or arguing about it with your colleague.

Trying to blindly optimize a program without measuring where it is actually spending its time is a useless exercise. Following your guts alone is not always sufficient.

There are many types of profiling, as there are many things you can measure. In this exercise, we'll focus on CPU utilization profiling, meaning the time spent by each function executing instructions. Obviously, we could do many more kinds of profiling and optimizations, such as memory profiling which would measure the memory used by each piece of code—something I talk about in The Hacker's Guide to Python.


Since Python 2.5, Python provides a C module called cProfile which has a reasonable overhead and offers a good enough feature set. The basic usage goes down to:

>>> import cProfile
>>> cProfile.run('2 + 2')
  2 function calls in 0.000 seconds

    Ordered by: standard name    

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)        
         1    0.000    0.000    0.000    0.000 <string>:1(<module>)        
         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}  

Though you can also run a script with it, which turns out to be handy:

$ python -m cProfile -s cumtime lwn2pocket.py
72270 function calls (70640 primitive calls) in 4.481 seconds

  Ordered by: cumulative time    

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)        
       1    0.004    0.004    4.481    4.481 lwn2pocket.py:2(<module>)        
       1    0.001    0.001    4.296    4.296 lwn2pocket.py:51(main)        
       3    0.000    0.000    4.286    1.429 api.py:17(request)        
       3    0.000    0.000    4.268    1.423 sessions.py:386(request)      
     4/3    0.000    0.000    3.816    1.272 sessions.py:539(send)        
       4    0.000    0.000    2.965    0.741 adapters.py:323(send)        
       4    0.000    0.000    2.962    0.740 connectionpool.py:421(urlopen)        
       4    0.000    0.000    2.961    0.740 connectionpool.py:317(_make_request)        
       2    0.000    0.000    2.675    1.338 api.py:98(post)       
      30    0.000    0.000    1.621    0.054 ssl.py:727(recv)       
      30    0.000    0.000    1.621    0.054 ssl.py:610(read)       
      30    1.621    0.054    1.621    0.054 {method 'read' of '_ssl._SSLSocket' objects}        
       1    0.000    0.000    1.611    1.611 api.py:58(get)        
       4    0.000    0.000    1.572    0.393 httplib.py:1095(getresponse)        
       4    0.000    0.000    1.572    0.393 httplib.py:446(begin)       
      60    0.000    0.000    1.571    0.026 socket.py:410(readline)        
       4    0.000    0.000    1.571    0.393 httplib.py:407(_read_status)        
       1    0.000    0.000    1.462    1.462 pocket.py:44(wrapped)        
       1    0.000    0.000    1.462    1.462 pocket.py:152(make_request)        
       1    0.000    0.000    1.462    1.462 pocket.py:139(_make_request)        
       1    0.000    0.000    1.459    1.459 pocket.py:134(_post_request)[…]  

This prints out all the function called, with the time spend in each and the number of times they have been called.

Advanced Visualization With KCacheGrind

While being useful, the output format is very basic and does not make easy-to-grab knowledge for complete programs. For more advanced visualization, I leverage KCacheGrind. If you did any C programming and profiling these last years, you may have used it as it is primarily designed as front-end for Valgrind generated call-graphs.

In order to use, you need to generate a cProfile result file, then convert it to KCacheGrind format. To do that, I use pyprof2calltree.

$ python -m cProfile -o myscript.cprof myscript.py
$ pyprof2calltree -k -i myscript.cprof  

And, the KCacheGrind window magically appears!

Concrete Case: Carbonara Optimization

I was curious about the performances of Carbonara, the small timeserie library I wrote for Gnocchi. I decided to do some basic profiling to see if there was any obvious optimization to do.

In order to profile a program, you need to run it. But, running the whole program in profiling mode can generate a lot of data that you don't care about, and it adds noise to what you're trying to understand. Since Gnocchi has thousands of unit tests and a few for Carbonara itself, I decided to profile the code used by these unit tests, as it's a good reflection of basic features of the library.

Note that this is a good strategy for a curious and naive first-pass profiling. There's no way that you can make sure that the hotspots you will see in the unit tests are the actual hotspots you will encounter in production. Therefore, profiling in conditions and with a scenario that mimics what's seen in production is often a necessity if you need to push your program optimization further and want to achieve perceivable and valuable gain.

I activated cProfile using the method described above, creating a cProfile.Profile object around my tests (I actually started to implement that in testtools). I then run KCacheGrind as described above. Using KCacheGrind, I generated the following figures.

The test I profiled here is called test_fetch and is pretty easy to understand: it puts data in a timeserie object, and then fetch the aggregated result. The above list shows that 88 % of the ticks are spent in set_values (44 ticks over 50). This function is used to insert values into the timeserie, not to fetch the values. That means that it's really slow to insert data, and pretty fast to actually retrieve data.

Reading the rest of the list indicates that several functions share the rest of the ticks, update_first_block_timestamp,_truncate_resample, etc. Some of the functions in the list are not part of Carbonara, so there's no point in looking to optimize them. The only thing that can be optimized is, sometimes, the number of times they're called.

The call graph gives me a bit more insight about what's going on here. Using my knowledge about how Carbonara works, I don't think that the whole stack on the left for _first_block_timestamp makes much sense. This function is supposed to find the first timestamp for an aggregate, e.g. with a timestamp of 13:34:45 and a period of 5 minutes, the function should return 13:30:00. The way it works currently is by calling the resample function from Pandas on a timeserie with only one element, but that seems to be very slow. Indeed, currently, this function represents 25 % of the time spent by set_values (11 ticks on 44).

Fortunately, I recently added a small function called _round_timestamp that does exactly what _first_block_timestamp needs that without calling any Pandas function, so no resample. So, I ended up rewriting that function this way:

def _first_block_timestamp(self):
-        ts = self.ts[-1:].resample(self.block_size)
-        return (ts.index[-1] - (self.block_size * self.back_window))
+        rounded = self._round_timestamp(self.ts.index[-1], self.block_size)
+        return rounded - (self.block_size * self.back_window)  

And then, I re-run the exact same test to compare the output of cProfile.

The list of function seems quite different this time. The number of times spent used by set_values dropped from 88 % to 71 %.

The call stack for set_values shows that pretty well: we can't even see the _first_block_timestamp function as it is so fast that it totally disappeared from the display. It's now being considered insignificant by the profiler.

So, we just sped up the whole insertion process of values into Carbonara by a nice 25 % in a few minutes. Not that bad for a first naive pass, right?

The Performance Zone is brought to you in partnership with AppDynamics.  See Gartner’s latest research on the application performance monitoring landscape and how APM suites are becoming more and more critical to the business.


Published at DZone with permission of Julien Danjou, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}