Python and Performance
Python and Performance
Does Python need optimization for performance? When is it necessary? Read on for a brief comparison and some insight from the author.
Join the DZone community and get the full member experience.Join For Free
Maintain Application Performance with real-time monitoring and instrumentation for any application. Learn More!
One of the standard problems that keeps coming up over and over is the parsing of URLs. A sub-problem is the parsing of domain and sub-domains and getting a count.
It would be nice to parse the received file and get counts like:
.com had 15,323 count.google.com had 62 count.theatlantic.com had 33 count
The first code snippet would be in Python and the other code snippet would be in C/C++ to optimize for performance.
Yes. They did not even try to look in the standard library for urllib.parse. The general problem has already been solved; it can be exploited in a single line of code.
The line can be long-ish, so it can help to use a lambda to make it a little easier to read. The code is below.
The C/C++ point about "optimize for performance" bothers me to no end. Python isn't very slow. Optimization isn't required.
I made 16,000 URLs. These were not utterly random strings, they were random URLs using a pool of 100 distinct names. This provides some lumpiness to the data. Not real lumpiness where there's a long tail of 1-time-only names. But enough to exercise Counter and urllib.parse.urlparse().
Here's what I found. Time to parse 16,000 URLs and pluck out the last two levels of the name.
CPU times: user 154 ms, sys: 2.18 ms, total: 156 ms.
Wall time: 157 ms.
CPU times: user 295 ms, sys: 6.87 ms, total: 302 ms.
Wall time: 318 ms.
At that pace, why use C?
I suppose one could demand more speed just to demand more speed.
Here's some code that can be further optimized.
top = lambda netloc: '.'.join(netloc.split('.')[-2:]) random_counts = Counter(top(urllib.parse.urlparse(x).netloc) for x in random_urls_32k)
The slow part of this is the top() function. Using rsplit('.', maxsplit=2) might be better than split('.'). A smarter approach might be to find all the "." and slice the substring from the next-to-last one. Something like this, netloc[findall('.', netloc)[-2]:], assuming a findall() function that returns the locations of all '.' in a string.
Of course, if there is a problem, using a NumPy structure might speed things up. Or use dask to farm the work out to multiple threads.
Published at DZone with permission of Steven Lott , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.