How to Hobble Your Python Web-Scraper With getaddrinfo()
This is the second article in what seems destined to be a four-part series about Python's getaddrinfo on Mac. Here, I discover that contention for the getaddrinfo lock makes connecting to localhost appear to time out.
Join the DZone community and get the full member experience.Join For Free
this is the second article in what seems destined to be
a four-part series about python's
. here, i discover that contention for the
lock makes connecting to localhost appear to time out.
network timeouts from asyncio
a washington post data scientist named
posted to the mongodb user group list, asking for help with a python script. his script downloaded feeds from 500 sites concurrently and stored the feeds' links in mongodb. since this is the sort of problem async is good for, he used my async driver motor. he'd chosen to implement his feed-fetcher on
, with motor's new
each feed has a variable number of articles (average 10?). so it should launch around 5000+ "concurrent" requests to insert into the database. i put concurrent in quotes because it's sending the insert requests as the downloads come in so it really shouldn't be that many requests per second. i understand pymongo should be able to do at least 20k-30k plus?
he's right. and yet, motor threw connection timeouts every time he ran his script. what was going wrong with motor?
it was a saturday afternoon when i saw al's message to the mailing list; i wanted to leave it until monday, but i couldn't stand the anxiety. what if my driver was buggy?
in al's message i saw three clues. the first clue was, motor made its initial connection to mongodb without trouble, but while the script downloaded feeds and inserted links into the database, motor began throwing timeouts. since motor was already connected to mongodb, and since mongodb was running on the same machine as his code, it seemed it must be a motor bug.
i feel like what i'm trying to accomplish really shouldn't be this hard.
al's code also threw connection errors from
, but this was less surprising than motor's errors, since it was fetching from remote servers. still, i noted this as a possible second clue.
the third clue was this: if al turned his script's concurrency down from 500 feeds to 150 or less, motor stopped timing out. why?
on sunday, i ran al's script on my mac and reproduced the motor errors. this was a relief, of course. a reproducible bug is a tractable one.
with some print statements and pycharm, i determined that motor occasionally expands its connection pool in order to increase its "insert" concurrency. that's when the errors happen.
i reviewed my connection-pool tests and verified that motor can expand its connection pool under normal circumstances. so
must be fighting with motor somehow.
i tracked down the location of the timeout to this line in the
event loop, where it begins a dns lookup on its thread pool:
def create_connection(self): executor = self.thread_pool_executor yield from executor.submit( socket.getaddrinfo, host, port, family, type, proto, flags)
call always succeeded, but later calls sometimes timed out.
i wondered what the holdup was in the thread pool. so, i printed its queue size before the
# ensure it's initialized. if self._default_executor: q = self._default_executor._work_queue print("unfinished tasks: %d" % q.unfinished_tasks)
there were hundreds of unfinished tasks! why were these lookups getting clogged? i tried increasing the thread pool size, from the
default of 5, to 50, to 500....but, the timeouts happened just the same.
i thought about the problem as i made dinner, i thought about it as i fell asleep, i thought about it while i was walking to the subway monday morning in december's unseasonable warmth.
i recalled a pymongo investigation where
anna herlihy and i had explored cpython's getaddrinfo lock
: on mac, python only allows one
call at a time. i was climbing the stairs out of the times square station near the office when i figured it out: al's script was queueing on that
when motor opens a new connection to the mongodb server, it starts a 20-second timer, then calls
with the server address. if hundreds of other
calls are already enqueued, then motor's call can spend more than 20 seconds waiting in line for the
lock. it doesn't matter that looking up "localhost" is near-instant: we need the lock first. it appears as if motor can't connect to mongodb, when in fact it simply couldn't get the
lock in time.
my theory explains the first clue: that motor's initial connection succeeds. in the case of al's script, specifically, motor opens its first connection before
begins its hundreds of lookups, so there's no queue on the lock yet.
starts 500 calls to
for the 500 feeds' domains. as feeds are fetched, it inserts them into mongodb.
there comes a moment when the script begins an insert while another insert is in progress. when this happens, motor tries to open a new mongodb connection to start the second insert concurrently. that's when things go wrong: since
has hundreds of
calls still in progress, motor's new connection gets enqueued, waiting for the lock so it can resolve "localhost" again. after 20 seconds it gives up. meanwhile, dozens of other motor connections have piled up behind this one, and as they reach their 20-second timeouts they fail too.
motor's not the only one suffering, of course. the
co-routines are all waiting in line, too. so, my theory explained the second clue: the
errors were also caused by queueing on the
what about the third clue? why does turning concurrency down to 150 fix the problem? my theory explains that, too. the first 150 hostnames in al's list of feeds can all be resolved in under 20 seconds total. when motor opens a connection it is certainly slow, but it doesn't time out.
an explanatory theory is good, but experimental evidence is even better. i designed three tests for my hypothesis.
first, i tried al's script on linux. the python interpreter doesn't lock around
calls on linux, so a large number of in-flight lookups shouldn't slow down motor very much when it needs to resolve "localhost". indeed, on linux the script worked fine, and motor could expand its connection pool easily.
second, on my mac, i tried setting motor's maximum pool size to 1. this prevented motor from trying to open more connections after the script began the feed-fetcher, so motor never got stuck in line behind the fetcher. capping the pool size at 1 didn't cost any performance in this application, since the script spent so little time writing to mongodb compared to the time it spent fetching and parsing feeds.
for my third experiment, i patched the
event loop to always resolve "localhost" to "127.0.0.1", skipping the
call. this also worked as i expected.
i wrote back to al johri with my findings. his response made my day:
"holy crap, thank you so much. this is amazing!"
i wish bug investigations always turned out this well.
but still—all i'd done was diagnose the problem. how should i solve it? motor could cache lookups, or treat "localhost" specially. or
could make one of those changes instead of motor. or perhaps the
should take a connection timeout argument, since
can tell the difference between a slow call to
and a genuine connection timeout
which solution did i choose? stay tuned for the next installment!
- the original bug report on the mongodb user group list.
- python's getaddrinfo lock.
- the full series on getaddrinfo on mac
images: lunardo fero, embroidery designs, italy circa 1559. from fashion and virtue: textile patterns and the print revolution 1520–1620 , by femke speelberg.
Published at DZone with permission of A. Jesse Jiryu Davis, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.