The Saga of Concurrent DNS in Python and the Defeat of the Wicked Mutex Troll
Gather round. I will tell you how I unearthed a lost secret, unbound Python from old shackles, and banished an ancient and horrible Mutex Troll.
Join the DZone community and get the full member experience.Join For Free
Illustration by Terry Marks.
"Tell us about the time you made DNS resolution concurrent in Python on Mac and BSD."
No, no, you do not want to hear that story, my friends. It is nothing but old lore and
"But you made Python more scalable. The saga of Steve Jobs was sung to you by a mysterious wizard with a fanciful nickname! Tell us!"
Gather round, then. I will tell you how I unearthed a lost secret, unbound Python from old shackles, and banished an ancient and horrible Mutex Troll.
Let us begin at the beginning.
A long time ago, in the 1980s, a coven of Berkeley sorcerers crafted an operating system. They named it after themselves: the Berkeley Software Distribution, or BSD. For generations, they nurtured it, growing it and adding features. One night, they conjured a powerful function that could resolve hostnames to IPv4 or IPv6 addresses. It was called getaddrinfo. The function was mighty, but in years to come, it would grow dangerous, for the sorcerers had not made getaddrinfo thread-safe.
As ages passed, BSD spawned many offspring. There were FreeBSD, OpenBSD, NetBSD, and in time, Mac OS X. Each made its copy of getaddrinfo thread safe, at different times and different ways. Some operating systems retained scribes who recorded these events in the annals. Some did not.
Because getaddrinfo is ringed round with mystery, the artisans who make cross-platform network libraries have mistrusted it. Is it thread safe or not? Often, they hired a Mutex Troll to stand guard and prevent more than one thread from using getaddrinfo concurrently. The most widespread such library is Python's own socket module, distributed with Python's standard library. On Mac and other BSDs, the Python interpreter hires a Mutex Troll who demands that each Python thread holds a special lock while calling getaddrinfo.
Behold, my friends, the getaddrinfo lock in Python's socketmodule.c:
/* On systems on which getaddrinfo() is believed to not be thread-safe, (this includes the getaddrinfo emulation) protect access with a lock. */ #if defined(WITH_THREAD) && (defined(__APPLE__) || \ (defined(__FreeBSD__) && __FreeBSD_version+0 < 503000) || \ defined(__OpenBSD__) || defined(__NetBSD__) || \ defined(__VMS) || !defined(HAVE_GETADDRINFO)) #define USE_GETADDRINFO_LOCK #endif #ifdef USE_GETADDRINFO_LOCK #define ACQUIRE_GETADDRINFO_LOCK PyThread_acquire_lock(netdb_lock, 1); #define RELEASE_GETADDRINFO_LOCK PyThread_release_lock(netdb_lock); #else #define ACQUIRE_GETADDRINFO_LOCK #define RELEASE_GETADDRINFO_LOCK #endif
This lock was not widely known. Although Python's Global Interpreter Lock certainly is infamous, the getaddrinfo lock was known only to a battle-worn few. The Mutex Troll required this lock in Python interpreters installed on Mac, NetBSD, OpenBSD, or on FreeBSD before 5.3. I first descried it while hunting a deadlock it caused in PyMongo. Since then, the mercenary troll and I had met in combat again and again: deadlocks, errors, and slowdowns in my Python code led me to renewed confrontation with it.
As I met more Python experts, I learned that they had encountered this hired troll, too. For example, multithreaded Python code that crawls thousands of websites and must resolve thousands of hosts ran fine on Linux but came to grief on a Mac. Threads would wait in a long queue to acquire the lock before the troll guard would allow them to call getaddrinfo. One very slow DNS resolution would block all the threads behind it, and they would throw timeouts before they could ever grasp the lock.
The day that Python's artisans hired the Mutex Troll, it was needed to safeguard getaddrinfo against concurrent threads. However, now, the troll was no longer needed. I knew that getaddrinfo had been made thread safe on BSD's children, especially the most famous of them: Mac OS X. Many modern programs that call getaddrinfo concurrently suffer no harm. The MongoDB server, for example, runs fine on Mac without a getaddrinfo lock nor a troll to enforce it. However, the mercenary's contract was eternal, and in the decades it stood guard over the lock it had grown corrupt and greedy. The time had come to banish the horrid thing. Whenever I read that comment from some past craftsman about "systems on which
getaddrinfo() is believed to not be thread-safe," my ire boiled hotter. Why enthrall ourselves to mere belief, not knowing the truth?
One winter morning last year, I stood before my companions in the daily status meeting and asked leave to endeavor on a quest. I told them about the Mutex Troll and how it had held Mac and BSD coders hostage for generations. I made a great boast: I would defeat the Mutex Troll in Python and free the threads. Gladly, my fellows at MongoDB granted me leave to go on the journey. "Banish the troll for the good of all!" they cried. They raised their flagons of Diet Coke and drank to my good fortune.
I donned my war-gear and sallied from MongoDB's castle. However, to dispel the Mutex Troll's power in Python, it is not enough to say "perchance getaddrinfo once was broken, but now it is surely mended." When was getaddrinfo fixed, and how? How could I prove it to the Python core developers? These developers, unlike MongoDB coders, must support all ancient versions of OS X to the dawn of time. To convince them, I would need to know the answer for certain. I decided to ask an Apple engineer to aid my cause.
Apple engineers are not like you and me — they are a shy and secretive folk. They publish only what code they must from Darwin. Their comings and goings are recorded in no bug tracker, their works in no changelog. To learn their secrets, one must delve deep.
Through wild hills, I journeyed to a tower where Apple clerics once gathered. I entered the deserted tower and found carved into the wall a man page for getaddrinfo on OS X 10.4, which warned:
getaddrinfo(3) BSD Library Functions Manual BUGS: The implementation of getaddrinfo() is not thread-safe. December 20, 2004
I read the source for 10.4's getaddrinfo. Uncertain what I beheld, I guessed I saw the data race:
gai_lookupd, which reads and writes a global static variable
gai_proc. It seemed ill-wrought for multithreading.
On OS X 10.5, the warning had vanished from the man page and the getaddrinfo function was largely rewritten. Should I believe that the bug was fixed then, a decade ago? I wept bitterly over the years of needless toil that programmers and processors had suffered at the hands of the troll. I pitied them, but I did not falter. I would prove that they were free of the troll's domination. Yet, diffing one version of getaddrinfo to the next was unprofitable. I did not understand what I saw! I needed an answer from Apple.
To ask a question of the Apple engineers, my friends, you must leave $99 of silver coins in a hollow oak tree. Then, wait. It may take a day, or a season, but an Apple engineer will come and whisper in your ear and bind you to a secret pact that you must never reveal what you have been told. The engineer will give you an Asking Ring. This you must use to ask a second question within a year and a day, or its power is lost.
I returned to MongoDB and asked my companions for some silver coins, which they gave me gladly. Then, on the first night after the first day of the year, I left them in the hollow oak, with my question:
"Has getaddrinfo been fixed? Can you give me a public statement or a link to a resolved bug in a tracker? I need a way, not only to know it was fixed, but to prove it to others."
I did not yet know what my second question would be.
Twelve days and twelve nights I waited, refreshing my email. Is today the day? Or today? The twelfth morning, January 13, I awoke to see an ancient box of rusted hinge and hoared with lichen resting by my bed. The box opened, exhaling the dust of forgotten smithies where the first network code was forged. Slowly, I reached in. I lifted out a scroll marked with assembly codes and unfurled it with a crackle.
My friends, I cannot tell you all I learned from that message. The secrets that were spoken to me, I am bound to keep. But I may relate a part of it, the story of a wizard both brilliant and foolish named Jobs.
…and it came to pass that Jobs was exiled from Apple. His crown and throne were taken from him and he was banished from his company. He wandered deep into the forest where he gathered a coven of witches to conjure a new operating system called NeXT, a child of BSD. A daemon called lookupd with the power to resolve hosts was bound to serve within it. Years passed. Jobs' fellows at Apple, hearing rumors of NeXT's greatness, sent emissaries to beg Jobs to return.
With Jobs restored as their king, the Apple engineers wrought the first versions of OS X. It, too, was an offspring of BSD, and its DNS system was a mix of new OS X features, mDNSResponder, and Open Directory, along with the daemon lookupd from NeXT and libresolv from an old BSD.
"Aha!" I cried. It was these OS X versions whose getaddrinfo was not thread-safe. When Python was first ported to Mac, it rightly hired a Mutex Troll to guard getaddrinfo and only allow one host resolution at a time. Unfurling the scroll more, I read on.
Next to libinfo, the scroll's author had written in the margin, "The presence and name of this library is a remnant from the original NetInfo architecture."
The mdns module uses something called the DNS-SD API, which is well-known to be thread safe. The DNS-SD API is part of the mDNSResponder project. The key function is DNSServiceQueryRecord. As you can see, it does an IPC over to the mDNSResponder process, at which point thread safety is assured.
The scroll was signed in an ornate hand:
Share and Enjoy,
Quinn "The Eskimo!"
It was a message from the loremaster Quinn, the gray-haired, the mighty-fingered hacker, the legendary, The Eskimo, who had named himself from a Bob Dylan lyric, who shouts the Hitchhiker's Guide battle cry, "Share and enjoy!"
In the dusty wooden chest, beneath the place the scroll had been, was an Asking Ring. I left it there for the day when I would need to ask a second question.
The Eskimo's message had spurred my courage. I knew what I had to do: I would prove that getaddrinfo, called concurrently, failed on 10.4 and worked on a modern Mac. Once I had done that, the Mutex Troll's power would be dispelled. Now, I had to get my hands on a 10.4 VM. I went on eBay and acquired an antique DVD.
Arduous days and nights I toiled, Googling by candlelight for the incantations that could breathe the ancient spirit to life in VirtualBox. At last, the creature arose:
Now, I needed advice from BSD witches: How should I test getaddrinfo on this old OS X?
There is a tiny coven of NYC BSD users who meet at the tavern called Stone Creek, near my dwelling. They are aged and fierce, but I made the Sign of the Trident and supplicated them humbly for advice, and they were kindly to me. One NetBSD developer named Christos Zoulas showed me NetBSD's getaddrinfo test, which resolves a hundred host names with ten threads. I plucked the test from NetBSD's code-hoard, which rests in heaps in a CVS repo.
The next task of my quest required a compiler. Happily, XCode 2 comes with the 10.4 DVD, so I installed it and compiled the NetBSD getaddrinfo test.
I prayed the test would fail, for then I would have reproduced the bug. I'd have shown that getaddrinfo was not thread-safe on 10.4, and so, assuming the test passed on a modern OS X, I could show that the Mutex Troll's reason for being was obsolete. My heart quivered and I prayed to the spirits of ancient code-smiths as I raised my fingers to the keyboard and invoked the program:
Thank the spirits who smiled on my fortune! The test failed.
I compiled the same test on my laptop running OS X 10.10 and it passed. I could even see the evidence of getaddrinfo's concurrency on my Mac: more threads reduced the total time to resolve all hosts.
To the green and happy kingdom of the Pythonistas I hastened with my news. "‘Tis mended! The getaddrinfo bug on OS X was fixed a decade ago, in 10.5. The reign of the Mutex Troll shall be ended." I related the story of my testing, and of The Eskimo's secret letter to me.
Now was the time to use my second question, for I needed to discover how to
#ifdef for Mac OS 10.5. I returned to the lichened chest and took up the Asking Ring. Wearing it on my finger, I spoke: "What preprocessor symbol can I rely on to tell me if OS X is 10.5 or newer?" The ring blazed up with heat and I cast it from me. I listened for an answer, but there was none. Despondent, I lay down and slept.
The next morning, the ring had vanished, and in the chest, there was a new scroll from The Eskimo with my answer:
AvailabilityMacros.h and check for
I had acquired all the knowledge and weapons I needed. I could fulfill the boast that I had made months before, to banish the Mutex Troll and free Mac users from the getaddrinfo lock:
-#if defined(WITH_THREAD) && (defined(__APPLE__) || \ +#if defined(WITH_THREAD) && ( \ + (defined(__APPLE__) && \ + MAC_OS_X_VERSION_MIN_REQUIRED < MAC_OS_X_VERSION_10_5) || \ (defined(__FreeBSD__) && __FreeBSD_version+0 < 503000) || \ defined(__OpenBSD__) || defined(__NetBSD__) || \ defined(__VMS) || !defined(HAVE_GETADDRINFO)) #define USE_GETADDRINFO_LOCK #endif
This patch was approved by Guido van Rossum and merged by a core developer, Ned Deily. And Guido did praise me, saying, "Thanks for the thorough work!"
Now look closely at the code, my friends, and you will see that Python on FreeBSD 5.3 and later was already free from the troll. The knight Maxim Sobolev updated Python in 2005 to allow concurrent hostname resolution there.
OpenBSD and NetBSD yet suffered the demands of the Mutex Troll! OpenBSD's getaddrinfo had been thread safe since 2013; there is no need for the lock on that OS. As for NetBSD, its getaddrinfo was fixed long ago in 2004 by the very same Christos Zoulas who had answered my call for aid when I went to the BSD witches in the tavern. My blood was still hot from my victory in OS X, so I made short work of the lock on the remaining BSDs. Their annals were well-kept and easily found, unlike Apple's, and I had no trouble persuading the Python developers that no guard was needed on those OSes. Without a word, the mercenary troll shouldered its ax and trudged off in search of other patrons on other platforms. Never again would it hold hostage the worthy smiths forging Python code on BSD.
I pondered that VMS was still on the list of non-thread-safe getaddrinfo implementations. Had VMS fixed its getaddrinfo? Could Python do concurrent resolution there too, now?
My sword-arm was weary. I retired, leaving that foe to prove the mettle of some future hero.
Published at DZone with permission of A. Jesse Jiryu Davis, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.