Charming Snakes and Shaving Yaks
Join the DZone community and get the full member experience.
Join For FreeOriginally Authored by Barry Warsaw
For the last couple of days I've been debugging a fun problem in the Ubuntu tool called Jockey. Jockey is a tool for managing device drivers on Ubuntu.
It actually contains both a command-line and a graphical front-end,
and a dbus backend service that does all the work (with proper
authentication, since it modifies your system). None of that is
terribly relevant to the problem, although the dbus bit will come back
to haunt us later.
What is important is that Jockey
is a Python application, written using many Python modules interfacing
to low-level tools such as apt and dbus. The original bug report was
mighty confusing. Aside from not being reproducible by myself and
others, the actual exception made no fricken sense! Basically, it was
code like this that was throwing a TypeError:
_actions = [] # _actions gets appended to at various times and later... for item in _actions[:]: # do something
Everyone who reported the problem said the TypeError was getting thrown on the for-statement
line. The exception message indicated that Python was getting some
object that it was trying to convert to an integer, but was failing.
How could you possible get that exception when either making a copy of a
list or iterating over that copy? Was the list corrupted? Was it not
actually a list but some list-like object that was somehow returning non-integers for its min and max indexes?
To
make matters worse, this little code snippet was in Python's standard
library, in the subprocess module. A quick search of Python's bug
database did reveal some recent threads about changes here, made to
ensure that popen objects got properly cleaned up by the garbage
collector if they weren't cleaned up explicitly by the program. Note
that we're using Python 2.7 here, and after some reading of the tracker
issues and the Python subprocess.py code, I just couldn't see any
problem here, or at least none that could possibly lead to the error
being seen.
When I originally looked at the
Launchpad bug report, that was about as far as I got. I couldn't see
any possible way this bug could happen, and I couldn't reproduce it, so I
set the bug to Incomplete. Unfortunately, it kept hitting Ubuntu Natty
beta testers so it wasn't going to go away. Fortunately, Martin Pitt
found a test recipe, with which I could reproduce the bug 100% of the
time. Yay! At least it probably wasn't going to be a race condition.
How
to debug this though? Normally, I'd just attach gdb to the thing and
start tracing, but the problem was that when I set the Jockey dbus
backend to use the debug version of Python, the error went away (or
rather morphed into something that was really unrelated). Here's where I started to sharpen up my favorite yak shaving blade.
The
parable about shaving yaks is so entirely appropriate to problems like
this that I'll take a quick detour. One of my (and my brother's)
favorite shows in the 90's was Ren and Stimpy.
That show was groundbreaking, and you can see elements of it in just
about every NickToons cartoon on cable today. Some of the episodes are
brilliant and others are horrible but as a whole, Ren and Stimpy are an
undeniable classic of American animation. In one particular awesome
episode, Ren and Stimpy celebrate Kilted Yaks-eve
where a Yak comes up from the tub drain and shaves his stubble, leaving
behind a present of the highly desired shaving scum for Stimpy in the
morning. Based on this episode (most likely!) the programming term "yak shaving" typically means having to take detour after meaningless detour before you can actually solve the problem you're having.
In
the context of this bug, the attempt to use the debug build of Python
in the Jockey dbus backend was the first yak to shave. Because I wanted
to get more information out of the process, I tried to attach to a
running backend process, but this proved to be quite difficult. I was
debugging this in a 64bit virtual machine, and gdb+debug-python was just
not cooperating.
Now the question is: do you debug that
problem (or at least get to a place where you can decide to file a bug
or not), or do you punt and try a different approach? The former is
most definitely shaving the yak; it gets you no closer to fixing the
original problem, but certain wastes enough of your time to make you
seem productive. Of course, it's inevitable that if you do follow that
secondary bug, it'll lead to a third one, and a fourth one, and so on
until you're deep in the shaving scum and no closer to a solution to the
original problem. You need to be constantly on guard against these types of yak, er, rabbit holes.
I'll spare you the gory details about hitting bugs in my text editor, bugs in building Python on a multiarch system,
and other paths that all lead to smooth chinned yaks, but not to happy
Pythons. When I punted on trying to attach to the Jockey backend, I
actually decided to try to figure out exactly where the TypeError
was occurring by using a twist on old-fashioned print statements. By
grepping the Python source I found about a dozen occurrences of the
error message "an integer is required". Which one was getting tripped?
I added a little marker text to every such occurrence and re-built a
Python package to aid in debugging.
Here's where I did
detour to yak shaving. The Ubuntu source package for Python itself,
runs the full test suite on every build. That's great for ensuring a
high quality Python package, but it's terrible for turnaround time when
experimentally hacking Python. Building the package with DEB_BUILD_OPTIONS="nocheck nobench" is supposed to work, but for some reason did not with my sbuild
environment. I guess debugging that would have been like shaving the
yak's left check, but instead it was so much easier to shave his right
check. So I spent some time with my razor chopping out huge sections of
the debian/rules file so that Python 2.7 would build as fast as
possible. While a seemingly meaningless task, this did actually help
quite a bit since I was able to try out ideas with a much shorter
turnaround.
Anyway, with many iterations on the
exception marking idea, I finally nailed down the culprit. It was in
the Python C API function PyInt_AsLong(),
but even here I was unsure which arm of the conditional was getting
triggered. Another round of hack-build-scp-reset-test and I found what I
suspected: PyInt_AsLong() was getting passed an object that could not be turned into an integer. But what was that object?
So
now back to the original problem with gdb. To solve this I downloaded
and built a 32bit virtual machine, and was able to reproduce the bug
there. Fortunately, in the 32bit environment I was much more successful
in attaching to a running backend Jockey dbus process, and even though I
didn't have the Python source available (and no, I won't talk about
where that particular yak shaving detour led), I could fairly
easily print the objects in the debugger at the offending code, where I
learned that PyInt_AsLong() was getting called with None as its argument. And yeah, you can't turn None into an integer in Python!
But, how was this function getting called with None? Popping up the call stack led me eventually to PyArg_Parse(),
an older C API function that parses a Python tuple into a set of
objects based on some format flags. This is used when implementing
Python functions in C, to parse an argument list. Popping up the stack
again led me into some python-apt code. python-apt is a C++ library
that exposes the APT system to Python programs. It's fairly mature and
robust, but I wasn't as familiar with all its dark corners as I was with
Python.
Now, my first instinct is never that
I've found a bug in Python. That's not to say they don't exist, but just
that Python has been around so long, is so well tested, and used so
extensively that I'm always suspicious in cases like this (i.e. strange,
inexplicable bugs that make no sense) of extension modules and third
party code. And indeed, my sleuthing has led me to python-apt, exactly
the kind of complex Python extension module that can have odd lurking
bugs in it. Still, the problem I was now faced with was this: the call
stack led me into a code path that had nothing to do with iterating over
a list, or copying that list. So, what gives?
Well,
it's useful to know how Python exceptions work at the C level. In
general terms, when some C code raises an exception, it basically sets
some global state, and then returns error codes up the stack until
either something catches it and deals with it, or it percolates up to
Python's top eval loop. The key thing here is that there are usually two
states that exist: the global exception value currently in effect, and
an error code that gets returned up the C call stack. Usually this
return code is a zero or one, but it can also be NULL or -1 in certain cases. The Python C API documentation is very good at describing these.
So
now that I looked closely at the python-apt code, I could see what what
going on and it all began to make sense! The python-apt package was pulsing
a progress meter, set up as a callback by the client of the python-apt
code. Meaning, python-apt couldn't really control what this callback
would return. python-apt expected the callback to return either Python's True or False, but it could return anything, including nothing! In Python, True can be coerced into the integer 1 and False into the integer 0, and python-apt wanted an integer, so indeed the call stack lead right to a call to PyArg_Parse()
to turn the callback's return object into an integer. What happens if
the callback didn't return something that could be turned into an
integer, or worse, didn't return anything at all?
In Python, a function always returns something, even if there is no explicit return statement. In those cases, None is implicitly returned. Yes, you see it now. And if not, there was this clue in the python-apt code: "most
of the time the user who subclasses the pulse() method forgot to add a
return {True,False} so we just assume he wants a True." Translated: Hey guy! You forgot to add a "return True" or "return False" to your pulse() method, and it probably fell of the end, giving us None, which we dutifully passed to PyArg_Parse().
PyArg_Parse() did its thing when getting None, by correctly setting the global exception state to the TypeError,
and returning a zero code to indicate an error occurred. But looking
at the python-apt code, it recognizes the error code, but forgets that some global exception state was set!
Meaning, even though python-apt was ignoring the exception, Python
still knew about it. But because control wasn't being passed on up to
the Python eval loop, the exception was just lurking there like a angry
unshaven yak, waiting to be discovered. And in fact, the next time
Python itself checked the exception state was in, yep, the for-loop iterating over the perfectly fine list object. Python hits the for-loop, finds this orphaned TypeError lurking there and raises it, in a place that has literally nothing to do with the original exception.
The fix is a one-liner. In python-apt, where its ignoring any exception returned by PyArg_Parse(), it must both swallow the error code (which it was doing), and clear the global exception state (which it was notdoing). By adding a call to PyErr_Clear(),
python-apt was keeping the interpreter's state consistent, and properly
ignoring the argument parsing error, thus fixing the bug.
As I mentioned to my colleague Colin Watson, this did turn out to be a fun one to debug, though not as "fun" as one he recently worked on.
Happy Yak Shaving.
Source: http://www.wefearchange.org/2011/03/charming-snakes-and-shaving-yaks.html
Opinions expressed by DZone contributors are their own.
Comments