Hacking into Python Objects Internals
Join the DZone community and get the full member experience.
Join For FreeOriginally Authored by Christian S. Perone
typedef struct _object { PyObject_HEAD } PyObject;
and the PyObject_HEAD macro is defined as:
#define PyObject_HEAD \ _PyObject_HEAD_EXTRA \ Py_ssize_t ob_refcnt; \ struct _typeobject *ob_type;
… with two fields (forget the _PyObject_HEAD_EXTRA, it’s only used for a tracing debug feature) called ob_refcnt and ob_type, representing the reference counting for the object and the type of the object. I know you can use sys.getrefcount to get the reference counting of an object, but hacking the object memory using ctypes is by far more powerful, since you can get the contents of any field of the object (in cases where you don’t have a native API for that), I’ll show more examples later, but lets focus on the reference counting field of the object.
Getting the reference count (ob_refcnt)
So, in Python, we have the built-in function id(),
this function returns the identity of the object, but, looking at its
definition on CPython implementation, you’ll notice that id() returns the memory address of the object, see the source in Python/bltinmodule.c:
static PyObject * builtin_id(PyObject *self, PyObject *v) { return PyLong_FromVoidPtr(v); }
… the function PyLong_FromVoidPtr returns a Python
long object from a void pointer. So, in CPython, this value is the
address of the object in the memory as shown below:
>>> value = 666 >>> hex(id(value)) '0x8998e50' # memory address of the 'value' object
Now that we have the memory address of the object, we can use the
Python ctypes module to get the reference counting by accessing the
attribute ob_refcnt, here is the code needed to do that:
>>> value = 666 >>> value_address = id(value) >>> >>> ob_refcnt = ctypes.c_long.from_address(value_address) >>> ob_refcnt c_long(1)
What I’m doing here is getting the integer value from the ob_refcnt attribute of the PyObject in memory. Let’s add a new reference for the object ‘value’ we created, and then check the reference count again:
>>> value_ref = value >>> id(value_ref) == id(value) True >>> ob_refcnt c_long(2)
Note that the reference counting was increased by 1 due to the new reference variable called ‘value_ref’.
Interned strings state (ob_sstate)
Now, getting the reference count wasn’t even funny, we already had the sys.getrefcount API for that, but what about the interned state of the strings
? In order to avoid the creation of different allocations for the same
string (and to speed comparisons), Python uses a dictionary that works
like a “cache” for strings, this dictionary is defined in Objects/stringobject.c:
/* This dictionary holds all interned strings. Note that references to strings in this dictionary are *not* counted in the string's ob_refcnt. When the interned string reaches a refcnt of 0 the string deallocation function will delete the reference from this dictionary. Another way to look at this is that to say that the actual reference count of a string is: s->ob_refcnt + (s->ob_sstate?2:0) */ static PyObject *interned;
I also copied here the comment about the dictionary, because is
interesting to note that the strings in the dictionary aren’t counted in
the string’s ob_refcnt.
So, the interned state of a string object is hold in the attribute ob_sstate of the string object, let’s see the definition of the Python string object:
typedef struct { PyObject_VAR_HEAD long ob_shash; int ob_sstate; char ob_sval[1]; /* Invariants: * ob_sval contains space for 'ob_size+1' elements. * ob_sval[ob_size] == 0. * ob_shash is the hash of the string or -1 if not computed yet. * ob_sstate != 0 iff the string object is in stringobject.c's * 'interned' dictionary; in this case the two references * from 'interned' to this object are *not counted* in ob_refcnt. */ } PyStringObject;
As you can note, strings objects inherit from the PyObject_VAR_HEAD
macro, which defines another header attribute, let’s see the definition
to get the complete idea of the structure:
#define PyObject_VAR_HEAD \ PyObject_HEAD \ Py_ssize_t ob_size; /* Number of items in variable part */
The PyObject_VAR_HEAD macro adds another field called ob_size,
which is the number of items on the variable part of the Python object
(i.e. the number of items on a list object). So, before getting to the ob_sstate field, we need to shift our offset to skip the fields ob_refcnt (long), ob_type (void*) (from PyObject_HEAD), the field ob_size (long) (from PyObject_VAR_HEAD) and the field ob_shash (long) from the PyStringObject. Concretely, we need to skip this offset (3 fields with size long and one field with size void*) of bytes:
>>> ob_sstate_offset = ctypes.sizeof(ctypes.c_long)*3 + ctypes.sizeof(ctypes.c_voidp) >>> ob_sstate_offset 16
Now, let’s prepare two cases, one that we know that isn’t interned
and another that is surely interned, then we’ll force the interned state
of the other non-interned string to check the result of the ob_sstate attribute:
>>> a = "lero" >>> b = "".join(["l", "e", "r", "o"]) >>> ctypes.c_long.from_address(id(a) + ob_sstate_offset) c_long(1) >>> ctypes.c_long.from_address(id(b) + ob_sstate_offset) c_long(0) >>> ctypes.c_long.from_address(id(intern(b)) + ob_sstate_offset) c_long(1)
Note that the interned state for the object “a” is 1 and for the
object “b” is 0. After forcing the interned state of the variable “b”,
we can see that the field ob_sstate has changed to 1.
Changing internal states (evil mode)
Now, let’s suppose we want to change some internal state of a Python
object through the interpreter. Let’s try to change the value of an int
object. Int objects are defined in Include/intobject.h:
typedef struct { PyObject_HEAD long ob_ival; } PyIntObject;
As you can see, the internal value of an int is stored in the field ob_ival, to change it, we just need to skip the ob_refcnt (long) and the ob_type (void*) from the PyObject_HEAD:
>>> value = 666 >>> ob_ival_offset = ctypes.sizeof(ctypes.c_long) + ctypes.sizeof(ctypes.c_voidp) >>> ob_ival = ctypes.c_int.from_address(id(value)+ob_ival_offset) >>> ob_ival c_long(666) >>> ob_ival.value = 8 >>> value 8
And that is it, we have changed the value of the int value directly in the memory.
I hope you liked it, you can play with lots of other Python objects like lists and dicts, note that this method is just intended to show how the Python objects are structured in the memory and how you can change them using the native API, but obviously, you’re not supposed to use this to change the value of ints lol.
Update 11/29/11: you’re not supposed to do such
things on your production code or something like that, in this post I’m
doing lazy assumptions about arch details like sizes of primitives, etc.
Be warned.
Source: http://pyevolve.sourceforge.net/wordpress/?p=2171
Opinions expressed by DZone contributors are their own.
Comments