Note: I’ve written on this topic before, but thought the subject warranted further more detailed discussion and a more comprehensive and up-to-date set of benchmarks. Hence, this post and this presentation. Enjoy.
Update: This post and the benchmarks have been updated for PHP 5.3.4 and Ubuntu 10.10. (12/14/2010)
The SPL, or Standard PHP Library, is an often overlooked extension in the PHP core. It first came on the scene in PHP 5 and a variety of iterators constituted the majority of its initial offerings. Though the iterator offerings were expanded in PHP 5.3, the particularly interesting additions to the SPL were several specialized data structure classes, the foundational concepts for which originate in the field of computer science. In this post, I will provide an overview of these new classes and explain why and when they should be used.
While PHP has several data types, the ones that likely see the most frequent and varied use are strings and arrays. They are the proverbial duct tape and WD-40 of PHP, respectively. Like arrays, SPL data structure classes are used to store composite (i.e. non-scalar) data.
Now, that’s not to say that every instance of an array in existing codebases should be replaced with an SPL container object. There are cases where it’s appropriate to use one over the other. Knowing the difference requires an understanding of how arrays work.
Within the C code that makes up the PHP interpreter, arrays are implemented as a data structure called a hash table or hash map. When a value contained within an array is referenced by its index, PHP uses a hashing function to convert that index into a unique hash representing the location of the corresponding value within the array.
This hash map implementation enables arrays to store an arbitrary number of elements and provide access to all of those elements simultaneously using either numeric or string keys. Arrays are extremely fast for the capabilities they provide and are an excellent general purpose data structure.
In contrast to arrays, SplFixedArray functions more like C arrays or Java arrays than PHP arrays. The maximum number of elements that it may contain is specified upon instantiation. While it is possible to change it later via the setSize() method, this negates the performance advantages of using it: because its size is fixed, it doesn’t need to use a hashing function to resolve the position of elements within the array. It makes sense to use fixed arrays when the number of elements to be stored is known in advance and the elements only need to be accessed by sequential position.
SplFixedArray implements the Iterator, ArrayAccess, and Countable interfaces. Iterator allows it to be iterated using a foreach loop. ArrayAccess provides access to its elements using array syntax where elements are referred to using integer positions beginning at 0 as with enumerated arrays. Countable enables a list to be passed to the count() function like an array.
Aside from the inability to use it in place of arrays with array functions, instances of SplFixedArray function just like arrays for all intents and purposes. It’s even possible to convert them to and from arrays using the toArray() and fromArray() methods respectively. However, it generally makes more sense to use SplFixedArray exclusively for each individual use case.
In computer science, a list is defined as an ordered collection of values. A linked list is a data structure in which each element in the list includes a reference to one or both of the elements on either side of it within the list. The term “doubly-linked list” is used to refer to the latter case. In the SPL, this takes the form of the class SplDoublyLinkedList.
Like SplFixedArray, SplDoublyLinkedList also implements the Iterator, ArrayAccess, and Countable interfaces. In addition to the methods that come with these interface implementations, elements can be added to or removed from the start or end of the list using its push(), pop(), shift() and unshift() methods, which correspond to the array_push(), array_pop(), array_shift(), and array_unshift() functions respectively. Unfortunately, as of PHP 5.3.4, there’s no way to insert an element anywhere in the list other than at the beginning or the end. A feature request has been filed for this. Add a comment or vote to show support for its addition.
The elements at the start and end of the list are accessible via its top() and bottom() methods respectively, which correspond to the reset() and end() functions. Like SplFixedArray, elements can also be accessed arbitrarily by positional index using the array syntax granted by ArrayAccess. It makes sense to use lists when the number of elements to be stored is not known in advance and the elements only need to be accessed by sequential position.
Stacks are similar to lists with two major differences. First, elements can only be added to the top of the stack. Second, an element can only be accessed by taking it off the top of the stack. Because of these differences, the stack is often referred to as a Last-In-First-Out or LIFO data structure. SplStack is the SPL stack implementation.
SplStack is a bit removed from the traditional definition of a stack. It extends SplDoublyLinkedList and inherits its abilities, some of which don’t really apply to stacks. In order to enforce its restriction on how elements are accessed, SplStack overrides the setIteratorMode() method of its parent class and implements its own to prevent modification of the iteration direction. Both methods allow elements to be retained or removed as they are iterated.
Use of stacks makes sense when the number of elements to be stored is not known in advance and the only element that must be accessible is the last one stored. However, as of PHP 5.3.4, the performance of SplStack leaves something to be desired. Benchmarks included later in this provide an objective illustration of this, though the cause of the behavior remains unknown.
Queues are also similar to lists, again with two major differences. First, elements can only be added (or “enqueued”) to the end of the queue. Second, an element can only be accessed by removing (or “dequeueing”) it from the beginning of the queue. For these reasons the queue is referred to as a First-In-First-Out or FIFO data structure. The SplQueue class implements this data structure in the SPL.
SplQueue follows suit with SplStack in extending SplDoublyLinkedList. Just as SplStack resultingly inherits some operations with at least questionable applicability, so too does SplQueue. Likewise, it overrides setIteratorMode() with its own version to restrict how elements are accessed. Use of queues makes sense when the number of elements to be stored is not known in advance and the only element that must be accessible is the remaining element that was stored earliest.
One minor difference between SplQueue and SplStack is that the former contains two method aliases named after conceptual queue operations: dequeue() aliases SplDoublyLinkedList::shift() and enqueue() aliases SplDoublyLinkedList::push(). This makes sense because while push() and pop() share similar applicability to conceptual stack operations, they are already present in its parent class.
Despite their common ancestry, SplQueue appears to have better performance than SplStack as of PHP 5.3.4. Benchmarks included later in this post review this in more detail.
Up to this point, the data structures discussed have resembled lists insofar as they contain elements in the order in which they were added. By contrast, when an element is added to a heap, a comparison function is used to compare the new element to other elements already in the heap and element is placed appropriately within the heap based on that function’s return value. The beauty of heaps is that their underlying algorithm does this with minimal element comparisons, so it’s extremely efficient. Using heaps makes sense when the number of elements to be stored is not known in advance and elements must be accessed in an order based on how they compare to each other.
SplHeap is an abstract class used to create a heap by extending it and providing a comparison function in the form of its compare() method. Only the root element of a heap, the one yielding the highest comparison function return value, may be accessed or removed from the heap at any given time. This is done using the extract() method of SplHeap. SplHeap implements the Iterator and Countable interfaces but, because only the root element can be extracted, it does not implement the ArrayAccess interface like the previously discussed data structure classes.
In addition to the abstract SplHeap class, two concrete implementations are also included in the SPL, namely SplMinHeap and SplMaxHeap. The compare() method of SplMinHeap returns a value such that the smallest element in the heap is the root element. Likewise, the compare() method of SplMaxHeap returns a value such that the largest element in the heap is the root element.
At first glance, using a subclass of SplHeap may seem equivalent to calling sort() or a similar function on an array and accessing the elements in sequence. This is indeed the case if all elements are added to the array prior to it being sorted. However, situations such as elements arriving over time or inadequate memory to store all elements simultaneously may preclude this approach. Use of arrays in such situations would require repeated resorting of the entire array as new elements are added, which is inefficient. This is why using the corresponding heap class makes a lot more sense in that situation than repeated calls to sort(), min() or max(). Additionally, SplHeap can be used to implement the heapsort algorithm, which has better worst case performance than the quicksort algorithm implementation used by arrays.
Priority queues are somewhat similar to heaps. In fact, while it doesn’t extend SplHeap, SplPriorityQueue does make use of a heap structure internally to implement its functionality. The difference is that the insert() method of SplPriorityQueue queue accepts both a value and an associated priority, removing the need to use an array or object to store both of these and define an appropriate comparison function in an SplHeap instance. Elements with the highest priority, like those in SplMaxHeap with the highest value, are the ones that come out first when extract() is called. Note that elements with equal priority are returned in no particular order.
For reasons similar to those of SplHeap, SplPriorityQueue implements both Iterator and Countable interfaces and does not implement the ArrayAccess interface. Because it stores a value and priority per element, SplPriorityQueue includes a setExtractFlags() method that modifies the behavior of extract() to return the stored value, the stored priority, or an array containing both. Priorities are not bound to a particular data type: strings, integers, or even composite data types can be used. SplPriorityQueue can be extended and its compare() method overridden to customize the comparison logic.
It makes sense to use a priority queue when the number of elements to be stored is not known in advance and elements must be accessed in an order based on how a value associated with each element (versus the element value itself) compares to the same associated values of other elements.
Sets and Composite Hash Maps
SplObjectStorage combines some of the properties of two different data structures. First, it provides the same functionality of a hash table that a normal array has, but without its associated inability to use objects as keys unless the spl_object_hash() function is used. In other words, it implements a composite hash map. Second, it can be used as a set to store objects as data without a meaningful corresponding key or concept of sequential order.
Its attach() method accepts an object key and the data to associate with it and its detach() method allows data to be removed using its associated object key. To use the object as a set, simply exclude the $data parameter for attach() as it’s optional. The set operations implemented by SplObjectStorage all have array function counterparts. For example, the addAll() method and array_merge() function both correspond to the union set operation. The difference operation is available using the removeAll() method and array_diff() function and its variants. The contains() method and in_array() function both implement the element_of operation. Sadly, only arrays have an implementation of the intersection operation in the form of array_intersect() and its variants. Tobias Schlitt has a more in-depth analysis of this data structure that includes implementations of the set operations lacking in the SPL itself.
Like some of the other data structures in the SPL, SplObjectStorage implements the Iterator, Countable, and ArrayAccess interfaces. Oddly, it also implements the Traversable interface (which is limited to internally defined classes and negates the need for implementation of the Iterator interface) and the Serializable interface (and it is the only SPL data structure class to do so).
Using this class makes sense when data must be stored using composite keys or the ability to access data using set operations is more important than accessing data in a specific order.
- System: Sony Vaio VGN-NR298E
- CPU: Intel Core2Duo 1.83GHz
- RAM: 4 GB DDR2
- OS: Ubuntu 10.10 Karmic Koala Desktop Edition 64-bit
- PHP: Custom build of 5.3.4 (here’s how to create one) using this configuration: --without-pear --without-sqlite --without-sqlite3 --without-pdo-sqlite
Code used is located in this GitHub repository.
- Modify constant declarations at the top of runner.php as appropriate (50 executions per test were used to get the results below), then execute it from the command line. It will in turn execute each of the scripts in the tests directory, measuring execution time and memory usage. Results will be recorded in results/raw.csv.
- To generate graphs, run graphs.php. This uses the Graph component from the ezComponents library. Resulting images will be written to the results directory in PNG format.
Other Data Structures
If you have an interest in other data structure implementations for PHP outside of SPL offerings, check out the bloomy PECL extension, which is an implementation of a bloom filter created by Andrei Zmievski.