Python Collections: High Performing Containers For Complex Problems
Python has a lot of data capabilities, plus collections for Java and C++. Learn more about these high performing containers for complex problems.
Join the DZone community and get the full member experience.
Join For Free1. Introduction
Python is known for its powerful general purpose built-in data types like list, dict, tuple and set. But Python also has collection objects like Java and C++. These objects are developed on top of the general built-in containers with additional functionalities which can be used in special scenarios. The objective of this article is to introduce Python collection objects and explain them with appropriate code snippets. The collections library contains the collections objects, they are namedtuples (v2.6), deque (v2.4), ChainMap(v3.3), Counter(v2.7 ), OrderedDict(v2.7), and defaultdict(v2.5). Python 3.x also has userDict, userList, and userString to create custom container types (not in the scope of this article), which deserves a separate article. NOTE: Python 2.x user might aware of various releases they are objects got introduced. All these objects are available in Python 3.x from 3.1 onwards, except ChainMap, which got introduced in v3.3. All the code snippets in the articles are executed in Python 3.5 environment.
2. Namedtuple
As the name suggests, namedtuple is a tuple with a name. In standard tuple, we access the elements using the index, whereas namedtuple allows user to define name for elements. This is very handy, especially when processing CSV (comma separated value) files and working with complex and large datasets, where the code becomes messy with the use of indices (not so Pythonic).
2.1 Example 1
Namedtuples are available in the collections library in Python. We have to import the collections library before using any container objects from this library.
>>>from collections import namedtuple
>>>saleRecord = namedtuple('saleRecord','shopId saleDate salesAmout totalCustomers')
>>>
>>>
>>>#Assign values to a named tuple
>>>shop11=saleRecord(11,'2015-01-01',2300,150)
>>>shop12=saleRecord(shopId=22,saleDate="2015-01-01",saleAmout=1512,totalCustomers=125)
In the above code snippet, in the first line we import namedtuple from the collections library. In the second line we create a namedtuple called "saleRecord", which has shopId, saleDate, salesAmount and totalCustomers as fields. Note that namedtuple() takes two string arguments: the first argument is the name of the tuple and second argument is the list of fields names separated by space or comma. In the above example, space is used as delimiter. We have also created two tuples here. They are shop11 and shop12. For shop11, the values are assigned to fields based on the order of the fields and shop12, the values are assigned using the names.
2.2 Example 2
>>>#Reading as a namedtuple
>>>print("Shop Id =",shop12.shopId)
12
>>>print("Sale Date=",shop12.saleDate)
2015-01-01
>>>print("Sales Amount =",shop12.salesAmount)
1512
>>>print("Total Customers =",shop12.totalCustomers)
125
The above code is pretty much clear that a tuple is accessed using the names. It is also possible to access them using indexes of the tuples, which is the usual way.
2.3 Interesting Methods and Members
2.3.1 _make
The _make method is used to convert the given iterable item (list, tuple, dictionary) into a named tuple.
>>>#Convert a list into a namedtuple
>>>aList = [101,"2015-01-02",1250,199]
>>>shop101 = saleRecord._make(aList)
>>>print(shop101)
saleRecord(shopId=101, saleDate='2015-01-02', salesAmount=1250, totalCustomers=199)
>>>#Convert a tuple into a namedtuple
>>>aTup =(108,"2015-02-28",1990,189)
>>>shop108=saleRecord._make(aTup)
>>>print(shop108)
saleRecord(shopId=108, saleDate='2015-02-28', salesAmount=1990, totalCustomers=189)
>>>
2.3.2 _fields
The _fields is a tuple, which contains the names of the tuple.
>>>print(sho108._fields)
>>>('shopId', 'saleDate', 'salesAmount', 'totalCustomers')
2.4 CSV File Processing
As we discussed namedtuple will be very handy while processing a CSV data file, where we can access the data using names instead of indexes, which make the code more meaningful and efficient.
from csv import reader
from collections import namedtuple
saleRecord = namedtuple('saleRecord','shopId saleDate totalSales totalCustomers')
fileHandle = open("salesRecord.csv","r")
csvFieldsList=csv.reader(fileHandle)
for fieldsList in csvFieldsList:
shopRec = saleRecord._make(fieldsList)
overAllSales += shopRec.totalSales;
print("Total Sales of The Retail Chain =",overAllSales)
In the above code snippet, we have salesRecord.csv which contains sales records of shops of a particular retail chain. It contains the values for the fields shopId, saleDate, totalSales, totalCustomers. The fields are delimited by commas and the records are delimited by new lines. The csv.reader() reads the file and provides an iterator. The iterator, "csvFieldsList" provides a list of fields for every single row of the CSV file. As we know, the _make() converts the list into namedtuple and the rest of the code is self-explanatory.
3. Counter
Counter is used for rapid tallies. It is a dictionary, where the elements are stored as keys and their counts are stored as values.
3.1 Creating Counters
The Counter() class takes an iterable object as an argument and computes the count for each element in the object and presents it as a key value pair.
>>>from collections import Counter
>>>listOfInts=[1,2,3,4,1,2,3,1,2,1]
>>>cnt=Counter(listOfInts)
>>>print(cnt)
Counter({1: 4, 2: 3, 3: 2, 4: 1})
In the above code snippet, listOfInts is a list which contains numbers. It is passed to Counter() and we got cnt, which is a container object. The cnt is a dictionary, which contains the unique numbers present in the given list as keys, and their respect counts as the value.
3.2 Accessing Counters
Counter is a subclass of dictionary. So it can be accessed the same as dictionary. The "cnt" can be handled as a regular dictionary object.
>>> cnt.items()
dict_items([(1, 4), (2, 3), (3, 2), (4, 1)])
>>> cnt.keys()
dict_keys([1, 2, 3, 4])
>>> cnt.values()
dict_values([4, 3, 2, 1])
3.3 Interesting Methods & Usecases
3.3.1 most_common
The most_common(n) of Counter class, provides most commonly occurring keys. The n is used as a rank, for example, n = 2 will provide top two keys.
>>>name = "Saravanan Subramanian"
>>>letterCnt=Counter(name)
>>>letterCnt.most_common(1)
[('a', 7)]
>>>letterCnt.most_common(2)
[('a', 7), ('n', 4)]
>>>letterCnt.most_common(3)
[('a', 7), ('n', 4), ('r', 2)]
In the above code, we could see that the string is parsed as independent characters as keys and their respective count is stored as values. So, the letterCnt.most_common(1) provides the top letter which has highest occurrences.
3.3.2 Operations on Counter
The Counter() subclass is also called a Multiset. It supports addition, subtraction, uniting, and intersection operations on the Counter class.
>>> a = Counter(x=1,y=2,z=3)
>>> b = Counter(x=2,y=3,z=4)
>>> a+b
Counter({'z': 7, 'y': 5, 'x': 3})
>>> a-b #This will result in negative values & will be omitted
Counter()
>>> b-a
Counter({'y': 1, 'x': 1, 'z': 1})
>>> a & b #Chooses the minimum values from their respective pair
Counter({'z': 3, 'y': 2, 'x': 1})
>>> a | b #Chooses the maximum values from their respective pair
Counter({'z': 4, 'y': 3, 'x': 2})
4. Default Dictionary
The defaultdict() is available as part of the collections library. It allows the user to specify a function to be called when a key is not present in the dictionary. In a standard dictionary, accessing an element where the key is not present will raise Key Error. So, this is a problem when working with collections (list, set, etc.), especially while creating them. So, when a dictionary is queried for a key, which does not exist, the function passed as an argument to the named argument "default_dictionary" of default_dict() will be called to set a value for a given "key" into the dictionary.
4.1 Creating Default Dictionary
The defaultdict() is available as part of collections library. The default dict takes a function without argument which returns value as an argument.
4.1.1 Example 1
>>>
>>> booksIndex = defaultdict(lambda:'Not Available')
>>> booksIndex['a']='Arts'
>>> booksIndex['b']='Biography'
>>> booksIndex['c']='Computer'
>>> print(booksIndex)
defaultdict(<function at 0x030EB3D8>, {'c': 'Computer', 'b': 'Biography', 'a': 'Arts'})
>>> booksIndex['z']
'Not Available'
>>> print(booksIndex)
defaultdict(<function at 0x030EB3D8>, {'c': 'Computer', 'b': 'Biography', 'z': 'Not Available', 'a': 'Arts'})
>>>
In the above example, the booksIndex is a defaultdict, where it sets Not Available as a value if any non-existent key is accessed. We have added values for keys a, b, and c into the defaultdict. The print(booksIndex) shows that the defaultdict contains values only for these keys. While trying to access the value for key 'z', which we have not set, it returns the value as Not Available' and updates the dictionary.
4.1.2 Example 2
>>> titleIndex = [('a','Arts'),('b','Biography'),('c','Computer'),('a','Army'),('c','Chemistry'),('d','Dogs')]
>>> rackIndices = defaultdict(list)
>>> for id,title in titleIndex:
rackIndices[id].append(title)
>>> rackIndices.items()
dict_items([('d', ['Dogs']), ('b', ['Biography']), ('a', ['Arts', 'Army']), ('c', ['Computer', 'Chemistry'])])
>>>
In the above example, titleIndex contains a list of tuples. We want to aggregate this list of tuples to identify titles for each alphabetical. So, we can have a dictionary where key is the alphabetical and the value is the list of titles. Here we use a defaultdict with "list" as a function to be called for missing elements. So for each new element a list will be called, and it will create an empty list object. The consecutive append() methods on the list will add elements to the list.
5. Ordered Dictionary
The ordered dictionary maintains the order of elements added into the dictionary, where the standard dictionary will not maintain the order of inclusion.
5.1 Ordered Dictionary Creation
Ordered Dictionary is created using OrderedDict() from collections library. It's a subsclass of regular dictionary, so it inherits all other methods and behaviors of a regular dictionary.
>>> from collections import OrderedDict
>>> dOrder=OrderedDict()
>>> dOrder['a']='Alpha'
>>> dOrder['b']='Bravo'
>>> dOrder['c']='Charlie'
>>> dOrder['d']='Delta'
>>> dOrder['e']='Echo'
>>> dOrder
>>> OrderedDict([('a', 'Alpha'), ('b', 'Bravo'), ('c', 'Charlie'), ('d', 'Delta'), ('e', 'Echo')])
>>> >>> dOrder.keys()
odict_keys(['a', 'b', 'c', 'd', 'e'])
>>> dOrder.values()
odict_values(['Alpha', 'Bravo', 'Charlie', 'Delta', 'Echo'])
>>> dOrder.items()
odict_items([('a', 'Alpha'), ('b', 'Bravo'), ('c', 'Charlie'), ('d', 'Delta'), ('e', 'Echo')])
>>>
5.2 Creating from Other Iterable Items
OrderedDict can also be created by passing a dictionary or a list of key, value pair tuples.
>>> from collections import OrderedDict
>>> listKeyVals = [(1,"One"),(2,"Two"),(3,"Three"),(4,"Four"),(5,"Five")]
>>> x = OrderedDict(listKeyVals)
>>> x
OrderedDict([(1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four'), (5, 'Five')])
>>>
5.3 Sort and Store
One of the interesting use cases for OrderedDict is a rank problem. For example, consider the problem of a dictionary containing students' names and their marks, now we have to find out the best student and rank them according to their marks. So, OrderedDict is the right choice here. Since OrderedDict will remember the order or addition and sorted() will sort a dictionary we can combine both to created a rank list based on the student marks. Check the example below:
>>> studentMarks={}
>>> studentMarks["Saravanan"]=100
>>> studentMarks["Subhash"]=99
>>> studentMarks["Raju"]=78
>>> studentMarks["Arun"]=85
>>> studentMarks["Hasan"]=67
>>> studentMarks
{'Arun': 85, 'Subhash': 99, 'Raju': 78, 'Hasan': 67, 'Saravanan': 100}
>>> sorted(studentMarks.items(),key=lambda t:t[0])
[('Arun', 85), ('Hasan', 67), ('Raju', 78), ('Saravanan', 100), ('Subhash', 99)]
>>> sorted(studentMarks.items(),key=lambda t:t[1])
[('Hasan', 67), ('Raju', 78), ('Arun', 85), ('Subhash', 99), ('Saravanan', 100)]
>>> sorted(studentMarks.items(), key = lambda t:-t[1])
[('Saravanan', 100), ('Subhash', 99), ('Arun', 85), ('Raju', 78), ('Hasan', 67)]
>>> rankOrder = OrderedDict(sorted(studentMarks.items(), key = lambda t:-t[1]))
>>> rankOrder
OrderedDict([('Saravanan', 100), ('Subhash', 99), ('Arun', 85), ('Raju', 78), ('Hasan', 67)])
In the above example, studentMarks is a dictionary contains the student name as a key and their mark as the value. It got sorted using its value and passed to OrderedDict and got stored in rankOrder. Now rankOrder contains the highest marked student as the first entry, and next highest as the second entry, and so on. This ordered is preserved in this dictionary.
6. Deque
Deque means double-ended queue and it pronounced like "deck." It is an extension to the standard list data structure. The standard list allows the user to append or extend elements only at the end. But deque allows the user to operate on both ends so that the user can implement both stacks and queues.
6.1 Creating & Performing Operations on Deque
The deque() is available in collections library. It takes iterative entity as an argument and an optional maximum length. If maxlen is set, it ensures that deque length does not exceed the size of the maxlen.
>>> from collections import deque
>>> aiao = deque([1,2,3,4,5],maxlen=5)
aiao = deque([1,2,3,4,5])
>>> aiao.append(6)
>>> aiao
deque([2, 3, 4, 5, 6], maxlen=5)
>>> aiao.appendleft(1)
>>> aiao
deque([1, 2, 3, 4, 5], maxlen=5)
In the above example, we have created a deque with maxlen 5, once we appended 6th element on the right, it pushed first element on the left. Similarly, it pushes out the last element on the right when we append element on the left.
6.2 Operations on Right
Operations on the right are common to performing any operations on the list. The methods append(), extend(), and pop() are operate on the right side of the deque().
>>> aiao.append(6)
>>> aiao
deque([2, 3, 4, 5, 6], maxlen=5)
>>> aiao.extend([7,8,9])
>>> aiao
deque([5, 6, 7, 8, 9], maxlen=5)
>>> aiao.pop()
9
6.3 Operation on the Left
The special feature of performing operations on the left is supported by a set of methods like appendleft(), extendleft(), and popleft().
>>> aiao = deque([1,2,3,4,5],maxlen=5)
>>> aiao.appendleft(0)
>>> aiao
deque([0, 1, 2, 3, 4], maxlen=5)
>>> aiao.extendleft([-1,-2,-3])
>>> aiao
deque([-3, -2, -1, 0, 1], maxlen=5)
>>> aiao.popleft()
-3
6.4 Example 2 (Without maxlen)
If the maxlen value is not set, the deque does not perform any trimming operations to maintain the size of the deque.
>>> aiao = deque([1,2,3,4,5])
>>> aiao.appendleft(0)
>>> aiao
deque([0, 1, 2, 3, 4, 5])
>>> aiao.extendleft([-1,-2,-3])
>>> aiao
deque([-3, -2, -1, 0, 1, 2, 3, 4, 5])
>>>
From the above example, the deque aiao continues to grow for the append and extend operations performed on it.
7. ChainMap
ChainMap allows to combine multiple dictionaries into a single dictionary so that operations can be performed on a single logical entity. The ChainMap() does not create any new dictionary, instead it maintains references to the original dictionaries, all operations are performed only on the referred dictionaries.
7.1 Creating ChainMap
>>> from collections import ChainMap
>>> x = {'a':'Alpha','b':'Beta','c':'Cat'}
>>> y = { 'c': "Charlie", 'd':"Delta", 'e':"Echo"}
>>> z = ChainMap(x,y)
>>> z
ChainMap({'c': 'Cat', 'b': 'Beta', 'a': 'Alpha'}, {'d': 'Delta', 'c': 'Charlie', 'e': 'Echo'})
>>> list(z.keys())
['b', 'd', 'c', 'e', 'a']
>>> list(z.values())
['Beta', 'Delta', 'Cat', 'Echo', 'Alpha']
>>> list(z.items())
[('b', 'Beta'), ('d', 'Delta'), ('c', 'Cat'), ('e', 'Echo'), ('a', 'Alpha')]
We have created ChainMap z from other dictionaries x & y. The ChainMap z is referenced to the dictionaries x and y. ChainMap will not maintain duplicate keys, it returns present value 'Cat' for key 'c'. So, basically, it skips the second occurrence of the same key.
>>> x
{'c': 'Cat', 'b': 'Beta', 'a': 'Alpha'}
>>> y
{'d': 'Delta', 'c': 'Charlie', 'e': 'Echo'}
>>> x.pop('c')
'Cat'
>>> x
{'b': 'Beta', 'a': 'Alpha'}
>>> list(z.keys())
['d', 'c', 'b', 'e', 'a']
>>> list(z.values())
['Delta', 'Charlie', 'Beta', 'Echo', 'Alpha']
>>> list(z.items())
[('d', 'Delta'), ('c', 'Charlie'), ('b', 'Beta'), ('e', 'Echo'), ('a', 'Alpha')]
>>>
In the above code, we have removed the key 'c' from dict x. Now the ChainMap points the value for key 'c' to "Charlie", which is present in y.
8. Summary
We have seen various Python collection data types and understand them with example and use cases. The official Python documentation can be referred for further reading.
9. References
[1] - Python Wiki - https://docs.Python.org/3.5/library/collections.html
Published at DZone with permission of Saravanan Subramanian, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments