Speeding Up Sparse Matrix Manipulations
Speeding Up Sparse Matrix Manipulations
Many of the matrices we use in machine learning are very sparse, yet our algorithms still must iterate over all of the data values. Here's a new tool, TACO, that speeds up these manipulations by as much as 100 times.
Join the DZone community and get the full member experience.Join For Free
Most of us have had to deal with matrix arithmetic at some point in our lives. Now with the huge focus on machine learning techniques applied to huge data sets (a.k.a. giant matrices) it's become apparent that there are some computational bottlenecks. One recurrent issue is that many of the largest data sets can be classed as very sparse tensors. [Note: Tensor is just a more general term for multidimensional matrices. A matrix has rows and columns and can be laid out as a square. Tensors just have more dimensions and can be a cube or a Tesseract or any higher dimensionality. I will use the term tensor for the remainder of this article.]
Imagine a tensor that represents customer ID versus all of the products sold by the company. At the intersection of each customer ID and each product ID there is a "1" and everywhere else is a "0". A tensor like this is very sparse. But if we considered the customer user rating of the product for every product purchased the tensor becomes even sparser. Now, suppose you are going to add this tensor to another tensor. A vast amount of work would be done accessing all the zeros and adding them to the other tensor: the net result is nothing happens. All that work was just a no-op. Those of you that have dealt with this problem in code most likely created a specialized algorithm to scan for these time wasting no-op's and skip over them. But that usually takes a significant amount of work and presents risks of errors (e.g. edge cases), even so, we often do the work and take the risks because the computational inefficiencies would otherwise not be acceptable.
Recently some researchers from MIT, the French Alternative Energies and Atomic Energy Commission, and Adobe Research described a programmatic system that automatically produces specialized code to deal with sparse data. It essentially compiles a new version of the tensor arithmetic algorithm to deal with specific sparse input data. This research was presented at the ACM conference on Systems, Programming, Languages and Applications: Software for Humanity (SPLASH). They have named this tool the Tensor Algebra COmpiler (TACO). [Do these two acronyms conspire to whet your appetite?]
The paper is here.
TACO generates a template or specific "kernel" that addresses the dimensionality and sparseness of the input tensors. Additionally, it addresses another computational inefficiency that often arises with tensor arithmetic. When doing simple arithmetic in a simple program it's often not apparent that intermediate results are computed and operated on over the scope of the equation. (A + B)*C-D generates intermediate results to be used by the following operators. This isn't much of an issue for single values, but if A, B, C, D are huge and potentially sparse matrices then how one manages these intermediate values is not only a matter of time but also of storage space for those values. And this is where the second key feature of TACO comes to bear: where possible the follow-on operations in the computation can be done on individual data values thus saving a great deal on maintaining intermediate complete tensor versions.
Senior author on this paper Saman Amarasinghe, an MIT professor of electrical engineering and computer science (EECS), pointed out:
“Sparse representations have been around for more than 60 years...” Amarasinghe went on to explain that “... nobody knew how to generate code for them automatically. People figured out a few very specific operations e.g. sparse matrix-vector multiply, sparse matrix-vector multiply plus a vector, sparse matrix-matrix multiply, sparse matrix-matrix-matrix multiply. The biggest contribution we make is the ability to generate code for any tensor-algebra expression when the matrices are sparse.”
Other co-authors on the paper are Fredrik Kjolstad, an MIT graduate student in EECS; Stephen Chou, a graduate student in EECS; David Lugato of the French Alternative Energies and Atomic Energy Commission; and Shoaib Kamil of Adobe Research.
This paper points out that for efficient operation on massive data sets each sequence of operations on the tensor demands a specific computational template. This template they refer to as a "kernel" and it is specific for the type of input tensors as well as the operations performed (and the order in which they are performed). Kjolstad said:
“If you do it in one kernel, you can do it all at once, and you can make it go faster, instead of having to put the output in memory and then read it back in so that you can add it to something else,” Kjolstad says. “You can just do it in the same loop.”
Because tensors can take any size and shape, and because they can be strung together in an arbitrary string of operations it becomes obvious that there are an infinite number of "kernels." And so, the only solution to this problem was to "compile" the small sections of code that the human programmer would insert to skip over adding zeros to things (because it makes no difference) or multiplying by zero (because it's always zero) or multiplying by one (because it makes no difference). So, a specific section of program code is generated to process specific matrices through a specific set of operations. Saday Sadayappan (not an author) a professor of computer science and engineering at Ohio State University said, “Many research groups over the last two decades have attempted to solve the compiler-optimization and code-generation problem for sparse-matrix computations but made little progress.”
The authors are willing to admit that computer science researchers have developed "kernels" for some of the more commonly used tensor operations that are embedded in machine learning algorithms, but for anyone tweaking the tensor operations for new methodologies they were on their own. They had to hand code their own kernels. TACO frees up the developer to focus on the high-level algorithms while not forgoing the efficiencies gained by managing the degenerate cases (adding zeros and multiplying by one).
Saday Sadayappan went on to say:
“The recent developments from Fred and Saman represent a fundamental breakthrough on this long-standing open problem ... their compiler now enables application developers to specify very complex sparse matrix or tensor computations in a very easy and convenient high-level notation, from which the compiler automatically generates very efficient code.” he continues. “For several sparse computations, the generated code from the compiler has been shown to be comparable or better than painstakingly developed manual implementations. This has the potential to be a real game-changer. It is one of the most exciting advances in recent times in the area of compiler optimization.”
We all like good tools. This one needs to be in the fast lane for immediate adoption.
Opinions expressed by DZone contributors are their own.