Over a million developers have joined DZone.

Getting the Most Out of Your Training Data With Support Vector Machines

DZone 's Guide to

Getting the Most Out of Your Training Data With Support Vector Machines

Learn how to create a machine learning model that can automatically classify your data and how to efficiently use your training data.

· AI Zone ·
Free Resource

In many cases, the acquisition of well-labeled training data is a huge hurdle for developing accurate prediction systems with supervised learning. At Love the Sales, we aggregate sales products from over 700 multinational retailers, which results in over two million products a day that need classification. It could take a traditional merchandising team four years to complete this task manually. Our requirement was to apply this classification to the textual metadata of these two million products (mostly fashion and homeware) into 1,000+ different categories that are represented in a hierarchy.

Support Vector Machines

For our classification task, we decided to use support vector machines (SVMs). SVMs are a class of supervised machine learning algorithms that are proficient in the classification of linearly separable data. Essentially, given a large enough set of labeled training data, an SVM attempts to learn the best discriminative plane between the examples — to try to draw a multidimensional line in the sand. Find more on SVMs here and here.

For example, here are some possible ways to separate this dataset:


The SVM will attempt to learn the optimal hyperplane:


Images from opencv.org

While there are many machine learning algorithms for classification (i.e. neural networks, random Forest, Naive Bayesian), SVMs really shine for data with many features — in our case, document classification, wherein each "word" is treated as a discrete feature.

While SVMs can actually classify into multiple classes, we opted to use a hierarchy of simple two-class SVMs chained together in a hierarchical fashion.

The main reason for this is that when we tried it, it seemed to yield better results, and, importantly, it used a lot less memory on our machine learning platform because each SVM only had to know about two classes of data. High memory utilization for large datasets (300k+ examples) and large input vectors (1m different known words) was a definite obstacle for us.

Some simple well-known techniques for pre-processing our documents also helped a great deal in reducing the feature space, such as lowercasing, stemming, stripping odd characters, and removing "noisy" words and numbers.

Stemming is a common language-specific technique useful when dealing with large corpuses of textual data, with the goal of taking different words with a similar meaning and root and "crunching" them down to similar tokens. For example, the words "clothing" and "clothes" have a very similar meaning; when each is stemmed through the common Porter stemming algorithm, the result is "cloth." In doing this, we halved the number of words we have to worry about. Using stemming in conjunction with noisy word removal (the removal of common words without any domain meaning, i.e. the, is, and, with, etc.), we can hope to see a large reduction in the number of words we have to deal with.

Creating SVMs

Once you have pre-processed your textual data, the next step is to train your model. In order to do this, you must first transform your textual data into a format the SVM can understand. This is known as vectorization. Take the sentence:

"Men, you’ll look fantastic in this great pair of men's skinny jeans."

After pre-processing it as described above with stemming and removing words, we don’t care about:

"men fantastic great pair men skinny jean"

To take just the words from the above example, we can see we have one repeated word, so we could encode the data like so:















This can be represented in a vector:


This is fine for a small set of terms (only one short sample in my above case). However, as we add more samples and more terms, our vocabulary increases. For example, if we add another training sample that isn’t men's skinny jeans:

"women bootcut acid wash jean"

...we need to increase the overall vocabulary the algorithm must know about, i.e.:


This means that our new term vector for the initial men's skinny jeans example has to change to:


When dealing with thousands of samples, your vocabulary can become large and can become quite cumbersome, as the encoded training samples become mostly empty and very long:

[0,0,0,0,0,0,0,0,..... 2,0,0,0,0,0,.....1,0,0,0,0 …]

Thankfully, many machine learning libraries allow you to encode your term vectors as sparse vectors — which means you only need to supply non-zero cases, and the library (in our case, LibSVM) will magically figure it out for you and fill in the gaps.

In such a case, you provide the term vectors and the classes they represent as term indices relative to the entire vocabulary for all the training samples you wish you use. For example:

Term Index






















So, you would describe these words:

"men fantastic great pair men skinny jean"


Term index #2: 1 Occurence

Term index #3: 1 Occurence

Term index #4: 1 Occurence

Term index #5: 2 Occurrences

Term index #6: 1 Occurrences

Term index #7: 1 Occurrences

Which can then be encoded succinctly as:


Alexandre Kowalczyk has a great description of vocabulary preparation here, along with other great SVM tutorials.

Hierarchy and Data Structures

A key learning for us is that the way these SVMs are structured can actually have a significant impact on how much training data has to be applied. For example, a naive approach would have been as follows:


This approach requires that for every additional sub-category, two new SVMs be trained. For example, the addition of a new class for swimwear would require an additional SVM under men's and women's — not to mention the potential complexity of adding a unisex class at the top level. Overall, deep hierarchical structures can be too rigid to work with.

We were able to avoid a great deal of labeling and training work, by flatening our data structures into many sub-trees like so:


By decoupling our classification structure from the final hierarchy, it is possible to generate the final classification by traversing the SVM hierarchy with each document and interrogating the results with simple set-based logic such as:

Men's slim-fit jeans = (men's and jeans and slim fit) and not women's

This approach vastly reduces the number of SVMs required to classify documents, as the resultant sets can be intersected to represent the final classification.


It should also now be evident that adding new classes opens up an exponentially increasing number of final categories. For example, adding a top-level children's class would immediately allow the creation of an entire dimension of new children’s categories (children’s jeans, shirts, bathing suits, etc.), with minimal additional training data (only one additional SVM):


Data reuse

Because of the structure we chose, one key insight that we were able to leverage was that of re-using training data via linked data relationships. Linking data enabled us to re-use our training data by an overall factor of 9x, massively reducing the cost and increasing the accuracy of predictions.

For each individual class, we obviously want as many training data examples as possible, covering both possible outcomes. Even though we built some excellent internal tooling — primarily, a nice fast user interface for searching, sorting, and labeling training data examples in large batches — labeling thousands of examples of each kind of product can still be laborious, costly, and error-prone. We determined that the best way to circumvent these issues was to attempt to reuse as much training data as we could across classes.

For example, given some basic domain knowledge of the categories, we know that washing machines can never be carpet cleaners:


By adding the ability to link excluded data, we can heavily bolster the number of negative training examples for the washing machines SVM by adding to it the positive training data from the carpet cleaners SVM. Put more simply, given that we know carpet cleaners can never be washing machines, we may as well reuse that training data.

This approach has a nice uptick in that whenever the need arises to add some additional training data to improve the carpet cleaners SVM, it inadvertently improves the washing machines class via linked negative data.

Finally, another chance for reuse that is apparent when considering a hierarchy is that the positive training data for any child nodes is also always positive training data for its parent.

For example, "jeans are always clothing" looks like:


This means that for every positive example of training data added to the jeans SVM, an additional positive example is also added to the clothing SVM via the link.

Adding linked data is far more efficient than manually labeling thousands of examples.



Support vector machines have helped us reach a quality and speed of classification that we could have never achieved with a non-machine learning approach. As such, we have come to learn that support vector machines are an excellent addition to any developer's toolbox and that any investigation should also serve as a nice introduction to some key machine learning concepts.

Additionally, when it comes to the specifics of hierarchical classification systems, decoupling the classification component from the resulting hierarchy, flattening the data structure, and enabling the reuse of training data will all be beneficial in gaining as much efficiency as possible. The approaches outlined above have not only helped reduce the amount of training data we needed to label but have also given us greater flexibility overall.

machine learning ,ai ,tutorial ,classification ,svm

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}