Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Distinguish Pop Music from Heavy Metal Using Apache Spark MLlib

DZone's Guide to

Distinguish Pop Music from Heavy Metal Using Apache Spark MLlib

In this introductory post, the author will use Apache Spark MLlib to distinguish pop music from heavy metal for you to learn basic NLP pipeline and to have some fun.

Free Resource

Intelligently automate your Big Data operations to lower your costs, make your team more productive, scale more efficiently, and lower the risk of failure. Learn how >>

Machine Learning for Java Engineers?

Machine learning is overhyped nowadays. There is a strong belief that this area is exclusively for data scientists with a deep mathematical background that leverage Python (scikit-learn, Theano, Tensorflow, etc.) or R ecosystem and use specific tools like R Studio, Matlab, Octave or similar. Obviously, there is a big grain of truth in this statement, but Java engineers can also take the best of machine learning world from applied perspective by using our native language and familiar frameworks like Apache Spark.

Goal

In this introductory post, I will use Apache Spark MLlib to distinguish pop music from heavy metal for you to learn basic NLP pipeline and, I hope, simply have fun.

Hypothesis

My initial hypothesis was that there are differences in genres not only in music itself but in lyrics too. So my simple NLP task can be formulated as following: recognize a genre in the given verse. 

For instance 8 lines of random lyrics: 

“I'm a rolling thunder, a pouring rain

I'm comin' on like a hurricane

My lightning's flashing across the sky

You're only young but you're gonna die

I won't take no prisoners, won't spare no lives

Nobody's putting up a fight

I got my bell, I'm gonna take you to hell

I'm gonna get you, Satan get you”

should be recognized as metal music cause these sentences belong to famous AC/DC "Hells Bells". 

For the simplicity, I decided to use only two genres: pop and metal, but the approach described below can be easily extended to support more musical styles, e.g. blues, rap, etc.

Approach 

The roadmap for implementation was pretty straightforward:

  • Collect the raw data set of the lyrics (~65k sentences in total):
    • Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
    • Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
  • Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)
  • Train logistic regression that is the obvious selection for the classification

So, having even small marked dataset, my initial hypothesis transformed into a trivial supervised machine learning pipeline.

Supervised ML Pipeline

Image title

  • (a) During training, a feature extractor is used to convert each input value to a feature set. The main complexity here is to extract features (that should be numeric vectors) from a text. Pairs of feature sets and labels are fed to the machine learning algorithm to generate a model, in our case, it is a logistic regression
  • (b) During prediction, the same feature extractor is used to convert unseen lyrics to feature sets. These feature sets are then fed to the model, which generates predicted labels: 0 for metal, 1 for pop music.

I decided to use Apache Spark MLlib just because it has handy pipeline features that were introduced in 1.2.0 version. In addition to that MLlib has a big set of feature extractors and ML algorithms (including required logistic regression) out of the box. So let's move to the next section. 

Apache Spark MLlib

Pipeline

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. These stages are run in order, and the input Dataset is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the Dataset. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the Dataset. Schematically a generic MLlib pipeline can be represented like this:

Image title

Feature Extraction

The most interesting part of feature extraction is to map each word/verse to a unique fixed-size vector.

For that purpose, I used Word2Vec that computes distributed vector representation of words. The main advantage of the distributed representations is that similar words are close in the vector space, which makes a generalization to novel patterns easier and model estimation more robust.

Apache Spark MLlib provides Word2Vec out-of-the-box. It is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The Word2VecModel transforms each verse into a vector using the average of all words in the verse. This numeric vector can then be used as features for prediction.

Model Selection and Hyperparameter Tuning

An important task in ML is model selection or using data to find the best model or parameters for a given task. In our case, the parameters might be vector size for Word2Vec, a number of sentences used to combine a verse (and verse is considered as one feature for training), maximum iterations for logistic regression, etc. Tuning may be done for individual Estimators such as LogisticRegression, or for entire Pipelines which include multiple algorithms, featurization, and other steps. Users can tune an entire Pipeline at once, rather than tuning each element in the Pipeline separately.

MLlib supports model selection using tools such as CrossValidator and TrainValidationSplit.

CrossValidator begins by splitting the dataset into a set of folds which are used as a separate training and test datasets. E.g., with k=3 folds, CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular Pipeline, CrossValidator computes the average evaluation metric for the 3 Models produced by fitting the Estimator on the 3 different (training, test) dataset pairs.

Image title

Image title

Image title

Custom Pipeline

Eventually, I ended up with a custom Pipeline combined both with default MLlib components plus my custom Transformers:

Image title

The components marked in black are present in MLlib out-of-the-box, while those in orange were added by me to achieve the desired Pipeline that starts with reading raw data and adding labels (0|1) plus:

  • Cleanser cleans data by removing empty lines, commas, apostrophes, etc.
  • Numerator numerates lines 'cause it is needed to create correct verses at latter stages
  • Tokenizer obviously splits sentences into words
  • StopWordsRemover removes stop words such as "he", "is", "at", "which", "and", "on", etc.
  • Exploder generates as many rows in Dataset as there are words, so eventually explodes one row into many. This is needed for Stemmer that processes one word at a time.
  • Stemmer produces a base string in an attempt to represent related words, so words like "fishing", "fished", and "fisher" will be reduced to the root word, "fish"
  • Uniter brings together stemmed words to produce a line of lyrics again
  • Verser forms verses by combining specific number of lines defined via hyperparameter
  • Word2Vec trains Word2VecModel based on all words available and using that model transforms verses (text) into features (vectors)
  • LogisticRegression trains LogisticRegressionModel
  • LogisticRegressionModel(s) are measured by BinaryClassificationEvaluator to select the CrossValidationModel produced by the best-performing set of parameters using cross validation

The thing I like most in Spark MLlib is that our visual understanding of ML pipelines can be seamlessly reflected via code: 

// Get all lyrics.
Dataset<Row> sentences = getPopMusic(lyricsInputDirectory).union(getMetalMusic(lyricsInputDirectory));
sentences = sentences.coalesce(sparkSession.sparkContext().defaultMinPartitions()).cache();
sentences.count();

// Remove all punctuation symbols.
Cleanser cleanser = new Cleanser();

// Add id and rowNumber based on it.
Numerator numerator = new Numerator();

// Split into words.
Tokenizer tokenizer = new Tokenizer().setInputCol("clean").setOutputCol("words");

// Remove stop words.
StopWordsRemover stopWordsRemover = new StopWordsRemover().setInputCol("words").setOutputCol("filteredWords");

// Create as many rows as words. This is needed or Stemmer.
Exploder exploder = new Exploder();

// Perform stemming.
Stemmer stemmer = new Stemmer();

Uniter uniter = new Uniter();
Verser verser = new Verser();

// Create Word2VecModel.
Word2Vec word2Vec = new Word2Vec().setInputCol("verses").setOutputCol("features").setMinCount(0);

LogisticRegression logisticRegression = new LogisticRegression();

// Define pipeline.
Pipeline pipeline = new Pipeline().setStages(
  new PipelineStage[]{
    cleanser,
      numerator,
      tokenizer,
      stopWordsRemover,
      exploder,
      stemmer,
      uniter,
      verser,
      word2Vec,
      logisticRegression});

// Use a ParamGridBuilder to construct a grid of parameters to search over.
ParamMap[] paramGrid = new ParamGridBuilder()
  .addGrid(verser.sentencesInVerse(), new int[]{4, 8, 16})
  .addGrid(word2Vec.vectorSize(), new int[] {100, 200, 300})
  .addGrid(logisticRegression.regParam(), new double[] {0.01D, 0.05D})
  .addGrid(logisticRegression.maxIter(), new int[] {100, 150, 200})
  .build();

CrossValidator crossValidator = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new BinaryClassificationEvaluator())
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(10);

// Run cross-validation, and choose the best set of parameters.
CrossValidatorModel model = crossValidator.fit(sentences);

Now having CrossValidatorModel we can easily predict a genre for unknown lyrics:

// Transforming unknown lyrics into Dataset with -1.0D label. 
List<Row> unknownLyricsList = Collections.singletonList(
  RowFactory.create(unknownLyrics, -1.0D)
);

StructType schema = new StructType(new StructField[]{
   DataTypes.createStructField("value", DataTypes.StringType, false),
   DataTypes.createStructField("label", DataTypes.DoubleType, false)
});

Dataset<Row> unknownLyricsDataset = sparkSession.createDataFrame(unknownLyricsList, schema);

// Choosing best model. 
PipelineModel bestModel = (PipelineModel) model.bestModel();

Dataset<Row> predictionsDataset = bestModel.transform(unknownLyricsDataset);
Row predictionRow = predictionsDataset.first();

// 0 for metal, 1 for pop.
final Double prediction = predictionRow.getAs("prediction");

Example in Action

Raw unknown lyrics:

I'm a rolling thunder, a pouring rain

I'm comin' on like a hurricane

My lightning's flashing across the sky

You're only young but you're gonna die

I won't take no prisoners, won't spare no lives

Nobody's putting up a fight

I got my bell, I'm gonna take you to hell

I'm gonna get you, Satan get you

After Cleanser:

Im a rolling thunder a pouring rain

Im comin on like a hurricane

My lightnings flashing across the sky

Youre only young but youre gonna die

I wont take no prisoners wont spare no lives

Nobodys putting up a fight

I got my bell Im gonna take you to hell

Im gonna get you Satan get you

After StopWordsRemover:

im rolling thunder pouring rain

im comin like hurricane

lightnings flashing across sky

youre young youre gonna die

wont take prisoners wont spare lives

nobodys putting fight

got bell im gonna take hell

im gonna get satan get

After Stemmer:

im roll thunder pour rain

im comin like hurrican

lightnflash across sky

your young your gonna die

wont take prison wont spare live

nobodi put fight

got bell im gonna take hell

im gonna get satan get

After Word2Vec (vector size is configurable):

[0.036463763926011056,

-0.013076733228398295,

0.044362547532774695,

0.03816963326281462,

.......................................

-0.013962931134021625,

0.049275818325650804,

-0.058982484615766086]

After LogisticRegression:

Probability:

[0.9212126972383768, 0.07878730276162313]

Prediction:

0.0

Model Persistence

Usually, there is a clear separation for ML tasks, i.e. first create a good model (that takes quite a lot of time for training) and second use it in production (very efficiently, especially in streaming applications). So it is quite useful to save the best model to a disk and then frequently use it for predictions. In Spark 1.6, a model import/export functionality was added to the Pipeline API. So we can easily save a model to a specified output directory:

model.write().overwrite().save(modelOutputDirectory);

and then read and use it when needed:

CrossValidatorModel model = CrossValidatorModel.load(modelDirectory);

Summary

Apache Spark MLlib has a lot of advantages:

  • Start is really quick
  • Scalability and performance out of the box
  • Simple ML algorithms work great
  • Easy integration with other Spark components, e.g. Spark streaming

But there are some drawbacks too, among them:

  • Features are quite limited, e.g. no neural networks support
  • spark.mllib package is more completed than a spark.ml package at least for now
  • Bugs (for instance, I fired SPARK-17048 while working on this simple demo)

Outcome

Anyway, I was able to represent my thinking pipeline via code and train the model that has an accuracy of 93.32% using just ~65k lines of lyrics. 

A source code is on GitHub, more visual information about mentioned topic is on a Slideshare

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
machine learning ,apache spark ,natural language processing ,big data ,supervised learning ,java 8 ,spring boot

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}