Over a million developers have joined DZone.

Using a Deep Neural Network for Automated Call Scoring (Part 2)

DZone's Guide to

Using a Deep Neural Network for Automated Call Scoring (Part 2)

In this part two, using XGBoost and a combination of LSTM and XGB, learn how the tone of voice in the whole audio file was detected.

· AI Zone ·
Free Resource

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.

In the previous post, we shared our experience in feature extraction and voice recognition. Specifically, we detected the tone of voice in separate phrases using speaker diarization and the LIUM library.

We would now like to tell you how we detected the tone of voice in the whole audio file using XGBoost and a combination of LSTM and XGB.

Detecting the Tone of Voice in the Whole File

We marked files as suspicious if they contained at least one phrase that violated rules. We used this method to mark 2500 files.

To extract features, we used the same principle and the same ANN architecture with a single difference. We scaled the net architecture to fit the new dimensions of the feature space.

With optimal neural network parameters, we achieved a classification accuracy of 85%.

Classification accuracy


2.3.1. Feature extraction

XGBoost model requires a fixed amount of features for each file. To fulfill the need for features we created several signals and statistics (parameters).

List of signals

Specifications and meanings

The following stats were used:

  1. Mean value of the signal
  2. Mean value of the first 10 seconds of the signal
  3. Mean value of the last 3 seconds of the signal
  4. Mean value of local maximums of the signal
  5. Mean value of local maximums of the first 10 seconds of the signal
  6. Mean value of local maximums of the last 3 seconds of the signal

All stats are calculated for each signal. A total amount of features is 36 excluding recording length. All in all, we’ve got 37 numerical features for each recording.

The prediction accuracy of this algorithm is 0.869.

A Combination of LSTM and XGB

To combine classifiers, we applied blending to these two models. It resulted in an average accuracy increase of 2%.

Prediction accuracy

We managed to increase the prediction accuracy of this algorithm to 0.9 ROC-AUC.

The Outcome

We tested our deep neural network classifier using a sample of 205 files. 177 of them were neutral and 28 were suspicious. The DNN has to process every single one of them and guess which group they belong to. See the results below.

  • 170 neutral files were correctly identified as neutral
  • 7 neutral files were identified as suspicious
  • 13 suspicious files were correctly identified as suspicious
  • 15 suspicious files were identified as neutral

To estimate the percentage ratio of true and false outputs, we used the Confusion Matrix. For better visual clarity, we used a 2x2 table.

The percentage ratio

Detecting a Particular Phrase in the Speech

We were eager to try this approach for recognizing words and phrases in audio files. The goal was to detect files where call center agents don’t introduce themselves and their organization to a client within the first 10 seconds of a call.

We used 200 phrases with an average length of 1.5 seconds where call center agents introduce themselves and their organizations.

Marking files manually took us a lot of time as we ran over each record to check if the required phrase was in it. To speed things up, we increased our dataset using augmentation. We transformed each file randomly 6 times adding noises, changing frequency, and changing volume. The resulting dataset contained 1500 samples.


We used the first 10 seconds of a rep’s speech to train the classifier because this was the timeframe when the required phrase is pronounced. Each file of this type was divided into windows (window length — 1,5 seconds, window step — 1 second) and processed by the network as an input file. As an output for each file, we had the probability for each phrase to be pronounced in the selected time window.

The output

We marked 300 more files to find out if the required phrase is pronounced within the first 10 seconds. The prediction accuracy for these files was 87%.

Why Use Voice Recognition Software

Automated call scoring helps define clear KPIs for call center agents, identify best practices and follow them, and increase call center productivity. However, speech recognition software can be applied to a much wider range of tasks.

Below, you can find a few examples of how exactly organizations can benefit from speech recognition software:

  • Collect and analyze data to improve voice UX
  • Analyze call recordings to find connections and trends
  • Recognize people by their voice
  • Detect and identify client’s emotions for a higher customer satisfaction rate
  • Dig deep into the bid data and increase the first call resolution rate
  • Increase revenue per call
  • Reduce churn rate
  • and much more!

Your machine learning project needs enormous amounts of training data to get to a production-ready confidence level. Get a checklist approach to assembling the combination of technology, workforce and project management skills you’ll need to prepare your own training data.

neural network ,neural network ai ,neural network in business ,machine learning ,deep learning ,artificial intelligence ,ai

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}