Sound Classification With TensorFlow
Sound Classification With TensorFlow
This article describes the tools we chose, the challenges we faced, how we trained the model for TensorFlow, and how to run our open-source sound classification project.
Join the DZone community and get the full member experience.Join For Free
Adding AI to your task-oriented bots enables advanced abilities such as bringing structure to your unstructured data. Get our free platform for developers today.
There are many different projects and services for human speech recognition like Pocketsphinx, Google’s Speech API, and many others. Such applications and services recognize speech to text with pretty good quality, but none of them can determine different sounds captured by the microphone. What was on record: human speech, animal sounds, or music playing?
We were faced with this task and decided to investigate and build sample projects that are able to classify different sounds using machine learning algorithms. This article describes which tools we chose, what challenges we faced, how we trained the model for TensorFlow, and how to run our open-source project. Also, we can supply the recognition results to the DeviceHive IoT platform to use them in cloud services for a third-party application.
Choosing Tools and a Classification Model
At first, we need to choose some software to work with neural networks. The first suitable solution that we found was Python Audio Analysis.
The main problem in machine learning is having a good training dataset. There are many datasets for speech recognition and music classification, but not a lot for random sound classification. After some research, we found the urban sound dataset.
After some testing, we were faced with the following problems:
pyAudioAnalysis isn’t flexible enough. It doesn’t take a wide variety of parameters, and some of them calculate on the fly, i.e. the number of training experiments based on the number of samples and you can’t alter this.
The dataset only has ten classes and all of them are “urban.”
The next solution that we found was Google AudioSet. It is based on labeled YouTube video segments and can be downloaded in two formats:
CSV files describing, for each segment, the YouTube video ID, start time, end time, and one or more labels.
Extracted audio features that are stored as TensorFlow Record files.
These features are compatible with YouTube-8M models. Also, this solution offers the TensorFlow VGGish model as feature extractor. It covered a big part of our requirements and was the best choice for us.
The next task was to figure out how the YouTube-8M interface works. It’s designed to work with videos, but fortunately, it can work with audio, as well. This library is pretty flexible, but it has a hardcoded number of sample classes. So, we modified this a little bit to pass the number of classes as a parameter.
YouTube-8M can work with data of two types: aggregated features and frame features. Google AudioSet can provide data as features, as we noted before. Through a little more research, we discovered that the features are in frame format. We then needed to choose the model to be trained.
Resources, Time, and Accuracy
GPUs are a more suitable choice for machine learning than CPUs. You can find more info about this here. So, we will skip this point and go directly to our setup. For our experiments, we have PC with one NVIDIA GTX 970 4GB.
In our case, the training time didn’t really matter. We should mention that one to two hours of training was enough to make an initial decision about the chosen model and its accuracy.
Of course, we want to get as good accuracy as possible. But to train a more complex model (potentially better accuracy), you need more RAM (video RAM in case of GPU) to fit it in.
Choosing the Model
A full list of YouTube-8M models with descriptions is available here. Because our training data was in frame format, frame-level models had to be used. Google AudioSet provides us with a dataset split into three parts: balanced train, unbalanced train, and evaluation. You can get more info about them here.
A modified version of YouTube-8M was used for training and evaluation. It’s available here.
The training command looks like:
train_data_pattern=/path_to_data/audioset_v1_embeddings/bal_train/*.tfrecord –num_epochs=100 –learning_rate_decay_examples=400000 –feature_names=audio_embedding –feature_sizes=128 –frame_features –batch_size=512 –num_classes=527 –train_dir=/path_to_logs –model=ModelName
For LstmModel, we changed the base learning rate to 0.001, as the documentation suggested. Also, we changed the default value of lstm_cells to 256 because we didn’t have enough RAM for more.
Let’s see the training results:
As we can see, we got good results during the training step — but this doesn’t mean we'll necessarily get good results on the full evaluation.
Let’s try the unbalanced train dataset. It has a lot more samples, so we will change the number of training epochs to 10 (should change to 5, at least, because it took significant time to train).
More About Training
YouTube-8M takes many parameters and a lot of them affect the training process.
For example, you can tune the learning rate and number of epochs that will change the training process a lot. There are also three different functions for loss calculation and many other useful variables that you can tune and change to improve the results.
Using Trained Model With Audio Capture Devices
Now that we have some trained models, it’s time to add some code to interact with them.
We need to somehow capture audio data from a microphone. We will use PyAudio. It provides a simple interface and can work on most platforms.
As we mentioned before, we will use the TensorFlow VGGish model as the feature extractor. Here is a short explanation of the transformation process.
A “dog bark” example from the UrbanSound dataset was used for visualization; resample audio to 16 kHz mono:
Compute spectrogram using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann window:
Compute mel spectrogram by mapping the spectrogram to 64 mel bins:
Compute stabilized log mel spectrogram by applying log(mel-spectrum + 0.01) where an offset is used to avoid taking a logarithm of zero:
These features are then framed into non-overlapping examples of 0.96 seconds, where each example covers 64 mel bands and 96 frames of 10 ms each.
These examples are then fed into the VGGish model to extract embeddings.
And finally, we need an interface to feed the data to the neural network and get the results.
We will use the YouTube-8M interface as an example but will modify it to remove the serialization/deserialization step.
Here you can see the result of our work. Let’s take a closer look.
PyAudio uses libportaudio2 and portaudio19-dev, so you need to install them to make it work.
Some python libraries are required. You can install them using pip:
pip install -r requirements.txt
Also, you need to download and extract to the project root the archive with the saved models. You can find it here.
Our project provides three interfaces to use.
1. Process Prerecorded Audio File
Simply run python parse_file.py path_to_your_file.wav and you will see in the terminal something like:
Speech: 0.75, Music: 0.12, Inside, large room or hall: 0.03
The result depends on the input file. These values are the predictions that the neural network has made. A higher value means a higher chance of the input file belonging to that class.
2. Capture and Process Data From mic
python capture.py starts the process that will capture data from your mic infinitely. It will feed data to the classification interface every five to seven seconds (by default). You will see the results in the previous example.
You can run it with –save_path=/path_to_samples_dir/. In this case, all captured data will be stored in the provided directory in wav files. This function is useful if you want to try different models with the same example(s). Use the –help parameter to get more info.
3. Web Interface
python daemon.py implements a simple web interface that is available on http://127.0.0.1:8000 by default. We use the same code as for the previous example. You can see the last ten predictions on the events (http://127.0.0.1:8000/events) page.
IoT Service Integration
Last but not least is integration with the IoT infrastructure. If you run the web interface that we mentioned in the previous section, then you can find the DeviceHive client status and configuration on the index page. As long as the client is connected, predictions will be sent to the specified device as notifications.
TensorFlow is a very flexible tool, as you can see, and can be helpful in many machine learning applications like image and sound recognition. Having such a solution together with an IoT platform allows you to build a smart solution over a very wide area.
Smart cities could use this for security purposes, continuously listening for broken glass, gunfire, and other sounds related to crimes. Even in rainforests, such a solution could be used to track wild animals or birds by analyzing their voices.
The IoT platform can then deliver all such notifications. This solution can be installed on local devices (though it still can be deployed somewhere as a cloud service) to minimize traffic and cloud expenses and be customized to deliver only notifications instead of including the raw audio. Do not forget that this is an open-source project, so please feel free to use it!
Published at DZone with permission of Igor Panteleyev . See the original article here.
Opinions expressed by DZone contributors are their own.