Build a Real-Time Phone Conversation Analytics
Build a Real-Time Phone Conversation Analytics
Join the DZone community and get the full member experience.Join For Free
What Is Real-Time Conversation Analytics?
In contrast to post-call conversation analytics, which provides insights after the fact, real-time call conversation analytics can point them out at present times.
In this blog, I will walk through the essential steps to build a web app that can analyze call conversations in real-time to assist an agent. Once we’re finished, we’ll have an app which will:
- Transcribe phone conversations in real-time and display the text on the screen.
- Translate the transcripts from English language to Spanish and display the translated texts on the screen.
- Allow an agent to record the conversations separately (customer and agent tracks) and save into audio files.
- Display comprehensive analytics such as sentiment and emotion scores on the screen.
I assume that you already know how to set up a RingCentral sandbox account, such as adding user extensions, assigning a phone number to an extension so that you can make a direct phone call to that user extension. I also assume that you are familiar with getting IBM Watson IAM API keys for accessing Watson AI services.
We need to create a RingCentral application and get the app’s client id and client secret, which will be used in our demo app. We will keep the demo app implementation as simple as possible by choosing the password flow authentication and select the following permissions for the demo app:
Call Control, ReadAccounts, WebhookSubscriptions
The associated demo application is built using the Express Web application framework, React JS, and Node JS. Thus, for conveniences, we will use the RingCentral JS SDK and the IBM Watson Node JS SDK to access their services.
Note: The code snippets shown in this article are shorter and just for illustration of essential parts. They may not work directly with copy/paste. I recommend you download the entire project from here.
Call Monitoring Setup
In order to supervise a call, we need to set up a call monitoring group in our sandbox account. We will log in to our RingCentral sandbox account and browse to the “Call Monitoring” view. Then, click the “+ New Call Monitoring” button to create a new call monitoring group. Give the group a name and follow the onscreen instructions to set it up.
Then, we subscribe to telephony session notifications for the monitored agent as follows:
After subscribing for notifications successfully, our app should receive notifications when the telephony status of the monitored agent is changed. The notifications will come via the webhook callback URL as shown below:
When an incoming call is answered, we iterate through the
agentsList to identify which agent has accepted the call. Once we find the agent from the list, we will read the call information, as shown in the code below:
As you can see, we parse the response to get the
telephonySessionId and get the
partyId of each party participating in the call. It’s worth mentioning that the telephony session id identifies “this” call, and the party id identifies which party (customer or agent) is participating in the call.
After getting each
partyId, we call the
submitSuperviseRequest() function, where we send a request for a supervision session. The reason we send a request for a supervision session twice with different
partyId is because we want to get separate audio streams, one from the customer and another from the agent.
Notes: If you want to get a mixed audio stream, you can call the supervise API without the party/[partyId] in the URL path.
Sending the supervise request above, we are asking RingCentral server to make a call to a phone device, identified by the
deviceId, and stream the audio of the call to a SIP device so that we can listen to the audio. But where can we get the device id? Let’s look for it.
If a supervisor is a human, who is supposed to listen to the call on a phone device, then the
deviceId can be one of the device ids retrieved from the supervisor’s extension device list using the extension device list API. However, in our use case, we want the audio streams to be streamed directly to our app so that we can analyze the conversation using AI solutions. That is why we need to implement a “softphone” which will receive the audio streams.
It is fairly complicated to implement a soft-phone from scratch using SIP over Web socket. Thanks to our engineer Tyler Liu, who created the RingCentral Softphone SDK, making it very easy to implement a soft-phone engine for our app.
As you can see, with just a few lines of code, we have set up our soft-phone with. the
deviceId, which we were looking for as discussed earlier. Remember that the
rcsdk parameter is an instance of the RingCentral SDK we used earlier. We can save the
deviceId in a database so that we can retrieve it when we call the supervise API.
Note: We must start the softphone engine before we subscribe for the telephony session notifications to make sure that when there is an incoming call, we already have the deviceId.
Let’s see how we use the softphone object to accept a SIP invite and to answer a SIP call.
Every time we submit for a supervised request, we will get a SIP invite message. As discussed earlier, we submitted for a supervise request twice, one with the customer’s call
partyId and one with the agent’s call
partyId. Thus, we will receive two SIP invite messages. How do we detect which invite is for the customer’s audio channel and which invite is for the agent’s audio channel, so that we can create resources to handle each audio stream separately?
Let’s first define the data model to keep necessary information about a channel.
After we submit for a supervise request, we save the
partyId of that request and other metadata in a channel object. And we add the channel object to the channels list.
When we receive a SIP invite message, we parse the SIP message’s header and extract the party id. Then, we compare the party id with the party ids we saved in the channels list to identify a channel and create necessary resources for that channel.
Each SIP call is identified by a call id, which is included in the SIP headers. We need to extract the call id and save it into the channel data object. This is needed for identifying a channel to reset the resources for that channel when the call hangs up. We also create the Watson engine object (we will look into the Watson engine implementation shortly), then answer the SIP call.
After answering the SIP call, we will receive the audio buffer via the softphone callback ‘track’. Within the callback function, we create an audio sink and start reading the audio data as shown below:
There are a few tricks in the implementation above. Because we want to use IBM Watson real-time transcription service, we need to create a Watson socket and set the sample rate of the audio we receive from the track. For some reason, the first audio data packet we receive does not have the correct audio sample rate.
That’s why we discard a few audio packets before we pass the sample rate data.sampleRate to the Watson engine. The size of the audio data packet is normally too small (10ms — 40ms) and it is not efficient to feed Watson real-time transcription with such a small data packet. That’s why we need to create a data buffer and concatenate those small packets to make a bigger packet (32k — 64k) before we feed it Watson socket for transcription.
In this demo, we want to transcribe the conversations and translate the text from English to Spanish. We also use Natural Language Processing technology (NLP) to analyze sentiments and emotions of the conversations. Let’s have a look at the Watson engine implementation.
To use Watson real-time speech recognition, we need a Speech-to-Text API key, and use the key to get the access token. Once we get the access token, we can create a Web socket URI as shown in the code above. If you want to transcribe other languages than English, you can check other language models supported by IBM Watson and replace the language model with your selected one.
Then, we create a Web socket object with the Watson Web socket URI created earlier. We define a configs data object and use the sample rate of our audio data to set the ‘content-type’ and set other features accordingly. Finally, we send the ‘configs’ after the Web socket is opened successfully.
When the audio buffer is ready, we call the
transcribe() function with the audio buffer then we send the audio buffer to Watson service. The transcript text will be returned to the callback function shown below, where we will parse the response to get the transcript.
Because we set the ‘interim_results’ feature to true in the configs, we will receive the interim transcript. It’s worth to note that the interim transcript is more instant but may not be accurate and the transcript might have changed in the final result. That’s why we check if the transcript status is final, then we will analyze the transcript, otherwise, we just merge the text and display it.
To create a dialogue, which is mixing between interim and final transcripts and between customer’s and agent’s speeches, we need to implement an algorithm to join the transcripts properly. I will let you explore by yourself the detailed implementation of the mergingChannels(thisClass.speakerId, thisClass.transcript) function from the index.js file. Let’s move on to discuss using Watson NLP to analyze the transcript.
To use Watson Language Translator and Natural Language Understanding services, we need to get the API key for each service. Then we use the IBM Watson JS SDKs and use the API keys to access the services as shown below:
To translate a transcript, we specify the translate API parameters with the transcript and the language model. In this demo, we want to translate from English to Spanish, but you can choose other language models if you want to.
To analyze sentiment and emotion of a transcript, we specify the analyze API parameters with the transcript and the features set. In this demo, we use the keywords which will be detected by Watson and analyze the sentiment and emotion based on the keywords.
Remember that the data analytics we use in this project is just for demonstration. It may not solve a real-world problem. But once we get the transcript, we can apply any AI solutions to analyze the transcript to assist agents such as compliance monitoring and guidance that helps agents answer questions accurately and concisely.
To push the transcript and other metadata from the server app to the client app, we fire a transcriptUpdate event with the data object.
We use React JS to implement the app UI. You can explore the client-side code to see how the transcripts and analytics scores are rendered if you want to. The source code is stored under the client subfolder.
Those are all the essential steps to build the demo app from my Github repo. To learn more details, I recommend you to clone the demo project, have a closer look at the source code and try to run it in your local machine with your own app settings.
Run the Demo on a Local Machine
Follow the instructions in the README file to setup the project’s environment and run the demo.
Congratulations! Now you should be able to build and further develop this project with more features if you want to. For example, you want to improve call quality by monitoring calls for talk speed, talk time, and other variables indicative of tone and mood.
Finally, I hope you enjoy your reading and will find this information useful.
Published at DZone with permission of Paco Vu . See the original article here.
Opinions expressed by DZone contributors are their own.