Can a Bot Learn a Second Language?
Creating a chatbot that can understand and use another language can be a difficult task. Read this article to see if it's possible.
Join the DZone community and get the full member experience.Join For Free
Creating a chatbot that can understand and use a language other than English can be an ambitious task. Chatbots are still in their early days and even though there are many NLP libraries available, most of them support only the English language. From stopwords and POS taggers to pretrained word2vec models, it can be time-consuming to work on an NLP problem in a different language.
We at SmartCat tried a bit of a different approach in creating a bot that uses the Serbian language. Using a large dataset of unlabeled sentences written in Serbian and performing ML methods, we were able to create a chatbot that resulted in a very decent performance. It was able to return an expected response 9/10 times. Interested in how we did it? Keep on reading.
As with any other NLP problem, we started by processing our sentences. The processing steps included converting text to lowercase, removing numbers that were irrelevant, removing punctuation and stripping extra white spaces. All these methods can be performed on a text written in any language. If we had been working with sentences written in English, we would have continued by performing stemming, lemmatization, and removing sparse terms and stopwords, but this got us thinking. Knowing that stopwords occur very often in sentences (an example of such words in English would be words such as: a, an, the, such, as, etc.) we decided to perform tf-idf statistic on our dataset. That way, we created a large vocabulary of words in Serbian that had a tf-idf index assigned to them. Words with lower tf-idf values were considered irrelevant — either stopwords or sparse terms. Words that had a higher tf-idf value were considered important. Stemming and lemmatization were not performed for two reasons — 1. there are many suffixes, many rules, and even more exceptions in the Serbian language on when a suffix can be applied to a word stem; 2. there are a few dialects and many grammar errors in our dataset, which could lead to wrong conclusions. We will iron out this issue in further steps.
Alright, now we have some processed sentences to work with. Usually, we would continue by performing POS tagging so we could learn something about semantics. Unfortunately, there are no POS taggers available for the Serbian language, and creating one would take too much of our time. That is why we opted for creating a word2vec model. Even though our dataset wasn’t perfect, we were positive that it would give us some insight on word similarity and relations. If you are interested in learning more about how word2vec can do this, you can check out our blog post. We trained our model using around 400.000 sentences, created word embeddings, and then continued by clustering retrieved embeddings using K-Means. It worked like a charm — it was clear that there was a cluster that contained greeting phrases, the one that contained personal information such as names or ID numbers, the one that contained gratitude phrases, complaints, problem reports, etc. After tuning our algorithm and creating subclusters where needed, we were able to label our large dataset in no time. We were aware of the semantics, which was what we needed to generate adequate responses.
Creating a ChatBot and Results
We used the clustering algorithm only as a classification tool, as we believed that the classification model trained on our labeled dataset would perform better over time. In our tests, random forest resulted in the best accuracy.
Bot: Hello! Thank you for contacting us. How can I help you?
User: why do you have frequent issues with your website
Bot: We will check immediately with a competent service so we can give you the right information.
User: ok, thank you
Bot: Thank you for contacting us. If you have any further questions, we will be pleased to answer.
Whenever a new message is received, the chatbot would process it, classify it, and return the response that was assigned to the predicted class if the prediction accuracy was higher than the threshold we set. We tested our bot on 100 previously conducted conversations. These conversations contained 210 messages from users in total. Out of 210, 23 messages could not be handled by the bot and 11 messages were not handled correctly. That means that in 84% of cases, the bot was able to respond, and in 95% of cases, the bot was able to handle the conversations correctly.
We were able to achieve even better results using n-grams. We separated every message received into n-grams where n can take values in range 1..no_of_words. That way, we were able to detect multiple classes for each message and give a better response. For example, a sentence could contain both a greeting phrase and a question. This approach does require more time to process each message, but it could be applicable to use cases where immediate response is not of great importance.
We hope we gave you an idea on how to handle your non-English speaking bot. Have you tried any other approaches? Let us know in the comments; we would be happy to discuss.
Published at DZone with permission of Nina Marjanovic, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.