Making Results of Credit Scoring Transparent Without Compromising Security
This article describes an algorithm that explains in natural language the strengths and weaknesses of a business, which affects decision-making.
Join the DZone community and get the full member experience.Join For Free
Nowadays, the most common problem of scoring is that not everyone can quickly interpret the results that the model produced. A complex algorithm examines dozens of factors in dynamics, and hardly anyone can immediately see and understand what it’s all about. This article describes an algorithm that explains in natural language the strengths and weaknesses of a business, which affects decision-making.
Why does a bank need it? They need it in order to increase the speed and culture of corporate communication and effective dialogue with the client. Banks had to implement new solutions post-factum and in a rush, which costs them a share of the market and was taken over by young and daring FinTech companies. In the eyes of ordinary people, a scoring system is a black box, and this fact does reduce trust in them. Upon being turned down, most clients go to the support service for explanations of what exactly the bank did not like and why the conditions for the loan are so tough.
If the bank decides to disclose the factor space that is used for scoring, fraudsters immediately get a tool for manipulating the most significant indicators of the scoring model. All of this suggests that scoring models must be taught to communicate directly with the client while meeting safety requirements.
Applying NLP Pipeline Scheme for Scoring
NLP Pipeline is a scheme that the most powerful chatbots like Siri or Alexa work on. The algorithm can be divided into several key stages.
At the first stage, speech recognition and translation of sounds into symbols, words, and sentences take place. This stage is absent for written speech. Among mathematical models, Deep Learning on neural networks is most often used at this stage.
Then, by means of stemming and lemmatization procedures, the text document is converted into a more convenient machine-readable form. At this stage, the system cuts off suffixes and endings that make speech beautiful but don’t carry any semantic charge. As a result, the text becomes as close to a machine-readable form as possible.
It is believed that this stage is highly dependent on the complexity of the grammatical structure of a language. However, this is only partially true; modern processors are able to work even with highly complex languages and extract facts from texts written in them, despite their grammatical complexity. An analysis of a Hungarian or Icelandic text will be only a few milliseconds longer than a similar analysis of an English text. However, the lack of libraries for analyzing texts in complex languages is surely a serious obstacle.
The next stage is the transformation of text into tables using algorithms that implement the theory of formal grammars, such as bag-of-words, word to vec, etc. At this stage, the text is transferred to the database, and only semantic constructions remain, not its complete grammatical structure. An ontological analysis of the text is carried out, turning it into a set of formal constructions, such as objects and subjects, properties, and methods; these are modifying characteristics.
Finally, the context and the contextual meanings of the facts that are set out in the text are determined: this is an interesting stage that hinges on the contextual dependence of the language and a particular text. Thus, legal and other formal types of texts are way easier to analyze than fictional works. As a result, at this stage, the text is finally turned into a table that is then inputted into the scoring model.
Next, the scoring model processes the data received at the input, gets tested, trained, and retrained. But, the important thing is that as soon as the scores are received and a decision is made based on them, the most interesting part begins: all the stages described above start repeating in reverse order:
- Based on the context, the appropriate dictionary is selected.
- Cases, gender, and declension are placed; a sentence with the correct grammatical structure is drawn up.
- Natural speech is synthesized, if necessary, which interprets the result obtained by Machine Learning methods.
Thus, the above-described algorithm is the algorithm that automatically explains in natural language which weaknesses or strengths of a business or a person had their influence on certain decisions. It is much better not just to get a rejection, but to find out the main reasons that led to it. Moreover, this may reveal an error in customer data, which can be quickly eliminated, leading to increased customer loyalty and sales.
Also, the use of this technology means employees don’t have to try to explain how the scoring model works and why it works correctly.
The question remains open: how do you eliminate the risk of revealing the factor space and the complexity of explaining the dependencies between factors?
Here, the high mobility of modern scoring systems comes to help. Real-time learning technologies give the possibility to easily change the role of those factors that influence the final decision. This makes it pointless for fraudsters to hack the system. As they build a company or borrower meeting the criteria, the importance of which they got to know about, the external environment and the scoring models that describe it will change, so all their efforts will be in vain.
It is more difficult to explain nonlinear dependencies and how the role of a factor changes depending on what other factors it is surrounded by. So far, a text document can only say about the presence of such relations but not interpret them in natural language. However, technologies are constantly improving.
Opinions expressed by DZone contributors are their own.