From Synthetic Data to Ethical AI: A Data Science Wish List for 2021 and Beyond
Improving models means making the most of the data we have and thinking ethically about the data we use.
Join the DZone community and get the full member experience.Join For Free
Leonardo Da Vinci once wrote, “Art is never finished, only abandoned.”
It’s a problem we can all relate to in some way. When is a task finished, and when can we stop and step away, satisfied that the energy we have put in will yield the best feedback, results, or returns?
Data scientists understand this problem all too well. We rarely have enough data, and even when we do, we might not be able to use it right away, if at all. Thankfully, data science is a field that advances faster than most, which is why an AI wish list is a perfectly reasonable list to make. Here’s my top four:
1. Making the Most of Small Data
Speak to any data scientist sitting atop a mountain of data and they will never refuse more — and rightfully so. More data can entail better models, but the problem of data scarcity in most scenarios is unlikely to ever go away. For many financial services use cases, datasets are inherently limited, which in turn limits their modeling potential. Neural networks, for example, typically require at least 5,000 samples per class for a standard classification problem to get near human-level performance on a task, which is far from achieving parity with even basic human comprehension.
An area of research that I believe will — and should — become increasingly important will focus on how we make more of limited datasets. Just as a young child can tell the difference between a cat and a dog after viewing a few pictures of each animal, we want to create models that function in the same way. This is a sign indicative of true learning.
To make use of smaller data, we can, of course, research further into transfer learning and finetuning. If we can leverage models that have prior knowledge and apply our smaller datasets, we might be able to build more robust models with the limited data we have.
Think of it like this: if I want to learn to play tennis, I can read a book, watch tutorials online, or practice by myself, but the best and fastest way for me to improve is to hire a coach who has trained hundreds of others, seen the mistakes they have made and the techniques that have worked for them.
2. Synthetic Data to Improve Models and Overcome Privacy Challenges
Another way of dealing with data scarcity is to create your own. Sometimes, we simply cannot overcome the problem of needing more data. It could be that data collection is too expensive or the data is not possible to collect in a reasonable time frame. This is where synthetic data can provide real value.
Synthetic data can be created by training a model to understand available data to such an extent that it can generate new data points that look, act, and feel real, i.e. mimic the existing data. An example could be a model that predicts how likely small and medium-sized businesses (SMBs) in the retail sector might be to default on loans. Factors such as location, number of employees, and annual turnover, might be key features in this scenario. A synthetic data model could learn the typical values of these features and create new data points that fit seamlessly into the real dataset, which can then be expanded and used to train an advanced loan default prediction model.
In this example, the parameters of the features that make up the SMB dataset are well understood, so anomalous features are unlikely. For example, an SMB cannot, by definition, employ more than 250 personnel, and they are unlikely to have a turnover of billions of dollars, so a synthetic model would not create such data points. All data points should fit within the statistics of the original dataset.
Another benefit of synthetic data is data privacy. In the financial services industry, much of the data is sensitive and there are many legal barriers to sharing datasets. Leveraging synthetic data is one way we can reduce these barriers as synthetic datapoints feel real but do not relate to real accounts and individuals. Increasingly, synthetic data is being leveraged across many industries and through model architectures such as generative adversarial networks. The training of these models can be computationally expensive but I think, and hope, we will see more research into creating statistically sound synthetic this year that are less computationally expensive.
3. Domain Fusion and The Move Towards General AI
Models like GPT-3 are showing great promise in language-related tasks. We have so far seen deep learning solve amazing vision tasks and I now expect to see some fusion between different domains, such as language and vision. OpenAI’s Dall.E is showing early promise by using natural language to input text describing a specific image, which a model then creates. This is a key step towards artificial general intelligence.
The next big leap toward general AI will be a model that can hear, understand, speak, and see. This is the most exciting possibility for data scientists, as such a model will be applicable in many domains and solve tasks across industries. We are still a number of years away from such a model, but we will certainly see some incredible fusion models along the way. The fusion of language and vision is a step in the right direction.
4. Algorithmic Fairness
Machine learning is being increasingly used to enhance products, automate tasks and perform decisions. When we make decisions and predictions that can profoundly affect people’s lives, we need to ensure that ethical AI is kept in mind. We need to make sure there is no bias against groups of people or individuals alike.
As data scientists and machine learning engineers, we have a very specific pipeline we all know and love. However, there is one essential practice that has traditionally been neglected across the machine learning pipeline, which is the consideration of a model’s fairness. We must ensure that the decisioning of our models is fair and accurately reflects the populations we model from. This means that they do not discriminate against particular groups or individuals.
Algorithmic fairness is a hugely important topic, especially in financial services, and I would love to see more research around fairness metrics and strategies to improve automated decision-making. We need to remember that algorithms don’t remember incidents of unfair bias. But customers do.
Opinions expressed by DZone contributors are their own.