The Hidden Causes of Algorithmic Unfairness
Despite our best efforts, bias can creep into decision-making models in a number of ways, especially when we prioritize high performance above other concerns.
Join the DZone community and get the full member experience.Join For Free
Having covered the importance of defining algorithmic fairness in my last article, I would now like to discuss the factors that can lead to model unfairness in more detail.
Data and Bias
Data collection is an amazing and fascinating field of study and there is a real science to collecting and maintaining high-quality datasets. There are also many different methods for gathering data, from machines and automated systems to collecting data manually by humans. As data practitioners, we need to be mindful of the fact that there is a possibility that data can be biased for an uncountable number of reasons.
Integrating data from different geographical regions or data that has been collected through different mediums, such as machines and devices that may be inaccurate for certain groups, can lead to bias creeping into datasets.
The point is, data that we think is bias-free could have bias already baked into it. Of course, we should expect biased outcomes if we train algorithms on biased data, so how can it be eliminated early on? It’s not going to be possible to list every way that data can become biased, but some common oversights that lead to data being unrepresentative of the target population users are wishing to model are as follows:
1. Missing data — Leveraging fields of data that are not available for particular groups in the population.
2. Sample bias — The samples chosen to train models do not accurately represent the population users wanted to model, like in the diagram below.
3. Exclusion bias — This is when data is deleted or not included because it is deemed unimportant.
4. Measurement bias — This occurs when the data collected for training does not accurately represent the target population, or when faulty measurements result in data distortion.
5. Label bias — A common pitfall at the data labeling stage of a project, label bias occurs when similar types of data are labeled inconsistently.
These are all key instances where bias can creep in, but data is not the only factor that can bring about unfair models.
Performance vs. People
With machine learning-related tasks, it is only natural that we want to maximize performance and minimize errors. From a business standpoint, high-performing models provide us with maximum ROI and enable us to achieve better outcomes for stakeholders. However, if we focus purely on performance and the objectives of an algorithm, this can introduce bias from our learning objective by benefiting a majority group over a minority group.
Apart from being ethically unconscionable, a long-term focus on performance over fairness will ultimately negatively affect an organization’s bottom line. Take a loan provider as an example; focusing on high performance over fairness will ultimately lead to customer abandonment from disenfranchised groups, which will be hard to win back. As I mentioned in my first article on this topic, algorithms are unaware of unfair bias, but customers are certainly not.
To eliminate this problem, we need to define appropriate measures, in terms of data and algorithmic objectives, that ensure all groups are treated fairly. Instead of looking at purely performance-related metrics, like accuracy or F1 score, we can optimize for a particular fairness metric that will safeguard against biased outcomes.
Of course, the most likely way bias can infect models would be the use of sensitive data that pinpoints certain groups with a high level of accuracy. In many industries, including financial services, sensitive data cannot be used and must be anonymized. However, leaving sensitive data out of models does not mean they do not have links to sensitive data. Related fields can serve as proxies that can be linked to sensitive data. For example, gender, ethnicity, and religion are not to be used in a lending model designed to predict mortgage approval. However, a postal code or name in the model could potentially enable users to derive things like ethnicity and gender. These can serve as links to the sensitive data originally left out. When constructing features, we might think we are removing bias from the sensitive fields, but we must be mindful that there can still be links to them, and this must be avoided so that bias can’t creep in through the back door.
In my next article, I will discuss key metrics that can be used to eliminate the risk of bias tainting models and delivering unfair outcomes.
Opinions expressed by DZone contributors are their own.