Machines learn from huge amounts of data to make decisions faster and cheaper. And it formalizes a process to avoid prejudice. But are these AI decisions fair? Do we have bias issues beyond the datasets? Should fairness be an integrated part of AI thinking?
But software is not free of human influence. Algorithms are written and maintained by people, and machine learning algorithms adjust what they do based on people’s behavior. As a result … algorithms can reinforce human prejudices.” Claire Miller 2015.
In the first part of the series, we study the general biases in clinical research and data science. In this article, we look into AI bias issues in more detail. We will:
- present the depth of the problems through examples in computer vision, NLP and machine learnings.
- discover some of the root causes.
- discuss bias problems in datasets.
- ask questions on what data scientists should be asked on the datasets.
Biases in Computer Vision
Is Deep Learning biased on race and gender? In 2018, the Gender Shades project studied the accuracy of facial analysis models. It realized the gender classifier has the lowest performance for darker skin females in many ML (machine learning) models.
An evaluation of four gender classifiers revealed a significant gap exists when comparing gender classification accuracies of females vs males (9–20%) and darker skin vs lighter skin (10–21%). (Quote)
Many computer vision datasets have selection and observer bias (definitions in the last article). They are biased on sensitive characters, like race and gender. That leads to disparate performance in the predictions. These bias problems are nothing new. Dated back in 2009, Zamen and Cryer had a YouTube video showing HP Webcam failed to track a Black person's face. As demonstrated later, the problems are not only on accuracy disparity. They are common, persistent, and hard to address.
Twitter Image-cropping Algorithm
When displaying images on social media, images are often cropped to look better.
In 2020, Twitter had to apologize for the racial bias in its cropping algorithm. This algorithm focused on “salient” image regions where a person is likely to gaze at it when freely viewing the image.
The cropping algorithm (paper) uses three public datasets — one of them is SALICON. SALICON used Amazon Mechanic Turk, a crowdsourcing platform, to collect mouse-tracking data. It replaced the expensive eye-tracking device to track human glaze. Such a dataset is vulnerable to observer bias. But its full impact was not realized until later.
Colin Madland noticed that the cropping algorithm continually selected his face, but not his darker-skinned colleagues. Here, we have two long vertical photos. The positions of the persons are flipped in the second picture. The algorithm picks McConnell over Obama for both pictures in generating the cropped thumbnail.
Similar tests are done on Simpsons characters, black/yellow Labradors, or different people. The algorithm picks the lighter subjects over the darker subjects.
Seven months later, Twitter wrote an article on what was learned. The algorithm had a bias favoring women & white individuals.
Here is a chart on which subjects (gender and race) are selected more frequently. White women are selected the most. The least is likely black men.
Twitter also released a technical review paper regarding the cropping bias.
In the paper, it states a few challenges:
- “Dataset: One key challenge in validating and communicating such problems is the lack of high-quality datasets for fairness analysis available, especially to industry practitioners.”
- “Lack of universal formalized notion of fairness: One key focus of ethical machine learning has been on developing formal notions of fairness and quantifying algorithmic bias. Another challenge is the lack of a universal formalized notion of fairness that can be easily applied to machine learning models; rather, different fairness metrics imply different normative values and have different appropriate use cases and limitations.”
- “Lack of a satisfactory definition of universally appropriate metrics or optimization objectives for machine learning for certain classes of problems.”
We will look into #1 & #2 later. Issue #3 is one of the issues of Frances Haugen’s whistleblowing case on Facebook: what is the objective function a company emphasized when a model is deployed? How much chance of harm a company is willing to risk over corporate priorities?
Finally, Twitter decided to let users control how photos appear. In short, the AI algorithm was shelved.
But don’t mistake that it is a sole dataset bias problem. Model design plays a role in the cropping issue also. As discovered, only using the highest “salient” region (argmax function) creates suboptimal and biased problems.
Notably, selecting an output based on a single point with highest predicted scores (the argmax selection) can amplify disparate impact in predictions, not only in automated image cropping but also in machine learning in general. (Quote)
Other algorithm changes can be done to promote fairness. Pre-processing can remove the bias context from the input features. Post-processing can recalibrate the scores for equal outcome. Some researchers add constraints in optimizing an objective function. (details in later articles)
In ML, the lack of model interpretability blinds us from potential biases. As twitter digs deeper, it realized what may go wrong with the model (still speculative).
One possible explanation of disparate impact is that the model favors high contrast, which is more strongly present in lighter skin on dark background or darker eyes on lighter skin, and in female heads which have higher image variability. (details)
Twitter also tested the “male gaze” issue: whether the cropping bias towards specific body parts. It concluded no bias is found but sometimes, the images are cropped to players’ numbers on a sports jersey.
As indicated here, better model interpretability helps us to identify biases and rebuild public trust.
Facebook Recommendation Engine (Label Black Men As ‘Primates’)
In 2021, Facebook put the title “Keep seeing videos about Primates?” on a video featuring black men.
Similar problems happened in Google in 2015. Black persons were misclassified as gorillas with the Google object-recognition algorithm. To overcome the problem, Google censors words like “gorilla”, “chimp”, “chimpanzee”, and “monkey” from its photo-related applications.
For example, even after 6 years, searching for “gorilla” in Google Photos above returns an empty result.
As these problems keep repeating themselves, always test individuals with dark skin on any computer vision applications.
When AI applications are deployed globally, achieving diversity in datasets becomes much harder. For example, computer vision applications do not adapt well to cultural differences. An AI application can caption the word “wedding” from most of the pictures below, but not the one on the right that has a different cultural context. A diverse team with different backgrounds will give a higher line of sight in potential biases.
Bias in NLP
In this section, we will explore biases in many popular NLP technologies.
How often do people misuse or curse at Google Assistant? In 2015, Brahnam estimated 10% to 50% of interactions with conversational agents are abusive. As Google Assistant keeps learning from real-life conversations, how can we avoid the Assistant from learning the bad things?
In 2016, Microsoft released Tay, a chatbot, on Twitter. It was designed to engage people through tweets or messages. Microsoft trained Tay’s model with anonymized public data along with material provided by professional comedians. In 2016, Microsoft already deployed XiaoIce successfully in China with 40 million users. To counter the cultural difference, Microsoft implemented other filters and conducted many user research. After being released, Tay would continue learning from the Twitter user response to adapt and improve the model.
Within a day, all things went hell. As explained by this article:
In a coordinated effort, the trolls exploited a “repeat after me” function that had been built into Tay, whereby the bot repeated anything that was said to it on demand. But more than this, Tay’s in-built capacity to learn meant that she internalized some of the language she was taught by the trolls, and repeated it unprompted. For example, one user innocently asked Tay whether Ricky Gervais was an atheist, to which she responded: “Ricky Gervais learned totalitarianism from Adolf Hitler, the inventor of atheism.”
Zoë Quinn argued that:
If a bot learns how to speak on Twitter — a platform rife with abusive language — then naturally it will learn some abusive language.
In two days, Tay was shut down. As a system is trained with real-time data, bad actors must be filtered out.
Pinterest is more popular with women while online forums are more popular with men. This type of imbalance should be mitigated in choosing the training datasets. Who do we learn from matter? Many NLP models are trained with Wikipedia datasets. As one of the Wikipedia pages says:
The English Wikipedia currently has 42,510,337 who have registered a username. Only a minority of users contribute regularly (124,158 have edited in the last 30 days), and only a minority of those contributors participate in community discussions.
As shown below, the editors in Wikipedia do not represent the general population. Wikipedia is vulnerable to volunteer bias.
Many sentimental analyses use datasets from IMDb. Like Wikipedia, IMDb suffers from volunteer bias. To be a successful product, this kind of bias must be addressed. For example, IBM NLU has put a heavy emphasis on combating AI biases.
Word-embedding is a popular NLP concept in manipulating language. For example, if “woman” is associated with “man”, then “queen” is associated with “king”.
As shown in this study, the word embedding trained with Google News articles is biased. For example, if “man” is associated with “programmer”, why “woman” is associated with “homemaker”.
The newer NLP technologies are no stranger to biases.
BERT generates cohesive text when prompted with a sequence of words as context (pre-text). Here is an example where the grayed words are generated automatically by the GPT-2 model. (a GPT-3 model will generate far cohesive content.)
The figure below contains the sentiments of Wikipedia articles on different topics. They are compared with the sentiments of texts generated by different text generation technologies, like BERT. As shown in the bottom half below, many texts generated by AI have different sentiments in the Islam topics compared with Wikipedia. Very likely, all of them including Wikipedia are biased. Since these technologies are the cornerstones for many NLP applications, their impacts on NLP applications can be far-reaching.
Google Perspective API
Even the AI applications that detect bias can be biased themselves.
Google Perspective API analyzes the toxicity of a text. During the initial fairness review in Google, text likes “I am gay.” was marked as toxic.
The key reason was the texts in the dataset related to homosexuality are mostly toxic. This classification imbalance taught the model to associate any homosexuality text with toxic. Here is a 90-second video explaining the source of the bias in more detail.
Another way to address this problem is to replace all “identity” words with a token. So “I am straight” and “I am gay” will both become I am IDENTITY. So ML model cannot make judgments based on membership groups. In addition, a diversified source of content contributors can be recruited for writing positive or neutral comments on sensitive topics to break the class imbalance.
Many Machine learning algorithm learns from historical data. In the case study below, the healthcare model understudy was badly biased. The biased system suggested different levels of health treatment based on race. Black patients receive less care compared with white patients.
The U.S. health care system uses commercial algorithms to guide health decisions. Obermeyer et al. find evidence of racial bias in one widely used algorithm, such that Black patients assigned the same level of risk by the algorithm are sicker than White patients.
The authors estimated that this racial bias reduces the number of Black patients identified for extra care by more than half. Bias occurs because the algorithm uses health costs as a proxy for health needs. Less money is spent on Black patients who have the same level of need, and the algorithm thus falsely concludes that Black patients are healthier than equally sick White patients. Reformulating the algorithm so that it no longer uses costs as a proxy for needs eliminates the racial bias in predicting who needs extra care.
As shown in this example, using costs as a proxy for health needs was a bad design choice! This proxy itself is strongly impacted by race. Race becomes a signal learned by the model to make decisions.
Amazon used a machine learning model to rate its candidates for software jobs. In 2015, it realized the model was not gender-neutral. The model was trained with resumes over the last 10-year. Since male engineers dominated the field, the model was learned to reward resumes for words common in male resumes and discriminate words common in female resumes. In short, the model taught itself that male candidates were preferable. The model was oversimplified and overfitted with discriminatory data that could not be generalized. It was not complex enough to capture information from the minority group. Instead, it took the easy guess and associates “male-like” resumes to be more successful. Amazon scrapped the project later.
As shown in the top right diagram below, we should question whether the training data is discriminatory in the development process.
Bias in Machine Learning
There are plenty of biases in other machine learning models. It comes in different shapes: by intention, by the subconscious, or by historical prejudice, etc …
Here is another interesting bias reported by C3.ai. Fraud in remote areas is marked “likely” in one AI application because of serious reporting bias.
Early in C3 AI’s history, we developed machine learning algorithms to detect customer fraud. In one customer deployment, the algorithms were significantly underperforming in a particular geography, a remote island. Under further examination, we found substantial reporting bias in the data set from the island. Every historical investigation performed on the island was a fraud case, skewing the data distributions from that island.
Because of the island’s remoteness, investigators wanted to be sure that a case would be fraudulent before they would travel there. The algorithm incorrectly maximized performance by marking all customers on the island with a high fraud score. Because the frequency of events, properties, and outcomes in the training set from that island differed from their real-world frequency, the model required adjustment to counteract implicit bias caused by the selective fraud inspections on the island.
COMPAS is a decision support tool used by some U.S. justice systems. It becomes a textbook case study in fairness.
Judges and parole officers use the system to score criminal defendants’ likelihood of reoffending if released. It provides suggestions on sentencing, parole, and bail. Here is the role of COMPAS as stated in one Wisconsin county.
The COMPAS assessment will help provide useful information in connection with arrest, bond, charging, sentencing, programming and probation supervision decisions.
ProPublica, which focuses on investigative journalism, released a report challenging the accuracy of the system:
ProPublica released an article with the title “There’s software used across the country to predict future criminals. And it’s biased against blacks.” Its key argument was Black defendants face a much higher false positive rate compared to white defendants.
The developer of COMPAS, Northpointe, questioned the analysis. Northpointe claimed the COMPAS scores were calibrated by group. This is a fairness criterion commonly used. So is this fair in the COMPAS context? We will come back to this issue later. One of the key issues at stake here is the transparency of AI decisions.
A lawsuit was filed in 2017. It claimed
Whether it is a violation of a defendant’s constitutional right to due process for a trial court to rely on the risk assessment results provided by a proprietary risk assessment instrument such as the COMPAS at sentencing because the proprietary nature of COMPAS prevents a defendant from challenging the accuracy and scientific validity of the risk assessment.
In 2016, the Wisconsin Supreme Court denied the petition. But to gain public and political support for AI applications, secrecy is counter-productive.
Many ML systems are vulnerable to all sorts of biases. (including reporting bias, expectation bias, confirmation bias, attrition bias, performance bias, etc…) As shown in the bottom right below, we need a system to safeguard the whole process where biases may be introduced. This includes the design, testing, validation, and deployment phase. A renewed AI governance is needed as credibility and transparency are critical in gaining public trust. Without it, its adoption is limited.
Here is an overview of the Google ethical review of new AI technologies. It involves external experts if necessary. Such openness can be a more effective offense against a potential nightmare on public opinion.
Recommenders are one of the most serious sources of bias. In many ML training, data collections are independent of the training. Careful data collection and screening will keep the bad actors out. However, recommenders collect user engagement and behavior in real life to tune the model continuously. Responses on many social platforms are often opinionated. Contents appeal to emotions rather than facts. And there are plenty of bad actors that want to manipulate opinions. These responses magnify the bias and unfortunately, create a reinforcing loop (more examples later).
But there are cases that are less intentional. In the Facebook ads below, the copywriting and the images of two recruiting ads are gender-neutral.
When researchers used these ads to evaluate gender bias, there is a difference in job ad delivery by gender. As stated by the paper,
We confirm that Facebook’s ad delivery can beyond what can be legally justified by possible differences in qualifications, thus strengthening the previously raised arguments that Facebook’s ad delivery algorithms may be in violation of anti-discrimination laws. We do not find such skew on LinkedIn.
People may disagree on the level of harm created here. But definitely, the recommender is gender-aware, intentional or not. The US anti-discrimination laws do cover job advertisements. Otherwise, different genders will have disparate treatment in employment. If women have less accessibility to high-paying jobs if they want to, the legal impact can be high.
Let’s get into a more serious example. When Latanya Sweeney, a Black Harvard professor, searched (a 2013 paper) her name online, she realized the display ads are full of criminal record checking ads. Was this retarget algorithm racially biased?
If AI algorithms turn back the progress on the general beliefs on equality, we will face the firing squad sooner or later.
Bias persists in our society.
We hope that we can eliminate bias by training ML models purely on carefully filtered data. But in reality, it is not easy.
But an algorithm is only as good as the data it works with. Data is frequently imperfect in ways that allow these algorithms to inherit the prejudices of prior decision makers. In other cases, data may simply reflect the widespread biases that persist in society at large. In still others, data mining can discover surprisingly useful regularities that are really just preexisting patterns of exclusion and inequality. Unthinking reliance on data mining can deny historically disadvantaged and vulnerable groups full participation in society. Worse still, because the resulting discrimination is almost always an unintentional emergent property of the algorithm’s use rather than a conscious choice by its programmers, it can be unusually hard to identify the source of the problem or to explain it to a court. (quote)
Predictive policing systems attempt to forecast crime for police departments using algorithmic techniques. But the training dataset can be very biased. Police databases “are not a complete census of all criminal oﬀences, nor do they constitute a representative random sample” (study).
Empirical evidence suggests that police oﬃcers — either implicitly or explicitly — consider race and ethnicity in their determination of which persons to detain and search and which neighbourhoods to patrol. If police focus attention on certain ethnic groups and certain neighbourhoods, it is likely that police records will systematically over-represent those groups and neighbourhoods. That is, crimes that occur in locations frequented by police are more likely to appear in the database simply because that is where the police are patrolling. (study)
For example, the instance of crime reported and collected is not representative of the crime instances over a city. The right diagram below shows illicit drug use from a non-criminal justice system. It mismatches with the crime distribution known to the police (the left diagram). This triggers a potential huge selection bias and observer bias in the policing training datasets.
Community trust in police can be another major source of reporting bias. Crimes can be underreported in areas where the trust is weak. On the other hand, well-doing areas may see overreporting.
In addition, the trained models can be self-fulfilling and reinforce the biases.
To make matters worse, the presence of bias in the initial training data can be further compounded as police departments use biased predictions to make tactical policing decisions. Because these predictions are likely to over-represent areas that were already known to police, oﬃcers become increasingly likely to patrol these same areas and observe new criminal acts that conﬁrm their prior beliefs regarding the distributions of criminal activity.(same study)
Heavy policing areas will have more crimes reported by police. Less attention will be devoted to other areas and less contradictive information will be collected. The biases are reinforced and magnified in further ML training.
The newly observed criminal acts that police document as a result of these targeted patrols then feed into the predictive policing algorithm on subsequent days, generating increasingly biased predictions. This creates a feedback loop where the model becomes increasingly conﬁdent that the locations most likely to experience further criminal activity are exactly the locations they had previously believed to be high in crime: selection bias meets conﬁrmation bias.(same study)
In healthcare, the level of health care and the diagnostic vary in areas with different economic statuses. This disparity causes underreporting of incidents in many minority groups.
Prejudice or observer bias (intentional or not intentional) can overreport incidents also. Black Americans are 2.4 times more likely to be diagnosed with schizophrenia. (details).
These observations suggested that among African Americans, psychotic symptoms were excessively weighted by clinicians, skewing diagnoses toward schizophrenia spectrum disorders, even when the patients had ratings of depression and manic symptoms similar to those of white patients. (source)
Even though the medical communities are trained to be professional and scientific, many measurements or assessments remain subjective. This leads to diagnostics and assessment biases in the medical field. The type of treatments, diagnostic, and attention skew the data that ML is learning from.
In addition, certain features for the protected groups can be less reliable and less informative because of prejudgment or service accessibility. For example, the disadvantaged groups may have less access to accurate diagnostic tests. Features that are accurate in predicting outcomes are less reliable for disadvantaged groups. Therefore, decisions based on these data are biased or underperform.
Other measurements, like those in the Amazon resume system, depend on judgments made previously by humans. ML will exploit bias if the decisions are screwed towards a specific demographic group.
Here is another real-life example of reporting bias (source) on San Francisco shoplifting crime data.
A closer look at the data shows that the spike in reported shoplifting came almost entirely from one store: the Target at 789 Mission St. in the Metreon mall. In September alone, 154 shoplifting reports were filed from the South of Market intersection where the Target stands, up from 13 in August.
What happened at this particular Target? Did the store see a massive spike in shoplifting in September? No, said store manager Stacy Abbott. The store was simply using a new reporting system implemented by the police that allows retailers to report crime incidents over the phone.
Questions to be asked about the Datasets as a Data Scientist
To scrutinize the data, here is the list of questions data scientists should ask and verify. And plans should be created to identify and mitigate potential issues on the datasets.
Source of data:
- Who creates the dataset. What is the original purpose and what is it intended for?
- Who are the information contributors or what are the sources? Does the corresponding demographic resemble the population in interest?
- Are data under-reported or under-diagnosed in protected groups or regions?
- Or on the contrary, are protected groups or regions over-reported or severely scrutinized?
- Will some data be easier to collect or detect in one group or region than others? Will there be trust and accessibility issues?
- Does the collection process attract one group over others? Are they highly biased or opinionated?
- Will income, economic well-doing, and racial stereotype influence the level of service and the type of diagnosis?
- Do the datasets have the same level of scenarios and conditions covered in disadvantaged groups?
- Over what timeframe are the data collected?
Character of data
- Do the distributions of labels heavily skewed in some groups or subtopics?
- Does the model train by decisions and judgments made by past human decision-makers? Do the decisions in favor of one specific group?
- Do some demographic groups do much better/worse in the past?
- Are the continuously collected data self-fulfilling? Will they reinforce or magnify a preexisting bias?
- Will the reported data be suspected of confirmation bias?
- Will outcome measures be objectively measured without human interpretation or influence?
Content of data:
- Will income, economic well-doing, and racial stereotype influence what is reported, how information is interpreted, and how data is measured in some regions or groups.
- Do we apply the same standard and procedure on how data are collected and measured?
- What are the noises and inaccuracies among different groups?
- Are data reliable and informative across all regions and groups? Will they be more accurate in making predictions than other groups?
- Do disadvantaged groups have more missing data in their records? If missing, how will the information be filled?
- Will disadvantaged groups share the same level of information with the data collector?
- Are labelers properly trained and apply the same standards in labeling data?
- For computer vision, do the data have all the scenarios needed, including combinations of the pose, environment conditions (like lighting), race, age, and group combinations, etc …
Data scientists should examine samples in the dataset proactively to identify potential biases (details later). A verification and mitigation process must be in place if the datasets are highly vulnerable to bias.
As demonstrated in the Google Perspective API, imbalance in class datasets and predictions can cause problems. Evaluate the confusion matrix for inclusion.