AI biases are common, persistent, and hard to address. We wish people see what AI can do but not its flaws. But this is like driving a Lamborghini with the check engine light on. It may run fine for the next few weeks but accidents are waiting to happen. To address the problem, we need to know what is fairness. Can it be judged or evaluated?
In the previous article, we look at the complexity of AI bias. In this article, we will discuss:
- anti-discrimination doctrines,
- using performance metrics to analyze fairness,
- fairness criteria,
- generalize fairness criteria, and
- tools to identify bias.
Understand US Anti-Discrimination Doctrines
All AI designs need to follow the laws if applicable. In this section, we will discuss these issues.
Sensitive characters are bias factors that are practically or morally irrelevant to a decision. For example, according to US laws, gender and age are irrelevant in employment. Any hiring decisions cannot be influenced by those factors. They are unjustifiable for differentiation. The society also believes that we cannot treat a qualified job applicant unfavorably because he/she has a disability. Even though the applicant may need more accommodations or maybe less productive, we believe that it is morally irrelevant.
But sensitive characters are domain-dependent. For example, while gender is irrelevant in employment, it is important in medical treatment. Also, US anti-discrimination laws apply to specific domains only. They are credit, employment, housing, education, housing, and public accommodation. Outside these domains, they are not protected by US anti-discrimination laws.
There are two legal doctrines in the US anti-discrimination laws. A plaintiff can claim discrimination based on “disparate treatment” or “disparate outcome”. In equal treatment, all protected groups should receive the same treatment in the protected domains. It addresses disparate treatment and inequality of opportunity with the goal of procedural fairness. But even with equal procedures, a process can impact protected groups more negatively. The second doctrine is about “equal outcome” to prevent avoidable and unjustified harms to the protected groups. It prevents disparate impact with the goal of addressing distributive justice and minimizing inequality of outcome.
But there are tensions between these two principles. In 2003, New Haven officials invalidated promotion test results for firefighters because none of the black firefighters scored high enough for promotions. Nineteen White and one Hispanic that passed the test suited the city. The city said that “if they had relied on those test results with such a wide racial disparity that they could have been open to being sued by the minority” (quote). This claim was based on disparate impact. The supreme court issued a split 5-4 decision against the city. Judge Kennedy ruled against its claim because the decline in certifying the examination created disparate treatment. Race is used to determine whether the results should be certified. The Supreme court ruling considered far more factors and the details were complex. But putting legal opinions aside, the tensions between disparate treatment and disparate impact are hard to resolve for many ML problems. Potential ML designs that advance “equal outcome” may violate “equal treatment”. Ironically, without such measurements, the outcome could be biased. On the other hand, ML is good at finding subtle signals. As shown before, even the gender information is not explicitly given, ML can learn it from other input features. “Equal treatment as blindness” may not exist in ML.
Model performance metrics
In machine learning, we collect performance data in the form of a confusion matrix with true positives, true negatives, false positives, and false negatives.
And we use them in the metrics below to evaluate model performance.
To evaluate the fairness of a model, we scrutinize the model performance under different sensitive characters. For example, to evaluate gender fairness, we check the performance parities between both sexes. In many computer vision applications, the true positive rate for dark skin persons is usually lower than those with fair skin. These models cannot spot positives for dark skin persons as efficient as others.
But not all metrics carry the same weight to a person. District attorneys will care about the true positive rate (TPR) the most. High-profile false negatives (FN) can cost their jobs. On the other hand, human rights activists care about the false positive rate (FPR). No innocent (false positives, FP) should be jailed. Achieving high TPR will sacrifice FPR or vice versa. Therefore, project stakeholders need to discuss the acceptable tradeoffs between false negatives and false positives. Based on this discussion, they can prioritize the evaluation metrics.
Here are the more complex metrics (the ROC curve and PR curve) that can be used for the analysis.
Other popular metrics are:
- Subgroup AUC: Subgroup positives v.s. subgroup negatives
- “BPSN” AUC: Background positives v.s. subgroup negatives
- “BNSP” AUC: Background negatives v.s. subgroup Positives
To discuss the fairness criteria, let’s start with an example that uses credit scores in granting loans. The dark blue dots indicate the loans that would be paid back and the light blue dot loans would default. On the top is the corresponding credit score for each loan (0 to 100).
We can adjust the credit score threshold to control what loans may be approved. The diagram below shows the positive rates, the true positive rate, and the model accuracy for a credit score threshold of 50 (source of the example). The true positive rate indicates the percentage of loans that will be paid back over all the approved loans. The positive rate is the percentage of loans that get approved.
In this example, a successful loan makes $300 and an unsuccessful loan costs $700. This dataset has two population groups: the blue and the orange which have different loan default rates. A threshold of 50 will make a profit of $13.6K on this loan portfolio.
To maximize profit, different thresholds will be applied to these two groups. As shown below, setting the threshold at 61 and 50 respectively will maximize the profit at $32.4K.
Now, we are ready to discuss some of the fairness criteria.
Unawareness means only non-sensitive characters are used in making decisions. The decisions should be purely based on merits. In this process, sensitive characters like gender and race are excluded from the decisions. We should not know the group identities of an applicant.
With this approach, we cannot treat these groups differently. Therefore, both groups will have the same threshold. As shown, profit will drop. The number of loans to the orange group (positive rates) drops even though the group is more creditworthy. Here, creditworthy means an individual will pay back a loan. We consider this as the ground truth as it truly tells whether a loan should be approved. The TPR (true positive rate) for the orange group drops as false negatives increase. Loans that should be approved in the orange group are rejected more now.
So is group unawareness really fair in real life? Unfortunately, some proxy measures or attributes are correlated with sensitive characters. Zip code, college, club associations, or an applicant’s name may have a correlation with race. If loans in disadvantaged groups are mostly rejected, the AI model will reject those applications based on the group identities. In ML training, we feed rich information to find patterns. The training session will eventually pick up subtle signals in the data related to race or gender. Then, it will falsely associate groups with the outcomes and create bias. Amazon’s resume bias was one good example. Hence, even unawareness seems to be the ultimate goal, legacy data can take over and turn the ML model biased.
Here are two examples demonstrating the issue. Roxbury is one of the poorest neighborhoods in Boston. In 2016, more than 80% of the population was black or Latino. When Amazon rolled out the same-day delivery service to Boston, it excluded Roxbury, despite offering delivery to the surrounding areas. And similar problems happened in other cities with the same-day service. Amazon said that it did not use race in determining services. But the factors used were likely correlated with economic data that correlated with race. So if an ML model were used in drawing the service map, the result can be associated with race when people looked at it. Even the race data is not explicitly stored, it is not equivalent to “it is not there”. The excuse of “we don’t consider it” is not valid in ML as ML often finds ways to figure that out.
An experiment had conducted on racial bias in hiring. In Germany, job applicants typically include their pictures in their resumes. The research created job applications for three fictitious characters with identical qualifications: one applicant had a German name, one with a Turkish name, and one had a Turkish name with a headscarf in the picture. As shown, the applicant with the headscarves received callbacks significantly less. If an ML model learns from these data, it will pick up signals about the same racial content even the race information is not explicitly provided. Feature filtering is often not enough to safeguard biased decisions. We need to conduct adversary tests to verify how a model responses to sensitive contexts.
Some people may argue that if structural problems, like economic disparity, do not exist, gender and race will have no impact on the loan process. The percentage of approval in each group should be the same. This is the state that we want the model to follow. Even though this is not the reality, this model will give opportunities to disadvantaged groups to break the vicious cycle. Therefore, we will train the model to have no demographic parity. To achieve that, we adjust the threshold in different groups until the positive rates for both groups are the same.
To achieve demographic parity, lending more to less creditworthy groups is unavoidable. Groups that benefited from biases often argue that they themselves are treated discriminatively with demographic parity. In the US, this is often ruled illegal in specific protected domains.
Instead, some people may assert that people having the same creditworthiness in these groups should have the same rate (opportunity) of loan approval. This is “equal opportunity”. We don’t want the creditworthy applicants in any group to have a lower approval rate compared to other groups.
To achieve that, we adjust the thresholds until the true positive rates for both groups are the same. In short, for any group, we set the thresholds such that we are equally good at approving loans that will be paid back (true positives) among all creditworthy applicants (positives).
This demonstrates a common group of post-processing methods to fulfill a fairness goal. The score will be unchanged. But we adjust the threshold in each group to achieve the target goal.
Some people may argue that equal opportunity does not narrow the gap caused by unfair structural problems. Let’s say the orange group and the blue group have 36 and 4 creditworthy applicants respectively. If the bank can approve 10 applicants only, there will be eight and one applicants approved respectively. The gap between these two groups is not narrowed. No new opportunity is given to the disadvantaged group.
In equal accuracy, we want the same percentage of correct judgment in both groups. A loan is correctly judged when the approval or rejection decision is correct. i.e. the percentage of the true classification (rate of true positives + true negatives over the total) should be the same for both groups.
We can view this as a true classification. It demonstrates how good the model is in approving and rejecting loans. This is different from equal opportunity which trains the model to approve creditworthy applicants equally. If we look at the confusion matrices, these policies are using different parts of the matrix in the evaluation.
Predictive parity verifies whether a classifier has the same precision (TP/(TP+FP)) for both groups. For the approved loans, they should have the same ratio of being paid back in all groups. The approved loans by the classifier have the same level of success in all groups.
Calibration by Group
A score r is calibrated by group if:
Y is the outcome and if a loan is repaid, Y=1. A score of 0.7 means a 0.7 chance of positive outcomes on average over people who receive a score of 0.7 in the group. This does not mean an individual will have a chance of 0.7 for a positive outcome. For example, we can have three persons in the group with a score of 0.7. If their P(Y) = 0.8, 0.7, 0.6 respectively, it still fulfills the requirement as their average is 0.7.
If a score is calibrated by group, regardless of the group membership, r is sufficient to predict the outcome Y. To evaluate the fairness of scores among groups, we can plot the score versus the outcome likelihood. In the diagram below, the scores are close to being calibrated as both the black and the white group had a similar likelihood of recidivism conditioning on a score. Hence, during a bail hearing, we could simply use the score without considering the racial factor.
But in Florida Broward County, there was a gender disparity. The scores missed the calibration by gender. Therefore, the risk of female defendants was systematically overestimated.
As shown in this paper, calibrations follow closely to the scores in unconstrained supervised learning. In short, it can be close to the most optimal solution that no fairness constraint is considered.
But the optimal method and calibration method do not solve the problem of moral irrelevance. Those factors do impact the output measurements. But as a society, we decide to ignore them. However, optimal predictions do their best to improve predictive accuracy by using all available information, including morally irrelevant factors. The predictions made will be influenced by those factors.
Generalize Fairness Criteria
Research papers often use different names for the same fairness criterion. It is very confusing. Let’s define some of the terms mathematically for clarity.
Demographic parity: Regardless of the group membership (A=a or b), all groups will have the same rate of loan approval (prediction: R=+).
Equal opportunity: if a person is creditworthy (i.e. outcome: Y=+), regardless of the group membership (A=a or b), the person should receive the same chance of loan approval (prediction: R=+).
Predictive parity: If a loan is approved (prediction: R=+), regardless of the group membership, the chance of paying back the loan (outcome: Y=+) is the same.
There are many fairness approaches. But in general, they are related to one of the three major fairness criteria: independence, separation, and sufficiency. For example, independence requires prediction R to be independent of A (group membership). If this requirement is met, it will fulfill demographic parity. Separation requires the prediction (R) to be independent of the group membership (A) conditioning on the same outcome Y. Sufficiency requires Y to be independent of A conditioning on R. In short, R is what we need to predict Y. Here is a table showing some of the relations in fairness criteria discussed in this session.
Finally, this is a list of fairness criteria with the corresponding relationship with these three major criteria.
Many datasets are not analyzed or validated fully for the intended uses. While these datasets may be adequate for research papers, many cause bias in commercial applications. In recent years, major AI players are releasing tools to identify model and dataset biases.
Google Fairness Indicators
Google Fairness Indicators (project site) computes many model performance metrics for binary and multiclass classifiers. We can select a metric and analyze the performance for different group memberships sliced by an attribute. For example, we slice the dataset by sex and analyze the false negative rates for the male and the female group.
slices = [tfma.slicer.SingleSliceSpec(columns=['sex'])]
Here is another example where the Fairness indicators show the false negative rates for the three membership groups (tall, short, medium) under the attribute “height”. We can compare them to a baseline (green below) in locating any “height” related bias.
Google also provides a what-if tool (introduction video) to identify bias. For example, we can select a data sample and change one attribute at a time (say, changing the sex from “male” to “female”). If a single change in a sensitive character causes a major decision shift, we should further verify similar samples to identify any potential bias. (tutorial, walkthrough, code).
What-if tool can also suggest the optimal threshold setting (the classification cutoff between positives and negatives) for different fairness strategies, like demographic parity, equal opportunity, equal accuracy, and group threshold. Otherwise, it can recommend a threshold to maximize the model accuracy with the lowest total of false positives and negatives.
To discover gender bias, we can review the performance parities between “Male” and “Female” below (tutorial).
Alternatively, we can fine-tune the threshold for each group manually to achieve a target performance goal. Different problem domains will value objectives differently. With the what-if tool, we can configure the FP/FN ratio in the threshold optimization to control the relative cost between false positives and false negatives.
Imbalanced feature distribution can be a sign of selection bias in the dataset, i.e. samples in the dataset do not represent the population in interest. With the what-if tool, we can browse through features’ distribution to identify such imbalance.
Computer Vision Datasets for Fairness
In the beginning, when people stumbled with bias issues, they switch to another dataset, like CelebA, to trouble the fairness issue further. In recent years, special datasets are designed for fairness evaluation. For example, FairFace has a more balanced race composition in its samples.
Facebook’s Casual Conversations is another fairness focus dataset released to the public in 2021. To demonstrate the idea, FB uses this dataset to evaluate the top five winners of the DeepFake Detection Challenge (DFDC). Results show that these models are less accurate for specific groups of people.
Google also released MIAP (More Inclusive Annotations for People) dataset focuses on enabling ML Fairness research.
These annotation files cover the 600 boxable object classes, and span the 1,743,042 training images where we annotated bounding boxes, object segmentations, visual relationships, and localized narratives; as well as the full validation (41,620 images) and test (125,436 images) sets. (Quote)
These initiatives are still new and we should expect more activities in the future.
Google also released tools like Know Your Data to browse the dataset based on different attributes. This allows data scientists to understand the diversity and distribution of samples under specific attributes. We can use it to locate class imbalance.
In the next article, we look into how effective these criteria are. We need to understand them carefully. Otherwise, we may do more harm than benefits.