For the past decade, we were building the Lamborghini in AI. The key focus was on the engine and the model, with fairness on the backburner. When the public and the media start the scrutiny, issues are exposed. Public images are damaged. From the previous article, we discuss how broad the AI biases are and how big they can impact. In one perspective, we are cultivating bias or even harm unintentionally. Yet, we have adopted a one-off approach in addressing the problems. It does not work. Even the most obvious problems keep repeating themselves.
Stakeholders do not know what should be done and what are they accountable for. In this article, we discuss AI governance in dealing with these persistence problems. We address what are the roles and tasks that each stakeholder should be. What testing and reporting should be done?
Six Dimensions of the AI Governance
Trustworthiness is the central theme of an AI deployment. It focuses on six major dimensions. 1) To be fair and impartial, we treat all people equally without prejudice despite group membership, like gender and race. The system should perform equally regardless of such membership. 2) We need to be transparent on how the data are used and how decisions are made. We should always be ready for public scrutiny on the AI decisions we made. The days of black-box AI are over. It should stand up to human reviews and incorporate feedback.
3) It should be clear who is responsible and accountable for when issues occur. It will be clear how users can get help. 4) The platform will provide a safe and secure environment, free from harm, bully, and abuse. Cyber security is maintained to fend off malicious attacks. 5) Privacy will be respected and communicated. 6) Finally, the service has to be robust and reliable.
Biases can be introduced at different phases of an ML pipeline.
Therefore, at different stages, we address different questions and mitigate the risks.
In addition, understanding the roles of key stakeholders is important. During this process, a diversified background, experience, and team will facilitate a more comprehensive view of the problems and solutions.
A business owner must understand the clear value and goal of the system — what it is, what it is not, and who it is for. Vision and code of conduct are well defined and coached. Identify the stakeholders and assemble the right team with the right experts. He/she must communicate clearly on how the project is governed, who is accountable for what, and how success is measured. With other stakeholders, the business owner defines solid goals for fairness and inclusion. They identify any potential harm: what bad decisions the system may make, what may be denied, and who may be impacted or excluded. They have to ensure the fairness criteria are adequately disclosed. Listen to feedback and review the policy appropriately.
As discussed before, there are two major doctrines in anti-discrimination laws: equal treatment and equal outcome. Sensitive information, like gender or race, can balance disparate outcomes but it creates treatment disparity. The project stakeholders, including the legal and ethical team, should recognize such tensions. They should debate and finalize the preferable fairness approach. If there are conflicts, should it be equal treatment or equal outcome? If the applications cover the protected domains, all the legal regulations must be well communicated.
The team should quantify the cost between false negatives and false positives. How bad is a false misjudgment (false positive) compared with a misidentified case (false positive)? Understanding their tradeoffs is important.
These are tough and conflicting decisions. But it should be well discussed. Develop a principle that guides the design decisions on fairness. Draw a boundary: state what cannot be crossed and what tradeoffs are acceptable under what circumstances. Document what regulations would be applied.
Below are two project descriptions summarizing these decisions. On the left is a system removing violent content and on the right is a job recommendation site.
Here are other examples summarizing the potential harms. One is recommending sunglasses based on an image of a face and the other removing abusive content.
Data scientists need to know how data is collected and how it is measured to avoid selection bias and measurement bias respectively. Knowing how data is generated is important. They are responsible for plans to mitigate the data problems.
Knowing selection bias
Know what the dataset is intended for. The source and the contributors of the data should be scrutinized. The dataset should have a good representation of the population in interest. What is the demographic distribution that provides and volunteers the information?
Evaluate any underreporting or overreporting bias for the disadvantaged groups. Where the samples are collected from? Will economic status impact the level of reporting? Do economically disadvantaged regions be represented well? Are samples from certain groups hard to collect? Do they cover the same level of scenarios and conditions (like posing and lighting in images)? Do data contain previous decision-makings favoring specific groups? What are the distributions of model predictions and data labels for different groups? Do these distributions heavily skewed in disadvantaged groups or sensitive topics? Are samples collected from more opinionated persons? Are the points of view balanced? Over what timeframe are the data collected?
Knowing measurement bias
Is the quality of information for the disadvantaged groups as good as others? Is their information more vulnerable to interpretation, prejudgment, and inaccurate measurement? What information is commonly missing from these groups? How the missing information is filled in? Do different groups willing to share the same level of information? What level of training do the examiners or the labelers have? Do they apply the same standard of measurement and interpretations in collecting and labeling the data? Does the quality of examiners and labelers be monitored and audited? How can we verify the accuracy of the labels from prejudices and unconscious bias?
Know your question, know your data & know your risks
Once the questions are known, data scientists should answer them quantitatively. Start using tools in analyzing the datasets in detail. Break down the analysis into different groups, different attributes, different conditions, and different topics in the case of NLP applications. Identify imbalance and biases among all steps in the process. Details matter. Often, we need to combine attributes to see how fair the model is. For example, we need to evaluate fairness with different race and gender combinations.
Before the 1990s, color film design and processing were optimized for white people. Failure in analyzing the combination of races and lighting conditions results in poor photo quality in the right. In contrast to the myth, addressing fairness can improve model accuracy. And it widens product offerings.
Here is an example of analyzing the sample distribution for a job recommendation system.
Data scientists should filter and clean data to avoid bad players and prejudices. In Conversation AI, remove offensive text from the output but not the input. This allows the bot to learn good responses to bad behaviors. For privacy and security, data should be processed to hide identities, private, and confidential information.
How data is used should be well thoughted out in promoting fairness. Avoid factors that harm a group un-proportionally and unjustifiably. Should sensitive characters be used in an ML model? By US discrimination laws, memberships of protected groups cannot impact decisions in the protected domains. That is the equal treatment doctrine. But outside those domains, the answers depend on the priority. Should equal outcomes be more important? For instance, Google removes queries on the word “gorilla” on its Photo apps. This is not equal treatment, at least not for gorillas. But to guide such decisions, the agreed fairness principle, code of conduct, and approach must be followed.
Risk assessments and mitigations are often missing in AI projects. What groups of people are vulnerable and how? What is the mitigation plan? This should be part of the regular development process. Reports should be generated for legal and ethical reviews.
Analyze the model performance under different combinations of sensitive characters and attributes. Identify those that constantly perform lower. Document them such that clear mitigation, validation, and monitoring plan can be created. Data collected for certain domains like health care and the criminal system are particularly vulnerable to bias. Special attention is deserved. Augment data to address all identified issues.
Model designers should work with data scientists in designing a model. Based on the agreed fairness approach, they determine the fairness metrics and model performance that should be prioritized for monitoring. They should decide what objective function and what measures are used. Evaluate those choices on fairness. As discussed before, proxy measures can hurt disadvantaged groups badly.
Identify and investigate decision unfairness or performance parity among sensitive characters or attributes. Model designers should improve the model’s interpretability to explain those irregularities. Spend time on the root cause. The lack of understanding and monitoring causes the same issues to reappear. Change the pre-processing and post-processing if necessary to promote fairness.
ML learns the patterns in the data. If the data is biased, it reinforces biased content and choices presented to the users. Users pick biased options which further reinforce the biased data. Introduce diversity and exploration in generating the user choices and options. Breakaway from the cycle.
Dataset augmentation addresses the data representation problem. But if bias is inherited in the data, augmentation will not help. This is particularly problematic for datasets carrying past judgments and observations that are biased. For high-impact applications, the fairness discussion will drive many design choices. Accuracy will no longer be the only test in model design and selection. The agreed fairness approach will influence:
- whether a single threshold is used or group-sensitive thresholds are used,
- whether output scores are calibrated,
- whether all groups have the same positive rates,
- whether sensitive characters can be used, or how they can be used,
- what performance metrics goals to achieve, and
- what the algorithm choices are.
Be prepared for scrutiny in fairness. Understand how decisions are made and what factors impact the result. With highly biased data, the model performance disparities can be huge. Understand the tradeoffs. But be ready to explain the fairness policy and to defend the decisions with an industrial-recognized standard. Community feedback should be studied.
In mitigating biases, there are limitations on what data augmentation or algorithmic change can do. In those scenarios, we may admit there is no perfect solution. Provide alternatives. Supplement with other less perfect means to achieve similar goals. In some cases, allow users to have manual override over AI suggestions and give back some controls to the users.
Design a solution that can handle failure gracefully or fail-soft better. Use humor to respond to abuse in conversational AI. The abuse behavior will drop as there is no fun to do it anymore. Allow user feedback and intervention to detect and correct false judgments. Think of the worst-case scenarios for false positives or false negatives. Consider how to process it gracefully. We may start with a less-desired solution that causes no harm and then iterate and improve it.
Solutions can be ugly. Historical stereotypes cause female doctors to be mislabeled as nurses. We may change the general labeling for nurses and doctors to health care professionals to mitigate the problem. In Google photo apps, Google decides not to use “gorilla” as a label or a query to avoid sensitive mislabelling.
A model validator is a gatekeeper in verifying the accuracy and the fairness of the system. Predictions should not be correlated with sensitive characters if they are practically or morally irrelevant. In addition, predictions should not impact disadvantaged groups specifically without justification or possible alternatives. And their accuracy should not be underperformed. Model validator needs to verify performance metrics among different sensitive characters. He/she should validate that privacy is protected.
User behaviors can change. Data can be collected from production to fine-tune the model. Bad actors may gain knowledge to manipulate the system. All these factors drift the systems away from the intended fairness and optimal state. Hence, continuous monitoring and metrics reporting are required. Operation engineers should continue to monitor the fairness, cybersecurity, safety, privacy, reliability, and performance of the system. Response and escalation plans should be created.
Testing should be done early and constantly. It takes time for these tests to take shape and to mature. Early testing catches problems before issues ballon. It costs more when it is caught later. Diverse testers should be involved to give more angles to the issues. And the effort takes resources. Therefore, the level of efforts will be adjusted according to the scope and impacts of AI decisions. Here are sample tests that can carry during the project life cycle:
Data scientists and model designers should be the first tier of people validating fairness. Typical bias scenarios should be verified. For computer vision applications, test thoroughly on gender disparity and persons with dark skin. For NLP, verify any gender bias and racial bias. There are specific topics that will require more scrutiny. Test both positive and negative cases. Test if the system can handle abuse gracefully.
Model validators should develop comprehensive tests for different sensitive characters, conditions, and configurations. Make sure all tests reflect real-life scenarios. Develop special datasets for fairness testing.
(Quote from Google Clips) We created controlled datasets by sampling subjects from different genders and skin tones in a balanced manner, while keeping variables like content type, duration, and environmental conditions constant. We then used this dataset to test that our algorithms had similar performance when applied to different groups.
Conduct adversarial tests for harassing, harming, abuse, attack, bullying, and manipulating. What are the worst things that can happen to the system? Test how a system may react to sensitive social topics. How does it handle racist, sexist and homophobic situations?
Operation engineers should collect business and operations metrics for different groups. These should include dropoff, revenues, error rates, customer satisfaction, abuse, and negative comments among different groups and topics. It verifies fairness from business angles.
Establish checklists for data risk, mitigation, fairness testing and verification.
Low visibility of fairness metrics put the fairness initiatives into second priority. Dashboards should be built to report fairness on top of the model accuracy. Dashboards should work on independent validation datasets. These datasets should include datasets specific for fairness testing for different sensitive characters.
Benchmarks should be established for easy comparison.
Throughout the process, companies can use a model card to capture and report information discussed in this section. This will increase the transparency of ML models and facilitate communications around different stakeholders. Here is a sample model card:
Here is another example.
AI systems should be audited to answer:
- What are the compositions of the datasets? Are they representative? Do they have good coverages over different attribute values and conditions?
- What are the potential harms? Are they handled appropriately?
- What are the risks? What are the mitigations?
- How does the system fulfill the intended uses for its target users?
- How performance and fairness metrics are monitored and reported?
- What is the model performance over different attribute values? Is the system fair?
- Is there any performance and fairness disparity in disadvantaged groups?
- Are there problems in specific fairness metrics? Are they justifiable?
- What is the quality of the examiners or labelers?
- What is the test coverage over fairness? Are sensitive areas properly tested?
- Does it meet regulations?
- Should it be reviewed or certified by external subject experts?
- Are the response and escalation plans appropriate?
- Is the system properly documented?
- Are the checklists completed and verified?
The diagram below is a look at how an AI project evolves. The key steps include risk assessment/mitigation, bias assessment/mitigation, early feedback (internal or external), and continuous post-deployment monitoring.
By interviewing AI professionals, this paper studied the practical challenges of ML projects. Here are some of our interpretations of those challenges.
- Do not have a process to collect balanced and representative datasets.
- Need support in identifying groups that are underrepresented in the training. Don’t know how much more data is needed for those groups.
- Certain segments are hard to reach and collect data. For example, it may not be easy to reach out to high score minority students.
- Need effective communications between data collection decision-makers and model developers.
- Need tools to guide data collection.
- Need tools to add tags to data and analyze data imbalance based on these tags.
- Teams do not discover serious fairness issues until user complaints or bad media reports. Need supports in detecting them before deployment.
- Need proactive and standard auditing processes. Need to know which metrics, and tools to use.
- Need checklists on what to do and what to verify.
- Current processes are not comprehensive or scalable.
- Lack of awareness in fairness. Fairness initiatives are usually not awarded. Need organizational level prioritization and awareness.
- No metrics are implemented to monitor fairness and progress. Need scalable and automated process.
- Demographic information may be missing in the data because of regulation. It is not easy to evaluate fairness if it is missing.
- Need process and fairness auditing tool when only coarse-grained demographic information is accessible.
- Need tools to flag samples that are toxic or harmful.
- Hard to evaluate whether a complaint is common or just a one-off.
- Need knowledge, tools, and processes specific to an application domain, like machine translation, recommender system, etc …
- Common practice and knowledge sharing are important but hard to achieve because of resource constraints and fine differences in projects.
- Human inspection is needed sometimes. No automation tools work with human inspection yet.
- What are the values created for individuals and what are their perspectives on fairness? The fairness metrics published by academia do not have the right line of sight in access fairness. (refer to the paper for details).
- Fairness evaluation may depend on user input or query. How can we simulate it, in particular for highly interactive systems like chatbots?
- Are the fairness interventions fair themselves? Does it cause unexpected side effects? Can we verify quickly that it does not hurt user experience?
- Need process and tools to anticipate the tradeoff, harms, and user satisfaction in different fairness approaches.
- What are the right strategies for fixing fairness? What should be the right focus for the project (models, data, augmentation, post-processing, or objective functions)?
- Need process and tools to reduce the influence of human biases on the labeling or scoring processes.
How are some of the tips:
- Careful test set design helps the detection of fairness issues. Low accuracy in certain sub-groups is a sign of imbalanced training data.
- Get together and imagine everything that could go wrong with the products. And monitor those issues proactively.
- Diverse background and knowledge in the teams improve the line of sight on fairness.
- Hire diverse staff and team-external “experts” for particular tasks.
- Facilitate knowledge sharing and best practices, including test sets and case studies, among groups, projects, and companies with different backgrounds.
- Evaluate and analyze human biases introduced in different phases of the ML pipeline, including crowd workers in labeling.
- Some fairness issues may be most effectively addressed through product design changes.
- Instead of slicing data using standard attributes (gender or race), it may be more effective to analyze data by domain-specific attributes.
AI bias remains common and persistent. Traditionally, we solve the issue as a one-off problem and undermine its complexity for decades. Without approaching the problem systematically, we will not solve it properly. So far, many projects adopt a traditional QA process. That is not enough. The governance of an AI project should be evolved. The way of approaching ML problems as a simple model accuracy optimization is inappropriate.
AI Bias is a huge topic. Here are some topics that we should but do not cover:
Vision and Code of Conduct for Google AI
Responsible AI practices (Google)
Best Practices for ML Engineering
Human-Centered Machine Learning
Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?