AI Bias

  • present the depth of the problems through examples in computer vision, NLP and machine learnings.
  • discover some of the root causes.
  • discuss bias problems in datasets.
  • ask questions on what data scientists should be asked on the datasets.

Biases in Computer Vision

Source
Source
Source
Source
Source
Source
  1. “Dataset: One key challenge in validating and communicating such problems is the lack of high-quality datasets for fairness analysis available, especially to industry practitioners.”
  2. “Lack of universal formalized notion of fairness: One key focus of ethical machine learning has been on developing formal notions of fairness and quantifying algorithmic bias. Another challenge is the lack of a universal formalized notion of fairness that can be easily applied to machine learning models; rather, different fairness metrics imply different normative values and have different appropriate use cases and limitations.”
  3. “Lack of a satisfactory definition of universally appropriate metrics or optimization objectives for machine learning for certain classes of problems.”
Source
Source

Bias in NLP

Source
What If Online Movie Ratings Weren’t Based Almost Entirely On What Men Think?
Source
Source
Modified from source

Historical Bias

Source

Bias in Machine Learning

Source
Source
Source

Recommender Bias

Source
Source (A 2021 paper)

Dataset Biases

Source
Source

Questions to be asked about the Datasets as a Data Scientist

  • Who creates the dataset. What is the original purpose and what is it intended for?
  • Who are the information contributors or what are the sources? Does the corresponding demographic resemble the population in interest?
  • Are data under-reported or under-diagnosed in protected groups or regions?
  • Or on the contrary, are protected groups or regions over-reported or severely scrutinized?
  • Will some data be easier to collect or detect in one group or region than others? Will there be trust and accessibility issues?
  • Does the collection process attract one group over others? Are they highly biased or opinionated?
  • Will income, economic well-doing, and racial stereotype influence the level of service and the type of diagnosis?
  • Do the datasets have the same level of scenarios and conditions covered in disadvantaged groups?
  • Over what timeframe are the data collected?
  • Do the distributions of labels heavily skewed in some groups or subtopics?
  • Does the model train by decisions and judgments made by past human decision-makers? Do the decisions in favor of one specific group?
  • Do some demographic groups do much better/worse in the past?
  • Are the continuously collected data self-fulfilling? Will they reinforce or magnify a preexisting bias?
  • Will the reported data be suspected of confirmation bias?
  • Will outcome measures be objectively measured without human interpretation or influence?
  • Will income, economic well-doing, and racial stereotype influence what is reported, how information is interpreted, and how data is measured in some regions or groups.
  • Do we apply the same standard and procedure on how data are collected and measured?
  • What are the noises and inaccuracies among different groups?
  • Are data reliable and informative across all regions and groups? Will they be more accurate in making predictions than other groups?
  • Do disadvantaged groups have more missing data in their records? If missing, how will the information be filled?
  • Will disadvantaged groups share the same level of information with the data collector?
  • Are labelers properly trained and apply the same standards in labeling data?
  • For computer vision, do the data have all the scenarios needed, including combinations of the pose, environment conditions (like lighting), race, age, and group combinations, etc …
Source

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store