AI Model Bias Audit Implementation
The model shows aggregate accuracy of 0.89 — sounds good. But when you break metrics down by subgroups, it turns out: for one demographic group, precision drops to 0.71, and recall to 0.58. This isn't just "fairness" — it's an operational risk: the model systematically makes mistakes in a specific segment, and if that segment is important to the business or protected by legislation, the problem is critical.
Bias audit is a structured process for finding such gaps and their sources.
What is Bias Audit Technically
Bias audit is measuring model metrics across subgroups, comparing these metrics, statistically verifying gaps, and finding sources in data, features, or annotation process. It's not a one-time event — it's a process embedded in the ML lifecycle.
Audit standards are built on several questions:
1. Which groups to analyze? Protected characteristics under legislation (gender, age, nationality, religion, etc.) — mandatory minimum. Additionally — business-relevant segments (region, customer type, acquisition channel).
2. Which fairness definition to choose? Demographic parity, equalized odds, calibration within groups — mathematically incompatible. Choice depends on the use case.
3. What gap counts as significant? Statistical significance (p < 0.05 with multiple comparison correction) + practical significance (effect size). 2% difference on a 50k sample is statistically significant but not necessarily operationally.
Audit Methodology
Stage 1 — Data Audit
Before model training. Analyze the training dataset:
- Distribution across subgroups — underrepresentation of one group will worsen metrics specifically for it
- Feature correlation with protected attributes (proxy features)
- Annotation quality across subgroups (inter-annotator agreement via Cohen's kappa separately by group)
- Temporal bias — data from different time periods may contain different patterns for different groups
Tools: pandas profiling, Ydata-profiling, custom scripts for correlation matrix.
Stage 2 — Model Performance Audit
After training. Standard metric set for each subgroup:
from fairlearn.metrics import MetricFrame
from sklearn.metrics import accuracy_score, precision_score, recall_score
metrics = {
'accuracy': accuracy_score,
'precision': precision_score,
'recall': recall_score,
'false_positive_rate': lambda y_true, y_pred:
((y_pred == 1) & (y_true == 0)).sum() / (y_true == 0).sum()
}
mf = MetricFrame(
metrics=metrics,
y_true=y_test,
y_pred=y_pred,
sensitive_features=sensitive_features
)
print(mf.by_group)
print(mf.difference()) # Max difference between groups
print(mf.ratio()) # Min/max ratio between groups
Target thresholds (EU AI Act guidelines for high-risk systems):
- Demographic parity difference < 0.1
- Equalized odds difference < 0.1
- False positive rate ratio: 0.8 – 1.25 (80% rule, EEOC standard)
Stage 3 — Root Cause Analysis
If a gap is found — search for its source. Four main vectors:
Representation bias: subgroup comprises 3% of dataset but 15% of real requests. Model hasn't seen enough examples. Solution: oversampling (SMOTE, ADASYN), class-weighted loss, focal loss.
Feature bias: proxy feature. Postal code → ethnic group. Transaction frequency → income level → demographics. Correlation analysis of all features with protected attributes. Remove proxies or use adversarial debiasing.
Label bias: annotators labeled differently for different groups. Inter-annotator agreement by subgroup. Re-label problematic segments.
Threshold bias: single classification threshold is unfair at different base rates. Threshold optimization separately by group (Fairlearn ThresholdOptimizer).
Practical Case
Client is an HR-tech company, resume scoring model (CatBoost, 85 features). Internal audit found: recall for candidates with foreign names is 17 percentage points lower than for others.
Root cause analysis: the "university name" feature had high weight and was encoded via target encoding — universities from certain countries systematically received low encoded values due to historical underrepresentation of hired candidates. Proxy discrimination through educational institution.
Solution:
- Replaced target encoding with neutral frequency encoding for this feature
- Added adversarial head to architecture (additional "foreign/not foreign name" classifier with gradient reversal)
- Threshold optimization via Fairlearn to equalize recall
Recall gap decreased from 17 pp to 4 pp with AUC loss = 0.008.
Documentation and Reporting
Audit results are formatted in standardized form. Minimum:
Model Card (Mitchell et al., 2019) — model description, training data, metrics by subgroup, known limitations.
Algorithmic Impact Assessment — analysis of potential harms, mitigations, residual risk.
For EU AI Act (high-risk systems) — mandatory technical documentation per Annex IV.
Timeline and Process
Audit of existing model — 2-3 weeks: collect subgroup data, measure metrics, root cause analysis, report with recommendations.
Mitigation + re-audit — another 3-5 weeks depending on bias source complexity.
Embedded process — bias audit as part of CI/CD: automatic Fairlearn metrics check on each retrain with deployment blocking on threshold violation. Setup takes 1-2 weeks.







