Troubleshooting: Control 2.11 - Bias Testing and Fairness Assessment

Last Updated: April 2026

This playbook covers common operational and methodological issues encountered when running the Control 2.11 program. For setup, see portal-walkthrough.md; for scripts, see powershell-setup.md.

Quick Reference

Issue	Likely Cause	Resolution
Insufficient test data	Sample sizes below power-calculation floor	Expand synthetic dataset; recompute power
Cannot classify outcomes	Free-text responses with no rubric	Define explicit rubric; consider LLM-judge with human spot-check
Persistent bias after fix	Knowledge-source contamination or system-prompt leakage	Audit knowledge sources and system prompt; widen test scope
"Pass" but feels wrong	Threshold without significance test	Pair threshold with chi-square / Fisher / regression
Disparate impact ratio close to 0.80	Small effect plus small sample	Increase n; report CI for the ratio
Manifest hash mismatch	Evidence file edited after emission	Re-run pipeline; do not edit emitted files in place
Quarterly cadence missed	No automated reminder; ownership gap	Wire Power Automate reminder; assign in RACI
Intersectional bias hidden	Marginal analysis only	Add intersectional cells (race × sex, age × income source)
Synthetic data too "clean"	Generator under-represents real variation	Calibrate generator against population statistics; add noise
Re-validation skipped after model change	No release gate	Add CI gate that calls `Validate-Control-2.11.ps1`

Detailed Troubleshooting

Issue 1 — Test Dataset Too Small (Inconclusive Results)

Symptoms: Chi-square / Fisher tests return p-values that fluctuate run-to-run; confidence intervals on the disparate-impact ratio span both above and below 0.80.

Diagnosis: Sample sizes per group are below what the power calculation requires for the effect size you want to detect.

Resolution:

Recompute statistical power for the smallest effect you care about (e.g., a 5 pp parity gap at α = 0.05, power = 0.80).
Generate additional synthetic cases for under-powered groups; rebalance the dataset.
Re-run Invoke-FsiBiasTestSuite and Get-FsiFairnessMetrics.
Document the new sample sizes in the methodology memo and replace the old version in the WORM library (the previous version is preserved by the retention label).

Issue 2 — Outcome Classification Is Subjective

Symptoms: Two reviewers classify the same agent response differently; pass/fail varies by who runs the analysis.

Diagnosis: Outcome rubric is not specific enough. Free-text agent responses do not map cleanly to "Positive" / "Negative" without explicit criteria.

Resolution:

Define a written rubric with concrete examples for each outcome class.
For Zone 3, use two independent reviewers plus inter-rater agreement (Cohen's kappa ≥ 0.80) before declaring results final.
If using an LLM-judge, document the judge model, prompt, and version; spot-check ≥10% of classifications by a human reviewer.
Capture the rubric and reviewer agreement statistics in the evidence library.

Issue 3 — Persistent Bias After Remediation

Symptoms: A second test run after a fix shows the same disparate impact.

Diagnosis: Common root causes:

Knowledge-source contamination — RAG sources contain biased framing.
System-prompt leakage — the system prompt steers the model toward certain outcomes.
Training data inheritance — the underlying foundation model carries bias the prompt cannot override.
Proxy variables — the agent infers protected attributes from non-protected fields (ZIP code, language, name).

Resolution:

Audit knowledge sources for biased phrasing; reword or remove.
Review the system prompt for words / framing that imply outcomes for specific groups.
Add explicit fairness instructions to the system prompt; re-test.
If the underlying model carries the bias, escalate to model selection — a different foundation model may be required for the use case.
Material model changes invoke re-validation under Fed SR 26-2 (formerly SR 11-7). Capture the change ticket and re-validation memo in the evidence library.
Do not declare the issue closed without a passing re-test cycle.

Issue 4 — Threshold "Pass" Without Significance

Symptoms: Demographic parity gap is 4 pp (under the 5 pp threshold) — but with only 50 cases per group.

Diagnosis: The threshold check passed but the result is statistically meaningless.

Resolution:

Always pair a threshold check with a significance test. The PowerShell scripts emit metrics; run a Python (Fairlearn + scipy) or R worker for chi-square / Fisher / regression.
Report both point estimate and 95% confidence interval for parity gap and disparate-impact ratio.
If the CI for the disparate-impact ratio crosses 0.80, treat as inconclusive — increase n and re-run.

Issue 5 — Disparate Impact Ratio Hovering at 0.80

Symptoms: DI ratio reports 0.81 one quarter, 0.78 the next. Hard to declare a stable result.

Diagnosis: Borderline disparate impact with high variance.

Resolution:

Treat 0.80 as a floor, not a target. Borderline values warrant remediation even if technically "passing."
Increase sample size to tighten the CI.
Consider whether the underlying outcome variable is the right one — e.g., recommendation rate vs. recommended product class may give different signals.
Document the borderline status in the attestation; do not paper over it.

Issue 6 — Intersectional Bias Hidden by Marginal Analysis

Symptoms: Each protected class passes individually, but customer complaints suggest bias for specific intersections (e.g., older women, Black men).

Diagnosis: Marginal (one-class-at-a-time) analysis can hide intersectional patterns.

Resolution:

Add intersectional cells to the test dataset for high-risk pairs (race × sex, age × public-assistance, national-origin × sex).
Compute fairness metrics at the cell level, not just the marginal level.
Sample-size requirement applies per cell — e.g., 50 per cell, not 50 per marginal group.
Report intersectional results separately in the Power BI dashboard.

Issue 7 — Manifest SHA-256 Mismatch

Symptoms: Validate-Control-2.11.ps1 reports a hash mismatch.

Diagnosis: An evidence file was edited (or re-saved by a viewer) after emission. WORM retention prevents this in production but local working copies can drift.

Resolution:

Never edit emitted JSON files in place — re-run the pipeline.
Confirm Purview WORM retention is applied to the SharePoint library (prevents post-emission edits).
If a legitimate correction is needed, emit a new version, update the manifest, and document the supersession in the methodology memo.

Issue 8 — Quarterly Cadence Missed

Symptoms: Last results file is more than 100 days old; Validate-Control-2.11.ps1 flags cadence.

Diagnosis: No automation, no reminder, or unclear ownership.

Resolution:

Wire the Power Automate orchestrator (Step 5 in portal-walkthrough.md) with a 90-day recurrence trigger.
Add a 30-day prior reminder to the AI Governance Lead and Compliance Officer.
Confirm the RACI for this control names a single accountable owner per quarter.
For the missed quarter, run a catch-up assessment and document the gap and remediation in the next attestation.

Escalation Path

Agent Owner — initial triage, prompt / knowledge-source remediation
Data Science Team — methodology, statistical analysis, threshold tuning
AI Governance Lead — methodology approval, cross-agent patterns
Compliance Officer — regulatory interpretation (ECOA / Reg B applicability, fair-lending exam posture)
Model Risk Manager — independent challenge, Fed SR 26-2 (formerly SR 11-7) effective challenge
Legal — fair-lending counsel for material findings or examination prep

Known Limitations

Limitation	Impact	Workaround
LLM responses are stochastic	Same input may yield different outputs	Run each prompt ≥3 times; report variance; consider deterministic settings (`temperature=0`) where supported
Synthetic data may not reflect real-world distributions	False sense of fairness	Calibrate generator against population statistics; supplement with anonymized production sampling per privacy review
Outcome classification is subjective	Inter-rater drift	Two-reviewer model + Cohen's kappa; document rubric
No regulator-blessed FSI fairness toolkit	Methodology open to challenge	Document methodology thoroughly; align with NIST AI RMF MEASURE-2.11 and Fed SR 26-2 (formerly SR 11-7); use Microsoft Responsible AI Toolbox / Fairlearn as reference
Single foundation-model dependency	Model-level bias hard to remediate via prompts	Track at model-selection level; escalate to model risk
Disparate-impact ratio sensitive to small denominators	Volatile near-floor	Require minimum group n in methodology; report CIs

Back to Control 2.11 | Portal Walkthrough | PowerShell Setup | Verification Testing

Updated: May 2026 | Version: v1.6.2 | UI Verification Status: Current