Verification & Testing: Control 3.10 - Hallucination Feedback Loop
Last Updated: April 2026 Testing Level: Control Validation Estimated Time: 60-90 minutes
This playbook provides verification and testing procedures for Control 3.10. Use it after the Portal Walkthrough or PowerShell Setup, and re-run quarterly as part of the supervisory cadence required by FINRA Rule 3110.
Test Environment Setup
Before beginning verification testing, prepare:
- At least one Zone 2 and one Zone 3 test agent in Copilot Studio
- Copilot Studio CSAT enabled and
Report Inaccurate Responsetopic published on each test agent - SharePoint
Hallucination Trackinglist provisioned with Purview retention label applied - Power Automate flows (
HFL-Intake-3.10, SLA monitor, trend detector) enabled - Test users with Reporter role (regular user) and Triager role (AI Governance Lead delegate)
- Access to Power BI workspace
AI Governance — Pillar 3 Reporting - Test incident response queue (Control 3.4) wired to the Critical escalation path
Avoid running these tests in production agents that handle live customer queries — synthetic load can skew CSAT analytics and inflate hallucination rate metrics.
Test Case 1: End-to-End Feedback Capture (CSAT Path)
Objective: Verify a user thumbs-down + comment lands in Hallucination Tracking within SLA.
Regulatory Anchor: FINRA Rule 3110 (supervisory review), SEC 17a-4 (record creation)
Test Steps
- As the Reporter user, open the published Zone 2 test agent in its deployed channel (Teams, web, or Copilot Studio test pane)
- Send a query that the agent will answer (e.g.,
What is the current Reg D limit?) - Select thumbs down on the response
- Submit a comment of the form
[Hallucination][Factual Error][High] Stated $X but actual is $Y - Wait up to 60 seconds for the intake flow to run
- As the Triager, open the SharePoint
Hallucination Trackinglist
Expected Results
- A new list item appears with
IssueIDmatching the formatHAL-YYYYMMDD-NNN -
AgentName,AgentEnvironment,Zone,ConversationId,UserQuery,AgentResponse,ReportedByare all populated -
Category = Factual Error,Severity = High,Status = New - An acknowledgment email arrives at the Reporter mailbox citing the
IssueID
Evidence Collection
- Screenshot: agent thumbs-down + comment box with submitted text
- Screenshot: SharePoint list item showing all populated fields
- Email export (
.eml) of the acknowledgment
Test Case 2: Critical Severity Escalation
Objective: Verify Critical reports trigger Control 3.4 incident creation and Compliance notification within 1 hour.
Regulatory Anchor: FINRA Notice 25-07 (real-time supervision), SOX 302 (material misstatement escalation)
Test Steps
- As the Reporter, invoke the
Report Inaccurate Responsetopic on the Zone 3 test agent - Choose
Fabricationas the category andCriticalas the suggested severity - Provide a synthetic correct-answer string clearly marked
[TEST DATA] - Submit and start a stopwatch
- Monitor the AI Governance — Critical Teams channel and the incident queue (Control 3.4)
Expected Results
- Teams adaptive card posts in AI Governance — Critical within 5 minutes, tagging AI Governance Lead and Compliance Officer
- Incident record appears in the Control 3.4 queue with
RelatedIssueIdmatching the hallucinationIssueID - The hallucination list item shows
RelatedIncidentIdpopulated - Email arrives at the agent owner and AI Administrator mailboxes
- Within 1 hour: a Triager updates
Status = TriagedandAssignedTois populated
Evidence Collection
- Screenshot: Teams adaptive card with timestamp
- Screenshot: incident record cross-referencing the hallucination IssueID
- Run history export from the intake flow showing the Condition branch executed
Test Case 3: SLA Timer Accuracy
Objective: Verify the SLA monitor correctly flags breaches and only flags genuine breaches.
Test Steps
- Create three test items via the intake flow with
Severity = Medium(72-hour remediation SLA) - Backdate
ReportDateon item A to 73 hours ago (using a System Account or via PowerShell) - Leave item B with current
ReportDate - On item C, set
Status = ClosedandResolutionDateto 71 hours afterReportDate - Wait for the next hourly SLA monitor flow run (or trigger manually)
Expected Results
- Item A:
RemediationSLAMet = No, escalation card posted - Item B:
RemediationSLAMetempty (not yet due) - Item C:
RemediationSLAMet = Yes(closed within SLA) - No false positives on items with
Status = ClosedorWon't Fix
Evidence Collection
- Flow run history showing the three items processed
- SharePoint list view filtered to
RemediationSLAMet = No
Test Case 4: Trend Detection Threshold
Objective: Verify the daily trend detector posts an alert when an agent exceeds the configured rate.
Test Steps
- Submit 10 hallucination reports for the same Zone 3 test agent over a 24-hour synthetic window
- Ensure the test agent's recorded conversation count for the same window is 100 (10% rate, well above any reasonable threshold)
- Wait for the next 06:00 trend detector run (or trigger manually)
Expected Results
- Teams message posts in AI Governance — Trends citing the agent and rate
- A
Trend Alertslist item is created - No alert is generated for unrelated agents
Evidence Collection
- Screenshot: Teams trend alert with rate and agent name
- Flow run history
Test Case 5: Power BI Dashboard Refresh and Accuracy
Objective: Confirm dashboard reflects current data within scheduled refresh window.
Test Steps
- Note the current Total Reports KPI value
- Submit 3 new test reports across different categories
- Trigger an on-demand dataset refresh in Power BI
- Reload the report
Expected Results
- Total Reports KPI increases by exactly 3
- Category distribution chart reflects the new categories
- No data quality warnings on the dataset
Evidence Collection
- Before/after screenshots of the KPI tile
- Refresh history export from the dataset settings
Test Case 6: Retention Label Enforcement
Objective: Verify the Purview retention label cannot be removed by a non-records-manager and that disposition requires review.
Regulatory Anchor: SEC Rule 17a-4(f) (non-rewriteable, non-erasable record retention)
Test Steps
- As a regular site member, attempt to delete a closed list item older than 30 days
- As the Purview Records Manager, attempt to remove the retention label
Expected Results
- Regular user cannot delete the item (or the deletion is intercepted by retention)
- Records Manager sees a disposition review prompt rather than immediate deletion
- Audit log captures both attempts with actor, action, and timestamp
Evidence Collection
- Screenshot: deletion attempt error or retention notice
- Purview audit log export filtered to the test items
Test Case 7: Reporter Acknowledgment and Closure Notification
Objective: Verify communication loop closes with the original reporter.
Test Steps
- Submit a test report
- Triage and remediate to
Status = Closedwith a populatedRemediationActionsfield
Expected Results
- Reporter receives an acknowledgment at intake (Test Case 1)
- Reporter receives a closure notification with the remediation summary
- Closure email contains a link back to the SharePoint item
Evidence Collection
- Email exports of both notifications
Quarterly Compliance Checklist
Run this checklist as part of the supervisory cadence and store the completed copy in your WSP evidence archive.
| Item | Required For | Status |
|---|---|---|
| CSAT enabled on all in-scope published agents | Quality management — FINRA 3110 | |
| Hallucination taxonomy reviewed and approved by Compliance | Consistent categorization — FINRA 25-07 | |
| Tracking list operational with current schema | FINRA 3110 supervisory review | |
| Purview retention label applied (6 years) | SEC 17a-4 record retention | |
| Intake, SLA, and trend flows all enabled (no failed runs > 5%) | Process integrity | |
| MTTR within firm target | Operational metric | |
| SLA compliance ≥ firm target (e.g., 95%) | Operational metric | |
| Power BI dashboard refreshed in last 7 days | Reporting integrity | |
| At least one tabletop exercise of Critical escalation in the quarter | FINRA 3110 supervisor training | |
| Quarterly trend report delivered to AI Governance Council | Governance cadence |
Evidence Collection for Audit
For FINRA, SEC, or internal audit reviews, prepare an evidence pack containing:
| Evidence Item | Source | Retention |
|---|---|---|
| CSAT configuration screenshot per agent | Copilot Studio | 6 years (SEC 17a-4) |
| Hallucination Tracking list export (CSV or JSON) | SharePoint | 6 years |
| Conversation transcripts for sampled items | Dataverse / Application Insights | 6 years |
| Intake flow run history (sampled) | Power Automate | Per Power Platform retention + export |
| SLA compliance metrics report | Power BI export | 6 years |
| Quarterly trend reports | Power BI export | 6 years |
| Remediation evidence (knowledge source updates per Control 2.16, prompt changes) | Source control + change tickets | 6 years |
| Purview retention label policy and audit log | Purview portal | Per Purview policy |
Wherever possible, exports should include cryptographic hashes (SHA-256) generated at export time. The PowerShell setup playbook includes helper functions that emit hashes alongside CSV exports.
Negative Test Cases
Equally important — verify the system does not misbehave:
| Negative Test | Expected Behavior |
|---|---|
| Submit thumbs-up | No item created, no flow triggered |
| Submit malformed JSON to intake URL | Flow returns 400, no list item created, error captured in run history |
| Disable CSAT on an agent | New conversations on that agent produce no items; existing items remain intact |
| Delete a triaged item as a regular user | Action denied or intercepted by retention label |
Run intake flow with same ConversationId twice |
Both items created (deduplication is intentionally manual at triage to avoid losing distinct turns) |
Next Steps
- Portal Walkthrough — Initial configuration
- PowerShell Setup — Automation scripts
- Troubleshooting — Common issues
Updated: April 2026 | Version: v1.4.0 | UI Verification Status: Current