Control 2.5: Testing, Validation, and Quality Assurance
Control ID: 2.5
Pillar: Management
Regulatory Reference: SOX Sections 302/404, FINRA Rule 4511, FINRA Rule 3110, FINRA RN 24-09, GLBA 501(b), SEC 17a-4(b)(4), OCC Bulletin 2026-13 (formerly OCC Bulletin 2011-12) / Fed SR 26-2 (formerly SR 11-7), and NIST AI RMF 1.0 + Generative AI Profile (testing, measurement, and ongoing monitoring expectations)
Last UI Verified: May 2026
Governance Levels: Baseline / Recommended / Regulated
Agent 365 Architecture Update
Agent 365 lifecycle management supports promotion gates that can enforce testing and validation requirements before agents move between environments. See Unified Agent Governance for promotion gate configuration.
Objective
Help validate that Microsoft Copilot Studio agents, Microsoft 365 Copilot extensibility scenarios, and any related Microsoft Foundry (formerly Azure AI Foundry) -backed agent workloads operate within zone-defined quality, safety, and performance thresholds before promotion to production through functional testing, regression testing, adversarial testing, and ongoing monitoring. This control also supports the independent-validation and evidence expectations associated with OCC Bulletin 2026-13 (formerly OCC Bulletin 2011-12) / Fed SR 26-2 (formerly SR 11-7) for finance-touching or customer-facing AI use cases.
Why This Matters for FSI
- SOX Sections 302/404: If an AI assistant or agent influences finance, controllership, disclosure drafting, or reconciliations, management should treat testing evidence as part of the internal-control environment and document the review boundary.
- FINRA Rule 4511 + SEC 17a-4(b)(4): Test plans, evaluator outputs, promotion approvals, and supervisory review records can become books-and-records evidence for AI-enabled communications and workflows.
- FINRA RN 24-09 / Rule 3110 (June 2024): Reaffirms that existing supervisory and recordkeeping obligations apply to generative AI; firms should be able to explain how AI outputs were tested, approved, and monitored.
- FINRA Rule 3110: Supervisory testing should cover not only bias but also accuracy, suitability, escalation paths, and review of AI-generated customer communications where applicable.
- OCC Bulletin 2026-13 (formerly OCC Bulletin 2011-12) / Fed SR 26-2 (formerly SR 11-7): Higher-risk or finance-touching AI use cases should be subject to functionally independent validation and ongoing monitoring, not only maker-side testing.
- GLBA 501(b): Security safeguards should be periodically tested for AI systems that handle or expose customer information.
Model Swap and Third-Party Model Revalidation
Treat any foundation-model change, provider change, major prompt-orchestration change, or grounding-source change as a mandatory re-validation event. This includes champion/challenger comparisons such as GPT-4o vs GPT-4.1 or other tenant-approved alternatives where available. Response quality, safety behavior, and latency can shift materially across models and providers. Do not assume third-party-model parity; verify subprocessor implications before relying on the new model in production.
Automation Available
See Conflict of Interest Testing in FSI-AgentGov-Solutions for conflict of interest testing for agent recommendations.
Control Description
This control establishes comprehensive testing requirements for AI agents across the development lifecycle:
| Test Type | Description | FSI Application |
|---|---|---|
| Functional Testing | Verify agent responds correctly to user queries | Customer accuracy |
| Security Testing | Validate data protection and access controls | Data protection |
| Performance Testing | Verify agents respond within 3 seconds (Zone 1-2) or 2 seconds (Zone 3) for standard queries | Customer experience |
| Bias Testing | Detect and mitigate unfair treatment | FINRA 3110 compliance |
| Regression Testing | Confirm changes don't break existing functionality | Change management |
| UAT | Business validation before production deployment | Stakeholder approval |
Evaluation Gates
| Gate | Lifecycle Stage | Required Validations |
|---|---|---|
| Gate 1 | Design > Build | Business justification, risk classification |
| Gate 2 | Build > Evaluate | Prompt injection testing, authentication validation |
| Gate 3 | Evaluate > Deploy | Test pass rate >95%, UAT sign-off |
| Gate 4 | Deploy > Monitor | Compliance approval, rollback plan documented |
Testing and Validation Planes
Testing for AI agents should be treated as five distinct evidence planes, each serving a different governance purpose:
| Plane | Primary Tooling | What It Demonstrates | FSI Caveat |
|---|---|---|---|
| Developer smoke testing | Copilot Studio Test Pane / developer mode | Topic flow, variable behavior, quick issue isolation | Useful during authoring, but not sufficient as independent validation evidence under Fed SR 26-2 (formerly SR 11-7) |
| Repeatable batch evaluation | Copilot Studio Agent Evaluation / test sets | Regression detection, version comparison, repeatable scoring across prompts | Should be exported and retained as part of the approval record |
| Quantitative quality and safety validation | Azure AI Evaluation SDK / Azure AI Foundry | External evaluator runs and reproducible metrics outside the maker surface | Stronger fit for independent validation packs and model-risk review |
| Adversarial testing | PyRIT or equivalent | Jailbreak, prompt-injection, misuse, and safety-resilience testing | Cross-reference Control 1.21 for adversarial evidence handling |
| Post-deployment monitoring | Copilot Studio Analytics / Quality dashboards and production monitoring | Drift, failure, abandonment, escalation, satisfaction, and real-world usage trends | Required for ongoing monitoring after release |
For higher-risk use cases, the validation record should go beyond functional pass/fail and capture quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, and F1, plus applicable safety evaluators such as Hate/Unfairness, Self-Harm, Sexual, Violence, Protected Materials, Indirect Attack, and where supported, Code Vulnerability and Ungrounded Attributes. Some evaluator families are preview, tenant-dependent, or cloud-dependent; verify availability at deploy time.
Independent Validation Boundary (SR 11-7 / OCC Bulletin 2026-13 (formerly OCC 2011-12))
The Test Pane and maker-run regression tests are valuable for development, but they are not a substitute for functionally independent validation when the agent is customer-facing, finance-touching, or otherwise material. For Zone 3 or higher-risk use cases, the validation package should be reviewed or countersigned by a party independent of the agent author, such as a Model Risk Manager, Compliance Officer, or separate validation function.
License Requirements
- Copilot Studio — required for in-product Test Pane and Agent Evaluation capabilities.
- Microsoft 365 Copilot — required for declarative-agent and Microsoft 365 extensibility test scenarios.
- Azure AI Foundry / Azure AI Evaluation SDK — required for external quality and safety evaluator runs; consumption-based billing applies.
- Microsoft 365 Agents Toolkit — recommended for local preview / sideload testing of declarative agents in Visual Studio Code.
- Power Platform Pipelines / Managed Environments — recommended to enforce promotion gates, approvals, and deployment governance.
- Microsoft Purview Audit / eDiscovery — recommended where testing and approval evidence must be retained for supervisory, audit, or examination purposes.
- PyRIT — open source; runs on customer-controlled infrastructure and should be validated for supportability before being treated as a primary control component.
Foundry Observability and Continuous Evaluation
Beyond pre-deployment evaluation runs, Microsoft Foundry observability (learn.microsoft.com/azure/ai-foundry/concepts/observability) provides built-in evaluators and Azure Monitor / Application Insights integration so the same evaluator families used in pre-release validation can run continuously in production against sampled traffic. For Zone 3 agents, configure Foundry tracing to emit evaluator scores into App Insights so trend, drift, and regression evidence flows into the same telemetry plane used by Control 3.3 (Compliance and Regulatory Reporting) for periodic supervisory reporting.
Key Configuration Points
- Maintain separate Dev, Test / UAT, and Production environments with documented promotion controls; for Zone 3, keep the validation evidence package separate from the maker workspace.
- Use the Copilot Studio Test Pane for smoke testing, variable inspection, and conversation debugging during development, but do not treat it as the sole validation record.
- Build and version-control golden datasets and adverse test sets, including expected answers, negative cases, escalation scenarios, hallucination checks, and policy-sensitive prompts.
- Use Copilot Studio Agent Evaluation for structured, repeatable batch testing and cross-version regression analysis:
- create or import test sets
- compare current vs previous versions
- export results for evidence retention
- include real-world scenarios where policy permits
- For higher-risk use cases, supplement in-product evaluation with Azure AI Evaluation SDK / Azure AI Foundry runs that capture quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, and F1, plus applicable safety evaluators.
- Run PyRIT or equivalent adversarial campaigns for Zone 3 and other high-risk agents, especially after material changes to prompts, connectors, knowledge sources, or model endpoints. Cross-reference Control 1.21.
- Use Microsoft 365 Agents Toolkit local preview / sideload testing for declarative agents and other Microsoft 365 Copilot extensibility scenarios before broader rollout.
- Treat model swap, prompt-orchestration change, knowledge-source change, and action/plugin change as mandatory re-validation events. A/B or champion/challenger comparison is recommended for material changes.
- Enforce Power Platform Solution Checker and Pipelines / Managed Environments approvals as promotion gates; do not promote if critical validation checks fail.
- Review Copilot Studio Analytics and other production monitoring signals after deployment for drift, abandonment, escalation, satisfaction, latency, and quality trends.
- Retain test evidence, evaluator outputs, approval records, and monitoring summaries according to the firm’s WSP, books-and-records schedule, and model-risk evidence policy.
February 2026 Pipeline Deadline
Microsoft has signaled tightening of the Managed Environments expectation for pipeline target environments — including those used for testing — through 2026. Organizations should review Control 2.1 for current guidance, licensing implications, and recommended proactive actions.
Zone-Specific Requirements
| Zone | Requirement | Rationale |
|---|---|---|
| Zone 1 (Personal) | Developer-side smoke testing in Test Pane, basic security review, and documented acceptable-use boundaries; no unsupervised regulated-customer communications | Lower-risk use still benefits from basic validation but should avoid higher-risk regulated scenarios |
| Zone 2 (Team) | Repeatable batch testing, version-controlled test sets, named approver, and pipeline-based promotion evidence; monthly post-release analytics review | Shared agents require repeatability and named accountability |
| Zone 3 (Enterprise) | Independent validation or countersignature, Azure AI Evaluation / equivalent external metrics, PyRIT or equivalent adversarial testing, supervisory approval where AI-generated communications are in scope, and documented ongoing monitoring with rollback plan | Customer-facing, finance-touching, or examiner-relevant use cases need stronger evidence and segregation of duties |
Roles & Responsibilities
| Role | Responsibility |
|---|---|
| AI Governance Lead | Defines the testing standard, evidence requirements, exception path, and review cadence |
| Agent Owner | Owns business test cases, UAT completion, and production-readiness sign-off for the use case |
| Copilot Studio Agent Author | Performs maker-side smoke testing, fixes defects, and maintains the version-controlled test suite |
| Model Risk Manager | Performs or approves functionally independent validation for Zone 3 or finance-touching agents |
| Power Platform Admin | Maintains test environments, Solution Checker posture, Pipelines / Managed Environments, and promotion-gate configuration |
| Compliance Officer | Verifies regulatory evidence and reviews higher-risk testing results for FINRA / SOX / OCC alignment |
| Purview Records Manager | Confirms retention and defensible preservation of test evidence, approvals, and monitoring artifacts |
| Designated Supervisor / Registered Principal | Provides supervisory sign-off where AI-generated customer communications or broker-dealer communications workflows are in scope under FINRA Rule 3110 |
Read-only Analytics Access for Validators and Supervisors
Independent validators (Model Risk Manager) and Designated Supervisors frequently need to review post-deployment quality and drift signals on the Copilot Studio Analytics page without being granted edit rights on the agent. The Copilot Studio Analytics Viewer sharing role helps meet this separation-of-duties requirement by granting read-only access to the agent's Analytics page; pair it with the Bot Transcript Viewer role to also expose conversation transcripts used as testing and supervisory evidence. The role is shared by the agent owner via the agent's three-dots menu → Share → Analytics viewer and must be assigned to individual users — security groups are not supported, so maintain a named-individual attestation list to support FINRA 3110 and OCC Bulletin 2026-13 (formerly OCC 2011-12) evidence trails. See Share an agent.
Copilot Studio Analytics Retention Windows (May 2026)
Per Microsoft Learn, analytics data is available for up to 180 days; session details and transcript information is available for the last 28 days. Validation evidence, transcript records, and testing artifacts required beyond these windows must be exported and stored in a retention-bound location (Purview retention, Log Analytics, or a zone-appropriate WORM store) before expiry. For Zone 3 evidence packs, schedule automated export before the 28-day session-data window closes.
Related Controls
| Control | Relationship |
|---|---|
| 1.7 - Comprehensive Audit Logging and Compliance | Audit trail for validation activity, approvals, and post-deployment monitoring evidence |
| 1.19 - eDiscovery for Agent Interactions | Preservation and production of testing and supervisory evidence where required |
| 1.21 - Adversarial Input Logging | PyRIT, prompt-injection evidence, and adversarial monitoring alignment |
| 2.1 - Managed Environments | Promotion-gate enforcement, pipeline governance, and environment controls |
| 2.3 - Change Management | Material-change triggers for re-validation and release approval |
| 2.11 - Bias Testing | Fairness and outcome testing within the broader validation program |
| 2.13 - Documentation and Record-Keeping | Evidence retention, testing records, and version history |
| 2.20 - Adversarial Testing | Security-focused red-team procedures complement functional and supervisory testing |
| 4.7 - Microsoft 365 Copilot Data Governance | Data-governance and third-party-model considerations for Microsoft 365 Copilot scenarios |
Implementation Playbooks
Step-by-Step Implementation
This control has detailed playbooks for implementation, automation, testing, and troubleshooting:
- Portal Walkthrough — Step-by-step portal configuration
- PowerShell Setup — Automation scripts
- Verification & Testing — Test cases and evidence collection
- Troubleshooting — Common issues and resolutions
Verification Criteria
Confirm control effectiveness by verifying:
- A version-controlled test strategy exists with zone-specific acceptance criteria, promotion gates, and material-change triggers for re-validation.
- Separate Dev, Test / UAT, and Production environments exist, and production promotion requires documented approval.
- Maker-side smoke testing in the Copilot Studio Test Pane or equivalent local preview path has been completed and any major defects remediated.
- Repeatable batch evaluation has been run for the current version using exported test sets and documented outcomes.
- For higher-risk use cases, the validation pack includes quantitative quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, F1, or equivalent measures approved by the firm.
- Applicable safety evaluators have been reviewed, including categories such as hate / unfairness, self-harm, sexual, violence, protected material, indirect attack, and other supported risk evaluators.
- PyRIT or equivalent adversarial testing has been completed for Zone 3 or other high-risk agents, and findings are documented with remediation status.
- Power Platform Solution Checker and Pipelines / Managed Environments promotion controls show a successful validation record for the deployed version.
- A functionally independent reviewer has approved or countersigned the validation package for Zone 3 or finance-touching agents.
- Supervisory sign-off is documented when the agent participates in AI-generated customer communications or broker-dealer communications workflows subject to FINRA Rule 3110.
- Post-deployment analytics and quality trends are reviewed on a defined cadence, with thresholds for drift, escalation, or rollback documented.
- Test evidence, approvals, evaluator outputs, and monitoring records are retained according to the firm’s WSP, books-and-records requirements, and model-risk evidence schedule.
Additional Resources
Regulatory Guidance:
- FINRA Regulatory Notice 15-09: Algorithmic Trading Strategies — Precedent for automated system testing and supervision; applies testing principles to AI agents
Microsoft Documentation:
- Microsoft Learn: Test your agent in Copilot Studio
- Microsoft Learn: Generate and import test sets for agent testing
- Microsoft Learn: Analytics overview in Copilot Studio
- Microsoft Learn: Azure AI Evaluation SDK for Python
- Microsoft Learn: Built-in evaluators in Azure AI Foundry
- Microsoft Learn: Test and debug agents in Microsoft 365 Agents Toolkit
- Microsoft Learn: Solution Checker
- Microsoft Learn: Overview of pipelines in Power Platform
FINRA Notice 15-09 Testing Precedent
FINRA Regulatory Notice 15-09 (March 2015) addresses supervision of algorithmic trading strategies and provides a useful precedent for AI agent testing. Key principles include:
- Pre-deployment testing: Test strategies in controlled environments before production
- Ongoing monitoring: Continuously monitor algorithm performance
- Kill switch capability: Ability to halt algorithm operation quickly
- Change testing: Re-test after any modification
These principles apply directly to AI agent governance. Treat AI agents as "algorithms" requiring similar rigor.
Agent 365 SDK Testing (Preview)
Agent 365 GA — May 2026
Microsoft Agent 365 reached general availability on May 1, 2026 (bundled in Microsoft 365 E7 or available as a standalone Microsoft Agent 365 per-user license). Agent Essentials category definitions and SDK feature scope continue to mature post-GA — verify current feature availability against Microsoft Learn before implementing production controls dependent on specific Agent 365 capabilities.
Agent 365 and Copilot Studio provide testing capabilities for Blueprint-registered agents:
- Automated evaluation via Copilot Studio connector or REST API for CI/CD integration
- Evaluation metrics aligned with Blueprint promotion gates
Automated agent evaluation in Copilot Studio:
Microsoft Copilot Studio's agent evaluations feature reached general availability in March 2026 (Copilot Studio What's new) and provides test-set-based evaluation with set-level grading, multi-dimensional graders, and importable/exportable test sets.
Two automation surfaces are available:
- Copilot Studio connector for Power Automate (April 2026 GA) — trigger evaluations from flows; see Automate agent evaluations.
- REST API (Preview as of April 2026) — integrate evaluation into CI/CD pipelines via the Power Platform API; see Run automated agent evaluations from a REST API. Treat as Preview: do not depend on it for production change-gate enforcement until GA is announced.
Agent 365 SDK testing scope clarification:
Agent 365 SDK agents (Teams/Word/Outlook-hosted) are tested locally with Microsoft 365 Agents Playground (winget install agentsplayground or npm install -g @microsoft/m365agentsplayground). The Agent 365 CLI (a365, installed via dotnet tool install --global Microsoft.Agents.A365.DevTools.Cli) covers deployment and management of Agent 365 apps to Azure — not test-suite execution. See Test agents using the Microsoft Agent 365 SDK and Agent 365 CLI reference. Copilot Studio agents are out of scope of a365.
Zone-Specific Test Requirements:
| Zone | Testing Approach | Evidence Retention |
|---|---|---|
| Zone 1 | Manual testing via Agents Playground or Copilot Studio test console acceptable | 90 days |
| Zone 2 | Evaluation via Copilot Studio connector or REST API in deployment pipeline recommended | 3 years |
| Zone 3 | Automated evaluation required; blocking on failure; use Copilot Studio evaluation or PPAPI REST API | 7–10 years |
- Microsoft Learn: Agent 365 SDK Overview - SDK overview including testing capabilities
Updated: June 2026 | Version: v1.6.2 | UI Verification Status: Current