Control 2.5: Testing, Validation, and Quality Assurance
Control ID: 2.5
Pillar: Management
Regulatory Reference: SOX Sections 302/404, FINRA Rule 4511, FINRA Rule 3110, FINRA Regulatory Notice 25-07, GLBA 501(b), SEC 17a-4(b)(4), OCC Bulletin 2011-12 / Fed SR 11-7, and NIST AI RMF 1.0 + Generative AI Profile (testing, measurement, and ongoing monitoring expectations)
Last UI Verified: April 2026
Governance Levels: Baseline / Recommended / Regulated
Agent 365 Architecture Update
Agent 365 lifecycle management supports promotion gates that can enforce testing and validation requirements before agents move between environments. See Unified Agent Governance for promotion gate configuration.
Objective
Help validate that Copilot Studio agents, Microsoft 365 Copilot extensibility scenarios, and any related Azure AI Foundry-backed agent workloads operate within zone-defined quality, safety, and performance thresholds before promotion to production through functional testing, regression testing, adversarial testing, and ongoing monitoring. This control also supports the independent-validation and evidence expectations associated with OCC Bulletin 2011-12 / Fed SR 11-7 for finance-touching or customer-facing AI use cases.
Why This Matters for FSI
- SOX Sections 302/404: If an AI assistant or agent influences finance, controllership, disclosure drafting, or reconciliations, management should treat testing evidence as part of the internal-control environment and document the review boundary.
- FINRA Rule 4511 + SEC 17a-4(b)(4): Test plans, evaluator outputs, promotion approvals, and supervisory review records can become books-and-records evidence for AI-enabled communications and workflows.
- FINRA Regulatory Notice 25-07 (March 2025): Reaffirms that existing supervisory and recordkeeping obligations apply to generative AI; firms should be able to explain how AI outputs were tested, approved, and monitored.
- FINRA Rule 3110: Supervisory testing should cover not only bias but also accuracy, suitability, escalation paths, and review of AI-generated customer communications where applicable.
- OCC Bulletin 2011-12 / Fed SR 11-7: Higher-risk or finance-touching AI use cases should be subject to functionally independent validation and ongoing monitoring, not only maker-side testing.
- GLBA 501(b): Security safeguards should be periodically tested for AI systems that handle or expose customer information.
Model Swap and Third-Party Model Revalidation
Treat any foundation-model change, provider change, major prompt-orchestration change, or grounding-source change as a mandatory re-validation event. This includes champion/challenger comparisons such as GPT-4o vs GPT-4.1 or other tenant-approved alternatives where available. Response quality, safety behavior, and latency can shift materially across models and providers. Do not assume third-party-model or sovereign-cloud parity; verify Commercial / GCC / GCC High rollout and any subprocessor implications before relying on the new model in production.
Automation Available
See Conflict of Interest Testing in FSI-AgentGov-Solutions for conflict of interest testing for agent recommendations.
Control Description
This control establishes comprehensive testing requirements for AI agents across the development lifecycle:
| Test Type | Description | FSI Application |
|---|---|---|
| Functional Testing | Verify agent responds correctly to user queries | Customer accuracy |
| Security Testing | Validate data protection and access controls | Data protection |
| Performance Testing | Verify agents respond within 3 seconds (Zone 1-2) or 2 seconds (Zone 3) for standard queries | Customer experience |
| Bias Testing | Detect and mitigate unfair treatment | FINRA 3110 compliance |
| Regression Testing | Confirm changes don't break existing functionality | Change management |
| UAT | Business validation before production deployment | Stakeholder approval |
Evaluation Gates
| Gate | Lifecycle Stage | Required Validations |
|---|---|---|
| Gate 1 | Design > Build | Business justification, risk classification |
| Gate 2 | Build > Evaluate | Prompt injection testing, authentication validation |
| Gate 3 | Evaluate > Deploy | Test pass rate >95%, UAT sign-off |
| Gate 4 | Deploy > Monitor | Compliance approval, rollback plan documented |
Testing and Validation Planes
Testing for AI agents should be treated as five distinct evidence planes, each serving a different governance purpose:
| Plane | Primary Tooling | What It Demonstrates | FSI Caveat |
|---|---|---|---|
| Developer smoke testing | Copilot Studio Test Pane / developer mode | Topic flow, variable behavior, quick issue isolation | Useful during authoring, but not sufficient as independent validation evidence under Fed SR 11-7 |
| Repeatable batch evaluation | Copilot Studio Agent Evaluation / test sets | Regression detection, version comparison, repeatable scoring across prompts | Should be exported and retained as part of the approval record |
| Quantitative quality and safety validation | Azure AI Evaluation SDK / Azure AI Foundry | External evaluator runs and reproducible metrics outside the maker surface | Stronger fit for independent validation packs and model-risk review |
| Adversarial testing | PyRIT or equivalent | Jailbreak, prompt-injection, misuse, and safety-resilience testing | Cross-reference Control 1.21 for adversarial evidence handling |
| Post-deployment monitoring | Copilot Studio Analytics / Quality dashboards and production monitoring | Drift, failure, abandonment, escalation, satisfaction, and real-world usage trends | Required for ongoing monitoring after release |
For higher-risk use cases, the validation record should go beyond functional pass/fail and capture quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, and F1, plus applicable safety evaluators such as Hate/Unfairness, Self-Harm, Sexual, Violence, Protected Materials, Indirect Attack, and where supported, Code Vulnerability and Ungrounded Attributes. Some evaluator families are preview, tenant-dependent, or cloud-dependent; verify availability at deploy time.
Independent Validation Boundary (SR 11-7 / OCC 2011-12)
The Test Pane and maker-run regression tests are valuable for development, but they are not a substitute for functionally independent validation when the agent is customer-facing, finance-touching, or otherwise material. For Zone 3 or higher-risk use cases, the validation package should be reviewed or countersigned by a party independent of the agent author, such as a Model Risk Manager, Compliance Officer, or separate validation function.
License Requirements
- Microsoft Copilot Studio — required for in-product Test Pane and Agent Evaluation capabilities.
- Microsoft 365 Copilot — required for declarative-agent and Microsoft 365 extensibility test scenarios.
- Azure AI Foundry / Azure AI Evaluation SDK — required for external quality and safety evaluator runs; consumption-based billing applies.
- Microsoft 365 Agents Toolkit — recommended for local preview / sideload testing of declarative agents in Visual Studio Code.
- Power Platform Pipelines / Managed Environments — recommended to enforce promotion gates, approvals, and deployment governance.
- Microsoft Purview Audit / eDiscovery — recommended where testing and approval evidence must be retained for supervisory, audit, or examination purposes.
- PyRIT — open source; runs on customer-controlled infrastructure and should be validated for supportability before being treated as a primary control component.
Sovereign Cloud Parity (verify at deploy time)
Copilot Studio is broadly available across Commercial, GCC, and GCC High, but feature-level parity can lag for Azure AI Foundry evaluation, advanced evaluator families, and third-party model availability. Do not assume full parity for every evaluation feature or external model endpoint in GCC / GCC High / DoD. Document compensating controls if a required evaluator or model family is not available in the target cloud.
Key Configuration Points
- Maintain separate Dev, Test / UAT, and Production environments with documented promotion controls; for Zone 3, keep the validation evidence package separate from the maker workspace.
- Use the Copilot Studio Test Pane for smoke testing, variable inspection, and conversation debugging during development, but do not treat it as the sole validation record.
- Build and version-control golden datasets and adverse test sets, including expected answers, negative cases, escalation scenarios, hallucination checks, and policy-sensitive prompts.
- Use Copilot Studio Agent Evaluation for structured, repeatable batch testing and cross-version regression analysis:
- create or import test sets
- compare current vs previous versions
- export results for evidence retention
- include real-world scenarios where policy permits
- For higher-risk use cases, supplement in-product evaluation with Azure AI Evaluation SDK / Azure AI Foundry runs that capture quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, and F1, plus applicable safety evaluators.
- Run PyRIT or equivalent adversarial campaigns for Zone 3 and other high-risk agents, especially after material changes to prompts, connectors, knowledge sources, or model endpoints. Cross-reference Control 1.21.
- Use Microsoft 365 Agents Toolkit local preview / sideload testing for declarative agents and other Microsoft 365 Copilot extensibility scenarios before broader rollout.
- Treat model swap, prompt-orchestration change, knowledge-source change, and action/plugin change as mandatory re-validation events. A/B or champion/challenger comparison is recommended for material changes.
- Enforce Power Platform Solution Checker and Pipelines / Managed Environments approvals as promotion gates; do not promote if critical validation checks fail.
- Review Copilot Studio Analytics and other production monitoring signals after deployment for drift, abandonment, escalation, satisfaction, latency, and quality trends.
- Retain test evidence, evaluator outputs, approval records, and monitoring summaries according to the firm’s WSP, books-and-records schedule, and model-risk evidence policy.
February 2026 Pipeline Deadline
Microsoft has signaled tightening of the Managed Environments expectation for pipeline target environments — including those used for testing — through 2026. Organizations should review Control 2.1 for current guidance, licensing implications, and recommended proactive actions.
Zone-Specific Requirements
| Zone | Requirement | Rationale |
|---|---|---|
| Zone 1 (Personal) | Developer-side smoke testing in Test Pane, basic security review, and documented acceptable-use boundaries; no unsupervised regulated-customer communications | Lower-risk use still benefits from basic validation but should avoid higher-risk regulated scenarios |
| Zone 2 (Team) | Repeatable batch testing, version-controlled test sets, named approver, and pipeline-based promotion evidence; monthly post-release analytics review | Shared agents require repeatability and named accountability |
| Zone 3 (Enterprise) | Independent validation or countersignature, Azure AI Evaluation / equivalent external metrics, PyRIT or equivalent adversarial testing, supervisory approval where AI-generated communications are in scope, and documented ongoing monitoring with rollback plan | Customer-facing, finance-touching, or examiner-relevant use cases need stronger evidence and segregation of duties |
Roles & Responsibilities
| Role | Responsibility |
|---|---|
| AI Governance Lead | Defines the testing standard, evidence requirements, exception path, and review cadence |
| Agent Owner | Owns business test cases, UAT completion, and production-readiness sign-off for the use case |
| Copilot Studio Agent Author | Performs maker-side smoke testing, fixes defects, and maintains the version-controlled test suite |
| Model Risk Manager | Performs or approves functionally independent validation for Zone 3 or finance-touching agents |
| Power Platform Admin | Maintains test environments, Solution Checker posture, Pipelines / Managed Environments, and promotion-gate configuration |
| Compliance Officer | Verifies regulatory evidence and reviews higher-risk testing results for FINRA / SOX / OCC alignment |
| Purview Records Manager | Confirms retention and defensible preservation of test evidence, approvals, and monitoring artifacts |
| Designated Supervisor / Registered Principal | Provides supervisory sign-off where AI-generated customer communications or broker-dealer communications workflows are in scope under FINRA Rule 3110 |
Related Controls
| Control | Relationship |
|---|---|
| 1.7 - Comprehensive Audit Logging and Compliance | Audit trail for validation activity, approvals, and post-deployment monitoring evidence |
| 1.19 - eDiscovery for Agent Interactions | Preservation and production of testing and supervisory evidence where required |
| 1.21 - Adversarial Input Logging | PyRIT, prompt-injection evidence, and adversarial monitoring alignment |
| 2.1 - Managed Environments | Promotion-gate enforcement, pipeline governance, and environment controls |
| 2.3 - Change Management | Material-change triggers for re-validation and release approval |
| 2.11 - Bias Testing | Fairness and outcome testing within the broader validation program |
| 2.13 - Documentation and Record-Keeping | Evidence retention, testing records, and version history |
| 2.20 - Adversarial Testing | Security-focused red-team procedures complement functional and supervisory testing |
| 4.7 - Microsoft 365 Copilot Data Governance | Data-governance, third-party-model, and sovereign-cloud considerations for Microsoft 365 Copilot scenarios |
Implementation Playbooks
Step-by-Step Implementation
This control has detailed playbooks for implementation, automation, testing, and troubleshooting:
- Portal Walkthrough — Step-by-step portal configuration
- PowerShell Setup — Automation scripts
- Verification & Testing — Test cases and evidence collection
- Troubleshooting — Common issues and resolutions
Verification Criteria
Confirm control effectiveness by verifying:
- A version-controlled test strategy exists with zone-specific acceptance criteria, promotion gates, and material-change triggers for re-validation.
- Separate Dev, Test / UAT, and Production environments exist, and production promotion requires documented approval.
- Maker-side smoke testing in the Copilot Studio Test Pane or equivalent local preview path has been completed and any major defects remediated.
- Repeatable batch evaluation has been run for the current version using exported test sets and documented outcomes.
- For higher-risk use cases, the validation pack includes quantitative quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, F1, or equivalent measures approved by the firm.
- Applicable safety evaluators have been reviewed, including categories such as hate / unfairness, self-harm, sexual, violence, protected material, indirect attack, and other supported risk evaluators.
- PyRIT or equivalent adversarial testing has been completed for Zone 3 or other high-risk agents, and findings are documented with remediation status.
- Power Platform Solution Checker and Pipelines / Managed Environments promotion controls show a successful validation record for the deployed version.
- A functionally independent reviewer has approved or countersigned the validation package for Zone 3 or finance-touching agents.
- Supervisory sign-off is documented when the agent participates in AI-generated customer communications or broker-dealer communications workflows subject to FINRA Rule 3110.
- Post-deployment analytics and quality trends are reviewed on a defined cadence, with thresholds for drift, escalation, or rollback documented.
- Test evidence, approvals, evaluator outputs, and monitoring records are retained according to the firm’s WSP, books-and-records requirements, and model-risk evidence schedule.
Additional Resources
Regulatory Guidance:
- FINRA Regulatory Notice 15-09: Algorithmic Trading Strategies — Precedent for automated system testing and supervision; applies testing principles to AI agents
Microsoft Documentation:
- Microsoft Learn: Test your agent in Copilot Studio
- Microsoft Learn: Generate and import test sets for agent testing
- Microsoft Learn: Analytics overview in Copilot Studio
- Microsoft Learn: Azure AI Evaluation SDK for Python
- Microsoft Learn: Built-in evaluators in Azure AI Foundry
- Microsoft Learn: Test and debug agents in Microsoft 365 Agents Toolkit
- Microsoft Learn: Solution Checker
- Microsoft Learn: Overview of pipelines in Power Platform
FINRA Notice 15-09 Testing Precedent
FINRA Regulatory Notice 15-09 (March 2015) addresses supervision of algorithmic trading strategies and provides a useful precedent for AI agent testing. Key principles include:
- Pre-deployment testing: Test strategies in controlled environments before production
- Ongoing monitoring: Continuously monitor algorithm performance
- Kill switch capability: Ability to halt algorithm operation quickly
- Change testing: Re-test after any modification
These principles apply directly to AI agent governance. Treat AI agents as "algorithms" requiring similar rigor.
Agent 365 SDK Testing (Preview)
Preview Notice
Microsoft Agent 365 SDK and Agent Essentials are in limited preview (Frontier program). Verify feature availability and GA timelines before implementing production controls dependent on these capabilities. Expect changes before general availability.
Agent 365 SDK provides additional testing capabilities for Blueprint-registered agents:
- CLI-based test execution for CI/CD integration
- Evaluation metrics aligned with Blueprint promotion gates
CLI-Based Test Execution:
Agent 365 SDK includes a test runner CLI for automated testing in CI/CD pipelines:
# Run agent test suite
agent365 test run --manifest ./agent-manifest.json --dataset ./golden-dataset.json
# Run specific test category
agent365 test run --category security --manifest ./agent-manifest.json
# Export test results for compliance evidence
agent365 test run --output ./test-results.json --format compliance
Test Categories Aligned with Blueprint Gates:
| Category | Gate | Validation |
|---|---|---|
| Functional | Build → Deploy | Response accuracy against golden dataset |
| Security | Build → Deploy | Prompt injection resistance, auth validation |
| Performance | Build → Deploy | Response time <2s (Zone 3), <3s (Zone 2) |
| Compliance | Deploy → Production | Data handling, audit logging verification |
CI/CD Integration Example (Azure DevOps):
# azure-pipelines.yml excerpt
- task: Bash@3
displayName: 'Run Agent 365 Tests'
inputs:
script: |
agent365 test run \
--manifest $(Build.SourcesDirectory)/agent-manifest.json \
--dataset $(Build.SourcesDirectory)/tests/golden-dataset.json \
--output $(Build.ArtifactStagingDirectory)/test-results.json \
--threshold 95
failOnStderr: true
Zone-Specific Test Requirements:
| Zone | CLI Test Requirement | Evidence Retention |
|---|---|---|
| Zone 1 | Manual testing acceptable | 90 days |
| Zone 2 | CLI tests in deployment pipeline | 3 years |
| Zone 3 | CLI tests required; blocking on failure | 7–10 years |
- Microsoft Learn: Agent 365 SDK Overview (Preview) - SDK overview including testing capabilities
Updated: April 2026 | Version: v1.4.0 | UI Verification Status: Current