Skip to content

Control 2.5: Testing, Validation, and Quality Assurance

Control ID: 2.5
Pillar: Management
Regulatory Reference: SOX Sections 302/404, FINRA Rule 4511, FINRA Rule 3110, FINRA RN 24-09, GLBA 501(b), SEC 17a-4(b)(4), OCC Bulletin 2026-13 (formerly OCC Bulletin 2011-12) / Fed SR 26-2 (formerly SR 11-7), and NIST AI RMF 1.0 + Generative AI Profile (testing, measurement, and ongoing monitoring expectations)
Last UI Verified: May 2026
Governance Levels: Baseline / Recommended / Regulated


Agent 365 Architecture Update

Agent 365 lifecycle management supports promotion gates that can enforce testing and validation requirements before agents move between environments. See Unified Agent Governance for promotion gate configuration.

Objective

Help validate that Microsoft Copilot Studio agents, Microsoft 365 Copilot extensibility scenarios, and any related Microsoft Foundry (formerly Azure AI Foundry) -backed agent workloads operate within zone-defined quality, safety, and performance thresholds before promotion to production through functional testing, regression testing, adversarial testing, and ongoing monitoring. This control also supports the independent-validation and evidence expectations associated with OCC Bulletin 2026-13 (formerly OCC Bulletin 2011-12) / Fed SR 26-2 (formerly SR 11-7) for finance-touching or customer-facing AI use cases.


Why This Matters for FSI

  • SOX Sections 302/404: If an AI assistant or agent influences finance, controllership, disclosure drafting, or reconciliations, management should treat testing evidence as part of the internal-control environment and document the review boundary.
  • FINRA Rule 4511 + SEC 17a-4(b)(4): Test plans, evaluator outputs, promotion approvals, and supervisory review records can become books-and-records evidence for AI-enabled communications and workflows.
  • FINRA RN 24-09 / Rule 3110 (June 2024): Reaffirms that existing supervisory and recordkeeping obligations apply to generative AI; firms should be able to explain how AI outputs were tested, approved, and monitored.
  • FINRA Rule 3110: Supervisory testing should cover not only bias but also accuracy, suitability, escalation paths, and review of AI-generated customer communications where applicable.
  • OCC Bulletin 2026-13 (formerly OCC Bulletin 2011-12) / Fed SR 26-2 (formerly SR 11-7): Higher-risk or finance-touching AI use cases should be subject to functionally independent validation and ongoing monitoring, not only maker-side testing.
  • GLBA 501(b): Security safeguards should be periodically tested for AI systems that handle or expose customer information.

Model Swap and Third-Party Model Revalidation

Treat any foundation-model change, provider change, major prompt-orchestration change, or grounding-source change as a mandatory re-validation event. This includes champion/challenger comparisons such as GPT-4o vs GPT-4.1 or other tenant-approved alternatives where available. Response quality, safety behavior, and latency can shift materially across models and providers. Do not assume third-party-model parity; verify subprocessor implications before relying on the new model in production.


Automation Available

See Conflict of Interest Testing in FSI-AgentGov-Solutions for conflict of interest testing for agent recommendations.

Control Description

This control establishes comprehensive testing requirements for AI agents across the development lifecycle:

Test Type Description FSI Application
Functional Testing Verify agent responds correctly to user queries Customer accuracy
Security Testing Validate data protection and access controls Data protection
Performance Testing Verify agents respond within 3 seconds (Zone 1-2) or 2 seconds (Zone 3) for standard queries Customer experience
Bias Testing Detect and mitigate unfair treatment FINRA 3110 compliance
Regression Testing Confirm changes don't break existing functionality Change management
UAT Business validation before production deployment Stakeholder approval

Evaluation Gates

Gate Lifecycle Stage Required Validations
Gate 1 Design > Build Business justification, risk classification
Gate 2 Build > Evaluate Prompt injection testing, authentication validation
Gate 3 Evaluate > Deploy Test pass rate >95%, UAT sign-off
Gate 4 Deploy > Monitor Compliance approval, rollback plan documented

Testing and Validation Planes

Testing for AI agents should be treated as five distinct evidence planes, each serving a different governance purpose:

Plane Primary Tooling What It Demonstrates FSI Caveat
Developer smoke testing Copilot Studio Test Pane / developer mode Topic flow, variable behavior, quick issue isolation Useful during authoring, but not sufficient as independent validation evidence under Fed SR 26-2 (formerly SR 11-7)
Repeatable batch evaluation Copilot Studio Agent Evaluation / test sets Regression detection, version comparison, repeatable scoring across prompts Should be exported and retained as part of the approval record
Quantitative quality and safety validation Azure AI Evaluation SDK / Azure AI Foundry External evaluator runs and reproducible metrics outside the maker surface Stronger fit for independent validation packs and model-risk review
Adversarial testing PyRIT or equivalent Jailbreak, prompt-injection, misuse, and safety-resilience testing Cross-reference Control 1.21 for adversarial evidence handling
Post-deployment monitoring Copilot Studio Analytics / Quality dashboards and production monitoring Drift, failure, abandonment, escalation, satisfaction, and real-world usage trends Required for ongoing monitoring after release

For higher-risk use cases, the validation record should go beyond functional pass/fail and capture quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, and F1, plus applicable safety evaluators such as Hate/Unfairness, Self-Harm, Sexual, Violence, Protected Materials, Indirect Attack, and where supported, Code Vulnerability and Ungrounded Attributes. Some evaluator families are preview, tenant-dependent, or cloud-dependent; verify availability at deploy time.

Independent Validation Boundary (SR 11-7 / OCC Bulletin 2026-13 (formerly OCC 2011-12))

The Test Pane and maker-run regression tests are valuable for development, but they are not a substitute for functionally independent validation when the agent is customer-facing, finance-touching, or otherwise material. For Zone 3 or higher-risk use cases, the validation package should be reviewed or countersigned by a party independent of the agent author, such as a Model Risk Manager, Compliance Officer, or separate validation function.

License Requirements

  • Copilot Studio — required for in-product Test Pane and Agent Evaluation capabilities.
  • Microsoft 365 Copilot — required for declarative-agent and Microsoft 365 extensibility test scenarios.
  • Azure AI Foundry / Azure AI Evaluation SDK — required for external quality and safety evaluator runs; consumption-based billing applies.
  • Microsoft 365 Agents Toolkit — recommended for local preview / sideload testing of declarative agents in Visual Studio Code.
  • Power Platform Pipelines / Managed Environments — recommended to enforce promotion gates, approvals, and deployment governance.
  • Microsoft Purview Audit / eDiscovery — recommended where testing and approval evidence must be retained for supervisory, audit, or examination purposes.
  • PyRIT — open source; runs on customer-controlled infrastructure and should be validated for supportability before being treated as a primary control component.

Foundry Observability and Continuous Evaluation

Beyond pre-deployment evaluation runs, Microsoft Foundry observability (learn.microsoft.com/azure/ai-foundry/concepts/observability) provides built-in evaluators and Azure Monitor / Application Insights integration so the same evaluator families used in pre-release validation can run continuously in production against sampled traffic. For Zone 3 agents, configure Foundry tracing to emit evaluator scores into App Insights so trend, drift, and regression evidence flows into the same telemetry plane used by Control 3.3 (Compliance and Regulatory Reporting) for periodic supervisory reporting.


Key Configuration Points

  • Maintain separate Dev, Test / UAT, and Production environments with documented promotion controls; for Zone 3, keep the validation evidence package separate from the maker workspace.
  • Use the Copilot Studio Test Pane for smoke testing, variable inspection, and conversation debugging during development, but do not treat it as the sole validation record.
  • Build and version-control golden datasets and adverse test sets, including expected answers, negative cases, escalation scenarios, hallucination checks, and policy-sensitive prompts.
  • Use Copilot Studio Agent Evaluation for structured, repeatable batch testing and cross-version regression analysis:
    • create or import test sets
    • compare current vs previous versions
    • export results for evidence retention
    • include real-world scenarios where policy permits
  • For higher-risk use cases, supplement in-product evaluation with Azure AI Evaluation SDK / Azure AI Foundry runs that capture quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, and F1, plus applicable safety evaluators.
  • Run PyRIT or equivalent adversarial campaigns for Zone 3 and other high-risk agents, especially after material changes to prompts, connectors, knowledge sources, or model endpoints. Cross-reference Control 1.21.
  • Use Microsoft 365 Agents Toolkit local preview / sideload testing for declarative agents and other Microsoft 365 Copilot extensibility scenarios before broader rollout.
  • Treat model swap, prompt-orchestration change, knowledge-source change, and action/plugin change as mandatory re-validation events. A/B or champion/challenger comparison is recommended for material changes.
  • Enforce Power Platform Solution Checker and Pipelines / Managed Environments approvals as promotion gates; do not promote if critical validation checks fail.
  • Review Copilot Studio Analytics and other production monitoring signals after deployment for drift, abandonment, escalation, satisfaction, latency, and quality trends.
  • Retain test evidence, evaluator outputs, approval records, and monitoring summaries according to the firm’s WSP, books-and-records schedule, and model-risk evidence policy.

February 2026 Pipeline Deadline

Microsoft has signaled tightening of the Managed Environments expectation for pipeline target environments — including those used for testing — through 2026. Organizations should review Control 2.1 for current guidance, licensing implications, and recommended proactive actions.


Zone-Specific Requirements

Zone Requirement Rationale
Zone 1 (Personal) Developer-side smoke testing in Test Pane, basic security review, and documented acceptable-use boundaries; no unsupervised regulated-customer communications Lower-risk use still benefits from basic validation but should avoid higher-risk regulated scenarios
Zone 2 (Team) Repeatable batch testing, version-controlled test sets, named approver, and pipeline-based promotion evidence; monthly post-release analytics review Shared agents require repeatability and named accountability
Zone 3 (Enterprise) Independent validation or countersignature, Azure AI Evaluation / equivalent external metrics, PyRIT or equivalent adversarial testing, supervisory approval where AI-generated communications are in scope, and documented ongoing monitoring with rollback plan Customer-facing, finance-touching, or examiner-relevant use cases need stronger evidence and segregation of duties

Roles & Responsibilities

Role Responsibility
AI Governance Lead Defines the testing standard, evidence requirements, exception path, and review cadence
Agent Owner Owns business test cases, UAT completion, and production-readiness sign-off for the use case
Copilot Studio Agent Author Performs maker-side smoke testing, fixes defects, and maintains the version-controlled test suite
Model Risk Manager Performs or approves functionally independent validation for Zone 3 or finance-touching agents
Power Platform Admin Maintains test environments, Solution Checker posture, Pipelines / Managed Environments, and promotion-gate configuration
Compliance Officer Verifies regulatory evidence and reviews higher-risk testing results for FINRA / SOX / OCC alignment
Purview Records Manager Confirms retention and defensible preservation of test evidence, approvals, and monitoring artifacts
Designated Supervisor / Registered Principal Provides supervisory sign-off where AI-generated customer communications or broker-dealer communications workflows are in scope under FINRA Rule 3110

Read-only Analytics Access for Validators and Supervisors

Independent validators (Model Risk Manager) and Designated Supervisors frequently need to review post-deployment quality and drift signals on the Copilot Studio Analytics page without being granted edit rights on the agent. The Copilot Studio Analytics Viewer sharing role helps meet this separation-of-duties requirement by granting read-only access to the agent's Analytics page; pair it with the Bot Transcript Viewer role to also expose conversation transcripts used as testing and supervisory evidence. The role is shared by the agent owner via the agent's three-dots menu → ShareAnalytics viewer and must be assigned to individual users — security groups are not supported, so maintain a named-individual attestation list to support FINRA 3110 and OCC Bulletin 2026-13 (formerly OCC 2011-12) evidence trails. See Share an agent.

Copilot Studio Analytics Retention Windows (May 2026)

Per Microsoft Learn, analytics data is available for up to 180 days; session details and transcript information is available for the last 28 days. Validation evidence, transcript records, and testing artifacts required beyond these windows must be exported and stored in a retention-bound location (Purview retention, Log Analytics, or a zone-appropriate WORM store) before expiry. For Zone 3 evidence packs, schedule automated export before the 28-day session-data window closes.


Control Relationship
1.7 - Comprehensive Audit Logging and Compliance Audit trail for validation activity, approvals, and post-deployment monitoring evidence
1.19 - eDiscovery for Agent Interactions Preservation and production of testing and supervisory evidence where required
1.21 - Adversarial Input Logging PyRIT, prompt-injection evidence, and adversarial monitoring alignment
2.1 - Managed Environments Promotion-gate enforcement, pipeline governance, and environment controls
2.3 - Change Management Material-change triggers for re-validation and release approval
2.11 - Bias Testing Fairness and outcome testing within the broader validation program
2.13 - Documentation and Record-Keeping Evidence retention, testing records, and version history
2.20 - Adversarial Testing Security-focused red-team procedures complement functional and supervisory testing
4.7 - Microsoft 365 Copilot Data Governance Data-governance and third-party-model considerations for Microsoft 365 Copilot scenarios

Implementation Playbooks

Step-by-Step Implementation

This control has detailed playbooks for implementation, automation, testing, and troubleshooting:


Verification Criteria

Confirm control effectiveness by verifying:

  1. A version-controlled test strategy exists with zone-specific acceptance criteria, promotion gates, and material-change triggers for re-validation.
  2. Separate Dev, Test / UAT, and Production environments exist, and production promotion requires documented approval.
  3. Maker-side smoke testing in the Copilot Studio Test Pane or equivalent local preview path has been completed and any major defects remediated.
  4. Repeatable batch evaluation has been run for the current version using exported test sets and documented outcomes.
  5. For higher-risk use cases, the validation pack includes quantitative quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, F1, or equivalent measures approved by the firm.
  6. Applicable safety evaluators have been reviewed, including categories such as hate / unfairness, self-harm, sexual, violence, protected material, indirect attack, and other supported risk evaluators.
  7. PyRIT or equivalent adversarial testing has been completed for Zone 3 or other high-risk agents, and findings are documented with remediation status.
  8. Power Platform Solution Checker and Pipelines / Managed Environments promotion controls show a successful validation record for the deployed version.
  9. A functionally independent reviewer has approved or countersigned the validation package for Zone 3 or finance-touching agents.
  10. Supervisory sign-off is documented when the agent participates in AI-generated customer communications or broker-dealer communications workflows subject to FINRA Rule 3110.
  11. Post-deployment analytics and quality trends are reviewed on a defined cadence, with thresholds for drift, escalation, or rollback documented.
  12. Test evidence, approvals, evaluator outputs, and monitoring records are retained according to the firm’s WSP, books-and-records requirements, and model-risk evidence schedule.

Additional Resources

Regulatory Guidance:

Microsoft Documentation:

FINRA Notice 15-09 Testing Precedent

FINRA Regulatory Notice 15-09 (March 2015) addresses supervision of algorithmic trading strategies and provides a useful precedent for AI agent testing. Key principles include:

  • Pre-deployment testing: Test strategies in controlled environments before production
  • Ongoing monitoring: Continuously monitor algorithm performance
  • Kill switch capability: Ability to halt algorithm operation quickly
  • Change testing: Re-test after any modification

These principles apply directly to AI agent governance. Treat AI agents as "algorithms" requiring similar rigor.

Agent 365 SDK Testing (Preview)

Agent 365 GA — May 2026

Microsoft Agent 365 reached general availability on May 1, 2026 (bundled in Microsoft 365 E7 or available as a standalone Microsoft Agent 365 per-user license). Agent Essentials category definitions and SDK feature scope continue to mature post-GA — verify current feature availability against Microsoft Learn before implementing production controls dependent on specific Agent 365 capabilities.

Agent 365 and Copilot Studio provide testing capabilities for Blueprint-registered agents:

  • Automated evaluation via Copilot Studio connector or REST API for CI/CD integration
  • Evaluation metrics aligned with Blueprint promotion gates

Automated agent evaluation in Copilot Studio:

Microsoft Copilot Studio's agent evaluations feature reached general availability in March 2026 (Copilot Studio What's new) and provides test-set-based evaluation with set-level grading, multi-dimensional graders, and importable/exportable test sets.

Two automation surfaces are available:

  • Copilot Studio connector for Power Automate (April 2026 GA) — trigger evaluations from flows; see Automate agent evaluations.
  • REST API (Preview as of April 2026) — integrate evaluation into CI/CD pipelines via the Power Platform API; see Run automated agent evaluations from a REST API. Treat as Preview: do not depend on it for production change-gate enforcement until GA is announced.

Agent 365 SDK testing scope clarification: Agent 365 SDK agents (Teams/Word/Outlook-hosted) are tested locally with Microsoft 365 Agents Playground (winget install agentsplayground or npm install -g @microsoft/m365agentsplayground). The Agent 365 CLI (a365, installed via dotnet tool install --global Microsoft.Agents.A365.DevTools.Cli) covers deployment and management of Agent 365 apps to Azure — not test-suite execution. See Test agents using the Microsoft Agent 365 SDK and Agent 365 CLI reference. Copilot Studio agents are out of scope of a365.

Zone-Specific Test Requirements:

Zone Testing Approach Evidence Retention
Zone 1 Manual testing via Agents Playground or Copilot Studio test console acceptable 90 days
Zone 2 Evaluation via Copilot Studio connector or REST API in deployment pipeline recommended 3 years
Zone 3 Automated evaluation required; blocking on failure; use Copilot Studio evaluation or PPAPI REST API 7–10 years

Updated: June 2026 | Version: v1.6.2 | UI Verification Status: Current