Skip to content

Control 2.5: Testing, Validation, and Quality Assurance

Control ID: 2.5
Pillar: Management
Regulatory Reference: SOX Sections 302/404, FINRA Rule 4511, FINRA Rule 3110, FINRA Regulatory Notice 25-07, GLBA 501(b), SEC 17a-4(b)(4), OCC Bulletin 2011-12 / Fed SR 11-7, and NIST AI RMF 1.0 + Generative AI Profile (testing, measurement, and ongoing monitoring expectations)
Last UI Verified: April 2026
Governance Levels: Baseline / Recommended / Regulated


Agent 365 Architecture Update

Agent 365 lifecycle management supports promotion gates that can enforce testing and validation requirements before agents move between environments. See Unified Agent Governance for promotion gate configuration.

Objective

Help validate that Copilot Studio agents, Microsoft 365 Copilot extensibility scenarios, and any related Azure AI Foundry-backed agent workloads operate within zone-defined quality, safety, and performance thresholds before promotion to production through functional testing, regression testing, adversarial testing, and ongoing monitoring. This control also supports the independent-validation and evidence expectations associated with OCC Bulletin 2011-12 / Fed SR 11-7 for finance-touching or customer-facing AI use cases.


Why This Matters for FSI

  • SOX Sections 302/404: If an AI assistant or agent influences finance, controllership, disclosure drafting, or reconciliations, management should treat testing evidence as part of the internal-control environment and document the review boundary.
  • FINRA Rule 4511 + SEC 17a-4(b)(4): Test plans, evaluator outputs, promotion approvals, and supervisory review records can become books-and-records evidence for AI-enabled communications and workflows.
  • FINRA Regulatory Notice 25-07 (March 2025): Reaffirms that existing supervisory and recordkeeping obligations apply to generative AI; firms should be able to explain how AI outputs were tested, approved, and monitored.
  • FINRA Rule 3110: Supervisory testing should cover not only bias but also accuracy, suitability, escalation paths, and review of AI-generated customer communications where applicable.
  • OCC Bulletin 2011-12 / Fed SR 11-7: Higher-risk or finance-touching AI use cases should be subject to functionally independent validation and ongoing monitoring, not only maker-side testing.
  • GLBA 501(b): Security safeguards should be periodically tested for AI systems that handle or expose customer information.

Model Swap and Third-Party Model Revalidation

Treat any foundation-model change, provider change, major prompt-orchestration change, or grounding-source change as a mandatory re-validation event. This includes champion/challenger comparisons such as GPT-4o vs GPT-4.1 or other tenant-approved alternatives where available. Response quality, safety behavior, and latency can shift materially across models and providers. Do not assume third-party-model or sovereign-cloud parity; verify Commercial / GCC / GCC High rollout and any subprocessor implications before relying on the new model in production.


Automation Available

See Conflict of Interest Testing in FSI-AgentGov-Solutions for conflict of interest testing for agent recommendations.

Control Description

This control establishes comprehensive testing requirements for AI agents across the development lifecycle:

Test Type Description FSI Application
Functional Testing Verify agent responds correctly to user queries Customer accuracy
Security Testing Validate data protection and access controls Data protection
Performance Testing Verify agents respond within 3 seconds (Zone 1-2) or 2 seconds (Zone 3) for standard queries Customer experience
Bias Testing Detect and mitigate unfair treatment FINRA 3110 compliance
Regression Testing Confirm changes don't break existing functionality Change management
UAT Business validation before production deployment Stakeholder approval

Evaluation Gates

Gate Lifecycle Stage Required Validations
Gate 1 Design > Build Business justification, risk classification
Gate 2 Build > Evaluate Prompt injection testing, authentication validation
Gate 3 Evaluate > Deploy Test pass rate >95%, UAT sign-off
Gate 4 Deploy > Monitor Compliance approval, rollback plan documented

Testing and Validation Planes

Testing for AI agents should be treated as five distinct evidence planes, each serving a different governance purpose:

Plane Primary Tooling What It Demonstrates FSI Caveat
Developer smoke testing Copilot Studio Test Pane / developer mode Topic flow, variable behavior, quick issue isolation Useful during authoring, but not sufficient as independent validation evidence under Fed SR 11-7
Repeatable batch evaluation Copilot Studio Agent Evaluation / test sets Regression detection, version comparison, repeatable scoring across prompts Should be exported and retained as part of the approval record
Quantitative quality and safety validation Azure AI Evaluation SDK / Azure AI Foundry External evaluator runs and reproducible metrics outside the maker surface Stronger fit for independent validation packs and model-risk review
Adversarial testing PyRIT or equivalent Jailbreak, prompt-injection, misuse, and safety-resilience testing Cross-reference Control 1.21 for adversarial evidence handling
Post-deployment monitoring Copilot Studio Analytics / Quality dashboards and production monitoring Drift, failure, abandonment, escalation, satisfaction, and real-world usage trends Required for ongoing monitoring after release

For higher-risk use cases, the validation record should go beyond functional pass/fail and capture quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, and F1, plus applicable safety evaluators such as Hate/Unfairness, Self-Harm, Sexual, Violence, Protected Materials, Indirect Attack, and where supported, Code Vulnerability and Ungrounded Attributes. Some evaluator families are preview, tenant-dependent, or cloud-dependent; verify availability at deploy time.

Independent Validation Boundary (SR 11-7 / OCC 2011-12)

The Test Pane and maker-run regression tests are valuable for development, but they are not a substitute for functionally independent validation when the agent is customer-facing, finance-touching, or otherwise material. For Zone 3 or higher-risk use cases, the validation package should be reviewed or countersigned by a party independent of the agent author, such as a Model Risk Manager, Compliance Officer, or separate validation function.

License Requirements

  • Microsoft Copilot Studio — required for in-product Test Pane and Agent Evaluation capabilities.
  • Microsoft 365 Copilot — required for declarative-agent and Microsoft 365 extensibility test scenarios.
  • Azure AI Foundry / Azure AI Evaluation SDK — required for external quality and safety evaluator runs; consumption-based billing applies.
  • Microsoft 365 Agents Toolkit — recommended for local preview / sideload testing of declarative agents in Visual Studio Code.
  • Power Platform Pipelines / Managed Environments — recommended to enforce promotion gates, approvals, and deployment governance.
  • Microsoft Purview Audit / eDiscovery — recommended where testing and approval evidence must be retained for supervisory, audit, or examination purposes.
  • PyRIT — open source; runs on customer-controlled infrastructure and should be validated for supportability before being treated as a primary control component.

Sovereign Cloud Parity (verify at deploy time)

Copilot Studio is broadly available across Commercial, GCC, and GCC High, but feature-level parity can lag for Azure AI Foundry evaluation, advanced evaluator families, and third-party model availability. Do not assume full parity for every evaluation feature or external model endpoint in GCC / GCC High / DoD. Document compensating controls if a required evaluator or model family is not available in the target cloud.


Key Configuration Points

  • Maintain separate Dev, Test / UAT, and Production environments with documented promotion controls; for Zone 3, keep the validation evidence package separate from the maker workspace.
  • Use the Copilot Studio Test Pane for smoke testing, variable inspection, and conversation debugging during development, but do not treat it as the sole validation record.
  • Build and version-control golden datasets and adverse test sets, including expected answers, negative cases, escalation scenarios, hallucination checks, and policy-sensitive prompts.
  • Use Copilot Studio Agent Evaluation for structured, repeatable batch testing and cross-version regression analysis:
    • create or import test sets
    • compare current vs previous versions
    • export results for evidence retention
    • include real-world scenarios where policy permits
  • For higher-risk use cases, supplement in-product evaluation with Azure AI Evaluation SDK / Azure AI Foundry runs that capture quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, and F1, plus applicable safety evaluators.
  • Run PyRIT or equivalent adversarial campaigns for Zone 3 and other high-risk agents, especially after material changes to prompts, connectors, knowledge sources, or model endpoints. Cross-reference Control 1.21.
  • Use Microsoft 365 Agents Toolkit local preview / sideload testing for declarative agents and other Microsoft 365 Copilot extensibility scenarios before broader rollout.
  • Treat model swap, prompt-orchestration change, knowledge-source change, and action/plugin change as mandatory re-validation events. A/B or champion/challenger comparison is recommended for material changes.
  • Enforce Power Platform Solution Checker and Pipelines / Managed Environments approvals as promotion gates; do not promote if critical validation checks fail.
  • Review Copilot Studio Analytics and other production monitoring signals after deployment for drift, abandonment, escalation, satisfaction, latency, and quality trends.
  • Retain test evidence, evaluator outputs, approval records, and monitoring summaries according to the firm’s WSP, books-and-records schedule, and model-risk evidence policy.

February 2026 Pipeline Deadline

Microsoft has signaled tightening of the Managed Environments expectation for pipeline target environments — including those used for testing — through 2026. Organizations should review Control 2.1 for current guidance, licensing implications, and recommended proactive actions.


Zone-Specific Requirements

Zone Requirement Rationale
Zone 1 (Personal) Developer-side smoke testing in Test Pane, basic security review, and documented acceptable-use boundaries; no unsupervised regulated-customer communications Lower-risk use still benefits from basic validation but should avoid higher-risk regulated scenarios
Zone 2 (Team) Repeatable batch testing, version-controlled test sets, named approver, and pipeline-based promotion evidence; monthly post-release analytics review Shared agents require repeatability and named accountability
Zone 3 (Enterprise) Independent validation or countersignature, Azure AI Evaluation / equivalent external metrics, PyRIT or equivalent adversarial testing, supervisory approval where AI-generated communications are in scope, and documented ongoing monitoring with rollback plan Customer-facing, finance-touching, or examiner-relevant use cases need stronger evidence and segregation of duties

Roles & Responsibilities

Role Responsibility
AI Governance Lead Defines the testing standard, evidence requirements, exception path, and review cadence
Agent Owner Owns business test cases, UAT completion, and production-readiness sign-off for the use case
Copilot Studio Agent Author Performs maker-side smoke testing, fixes defects, and maintains the version-controlled test suite
Model Risk Manager Performs or approves functionally independent validation for Zone 3 or finance-touching agents
Power Platform Admin Maintains test environments, Solution Checker posture, Pipelines / Managed Environments, and promotion-gate configuration
Compliance Officer Verifies regulatory evidence and reviews higher-risk testing results for FINRA / SOX / OCC alignment
Purview Records Manager Confirms retention and defensible preservation of test evidence, approvals, and monitoring artifacts
Designated Supervisor / Registered Principal Provides supervisory sign-off where AI-generated customer communications or broker-dealer communications workflows are in scope under FINRA Rule 3110

Control Relationship
1.7 - Comprehensive Audit Logging and Compliance Audit trail for validation activity, approvals, and post-deployment monitoring evidence
1.19 - eDiscovery for Agent Interactions Preservation and production of testing and supervisory evidence where required
1.21 - Adversarial Input Logging PyRIT, prompt-injection evidence, and adversarial monitoring alignment
2.1 - Managed Environments Promotion-gate enforcement, pipeline governance, and environment controls
2.3 - Change Management Material-change triggers for re-validation and release approval
2.11 - Bias Testing Fairness and outcome testing within the broader validation program
2.13 - Documentation and Record-Keeping Evidence retention, testing records, and version history
2.20 - Adversarial Testing Security-focused red-team procedures complement functional and supervisory testing
4.7 - Microsoft 365 Copilot Data Governance Data-governance, third-party-model, and sovereign-cloud considerations for Microsoft 365 Copilot scenarios

Implementation Playbooks

Step-by-Step Implementation

This control has detailed playbooks for implementation, automation, testing, and troubleshooting:


Verification Criteria

Confirm control effectiveness by verifying:

  1. A version-controlled test strategy exists with zone-specific acceptance criteria, promotion gates, and material-change triggers for re-validation.
  2. Separate Dev, Test / UAT, and Production environments exist, and production promotion requires documented approval.
  3. Maker-side smoke testing in the Copilot Studio Test Pane or equivalent local preview path has been completed and any major defects remediated.
  4. Repeatable batch evaluation has been run for the current version using exported test sets and documented outcomes.
  5. For higher-risk use cases, the validation pack includes quantitative quality metrics such as groundedness, relevance, coherence, fluency, similarity / AI-assisted text similarity, F1, or equivalent measures approved by the firm.
  6. Applicable safety evaluators have been reviewed, including categories such as hate / unfairness, self-harm, sexual, violence, protected material, indirect attack, and other supported risk evaluators.
  7. PyRIT or equivalent adversarial testing has been completed for Zone 3 or other high-risk agents, and findings are documented with remediation status.
  8. Power Platform Solution Checker and Pipelines / Managed Environments promotion controls show a successful validation record for the deployed version.
  9. A functionally independent reviewer has approved or countersigned the validation package for Zone 3 or finance-touching agents.
  10. Supervisory sign-off is documented when the agent participates in AI-generated customer communications or broker-dealer communications workflows subject to FINRA Rule 3110.
  11. Post-deployment analytics and quality trends are reviewed on a defined cadence, with thresholds for drift, escalation, or rollback documented.
  12. Test evidence, approvals, evaluator outputs, and monitoring records are retained according to the firm’s WSP, books-and-records requirements, and model-risk evidence schedule.

Additional Resources

Regulatory Guidance:

Microsoft Documentation:

FINRA Notice 15-09 Testing Precedent

FINRA Regulatory Notice 15-09 (March 2015) addresses supervision of algorithmic trading strategies and provides a useful precedent for AI agent testing. Key principles include:

  • Pre-deployment testing: Test strategies in controlled environments before production
  • Ongoing monitoring: Continuously monitor algorithm performance
  • Kill switch capability: Ability to halt algorithm operation quickly
  • Change testing: Re-test after any modification

These principles apply directly to AI agent governance. Treat AI agents as "algorithms" requiring similar rigor.

Agent 365 SDK Testing (Preview)

Preview Notice

Microsoft Agent 365 SDK and Agent Essentials are in limited preview (Frontier program). Verify feature availability and GA timelines before implementing production controls dependent on these capabilities. Expect changes before general availability.

Agent 365 SDK provides additional testing capabilities for Blueprint-registered agents:

  • CLI-based test execution for CI/CD integration
  • Evaluation metrics aligned with Blueprint promotion gates

CLI-Based Test Execution:

Agent 365 SDK includes a test runner CLI for automated testing in CI/CD pipelines:

# Run agent test suite
agent365 test run --manifest ./agent-manifest.json --dataset ./golden-dataset.json

# Run specific test category
agent365 test run --category security --manifest ./agent-manifest.json

# Export test results for compliance evidence
agent365 test run --output ./test-results.json --format compliance

Test Categories Aligned with Blueprint Gates:

Category Gate Validation
Functional Build → Deploy Response accuracy against golden dataset
Security Build → Deploy Prompt injection resistance, auth validation
Performance Build → Deploy Response time <2s (Zone 3), <3s (Zone 2)
Compliance Deploy → Production Data handling, audit logging verification

CI/CD Integration Example (Azure DevOps):

# azure-pipelines.yml excerpt
- task: Bash@3
  displayName: 'Run Agent 365 Tests'
  inputs:
    script: |
      agent365 test run \
        --manifest $(Build.SourcesDirectory)/agent-manifest.json \
        --dataset $(Build.SourcesDirectory)/tests/golden-dataset.json \
        --output $(Build.ArtifactStagingDirectory)/test-results.json \
        --threshold 95
    failOnStderr: true

Zone-Specific Test Requirements:

Zone CLI Test Requirement Evidence Retention
Zone 1 Manual testing acceptable 90 days
Zone 2 CLI tests in deployment pipeline 3 years
Zone 3 CLI tests required; blocking on failure 7–10 years

Updated: April 2026 | Version: v1.4.0 | UI Verification Status: Current