Skip to content

Control 2.5: Testing, Validation, and Quality Assurance

Control ID: 2.5 Pillar: Management Regulatory Reference: SOX 302/404, FINRA 4511, FINRA Rule 3110, GLBA 501(b), OCC 2011-12 Last UI Verified: January 2026 Governance Levels: Baseline / Recommended / Regulated Last Verified: 2026-02-03


Agent 365 Architecture Update

Agent 365 lifecycle management supports promotion gates that can enforce testing and validation requirements before agents move between environments. See Unified Agent Governance for promotion gate configuration.

Objective

Ensure Copilot Studio agents function correctly, securely, and fairly before deployment to production through comprehensive testing including functional validation, security testing, performance benchmarking, bias detection, and user acceptance testing.


Why This Matters for FSI

  • SOX 302/404: Internal control testing requires documented test procedures and results
  • FINRA 4511: Testing records support books and records requirements for validated agent behavior
  • OCC 2011-12: Model validation requires independent testing of AI behavior
  • FINRA Rule 3110: AI supervision requires bias testing before deployment
  • GLBA 501(b): Security program requires security testing verification

Control Description

This control establishes comprehensive testing requirements for AI agents across the development lifecycle:

Test Type Description FSI Application
Functional Testing Verify agent responds correctly to user queries Customer accuracy
Security Testing Validate data protection and access controls Data protection
Performance Testing Verify agents respond within 3 seconds (Zone 1-2) or 2 seconds (Zone 3) for standard queries Customer experience
Bias Testing Detect and mitigate unfair treatment FINRA 3110 compliance
Regression Testing Confirm changes don't break existing functionality Change management
UAT Business validation before production deployment Stakeholder approval

Evaluation Gates

Gate Lifecycle Stage Required Validations
Gate 1 Design > Build Business justification, risk classification
Gate 2 Build > Evaluate Prompt injection testing, authentication validation
Gate 3 Evaluate > Deploy Test pass rate >95%, UAT sign-off
Gate 4 Deploy > Monitor Compliance approval, rollback plan documented

Key Configuration Points

  • Establish dedicated test environments with production-equivalent configuration
  • Create golden datasets with known-correct question-answer pairs (Zone 2: 50+, Zone 3: 150+ entries)
  • Configure Copilot Studio Agent Evaluation metrics (groundedness, relevance, coherence)
  • Use Agent Evaluation with Test Sets (Preview) for structured multi-version comparison testing:
    • Create test set CSV templates with expected question-answer pairs
    • Run evaluations across agent versions with automated scoring
    • Review activity maps and thumbs-up/down feedback analytics
    • Track evaluation results over time for regression detection
  • Consider the Copilot Studio Kit (GA, October 2025) from Power CAT for additional testing tools including AI content validation and KPI analysis
  • Implement automated regression testing in CI/CD pipeline
  • Track hallucination rate with zone-appropriate thresholds (Zone 2: <5%, Zone 3: <2%)
  • Require UAT sign-off from business owners before production deployment
  • Retain test evidence per regulatory requirements (Zone 2: 3 years, Zone 3: 7–10 years)

February 2026 Pipeline Deadline

Pipeline target environments used for testing will be automatically enabled as Managed Environments starting February 2026. Organizations should review Control 2.1 for deadline details and licensing requirements.


Zone-Specific Requirements

Zone Requirement Rationale
Zone 1 (Personal) Basic functional testing; security scan only; informal documentation Low risk, personal use
Zone 2 (Team) Full test suite required; golden dataset recommended; UAT sign-off required Shared agents require higher standards
Zone 3 (Enterprise) Comprehensive testing with independent review; golden dataset mandatory; Compliance approval Customer-facing, regulatory examination focus

Roles & Responsibilities

Role Responsibility
AI Governance Lead Define testing standards, approve test plans
QA Lead Execute test suites, manage golden datasets
Business Owner Complete UAT, sign off on production readiness
Compliance Officer Approve Zone 3 deployment, verify regulatory tests

Control Relationship
2.2 - Environment Groups Test environment classification
2.3 - Change Management Pre-deployment testing gates
2.11 - Bias Testing Fairness assessment
2.18 - Conflict of Interest Testing COI testing for agent recommendations (COI Testing Framework)
2.20 - Adversarial Testing Security testing

Implementation Playbooks

Step-by-Step Implementation

This control has detailed playbooks for implementation, automation, testing, and troubleshooting:


Verification Criteria

Confirm control effectiveness by verifying:

  1. Test strategy documented with zone-appropriate requirements
  2. Test environments configured and matching production settings
  3. Golden dataset created with minimum entries for zone level
  4. Performance testing confirms response times meet targets: <3s for Zone 1-2, <2s for Zone 3
  5. Automated regression tests integrated in deployment pipeline
  6. UAT sign-off obtained from business owners
  7. Test evidence retained per regulatory retention policy

Additional Resources

Regulatory Guidance:

Microsoft Documentation:

FINRA Notice 15-09 Testing Precedent

FINRA Regulatory Notice 15-09 (March 2015) addresses supervision of algorithmic trading strategies and provides a useful precedent for AI agent testing. Key principles include:

  • Pre-deployment testing: Test strategies in controlled environments before production
  • Ongoing monitoring: Continuously monitor algorithm performance
  • Kill switch capability: Ability to halt algorithm operation quickly
  • Change testing: Re-test after any modification

These principles apply directly to AI agent governance. Treat AI agents as "algorithms" requiring similar rigor.

Agent 365 SDK Testing (Preview)

Preview Notice

Microsoft Agent 365 SDK and Agent Essentials are in limited preview (Frontier program). Verify feature availability and GA timelines before implementing production controls dependent on these capabilities. Expect changes before general availability.

Agent 365 SDK provides additional testing capabilities for Blueprint-registered agents:

  • CLI-based test execution for CI/CD integration
  • Evaluation metrics aligned with Blueprint promotion gates

CLI-Based Test Execution:

Agent 365 SDK includes a test runner CLI for automated testing in CI/CD pipelines:

# Run agent test suite
agent365 test run --manifest ./agent-manifest.json --dataset ./golden-dataset.json

# Run specific test category
agent365 test run --category security --manifest ./agent-manifest.json

# Export test results for compliance evidence
agent365 test run --output ./test-results.json --format compliance

Test Categories Aligned with Blueprint Gates:

Category Gate Validation
Functional Build → Deploy Response accuracy against golden dataset
Security Build → Deploy Prompt injection resistance, auth validation
Performance Build → Deploy Response time <2s (Zone 3), <3s (Zone 2)
Compliance Deploy → Production Data handling, audit logging verification

CI/CD Integration Example (Azure DevOps):

# azure-pipelines.yml excerpt
- task: Bash@3
  displayName: 'Run Agent 365 Tests'
  inputs:
    script: |
      agent365 test run \
        --manifest $(Build.SourcesDirectory)/agent-manifest.json \
        --dataset $(Build.SourcesDirectory)/tests/golden-dataset.json \
        --output $(Build.ArtifactStagingDirectory)/test-results.json \
        --threshold 95
    failOnStderr: true

Zone-Specific Test Requirements:

Zone CLI Test Requirement Evidence Retention
Zone 1 Manual testing acceptable 90 days
Zone 2 CLI tests in deployment pipeline 3 years
Zone 3 CLI tests required; blocking on failure 7–10 years

Updated: January 2026 | Version: v1.2 | UI Verification Status: Current