Control 1.27: AI Agent Content Moderation Enforcement
Control ID: 1.27 Pillar: Security Regulatory Reference: FINRA 3110/25-07, SEC 17a-3/4, GLBA 501(b) Last UI Verified: February 2026 Governance Levels: Baseline / Recommended / Regulated Last Verified: 2026-02-12
Runtime Content Safety at Agent and Topic Level
This control governs the content moderation levels (Low, Medium, High) configurable in Copilot Studio at both the agent level and the topic level within individual agents. Topic-level overrides take precedence at runtime, allowing fine-grained control over content safety for specific conversation paths. This control establishes zone-based moderation requirements to help prevent harmful outputs, regulatory violations, and inappropriate customer-facing responses. It complements Control 1.8 (Runtime Protection and External Threat Detection) which addresses broader runtime security threats, and Control 1.21 (Adversarial Input Logging) which captures malicious prompt attempts for analysis.
Objective
Enforce appropriate content moderation levels at both agent and topic levels based on governance zone, usage context, and audience (internal vs. customer-facing) to support compliance with FSI regulatory requirements and reduce the risk of harmful or inappropriate AI-generated responses.
Why This Matters for FSI
- FINRA 3110/25-07: Content moderation helps meet supervisory obligations by filtering harmful outputs and inappropriate communications that could create regulatory exposure or customer harm
- SEC 17a-3/4: Moderation enforcement supports recordkeeping compliance by helping prevent AI agents from generating responses that could trigger disclosure violations or mislead investors
- GLBA 501(b): Content moderation aids in meeting information safeguards requirements by blocking outputs that could inadvertently expose customer data or violate privacy protections
Control Description
Content moderation in Copilot Studio provides three runtime safety levels—Low, Medium, and High—that control how strictly the platform filters AI-generated responses for harmful, inappropriate, or non-compliant content. These moderation levels are configurable at two enforcement points: the agent level (default moderation for all topics) and the topic level (overrides for specific conversation paths). Topic-level settings take precedence at runtime, enabling fine-grained control over content safety for customer-facing workflows, regulated interactions, and sensitive use cases.
This control establishes a risk-based moderation enforcement model that escalates by governance zone:
- Agent-level default moderation — Each agent has a baseline moderation level (Low, Medium, or High) that applies to all topics unless overridden; zone classification determines the minimum permissible default
- Topic-level moderation overrides — Individual topics within an agent may have stricter or more permissive moderation levels based on the topic's role, audience, and sensitivity; overrides to less restrictive levels require approval in Zone 2+
- Custom safety messages — When content is blocked by moderation filters, agents display custom messages to users; Zone 3 requires tailored safety messages that align with brand and regulatory voice
- Audit and logging integration — Content moderation events (prompts blocked, moderation level changes) are logged for compliance review; Zone 3 requires integration with Microsoft Purview for audit trail retention
The enforcement model escalates by governance zone. Zone 1 environments permit Medium moderation as the minimum default, with agent authors empowered to configure topic-level overrides without approval. Zone 2 environments require High moderation by default, with topic-level overrides to Medium or Low requiring documented approval. Zone 3 environments mandate High moderation at both agent and topic levels, prohibit topic-level downgrades to Low, require custom safety messages, and integrate moderation logs with Purview for regulatory audit trails.
Capability Comparison by Zone
| Capability | Zone 1 (Personal) | Zone 2 (Team) | Zone 3 (Enterprise) |
|---|---|---|---|
| Agent-level default moderation | Medium minimum | High default | High mandatory |
| Topic-level override to Medium | Allowed | Allowed with approval | Allowed with documented justification |
| Topic-level override to Low | Allowed | Requires documented approval | Prohibited |
| Custom safety messages | Recommended | Recommended | Required |
| Moderation change approval | Not required | Documented approval | Formal review + approval |
| Purview audit integration | Not required | Recommended | Required |
| Moderation testing before deployment | Recommended | Required | Required with adversarial testing |
| Moderation inventory tracking | Recommended | Required | Required |
| Review cadence | Quarterly | Monthly | Weekly |
Key Configuration Points
- Agent-level moderation default — Set the baseline content moderation level (Low, Medium, High) for each agent in Copilot Studio prompt builder settings; default should align with the agent's governance zone and audience
- Topic-level moderation overrides — Configure topic-specific moderation levels for conversation paths requiring stricter or more permissive filtering; topic overrides take precedence at runtime
- Custom safety messages — Define user-facing messages displayed when content is blocked by moderation filters; Zone 3 requires messages aligned with organizational voice and regulatory context
- Moderation approval workflow — Establish an approval process for topic-level overrides that reduce moderation strictness below the agent-level default (Zone 2+)
- Audit log configuration — Enable and retain moderation event logs in Microsoft Purview for compliance review; capture blocked prompts, moderation level changes, and override approvals (Zone 3)
- Adversarial testing integration — Test agent responses at each moderation level using adversarial prompts before deployment to verify moderation effectiveness (Zone 2+)
- Moderation inventory and review — Maintain an inventory of all agents and topics with moderation levels documented, including zone classification, approval status, and last review date
Zone-Specific Requirements
| Zone | Requirement | Rationale |
|---|---|---|
| Zone 1 (Personal) | Medium moderation minimum; agent authors may configure topic overrides without approval; quarterly review of moderation settings | Personal productivity agents present lower compliance risk; Medium moderation balances safety with user autonomy for experimentation |
| Zone 2 (Team) | High moderation default; topic-level overrides to Medium or Low require documented approval; monthly review of moderation settings | Shared team agents may be used in workflows involving customer data or internal communications; High default reduces risk of inappropriate outputs |
| Zone 3 (Enterprise) | High moderation mandatory at agent level; topic-level overrides to Medium allowed with documented justification; topic-level downgrades to Low prohibited; custom safety messages required; Purview audit integration; weekly review | Customer-facing and regulated agents require maximum content safety; blocking downgrades to Low helps prevent inappropriate responses in high-stakes interactions |
Roles & Responsibilities
| Role | Responsibility |
|---|---|
| Power Platform Admin | Configure environment-level moderation policies; review and approve moderation override requests for Zone 2+ agents; monitor moderation compliance across environments |
| Copilot Studio Agent Author | Set agent-level default moderation; configure topic-level moderation overrides per approved settings; define custom safety messages; document justification for moderation changes |
| Purview Compliance Admin | Configure audit log retention for moderation events; integrate moderation logs with Purview for compliance reporting; review blocked content trends for policy updates |
| Security Operations | Monitor moderation alerts and trends; investigate anomalous moderation bypass attempts; triage and escalate confirmed incidents involving harmful outputs |
Related Controls
| Control | Relationship |
|---|---|
| 1.8 - Runtime Protection and External Threat Detection | Complementary runtime security layer addressing broader threat detection; content moderation focuses on output filtering |
| 2.5 - Testing, Validation, and Quality Assurance | Testing framework that includes moderation effectiveness validation before agent deployment |
| 2.20 - Adversarial Testing and Red Team Framework | Red team testing methodology for evaluating moderation bypass attempts and filter effectiveness |
| 1.7 - Audit Logging and Monitoring | Audit log infrastructure for capturing and retaining moderation events in Microsoft Purview (Zone 3 mandatory) |
| 1.21 - Adversarial Input Logging | Logging framework capturing malicious prompts that trigger moderation filters |
| 3.3 - Compliance and Regulatory Reporting | Reporting integration for tracking moderation enforcement status and blocked content trends |
| 2.22 - Inactivity Timeout Enforcement | Agent-level timeout configuration should be reviewed alongside moderation settings during quarterly reviews |
Implementation Playbooks
Step-by-Step Implementation
This control has detailed playbooks for implementation, automation, testing, and troubleshooting:
- Portal Walkthrough — Step-by-step portal configuration
- PowerShell Setup — Automation scripts
- Verification & Testing — Test cases and evidence collection
- Troubleshooting — Common issues and resolutions
Automated Validation: Content Moderation Governance Monitor
The Content Moderation Governance Monitor solution automates per-agent content moderation level validation against zone-specific governance requirements. It provides daily drift detection, Teams adaptive card alerts with severity classification, Dataverse tracking for moderation baselines and violations, and SHA-256 evidence export for audit readiness. Deploy this solution to continuously monitor moderation compliance across all environments.
Repository: content-moderation-monitor | Details: Solutions Index
Verification Criteria
Confirm control effectiveness by verifying:
- Agent-level default moderation is set to the correct level (Medium minimum for Zone 1, High for Zone 2+) for each agent based on its governance zone classification
- Topic-level moderation overrides are documented with appropriate approval (Zone 2+) and no prohibited downgrades to Low exist in Zone 3 agents
- Custom safety messages are configured for all Zone 3 agents and align with organizational voice and regulatory requirements
- Moderation event logs are being captured and retained in Microsoft Purview for Zone 3 agents
- An up-to-date inventory of agents and topics with moderation levels exists, including zone classification, approval status, and last review date
- Adversarial testing has been conducted for Zone 2+ agents to validate moderation effectiveness before deployment
Additional Resources
- Microsoft Learn: Harmful content protection settings for M365 Copilot Chat
- Microsoft Learn: Azure AI Content Safety
- Microsoft Learn: Responsible AI for Copilot Studio
- Microsoft Learn: Microsoft Purview audit logging
FSI Scope Note
Agent and Topic Dual Control: This control targets both agent-level and topic-level content moderation settings within Copilot Studio. Organizations should implement content moderation enforcement when:
- Copilot Studio agents are being deployed for customer-facing interactions or regulated workflows
- Agents generate responses that could create regulatory exposure, customer harm, or reputational risk
- Regulatory requirements mandate controls over AI-generated communications (e.g., FINRA supervision, SEC disclosure rules)
For organizations in early-stage agent development, this control should be implemented during the design phase to establish baseline moderation levels before agents are published. The generally available (GA) release of moderation levels (January 31, 2026, MC1217615) provides enhanced configurability compared to prior versions.
Complement with Adversarial Testing
Content moderation provides output filtering, but it is not a substitute for comprehensive security controls. Pair this control with Control 2.20 (Adversarial Testing and Red Team Framework) to proactively test moderation bypass attempts and jailbreak techniques before agents reach production. This layered approach addresses both runtime filtering and pre-deployment validation.
Updated: February 2026 | Version: v1.3 | UI Verification Status: Current