Skip to content

Control 1.27: AI Agent Content Moderation Enforcement

Control ID: 1.27
Pillar: Security
Regulatory Reference: FINRA Rule 3110 / RN 24-09, FINRA 2210, SEC 17a-3/4, SOX 302/404, GLBA 501(b), OCC Bulletin 2026-13 (formerly OCC 2011-12), Fed SR 26-2 (formerly SR 11-7), NIST AI RMF / AI 600-1
Last UI Verified: May 2026
Governance Levels: Baseline / Recommended / Regulated


Runtime Content Safety Configured Per AI Prompt

Microsoft Copilot Studio content moderation is configured per AI prompt in the prompt builder. The per-prompt moderation slider (applicable to managed GPT models) is a 3-position scale: Low / Moderate / High ("Medium" remains widely used in community references and prior UI labels — treat it as synonymous with Moderate on this surface). In practice this surfaces at two enforcement points: (1) the agent's generative answers / "Conversational boosting" system topic, which acts as the agent-effective default for unstructured Q&A, and (2) any custom topic that contains a Generative answers node, where the topic-level setting takes precedence at runtime for that conversation path. This control establishes zone-based moderation requirements to help reduce harmful outputs, regulatory exposure from AI-generated communications, and inappropriate customer-facing responses. It complements Control 1.8 (Runtime Protection and External Threat Detection), Control 1.21 (Adversarial Input Logging), and Control 2.20 (Adversarial Testing and Red Team Framework). Per Microsoft Learn, this capability is generally available as of February 11, 2026, originally announced via Message Center post MC1217615 (announced GA target of January 31, 2026).

Two distinct moderation UI surfaces — do not confuse them

Copilot Studio exposes two separate content-moderation surfaces with different scales:

  • Per-prompt slider (this control's scope) — in the prompt builder on the system Conversational boosting topic and on each custom topic's Generative answers node. 3 positions: Low / Moderate / High. Available on managed GPT models.
  • Agent-level slider (Control 1.8 scope) — at Agent → Settings → Generative AI → Content moderation, applying tenant-wide harm-category classifier strictness to the agent's responses. 5 positions: Lowest / Low / Moderate / High / Highest (default: High).

The two surfaces operate at different layers and are configured independently. References to "Medium" are legacy/community shorthand for Moderate on both scales. Microsoft Learn documents the agent-level slider by its poles (Lowest…Highest) and default (High); the intermediate positions follow the same Low / Moderate / High naming.

Objective

Enforce appropriate content moderation levels at both agent and topic levels based on governance zone, usage context, and audience (internal vs. customer-facing) to support compliance with FSI regulatory requirements and reduce the risk of harmful or inappropriate AI-generated responses.


Why This Matters for FSI

  • FINRA Rule 3110 / RN 24-09 / Rule 2210: Content moderation helps support supervisory obligations and FINRA 2210 communications-with-the-public standards by filtering harmful, misleading, or non-compliant AI-generated outputs that could create regulatory exposure or customer harm
  • SEC 17a-3 / 17a-4: Moderation enforcement aids recordkeeping compliance by helping reduce the likelihood that AI agents generate responses that could trigger disclosure violations or mislead investors; moderation configuration changes themselves are records that should be captured under 17a-4(f) WORM retention
  • SOX 302 / 404: For AI agents that touch financial reporting workflows, moderation contributes to entity-level controls over the integrity of AI-generated content used in disclosures or management certifications
  • GLBA 501(b): Content moderation helps meet information safeguards requirements by reducing the chance that AI outputs inadvertently expose nonpublic personal information (NPI) or violate customer privacy protections
  • OCC Bulletin 2026-13 (formerly OCC 2011-12) / Fed SR 26-2 (formerly SR 11-7) (Model Risk Management): Moderation level selection, change control, and effectiveness testing are model risk controls — they constrain the operational risk surface of generative model outputs and feed into ongoing model performance monitoring
  • NIST AI RMF / AI 600-1: Maps to the MANAGE function (risk treatment) and MEASURE function (validity, safety, bias) for generative AI systems

Automation Available

See Content Moderation Governance Monitor in FSI-AgentGov-Solutions for automated content moderation level validation against zone-specific requirements.

Control Description

The Copilot Studio per-prompt moderation slider provides three runtime safety levels—Low, Moderate, and High—that control how strictly the platform filters AI-generated responses for harmful, inappropriate, or non-compliant content. These per-prompt moderation levels are configurable at two prompt-builder enforcement points: the agent-effective default (the setting on the Conversational boosting system topic, applied to all unstructured Q&A) and the per-topic override (the setting on each custom topic's Generative answers node, taking precedence for that conversation path at runtime). This surface is distinct from the agent-level Settings → Generative AI moderation slider (5 positions: Lowest / Low / Moderate / High / Highest, default High) governed by Control 1.8 — see the disambiguation note above. "Medium" is a community / legacy synonym for Moderate on the per-prompt scale and is treated as equivalent throughout this control.

This control establishes a risk-based moderation enforcement model that escalates by governance zone:

  1. Agent-effective default moderation — The per-prompt moderation level (Low, Moderate, or High) configured on the Conversational boosting system topic applies to all unstructured Q&A unless a more specific topic overrides it; zone classification determines the minimum permissible default
  2. Per-topic moderation overrides — Individual custom topics that contain a Generative answers node may have stricter or more permissive per-prompt moderation levels based on the topic's role, audience, and sensitivity; overrides to less restrictive levels require approval in Zone 2+
  3. Custom safety messages — When content is blocked by moderation filters, agents display custom messages to users; Zone 3 requires tailored safety messages that align with brand and regulatory voice
  4. Audit and logging integration — Content moderation events (prompts blocked, moderation level changes) are logged for compliance review; Zone 3 requires integration with Microsoft Purview for audit trail retention

The enforcement model escalates by governance zone. Zone 1 environments permit Moderate as the minimum agent-effective default, with agent authors empowered to configure topic-level overrides without approval. Zone 2 environments require High by default, with topic-level overrides to Moderate or Low requiring documented approval. Zone 3 environments mandate High at both the agent-effective default and per-topic levels, prohibit topic-level downgrades to Low, require custom safety messages, and integrate moderation logs with Purview for regulatory audit trails.

Capability Comparison by Zone

Capability Zone 1 (Personal) Zone 2 (Team) Zone 3 (Enterprise)
Agent-effective default moderation Moderate minimum High default High mandatory
Topic-level override to Moderate Allowed Allowed with approval Allowed with documented justification
Topic-level override to Low Allowed Requires documented approval Prohibited
Custom safety messages Recommended Recommended Required
Moderation change approval Not required Documented approval Formal review + approval
Purview audit integration Not required Recommended Required
Moderation testing before deployment Recommended Required Required with adversarial testing
Moderation inventory tracking Recommended Required Required
Review cadence Quarterly Monthly Weekly

Key Configuration Points

  • Agent-effective default moderation — Set the per-prompt content moderation level (Low, Moderate, High) on each agent's Conversational boosting system topic in the prompt builder; default should align with the agent's governance zone and audience. (This is distinct from the agent-level Settings → Generative AI → Content moderation slider — 5 positions, Lowest/Low/Moderate/High/Highest — covered by Control 1.8.)
  • Per-topic moderation overrides — Configure per-prompt moderation levels (Low / Moderate / High) on individual custom topics' Generative answers nodes when a conversation path requires stricter or more permissive filtering; topic overrides take precedence at runtime
  • Custom safety messages — Define user-facing messages displayed when content is blocked by moderation filters; Zone 3 requires messages aligned with organizational voice and regulatory context
  • Moderation approval workflow — Establish an approval process for topic-level overrides that reduce moderation strictness below the agent-level default (Zone 2+)
  • Audit log configuration — Enable and retain moderation event logs in Microsoft Purview for compliance review; capture blocked prompts, moderation level changes, and override approvals (Zone 3)
  • Adversarial testing integration — Test agent responses at each moderation level using adversarial prompts before deployment to verify moderation effectiveness (Zone 2+)
  • Moderation inventory and review — Maintain an inventory of all agents and topics with moderation levels documented, including zone classification, approval status, and last review date

Zone-Specific Requirements

Zone Requirement Rationale
Zone 1 (Personal) Moderate per-prompt moderation minimum on the agent-effective default; agent authors may configure topic overrides without approval; quarterly review of moderation settings Personal productivity agents present lower compliance risk; Moderate moderation balances safety with user autonomy for experimentation
Zone 2 (Team) High per-prompt moderation as the agent-effective default; topic-level overrides to Moderate or Low require documented approval; monthly review of moderation settings Shared team agents may be used in workflows involving customer data or internal communications; High default reduces risk of inappropriate outputs
Zone 3 (Enterprise) High per-prompt moderation mandatory as the agent-effective default; topic-level overrides to Moderate allowed with documented justification; topic-level downgrades to Low prohibited; custom safety messages required; Purview audit integration; weekly review Customer-facing and regulated agents require maximum content safety; blocking downgrades to Low helps prevent inappropriate responses in high-stakes interactions

Roles & Responsibilities

Role Responsibility
Power Platform Admin Configure environment-level moderation policies; review and approve moderation override requests for Zone 2+ agents; monitor moderation compliance across environments
Copilot Studio Agent Author Set the agent-effective default per-prompt moderation (Conversational boosting system topic); configure per-topic moderation overrides per approved settings; define custom safety messages; document justification for moderation changes
Purview Compliance Admin Configure audit log retention for moderation events; integrate moderation logs with Purview for compliance reporting; review blocked content trends for policy updates
SOC Analyst / Entra Security Admin Monitor moderation alerts and trends in Sentinel / Defender XDR; investigate anomalous moderation bypass attempts; triage and escalate confirmed incidents involving harmful outputs
Model Risk Manager Review moderation level selection rationale, override approvals, and adversarial test results as part of OCC Bulletin 2026-13 (formerly OCC 2011-12) / Fed SR 26-2 (formerly SR 11-7) model risk oversight

Control Relationship
1.8 - Runtime Protection and External Threat Detection Complementary runtime security layer addressing broader threat detection; content moderation focuses on output filtering
2.5 - Testing, Validation, and Quality Assurance Testing framework that includes moderation effectiveness validation before agent deployment
2.20 - Adversarial Testing and Red Team Framework Red team testing methodology for evaluating moderation bypass attempts and filter effectiveness
1.7 - Audit Logging and Monitoring Audit log infrastructure for capturing and retaining moderation events in Microsoft Purview (Zone 3 mandatory)
1.21 - Adversarial Input Logging Logging framework capturing malicious prompts that trigger moderation filters
3.3 - Compliance and Regulatory Reporting Reporting integration for tracking moderation enforcement status and blocked content trends
2.22 - Inactivity Timeout Enforcement Agent-level timeout configuration should be reviewed alongside moderation settings during quarterly reviews

Implementation Playbooks

Step-by-Step Implementation

This control has detailed playbooks for implementation, automation, testing, and troubleshooting:


Automated Validation: Content Moderation Governance Monitor

The Content Moderation Governance Monitor solution automates per-agent content moderation level validation against zone-specific governance requirements. It provides daily drift detection, Teams adaptive card alerts with severity classification, Dataverse tracking for moderation baselines and violations, and SHA-256 evidence export for audit readiness. Deploy this solution to continuously monitor moderation compliance across all environments.

Repository: content-moderation-monitor | Details: Solutions Index


Verification Criteria

Confirm control effectiveness by verifying:

  1. The agent-effective default per-prompt moderation level is set to the correct value (Moderate minimum for Zone 1, High for Zone 2+) for each agent based on its governance zone classification
  2. Per-topic moderation overrides are documented with appropriate approval (Zone 2+) and no prohibited downgrades to Low exist in Zone 3 agents
  3. Custom safety messages are configured for all Zone 3 agents and align with organizational voice and regulatory requirements
  4. Moderation event logs are being captured and retained in Microsoft Purview for Zone 3 agents
  5. An up-to-date inventory of agents and topics with moderation levels exists, including zone classification, approval status, and last review date
  6. Adversarial testing has been conducted for Zone 2+ agents to validate moderation effectiveness before deployment

Additional Resources

FSI Scope Note

Agent and Topic Dual Control: This control targets both agent-level and topic-level content moderation settings within Copilot Studio. Organizations should implement content moderation enforcement when:

  • Copilot Studio agents are being deployed for customer-facing interactions or regulated workflows
  • Agents generate responses that could create regulatory exposure, customer harm, or reputational risk
  • Regulatory requirements mandate controls over AI-generated communications (e.g., FINRA supervision, SEC disclosure rules)

For organizations in early-stage agent development, this control should be implemented during the design phase to establish baseline moderation levels before agents are published. The generally available (GA) release of per-prompt moderation levels (Microsoft Learn release plan: February 11, 2026; originally announced via MC1217615 for January 31, 2026) provides enhanced configurability compared to prior versions.

Complement with Adversarial Testing

Content moderation provides output filtering, but it is not a substitute for comprehensive security controls. Pair this control with Control 2.20 (Adversarial Testing and Red Team Framework) to proactively test moderation bypass attempts and jailbreak techniques before agents reach production. This layered approach addresses both runtime filtering and pre-deployment validation.


Updated: May 2026 | Version: v1.6.2 | UI Verification Status: Current