Agent Observability Alerting Configuration
Overview
This guide provides zone-based alerting thresholds and escalation configurations for Agent 365 observability. Alerts are designed to surface operational issues, security events, and compliance concerns based on governance zone requirements.
Schema assumptions in the KQL below
The alert queries in this document target Application Insights traces with FSI-overlay span / event names (agent.interaction, tool.invocation, rai.filter, dlp.block, access.denied, auth.failure, sponsor.attestation.check, blueprint.promotion, config.change) and a custom dimension fsi_zone. None of these are part of the documented Microsoft Agent 365 SDK schema — see Telemetry schema. They depend on (a) an OpenTelemetry Collector / agent host pipeline that emits these synthetic names, and (b) downstream signals from Microsoft Defender / Microsoft Purview being re-projected into Application Insights. Before deploying, rewrite each query against the actual signal source in your environment (the documented InvokeAgentScope / ExecuteToolScope / InferenceScope / OutputScope spans with gen_ai.* / microsoft.* attributes, or the Defender / Purview tables for security and compliance events).
Alert Categories
| Category | Zone 1 | Zone 2 | Zone 3 |
|---|---|---|---|
| Performance | Optional | ✓ | ✓ |
| Availability | Optional | ✓ | ✓ |
| Security | - | ✓ | ✓ (Priority) |
| Compliance | - | Optional | ✓ (Priority) |
| Governance | - | ✓ | ✓ |
Performance Alerts
Alert 1: High Response Latency
Trigger: Agent response time exceeds threshold.
| Zone | Warning | Critical | Evaluation Window |
|---|---|---|---|
| Zone 2 | P95 > 3s | P95 > 5s | 5 minutes |
| Zone 3 | P95 > 2s | P95 > 3s | 5 minutes |
{
"name": "Agent High Response Latency",
"description": "Agent response time exceeds acceptable threshold",
"severity": 2,
"enabled": true,
"scopes": ["/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/components/{ai}"],
"evaluationFrequency": "PT5M",
"windowSize": "PT5M",
"criteria": {
"allOf": [
{
"query": "traces | where name == 'agent.interaction' | extend latency = todouble(customDimensions.agent_response_latency_ms) | summarize P95 = percentile(latency, 95) by bin(timestamp, 5m) | where P95 > 3000",
"timeAggregation": "Count",
"operator": "GreaterThan",
"threshold": 0,
"failingPeriods": {
"numberOfEvaluationPeriods": 1,
"minFailingPeriodsToAlert": 1
}
}
]
},
"actions": [
{ "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-platform-ops" }
]
}
Alert 2: Low Success Rate
Trigger: Agent success rate drops below threshold.
| Zone | Warning | Critical | Evaluation Window |
|---|---|---|---|
| Zone 2 | < 95% | < 90% | 15 minutes |
| Zone 3 | < 98% | < 95% | 10 minutes |
{
"name": "Agent Low Success Rate",
"description": "Agent success rate below acceptable threshold",
"severity": 1,
"enabled": true,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(15m) | where name == 'agent.interaction' | extend success = customDimensions.agent_response_success == 'true' | summarize SuccessRate = 100.0 * countif(success) / count() | where SuccessRate < 95",
"timeAggregation": "Count",
"operator": "GreaterThan",
"threshold": 0
}
]
}
}
Alert 3: High Fallback Rate
Trigger: Fallback topic triggered excessively.
| Zone | Warning | Critical | Evaluation Window |
|---|---|---|---|
| Zone 2 | > 20% | > 30% | 1 hour |
| Zone 3 | > 10% | > 20% | 30 minutes |
{
"name": "Agent High Fallback Rate",
"description": "Agent triggering fallback topic too frequently - review topic coverage",
"severity": 3,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(1h) | extend isFallback = name == 'fallback.triggered', isInteraction = name == 'agent.interaction' | summarize FallbackRate = 100.0 * countif(isFallback) / countif(isInteraction) | where FallbackRate > 20"
}
]
}
}
Availability Alerts
Alert 4: Agent Unavailable
Trigger: No interactions received for extended period.
| Zone | Threshold | Evaluation Window |
|---|---|---|
| Zone 2 | 0 interactions | 30 minutes |
| Zone 3 | 0 interactions | 15 minutes |
{
"name": "Agent Unavailable",
"description": "No agent interactions detected - service may be down",
"severity": 0,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(15m) | where name == 'agent.interaction' | where customDimensions.fsi_zone == 'Zone3' | count",
"timeAggregation": "Count",
"operator": "LessThanOrEqual",
"threshold": 0
}
]
},
"actions": [
{ "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-critical-oncall" }
]
}
Alert 5: Connector Failure
Trigger: External connector/tool calls failing.
| Zone | Warning | Critical |
|---|---|---|
| Zone 2 | > 10% fail | > 25% fail |
| Zone 3 | > 5% fail | > 15% fail |
{
"name": "Agent Connector Failures",
"description": "External connector calls experiencing high failure rate",
"severity": 2,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(15m) | where name == 'tool.invocation' | extend success = customDimensions.tool_success == 'true' | summarize FailRate = 100.0 * countif(not(success)) / count() by tostring(customDimensions.tool_name) | where FailRate > 10"
}
]
}
}
Security Alerts
Alert 6: RAI Filter Spike (Zone 3 Priority)
Trigger: Unusual increase in RAI filter activations.
{
"name": "RAI Filter Spike - Security",
"description": "Significant increase in RAI content filter activations - review for potential abuse",
"severity": 1,
"criteria": {
"allOf": [
{
"query": "let baseline = traces | where timestamp between (ago(7d) .. ago(1d)) | where name == 'rai.filter' | summarize BaselineAvg = count() / 6; traces | where timestamp > ago(1h) | where name == 'rai.filter' | summarize CurrentCount = count() | extend baseline = toscalar(baseline) | where CurrentCount > baseline * 3"
}
]
},
"actions": [
{ "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-security-ops" }
]
}
Alert 7: DLP Block Event (Zone 3 Priority)
Trigger: DLP policy blocks agent content.
{
"name": "Agent DLP Block",
"description": "DLP policy blocked agent content - review for data protection",
"severity": 2,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(15m) | where name == 'dlp.block' | where customDimensions.fsi_zone == 'Zone3' | count",
"operator": "GreaterThan",
"threshold": 0
}
]
},
"actions": [
{ "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-compliance" }
]
}
Alert 8: Access Denial Spike
Trigger: Unusual increase in access denial events.
{
"name": "Agent Access Denial Spike",
"description": "Spike in access denial events - potential unauthorized access attempts",
"severity": 1,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(1h) | where name == 'access.denied' | summarize DenialCount = count() | where DenialCount > 10"
}
]
},
"actions": [
{ "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-security-ops" }
]
}
Alert 9: Authentication Failure Pattern
Trigger: Multiple authentication failures from same source.
{
"name": "Agent Auth Failure Pattern",
"description": "Multiple authentication failures detected - potential credential attack",
"severity": 1,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(30m) | where name == 'auth.failure' | summarize FailCount = count() by tostring(customDimensions.client_ip) | where FailCount > 5"
}
]
}
}
Compliance Alerts
Alert 10: Sponsor Attestation Overdue (Zone 3)
Trigger: Agent sponsor attestation past due date.
{
"name": "Sponsor Attestation Overdue",
"description": "Agent sponsor attestation is overdue - compliance risk",
"severity": 2,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(24h) | where name == 'sponsor.attestation.check' | extend overdue = customDimensions.attestation_overdue == 'true' | where overdue | summarize OverdueCount = dcount(customDimensions.agent_id) | where OverdueCount > 0"
}
]
},
"actions": [
{ "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-compliance" }
]
}
Alert 11: Orphaned Agent Detected
Trigger: Agent without valid sponsor identified.
{
"name": "Orphaned Agent Detected",
"description": "Agent without valid sponsor - immediate attention required",
"severity": 1,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(1h) | where name == 'agent.interaction' | where customDimensions.fsi_zone in ('Zone2', 'Zone3') | where isempty(customDimensions.sponsor_id) or customDimensions.sponsor_status == 'Invalid' | distinct tostring(customDimensions.service_name)"
}
]
}
}
Governance Alerts
Alert 12: Blueprint Phase Change
Trigger: Agent promoted or demoted in Blueprint lifecycle.
{
"name": "Blueprint Phase Change",
"description": "Agent Blueprint lifecycle phase changed",
"severity": 4,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(15m) | where name in ('blueprint.promotion', 'blueprint.demotion') | project timestamp, AgentId = customDimensions.agent_id, FromPhase = customDimensions.source_phase, ToPhase = customDimensions.target_phase"
}
]
}
}
Alert 13: Configuration Change (Zone 3)
Trigger: Agent configuration modified.
{
"name": "Zone 3 Agent Configuration Change",
"description": "Zone 3 agent configuration was modified - audit required",
"severity": 3,
"criteria": {
"allOf": [
{
"query": "traces | where timestamp > ago(15m) | where name == 'config.change' | where customDimensions.fsi_zone == 'Zone3' | project timestamp, AgentId = customDimensions.agent_id, ChangeType = customDimensions.change_type, ModifiedBy = customDimensions.modified_by"
}
]
},
"actions": [
{ "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-compliance" }
]
}
Action Groups Configuration
Platform Operations
{
"name": "ag-platform-ops",
"shortName": "PlatformOps",
"emailReceivers": [
{
"name": "Platform-Ops-Email",
"emailAddress": "platform-ops@contoso.com"
}
],
"webhookReceivers": [
{
"name": "Teams-PlatformOps",
"serviceUri": "https://contoso.webhook.office.com/webhookb2/..."
}
]
}
Security Operations
{
"name": "ag-security-ops",
"shortName": "SecOps",
"emailReceivers": [
{
"name": "SecOps-Email",
"emailAddress": "secops@contoso.com"
}
],
"smsReceivers": [
{
"name": "SecOps-Oncall",
"countryCode": "1",
"phoneNumber": "5551234567"
}
]
}
Compliance Team
{
"name": "ag-compliance",
"shortName": "Compliance",
"emailReceivers": [
{
"name": "Compliance-Team",
"emailAddress": "compliance@contoso.com"
}
]
}
Critical On-Call
{
"name": "ag-critical-oncall",
"shortName": "CriticalOC",
"emailReceivers": [
{
"name": "Critical-Escalation",
"emailAddress": "critical@contoso.com"
}
],
"smsReceivers": [
{
"name": "Oncall-Primary",
"countryCode": "1",
"phoneNumber": "5551234567"
},
{
"name": "Oncall-Backup",
"countryCode": "1",
"phoneNumber": "5559876543"
}
],
"voiceReceivers": [
{
"name": "Oncall-Voice",
"countryCode": "1",
"phoneNumber": "5551234567"
}
]
}
Deployment Script
# Deploy all alert rules
$alertRules = Get-ChildItem -Path "./alerts/*.json"
foreach ($rule in $alertRules) {
$alertConfig = Get-Content $rule.FullName | ConvertFrom-Json
$params = @{
ResourceGroupName = "rg-agent-governance"
Name = $alertConfig.name
Location = "eastus"
Description = $alertConfig.description
Severity = $alertConfig.severity
Enabled = $alertConfig.enabled
Scope = $alertConfig.scopes
EvaluationFrequency = $alertConfig.evaluationFrequency
WindowSize = $alertConfig.windowSize
Criteria = $alertConfig.criteria
ActionGroupId = $alertConfig.actions[0].actionGroupId
}
New-AzScheduledQueryRule @params
Write-Host "Created alert: $($alertConfig.name)" -ForegroundColor Green
}
Escalation Matrix
| Severity | Initial Response | Escalation (15 min) | Escalation (1 hr) |
|---|---|---|---|
| Sev 0 (Critical) | On-call + Manager | Director | VP/CIO |
| Sev 1 (High) | On-call | Manager | Director |
| Sev 2 (Medium) | Platform Ops | On-call | Manager |
| Sev 3 (Low) | Platform Ops | - | On-call |
| Sev 4 (Info) | Log only | - | - |
Related Resources
- Overview - Observability architecture
- OpenTelemetry Setup - Collector configuration
- Application Insights Workbooks - Dashboard templates
- Microsoft Learn: Azure Monitor Alerts
Updated: June 2026 | Version: v1.6.2 | UI Verification Status: Current