Agent Observability Alerting Configuration

Part of Agent 365 Observability Implementation Guide

Overview

This guide provides zone-based alerting thresholds and escalation configurations for Agent 365 observability. Alerts are designed to surface operational issues, security events, and compliance concerns based on governance zone requirements.

Schema assumptions in the KQL below

The alert queries in this document target Application Insights traces with FSI-overlay span / event names (agent.interaction, tool.invocation, rai.filter, dlp.block, access.denied, auth.failure, sponsor.attestation.check, blueprint.promotion, config.change) and a custom dimension fsi_zone. None of these are part of the documented Microsoft Agent 365 SDK schema — see Telemetry schema. They depend on (a) an OpenTelemetry Collector / agent host pipeline that emits these synthetic names, and (b) downstream signals from Microsoft Defender / Microsoft Purview being re-projected into Application Insights. Before deploying, rewrite each query against the actual signal source in your environment (the documented InvokeAgentScope / ExecuteToolScope / InferenceScope / OutputScope spans with gen_ai.* / microsoft.* attributes, or the Defender / Purview tables for security and compliance events).

Alert Categories

Category	Zone 1	Zone 2	Zone 3
Performance	Optional	✓	✓
Availability	Optional	✓	✓
Security	-	✓	✓ (Priority)
Compliance	-	Optional	✓ (Priority)
Governance	-	✓	✓

Performance Alerts

Alert 1: High Response Latency

Trigger: Agent response time exceeds threshold.

Zone	Warning	Critical	Evaluation Window
Zone 2	P95 > 3s	P95 > 5s	5 minutes
Zone 3	P95 > 2s	P95 > 3s	5 minutes

{
  "name": "Agent High Response Latency",
  "description": "Agent response time exceeds acceptable threshold",
  "severity": 2,
  "enabled": true,
  "scopes": ["/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/components/{ai}"],
  "evaluationFrequency": "PT5M",
  "windowSize": "PT5M",
  "criteria": {
    "allOf": [
      {
        "query": "traces | where name == 'agent.interaction' | extend latency = todouble(customDimensions.agent_response_latency_ms) | summarize P95 = percentile(latency, 95) by bin(timestamp, 5m) | where P95 > 3000",
        "timeAggregation": "Count",
        "operator": "GreaterThan",
        "threshold": 0,
        "failingPeriods": {
          "numberOfEvaluationPeriods": 1,
          "minFailingPeriodsToAlert": 1
        }
      }
    ]
  },
  "actions": [
    { "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-platform-ops" }
  ]
}

Alert 2: Low Success Rate

Trigger: Agent success rate drops below threshold.

Zone	Warning	Critical	Evaluation Window
Zone 2	< 95%	< 90%	15 minutes
Zone 3	< 98%	< 95%	10 minutes

{
  "name": "Agent Low Success Rate",
  "description": "Agent success rate below acceptable threshold",
  "severity": 1,
  "enabled": true,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(15m) | where name == 'agent.interaction' | extend success = customDimensions.agent_response_success == 'true' | summarize SuccessRate = 100.0 * countif(success) / count() | where SuccessRate < 95",
        "timeAggregation": "Count",
        "operator": "GreaterThan",
        "threshold": 0
      }
    ]
  }
}

Alert 3: High Fallback Rate

Trigger: Fallback topic triggered excessively.

Zone	Warning	Critical	Evaluation Window
Zone 2	> 20%	> 30%	1 hour
Zone 3	> 10%	> 20%	30 minutes

{
  "name": "Agent High Fallback Rate",
  "description": "Agent triggering fallback topic too frequently - review topic coverage",
  "severity": 3,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(1h) | extend isFallback = name == 'fallback.triggered', isInteraction = name == 'agent.interaction' | summarize FallbackRate = 100.0 * countif(isFallback) / countif(isInteraction) | where FallbackRate > 20"
      }
    ]
  }
}

Availability Alerts

Alert 4: Agent Unavailable

Trigger: No interactions received for extended period.

Zone	Threshold	Evaluation Window
Zone 2	0 interactions	30 minutes
Zone 3	0 interactions	15 minutes

{
  "name": "Agent Unavailable",
  "description": "No agent interactions detected - service may be down",
  "severity": 0,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(15m) | where name == 'agent.interaction' | where customDimensions.fsi_zone == 'Zone3' | count",
        "timeAggregation": "Count",
        "operator": "LessThanOrEqual",
        "threshold": 0
      }
    ]
  },
  "actions": [
    { "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-critical-oncall" }
  ]
}

Alert 5: Connector Failure

Trigger: External connector/tool calls failing.

Zone	Warning	Critical
Zone 2	> 10% fail	> 25% fail
Zone 3	> 5% fail	> 15% fail

{
  "name": "Agent Connector Failures",
  "description": "External connector calls experiencing high failure rate",
  "severity": 2,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(15m) | where name == 'tool.invocation' | extend success = customDimensions.tool_success == 'true' | summarize FailRate = 100.0 * countif(not(success)) / count() by tostring(customDimensions.tool_name) | where FailRate > 10"
      }
    ]
  }
}

Security Alerts

Alert 6: RAI Filter Spike (Zone 3 Priority)

Trigger: Unusual increase in RAI filter activations.

{
  "name": "RAI Filter Spike - Security",
  "description": "Significant increase in RAI content filter activations - review for potential abuse",
  "severity": 1,
  "criteria": {
    "allOf": [
      {
        "query": "let baseline = traces | where timestamp between (ago(7d) .. ago(1d)) | where name == 'rai.filter' | summarize BaselineAvg = count() / 6; traces | where timestamp > ago(1h) | where name == 'rai.filter' | summarize CurrentCount = count() | extend baseline = toscalar(baseline) | where CurrentCount > baseline * 3"
      }
    ]
  },
  "actions": [
    { "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-security-ops" }
  ]
}

Alert 7: DLP Block Event (Zone 3 Priority)

Trigger: DLP policy blocks agent content.

{
  "name": "Agent DLP Block",
  "description": "DLP policy blocked agent content - review for data protection",
  "severity": 2,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(15m) | where name == 'dlp.block' | where customDimensions.fsi_zone == 'Zone3' | count",
        "operator": "GreaterThan",
        "threshold": 0
      }
    ]
  },
  "actions": [
    { "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-compliance" }
  ]
}

Alert 8: Access Denial Spike

Trigger: Unusual increase in access denial events.

{
  "name": "Agent Access Denial Spike",
  "description": "Spike in access denial events - potential unauthorized access attempts",
  "severity": 1,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(1h) | where name == 'access.denied' | summarize DenialCount = count() | where DenialCount > 10"
      }
    ]
  },
  "actions": [
    { "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-security-ops" }
  ]
}

Alert 9: Authentication Failure Pattern

Trigger: Multiple authentication failures from same source.

{
  "name": "Agent Auth Failure Pattern",
  "description": "Multiple authentication failures detected - potential credential attack",
  "severity": 1,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(30m) | where name == 'auth.failure' | summarize FailCount = count() by tostring(customDimensions.client_ip) | where FailCount > 5"
      }
    ]
  }
}

Compliance Alerts

Trigger: Agent sponsor attestation past due date.

{
  "name": "Sponsor Attestation Overdue",
  "description": "Agent sponsor attestation is overdue - compliance risk",
  "severity": 2,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(24h) | where name == 'sponsor.attestation.check' | extend overdue = customDimensions.attestation_overdue == 'true' | where overdue | summarize OverdueCount = dcount(customDimensions.agent_id) | where OverdueCount > 0"
      }
    ]
  },
  "actions": [
    { "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-compliance" }
  ]
}

Alert 11: Orphaned Agent Detected

Trigger: Agent without valid sponsor identified.

{
  "name": "Orphaned Agent Detected",
  "description": "Agent without valid sponsor - immediate attention required",
  "severity": 1,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(1h) | where name == 'agent.interaction' | where customDimensions.fsi_zone in ('Zone2', 'Zone3') | where isempty(customDimensions.sponsor_id) or customDimensions.sponsor_status == 'Invalid' | distinct tostring(customDimensions.service_name)"
      }
    ]
  }
}

Governance Alerts

Alert 12: Blueprint Phase Change

Trigger: Agent promoted or demoted in Blueprint lifecycle.

{
  "name": "Blueprint Phase Change",
  "description": "Agent Blueprint lifecycle phase changed",
  "severity": 4,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(15m) | where name in ('blueprint.promotion', 'blueprint.demotion') | project timestamp, AgentId = customDimensions.agent_id, FromPhase = customDimensions.source_phase, ToPhase = customDimensions.target_phase"
      }
    ]
  }
}

Alert 13: Configuration Change (Zone 3)

Trigger: Agent configuration modified.

{
  "name": "Zone 3 Agent Configuration Change",
  "description": "Zone 3 agent configuration was modified - audit required",
  "severity": 3,
  "criteria": {
    "allOf": [
      {
        "query": "traces | where timestamp > ago(15m) | where name == 'config.change' | where customDimensions.fsi_zone == 'Zone3' | project timestamp, AgentId = customDimensions.agent_id, ChangeType = customDimensions.change_type, ModifiedBy = customDimensions.modified_by"
      }
    ]
  },
  "actions": [
    { "actionGroupId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/microsoft.insights/actionGroups/ag-compliance" }
  ]
}

Action Groups Configuration

Platform Operations

{
  "name": "ag-platform-ops",
  "shortName": "PlatformOps",
  "emailReceivers": [
    {
      "name": "Platform-Ops-Email",
      "emailAddress": "platform-ops@contoso.com"
    }
  ],
  "webhookReceivers": [
    {
      "name": "Teams-PlatformOps",
      "serviceUri": "https://contoso.webhook.office.com/webhookb2/..."
    }
  ]
}

Security Operations

{
  "name": "ag-security-ops",
  "shortName": "SecOps",
  "emailReceivers": [
    {
      "name": "SecOps-Email",
      "emailAddress": "secops@contoso.com"
    }
  ],
  "smsReceivers": [
    {
      "name": "SecOps-Oncall",
      "countryCode": "1",
      "phoneNumber": "5551234567"
    }
  ]
}

Compliance Team

{
  "name": "ag-compliance",
  "shortName": "Compliance",
  "emailReceivers": [
    {
      "name": "Compliance-Team",
      "emailAddress": "compliance@contoso.com"
    }
  ]
}

Critical On-Call

{
  "name": "ag-critical-oncall",
  "shortName": "CriticalOC",
  "emailReceivers": [
    {
      "name": "Critical-Escalation",
      "emailAddress": "critical@contoso.com"
    }
  ],
  "smsReceivers": [
    {
      "name": "Oncall-Primary",
      "countryCode": "1",
      "phoneNumber": "5551234567"
    },
    {
      "name": "Oncall-Backup",
      "countryCode": "1",
      "phoneNumber": "5559876543"
    }
  ],
  "voiceReceivers": [
    {
      "name": "Oncall-Voice",
      "countryCode": "1",
      "phoneNumber": "5551234567"
    }
  ]
}

Deployment Script

# Deploy all alert rules
$alertRules = Get-ChildItem -Path "./alerts/*.json"

foreach ($rule in $alertRules) {
    $alertConfig = Get-Content $rule.FullName | ConvertFrom-Json

    $params = @{
        ResourceGroupName = "rg-agent-governance"
        Name = $alertConfig.name
        Location = "eastus"
        Description = $alertConfig.description
        Severity = $alertConfig.severity
        Enabled = $alertConfig.enabled
        Scope = $alertConfig.scopes
        EvaluationFrequency = $alertConfig.evaluationFrequency
        WindowSize = $alertConfig.windowSize
        Criteria = $alertConfig.criteria
        ActionGroupId = $alertConfig.actions[0].actionGroupId
    }

    New-AzScheduledQueryRule @params
    Write-Host "Created alert: $($alertConfig.name)" -ForegroundColor Green
}

Escalation Matrix

Severity	Initial Response	Escalation (15 min)	Escalation (1 hr)
Sev 0 (Critical)	On-call + Manager	Director	VP/CIO
Sev 1 (High)	On-call	Manager	Director
Sev 2 (Medium)	Platform Ops	On-call	Manager
Sev 3 (Low)	Platform Ops	-	On-call
Sev 4 (Info)	Log only	-	-

Overview - Observability architecture
OpenTelemetry Setup - Collector configuration
Application Insights Workbooks - Dashboard templates
Microsoft Learn: Azure Monitor Alerts

Updated: June 2026 | Version: v1.6.2 | UI Verification Status: Current