Control 2.4: Business Continuity and Disaster Recovery — Troubleshooting

Common BC/DR issues and resolutions for Control 2.4: Business Continuity and Disaster Recovery.

Audience

M365 administrators in US financial services responding to BC/DR issues during a live event, an exercise, or an examination response. Steps below assume the Portal Walkthrough and PowerShell Setup have been completed previously.

Issue Index

#	Symptom	Likely cause	Severity in event
1	DR environment out of sync with production	Sync pipeline failure or stale copy-environment job	Blocks failover
2	Connection failures in DR after failover	DR connection references not bound or service-account permissions missing in DR region	Blocks Tier 1 traffic
3	Customer-managed export pipeline failing	Service principal credential expired, Dataverse permissions changed, or Azure Storage RBAC drift	Blocks RPO compliance
4	Measured RTO exceeds target	Manual steps in runbook, slow approval chain, or under-provisioned DR environment	Examiner finding
5	Data integrity issues after restore	Solution export missing components (knowledge sources, custom topics) or hash mismatch	SEC 17a-4 reconstruction risk
6	Failback data conflicts	Concurrent writes in DR and primary during failover/failback overlap	Records integrity risk
7	Microsoft system backup not visible in PPAC	Environment recently migrated, region change, or temporary service issue	Operational only
8	Cross-region restore not offered	Microsoft does not provide native cross-region Dataverse restore	Architectural — confirms need for customer-managed exports
9	Entra Agent ID / service principal cannot authenticate to DR	Application user not registered in DR environment, or federated credential expired	Blocks Tier 1 traffic
10	Service Health alerts missed	Subscription routed to single mailbox, on-call rotation not on the DL, or Graph rate-limit on polling job	Detection delay

Issue 1 — DR Environment Out of Sync

Symptoms: DR agent version, environment variables, or solution components do not match production.

Diagnosis:

PPAC → DR environment → Solutions → compare solution Version and Last installed with production
Review the most recent ALM pipeline run or Copy-PowerAppEnvironment job log
Confirm service principal used by the sync job still has Dataverse System Administrator role in both environments

Resolution:

Run a fresh export from production using the PowerShell Setup §4 script (or pac solution export ... --managed true)
Import to DR using pac solution import or PPAC → Import solution
Re-apply DR-specific environment variable values (regional API endpoints, storage account names)
Re-bind connection references using DR-region service accounts
Add monitoring on the sync pipeline so failures alert the DR distribution list, not a single mailbox
If sync drift recurs, increase sync frequency (for Zone 3, the staleness budget is typically ≤ 24 hours)

Issue 2 — Connection Failures in DR After Failover

Symptoms: Agents in the DR environment return errors when calling APIs, Dataverse tables, or premium connectors.

Diagnosis:

Power Apps maker portal in DR → Connections → check connection status
DR environment → Connection references → verify each reference is bound to a working connection
Test the underlying connector with a one-step Power Automate flow

Resolution:

Re-bind connection references:
DR environment → Solutions → open solution → Connection references → for each, Edit and select the DR-region connection
Verify service-account presence in the DR region:
PPAC → DR environment → Settings → Users + permissions → Application users → confirm each agent identity is present and assigned the same Dataverse security role
Validate network reachability:
For VNet-integrated connectors, confirm the DR-region VNet integration gateway is provisioned
For ExpressRoute / private endpoint dependencies, verify routing to the DR region
Refresh OAuth tokens:
Some connectors require interactive re-consent in the DR region; capture this in the runbook so it does not block the next failover
Smoke-test each connector with a flow before declaring the failover complete

Issue 3 — Customer-Managed Export Pipeline Failing

Symptoms: Scheduled solution or data export jobs are not landing in the immutable Azure Blob container; RPO at risk.

Diagnosis:

Azure DevOps / GitHub Actions / Azure Automation run history — identify the failing step
Confirm the service principal still has:
Dataverse System Administrator (or equivalent) on the source environment
Storage Blob Data Contributor on the target Storage Account / container
Check Azure Blob immutability policy — once locked, version-level immutability cannot be reduced; ensure the policy hasn't drifted from the documented baseline

Resolution:

Renew or rotate service principal credentials; prefer federated credentials (workload identity federation) to avoid manual secret rotation
Reapply RBAC on both Dataverse (application user role) and Storage (Storage Blob Data Contributor on the container)
Validate immutability policy is enabled and the retention period meets the documented requirement (typically ≥ 6 years for SEC 17a-4)
Test manual export via PowerShell Setup §4; compare SHA-256 against a prior known-good export
Open a CAB ticket documenting the failure window, the gap against RPO, and remediation; this becomes part of the next exercise evidence

Issue 4 — Measured RTO Exceeds Target

Symptoms: Exercise documentation shows tier RTO not met; examiner finding likely.

Diagnosis:

Review exercise timestamps phase-by-phase
Identify steps that took longer than the runbook target
Check whether delays were process (waiting for approval), technical (slow re-binding), or organizational (on-call manager unavailable)

Resolution:

Streamline the runbook: pre-authorize tier-specific declaration so the on-call manager does not wait for executive approval inside the RTO window
Pre-stage in DR: keep the DR environment within the staleness budget so import time at failover is zero
Automate the cutover: script DNS / Front Door / channel publish changes; manual portal clicks are the most common time sink
Pre-bind DR connection references during steady-state with a "shadow" identity that is dormant until failover; activate at failover rather than re-bind
Up-tier DR capacity (capacity add-ons) if cold start of the DR environment causes user-visible latency
Increase exercise frequency so the team is practiced; FFIEC examiners view this favorably
Document the RTO miss in the BIA as a known constraint and adjust either the RTO target (with BCP committee approval) or the architecture

Issue 5 — Data Integrity Issues After Restore

Symptoms: Restored agent has missing knowledge sources, missing topics, or returns "configuration not found" errors.

Diagnosis:

Compare the restored solution component count with the source export manifest
Verify the SHA-256 hash of the restored .zip against the hash sidecar from immutable storage
Confirm whether the missing component is a solution-aware component (covered by export) or a non-solution-aware component (knowledge source, certain custom topics)

Resolution:

Hash mismatch → restore from a different known-good export and re-verify; investigate the storage account integrity
Solution component missing → the source export was incomplete; review the export script and its solution component selection
Knowledge source missing → invoke the documented secondary export procedure for non-solution-aware components; this is a known limitation of solution export and must be addressed by separate backup procedure
Environment variable values missing → restore environment variable values from the documented DR variable set, not from the production export (the values typically differ per region)
Document any reconstruction effort for the regulatory record — this matters for SEC 17a-4 reconstruction obligations

Issue 6 — Failback Data Conflicts

Symptoms: After failback to primary, conflicting records exist in primary and DR; users see stale or duplicate data.

Diagnosis:

Identify the DR operating window (failover declaration → failback declaration)
Inventory tables that received writes during the window
For each conflicting row, determine source-of-truth (DR is typically authoritative for the operating window)

Resolution:

Document the DR operating window with start and end timestamps
Export DR state before failback (managed solution + relevant Dataverse table snapshots)
Reconcile in primary:
For low-volume tables, manual merge by Dataverse System Admin
For high-volume tables, use a Power Automate or Azure Synapse Link reverse-sync flow
Consider "DR becomes primary" if the operating window was long or the merge cost exceeds the value of returning to the original primary; in this case, update DNS / endpoint configuration permanently and rebuild the old primary as the new DR
Capture the reconciliation in the post-event report under Control 3.4 for examiner traceability

Issue 7 — Microsoft System Backup Not Visible in PPAC

Symptoms: PPAC → Environment → Backups shows no system backups for the past 24 hours.

Diagnosis:

Check Microsoft 365 Service Health for active Dataverse advisories
Confirm the environment was not recently migrated, restored, or had its region changed (these can briefly suppress visible system backups)

Resolution:

Wait 24 hours and re-check; system backups are continuous and gaps usually self-heal
Trigger a manual backup in the interim (PowerShell Setup §2) so the change window is not blocked
If the gap persists more than 48 hours, open a Microsoft support case (severity B for production)
Document the gap in the BIA risk log so the next exercise references it

Issue 8 — Cross-Region Restore Not Offered

Symptoms: PPAC restore dialog only offers in-region restore; DR-region restore is not an option.

Cause: This is the documented Microsoft platform behavior — Microsoft does not provide native cross-region Dataverse restore. System backups are tied to the source region.

Resolution (architectural, not a defect):

Confirm the customer-managed export-to-Blob pipeline is the recovery path for cross-region scenarios
Confirm a secondary-region environment is pre-provisioned per Portal Walkthrough §2
Confirm solution import to the DR environment is part of the runbook, not a system-backup restore
Document this Microsoft platform behavior in the BIA so examiners understand the customer-side controls that compensate

Issue 9 — Entra Agent ID / Service Principal Cannot Authenticate to DR

Symptoms: Token acquisition succeeds against Entra, but Dataverse calls in the DR environment return 401 or Forbidden.

Diagnosis:

Use PowerShell Setup §6 WhoAmI smoke test
PPAC → DR environment → Settings → Users + permissions → Application users → search for the app's appId
Microsoft Entra admin center → confirm federated credential or client secret is valid

Resolution:

App user missing in DR → register the application user in the DR environment with the same Dataverse security role assigned in production
Credential expired → rotate credential per the documented rotation procedure; prefer federated credentials for unattended workloads
API permission scope mismatch → confirm the application has the same Graph / Dataverse delegated and application permissions; admin-consent in DR if a tenant-scoped consent record is required
Conditional Access blocking → confirm the DR environment URL is included in the Conditional Access named-locations and policies that target the agent's identity

Issue 10 — Service Health Alerts Missed

Symptoms: A Microsoft Power Platform / Dataverse advisory was published but the DR team did not receive it in time.

Diagnosis:

Microsoft 365 admin center → Service health → confirm the advisory was published
Verify the alert subscription audience (must be a distribution list, not a single mailbox)
Confirm the on-call rotation includes membership in the DL

Resolution:

Re-subscribe to a distribution list that the on-call rotation belongs to
Add Teams channel notifications by routing the Graph Get-MgServiceAnnouncementMessage output (PowerShell Setup §7) to a Teams DR channel
Add a daily digest job so missed real-time notifications are caught within 24 hours
Document detection time as a separate SLA in the BIA — detection delay is part of the total RTO

Escalation Path

If issues cannot be resolved using this guide:

Level	Role	When to engage
1	IT Operations	Pipeline failures, automation errors, runbook execution support
2	Power Platform Admin	Environment configuration, Managed Environment policy, capacity issues
3	Entra Agent ID Admin	Identity / federated credential issues affecting DR authentication
4	AI Governance Lead	Process, scope, runbook accuracy, exercise findings
5	Compliance Officer	Examiner inquiries, regulatory recordkeeping concerns
6	Microsoft Support	Confirmed platform-level incidents (open Sev B for production)

Portal Walkthrough — Step-by-step portal configuration
PowerShell Setup — Automation scripts
Verification & Testing — Test procedures and exercise scenarios

Updated: May 2026 | Version: v1.6.2 | UI Verification Status: Current

Control 2.4: Business Continuity and Disaster Recovery — Troubleshooting

Audience

Issue Index

Issue 1 — DR Environment Out of Sync

Issue 2 — Connection Failures in DR After Failover

Issue 3 — Customer-Managed Export Pipeline Failing

Issue 4 — Measured RTO Exceeds Target

Issue 5 — Data Integrity Issues After Restore

Issue 6 — Failback Data Conflicts

Issue 7 — Microsoft System Backup Not Visible in PPAC

Issue 8 — Cross-Region Restore Not Offered

Issue 9 — Entra Agent ID / Service Principal Cannot Authenticate to DR

Issue 10 — Service Health Alerts Missed

Escalation Path

Related Playbooks