Control 4.10: Business Continuity and Disaster Recovery — Troubleshooting
Common issues and resolution steps for Copilot business continuity and disaster recovery.
Common Issues
Issue 1: No Fallback Procedures When Copilot Is Unavailable
- Symptoms: Business operations halt or significantly degrade when Copilot experiences an outage because users have no alternative workflows.
- Root Cause: Over-reliance on Copilot without documented fallback procedures, or fallback procedures exist but users are not trained on them.
- Resolution:
- Document manual fallback procedures for each Copilot-dependent business process.
- Conduct training sessions so users know how to operate without Copilot.
- Keep traditional tools and templates accessible as backups.
- Conduct periodic "Copilot-free" exercises to maintain manual skills.
Issue 2: Service Health Notifications Not Reaching the Right People
- Symptoms: IT operations team is unaware of a Copilot service degradation until users report problems.
- Root Cause: Service health notification settings are misconfigured or notification recipients are outdated.
- Resolution:
- Review service health notification settings in the M365 Admin Center.
- Update notification recipients to include current IT operations and compliance contacts.
- Add a distribution group rather than individual emails for resilience.
- Implement API-based monitoring as a backup to email notifications.
Issue 3: BCP Plan Does Not Address Copilot Dependencies
- Symptoms: During a BCP/DR test, the Copilot dependency is not addressed, leaving a gap in the continuity plan.
- Root Cause: The BCP plan was created before Copilot deployment and has not been updated.
- Resolution:
- Update the BCP plan to include a Copilot-specific appendix.
- Map Copilot dependencies to business process impact assessments.
- Define RTO and RPO for Copilot services.
- Include Copilot outage scenarios in the next BCP test cycle.
Issue 4: Extended Outage Exceeding Acceptable Business Impact
- Symptoms: A prolonged Copilot outage causes business impact beyond the defined RTO tolerance.
- Root Cause: Microsoft service recovery is beyond organizational control, and fallback procedures may not sustain operations for extended periods.
- Resolution:
- Activate the extended outage communication plan.
- Implement additional manual resources to maintain business operations.
- Monitor Microsoft service health for restoration timeline updates.
- Document the business impact for post-incident review and BCP plan improvements.
- Evaluate if alternative AI services should be part of the DR strategy.
Diagnostic Steps
- Check service status: Run the service health script or check the M365 Admin Center.
- Review notification delivery: Verify the operations team received recent service health notifications.
- Test fallback procedures: Ask a business unit to demonstrate their Copilot fallback process.
- Review BCP plan currency: Confirm the BCP plan includes Copilot and was updated within the last 12 months.
Escalation
| Severity | Condition | Escalation Path |
|---|---|---|
| Critical | Extended outage impacting client-facing operations | CIO/CTO + Business leadership + Microsoft account team |
| High | BCP plan gaps discovered during actual incident | BCP coordinator + IT leadership |
| Medium | Service health monitoring failures | IT Operations — restore monitoring |
| Low | BCP plan update needed | BCP coordinator — schedule update |