Appearance
Synthetic Users Disaster Recovery Plan
Document Version
- Version: 1.0
- Date: 12/12/2023
- Prepared by: Artur Ventura, CTO & CISO
Plan Overview
- Purpose: To ensure rapid and efficient recovery of Synthetic Users' operations in the event of a disaster, specifically focusing on critical dependencies like Heroku and OpenAI.
- Scope: This DRP covers processes and protocols for switching operations from Heroku to AWS in the event of a Heroku failure and from OpenAI to Anthropic in the event of an OpenAI failure.
Critical Dependencies
- Primary Dependencies: Heroku for application hosting, OpenAI for AI model integrations.
- Secondary Options: AWS for application hosting, Anthropic for AI model integrations.
Recovery Objectives
- Recovery Time Objective (RTO): The maximum acceptable time to restore critical functions after a disaster.
- Heroku to AWS Migration: 4 hours
- OpenAI to Anthropic Switch: 2 hours
- Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time.
- Data Backup: 1 hour
Disaster Recovery Protocols
Heroku Failure - Migration to AWS
- Detection and Assessment: Monitor and quickly identify service disruption on Heroku. Confirm the outage's scope and expected duration.
- Activation of AWS Environment:
- Pre-configured AWS environments should be maintained, mirroring the Heroku setup.
- Initiate the AWS environment, ensuring all services and databases are operational.
- Data Migration:
- Last data backup from Heroku (or directly from the database if accessible) is restored to AWS.
- Ensure the RPO of 1 hour is met by verifying the data integrity post-migration.
- DNS Update:
- Update DNS records to point to the AWS environment, minimizing the switch-over time to meet the RTO of 2 hours.
- Verification and Monitoring:
- Conduct thorough testing to confirm operational functionality on AWS.
- Monitor performance and stability closely following the switch.
OpenAI Failure - Switch to Anthropic
- Detection and Assessment: Identify failure in OpenAI services impacting operations. Assess the impact on service offerings.
- Switch to Anthropic:
- Pre-configure Anthropic models to match the functionality provided by OpenAI models closely.
- Redirect API calls from OpenAI to Anthropic, ensuring minimal changes to the integration layer.
- Verification and Adjustment:
- Test the integration thoroughly to ensure that Anthropic models perform as expected.
- Adjust configurations as needed to optimize performance and accuracy.
- Communication:
- Inform internal teams about the switch to manage expectations and provide updated documentation if necessary.
- Notify key clients of the change, emphasizing the continuity of service and quality.
Post-Recovery Actions
- Review and Analysis: After the recovery, conduct a detailed review to analyze the response's effectiveness, documenting lessons learned.
- Plan Update: Update the DRP based on feedback and any changes in the technological landscape or business requirements.
Testing and Maintenance
- Annual DRP Testing: Simulate disaster scenarios annually to test the effectiveness of the DRP, focusing on the switch from Heroku to AWS and OpenAI to Anthropic.
- DRP Updates: Review and update the DRP semi-annually or following significant changes in technology or business operations.
Documentation and Training
- DRP Document: Maintain a comprehensive, accessible DRP document detailing all protocols, procedures, and recovery objectives.
- Training: Regularly train relevant staff on DRP protocols, ensuring clear understanding and readiness to act in the event of a disaster.