4.2 Reliability & Resilience
Purpose
Section titled “Purpose”Reliability & Resilience focuses on the solution’s ability to withstand failures, recover from disruptions, and scale to meet demand. This quality attribute is closely tied to the Physical View (3.3) and Data View (3.4) for infrastructure and data backup details. Evaluate this quality attribute across all architectural views documented in Section 3.
4.2.1 Geographic Footprint & Disaster Recovery
Section titled “4.2.1 Geographic Footprint & Disaster Recovery”| Question | Response |
|---|---|
| Is the application deployed across multiple hosting venues for continuity? | Yes / No - [details] |
| What is the DR strategy? | Active-Active / Active-Passive / Pilot Light / Backup & Restore |
| Are there data sovereignty requirements affecting geographic choices? | Yes / No - [details] |
[Insert geographic deployment diagram if applicable]
4.2.2 Scalability
Section titled “4.2.2 Scalability”Application Scalability
Section titled “Application Scalability”| Attribute | Response |
|---|---|
| Scaling capability | No dynamic scaling (pre-sized) / Manual scaling / Partial auto-scaling / Full auto-scaling |
| Scaling details | [describe how scaling works, triggers, limits] |
Dependency Scalability
Section titled “Dependency Scalability”| Attribute | Response |
|---|---|
| Dependencies adequately sized? | Yes (confirmed) / Unconfirmed / Known insufficient |
| Dependency details | [describe scaling posture of dependencies] |
4.2.3 Fault Tolerance
Section titled “4.2.3 Fault Tolerance”Has the application been designed to tolerate unexpected disruptions such as failure or degradation of internal components or external dependencies?
- Yes - [describe fault tolerance design, including:]
- How the application handles component failures
- Graceful degradation strategies
- Circuit breaker or retry patterns
- Health check and self-healing mechanisms
- Testing practices (chaos engineering, game days)
- No - [describe why not]
4.2.4 Failure Modes & Recovery Behaviour
Section titled “4.2.4 Failure Modes & Recovery Behaviour”Document how the solution behaves when individual components or dependencies fail:
| Component / Dependency | Failure Mode | Detection Method | Recovery Behaviour | User Impact |
|---|---|---|---|---|
| [component] | [how it fails] | [how detected: health check, alert, timeout] | [auto-restart, failover, graceful degradation, manual intervention] | [full outage, degraded service, transparent] |
Guidance
For each critical component, consider:
- What happens when it becomes unavailable? Does the solution fail entirely, degrade gracefully, or continue with reduced functionality?
- How is the failure detected? Health checks, heartbeats, error thresholds, timeouts?
- How is it recovered? Automatic restart, failover to secondary, circuit breaker, manual intervention?
- What do users experience? Full outage, degraded experience, increased latency, or transparent failover?
This section is frequently missing from architecture documents but is one of the most valuable for operations teams and SREs.
4.2.5 Backup & Recovery
Section titled “4.2.5 Backup & Recovery”Backup Design
Section titled “Backup Design”| Attribute | Detail |
|---|---|
| Backup strategy | [what is backed up and how] |
| Backup product/service | [tool used] |
| Backup type | Full / Incremental / Differential |
| Backup frequency | [schedule] |
| Backup retention | [period] |
Backup Protection
Section titled “Backup Protection”| Control | Detail |
|---|---|
| Immutability | [how backups are protected against modification/deletion] |
| Encryption | [how backup data is encrypted] |
| Access control | [who can access backups] |
4.2.6 Recovery Scenarios
Section titled “4.2.6 Recovery Scenarios”Document how the solution recovers under different failure scenarios:
| # | Scenario | Recovery Approach | RTO | RPO |
|---|---|---|---|---|
| 1 | Primary hosting venue / AZ / region failure | [approach] | [time] | [time] |
| 2 | Critical software component failure | [approach] | [time] | [time] |
| 3 | Key infrastructure failure (hardware, storage, network) | [approach] | [time] | [time] |
| 4 | Network connectivity failure between venues | [approach] | [time] | [time] |
| 5 | External connectivity failure (customer-facing) | [approach] | [time] | [time] |
| 6 | Ransomware / cyber-attack | [approach] | [time] | [time] |
| 7 | Accidental or malicious data corruption / deletion | [approach] | [time] | [time] |
Guidance
For each scenario, describe:
- How the failure is detected
- Automatic vs. manual recovery steps
- Expected Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
- Any dependencies on other teams or systems for recovery
- Whether recovery has been tested and when
Scoring Guidance
| Score | What This Looks Like |
|---|---|
| 1 | DR strategy identified but RTO/RPO not defined |
| 3 | DR strategy documented with RTO/RPO targets, backup configured, scalability approach defined |
| 5 | All of the above plus fault tolerance designed, chaos testing practised, backup immutability and encryption confirmed, DR tested |