Skip to content

4.1 Operational Excellence

Recommended AWS Ops Excellence Azure Ops Excellence GCP Ops Excellence

Operational Excellence focuses on how the solution is monitored, operated, and improved over time. It covers observability (logging, metrics, tracing), alerting, capacity management, and operational procedures. Evaluate this quality attribute across all architectural views documented in Section 3.

Recommended
Log TypeEvents LoggedLocal StorageRetention PeriodRemote Services
Application logs[what is logged][file system, database][period][e.g., Datadog, CloudWatch]
Data store logs[what is logged][location][period][remote service]
Infrastructure logs[what is logged][location][period][remote service]
Security event logs[what is logged][location][period][SIEM service]

Guidance

For each log type, document:

  • What events are captured (application errors, access logs, audit events, etc.)
  • Where logs are stored locally within the application
  • How long logs are retained before rotation or deletion
  • Whether logs are forwarded to centralised logging or SIEM services

4.1.2 Observability - Monitoring & Alerting

Section titled “4.1.2 Observability - Monitoring & Alerting”
Recommended

Describe how operational alerts are implemented:

Alert CategoryTrigger ConditionNotification MethodRecipient
[e.g., Application error rate][threshold][email, PagerDuty, Slack][team/role]
CapabilityToolCoverage
Application Performance Monitoring[e.g., Datadog, New Relic][which components]
Infrastructure Monitoring[e.g., CloudWatch, Prometheus][which resources]
Log Aggregation[e.g., ELK, Splunk, Datadog Logs][which log sources]
Distributed Tracing[e.g., Jaeger, X-Ray, Datadog APM][which services]
Comprehensive
QuestionResponse
What metrics are collected for capacity monitoring?[CPU, memory, storage, network, queue depth]
How are capacity trends analysed?[tools, dashboards, reports]
Are capacity thresholds and alerts configured?[threshold details]
Is there a capacity planning process?[process description]
Comprehensive

Document key operational procedures and runbooks:

ProcedureDescriptionOwnerDocumentation
Incident response[how incidents are detected and resolved][team][link]
Change management[how changes are approved and deployed][team][link]
Escalation paths[escalation procedures][team][link]
On-call rotation[on-call structure][team][link]

Scoring Guidance

ScoreWhat This Looks Like
1Monitoring tool identified but not configured
3Centralised logging, monitoring, and alerting in place; runbooks documented
5All of the above plus distributed tracing enabled, dashboards defined, incident response procedures tested