Skip to content

Example: Stellar Platform (Internal Developer Platform)

About This Example

This is a fictional but realistic Solution Architecture Document for Stellar Platform, an Internal Developer Platform (IDP) at Stellar Engineering Ltd — a 400-engineer B2B SaaS company. It demonstrates the ADS standard at Recommended documentation depth, appropriate for a Tier 3 internal productivity platform with no direct customer impact.

The example is written in the language of modern platform engineering: Team Topologies, cognitive load, golden paths, paved roads, platform-as-a-product, and DevEx. Use it as a reference when writing your own SAD for an internal platform or developer experience initiative.


FieldValue
Document TitleSolution Architecture Document — Stellar Platform (Internal Developer Platform)
Application / Solution NameStellar Platform
Application IDAPP-1042
Author(s)Tom Bloggs, Principal Platform Engineer
OwnerTom Bloggs, Principal Platform Engineer
Version1.0
StatusApproved
Created Date2026-01-14
Last Updated2026-04-18
ClassificationInternal
VersionDateAuthor / EditorDescription of Change
0.12026-01-14Tom BloggsInitial draft following platform strategy workshop
0.22026-02-05Claire DoeAdded developer journey scenarios and DevEx metrics
0.32026-02-27Amir BloggsAdded SRE-facing sections: observability, reliability, on-call model
0.42026-03-20Tom BloggsIncorporated feedback from Platform Advisory Group; added ADR-003 (multi-cloud)
1.02026-04-18Tom BloggsApproved by Architecture Review Board
NameRoleContribution Type
Tom BloggsPrincipal Platform Engineer (Platform Lead)Author
Claire DoeDeveloper Experience LeadAuthor
Amir BloggsSite Reliability Engineering LeadAuthor
Jane DoeProduct Manager (Stellar Platform)Reviewer
Priya BloggsHead of EngineeringReviewer
Joe BloggsSecurity ArchitectReviewer
Sam DoeFinOps LeadReviewer
Architecture Review BoardGovernanceApprover

This SAD describes the architecture of Stellar Platform — a self-service Internal Developer Platform (IDP) that provides Stellar Engineering Ltd’s 60 stream-aligned product teams with golden paths for service creation, deployment, observability, and day-2 operations.

  • Scope boundary: The Backstage developer portal, the platform control plane (Crossplane, Terraform), the delivery plane (ArgoCD, Tekton), the observability stack (Prometheus, Grafana, Datadog), and the golden-path templates they expose. Includes the GKE (primary) and EKS (secondary) Kubernetes fleets that host both the platform itself and its customer workloads.
  • Out of scope: The individual product-team services that run on the platform (documented by their owning teams), the corporate identity provider (Okta, documented under APP-0008), and the customer-facing Stellar SaaS product (documented under APP-0100).
  • Related documents: Stellar Engineering Platform Strategy 2026-2028 (STRAT-0004), Platform-as-a-Product Operating Model (POL-0031), Stellar Cloud Landing Zone Standards (STD-0012), Information Security Policy (POL-0001).

Stellar Platform is an Internal Developer Platform (IDP) built on Backstage that offers Stellar’s 400 engineers a curated, self-service experience for the entire software delivery lifecycle. It exposes a small number of well-paved golden paths — opinionated templates and automation — that reduce the cognitive load on stream-aligned product teams and let them ship independently without having to reason about Kubernetes manifests, Terraform modules, IAM boundaries, or observability wiring.

The platform is architected as three loosely-coupled planes:

  • Portal plane: A Backstage instance acting as the single pane of glass for discovery, self-service actions, software catalogue, TechDocs, and scorecards.
  • Control plane: Crossplane-managed infrastructure abstractions, Terraform for everything Crossplane cannot model yet, GitHub as the source of truth, and Dagger for reusable CI pipelines.
  • Runtime plane: A federated fleet of Kubernetes clusters (GKE as primary, EKS as secondary), delivered via ArgoCD (GitOps) and Tekton (for build and security pipelines), with observability provided by a Prometheus + Grafana stack and Datadog for cross-cloud APM and incident workflow.

The platform is treated as a product. It has a product manager, a roadmap, user research cadence, and opt-in adoption — teams can route around it, but we design the paved road to be the path of least resistance.

DriverDescriptionPriority
Developer productivityLead time for changes has stretched from 2 days to 9 days as the estate has grown; new service bootstrapping takes 3-6 weeks of coordination across SRE, Security, and PlatformCritical
Cognitive loadProduct teams are carrying too many accidental responsibilities (clusters, pipelines, IAM, alerting) instead of focusing on customer valueHigh
Fragmentation14 different CI patterns, 6 Terraform module styles, 4 Kubernetes deployment approaches, and 3 competing observability stacks across teamsHigh
ReliabilityProduction incidents increasingly rooted in configuration drift, unclear ownership, and inconsistent runbooks; change failure rate at 18% (DORA high-performer threshold is 15%)High
SecurityInconsistent supply-chain controls and secret handling across teams; audit findings in SOC 2 Type II reportHigh
CostCloud spend grew 42% YoY against 18% revenue growth; no unified FinOps view across teamsMedium
QuestionResponse
Which organisational strategy or initiative does this solution support?Stellar Engineering Platform Strategy 2026-2028: pillar 2 (“reduce cognitive load on stream-aligned teams”) and pillar 4 (“engineer productivity and DORA elite performance”)
Has this solution been reviewed against the organisation’s capability model?Yes — reviewed by the Enterprise Architecture Council 2026-02-12
Does this solution duplicate any existing capability?No — it explicitly consolidates and retires fragmented capabilities (see Current State)
CapabilityShared Service / PlatformReused?Justification (if not reused)
Source controlGitHub Enterprise (corporate)Yes
Identity & AccessOkta (corporate IdP)YesSCIM-provisioned groups drive Backstage and Kubernetes RBAC
APM & Incident ManagementDatadog (existing enterprise contract)YesRetained for APM, synthetics, and on-call workflow; avoids re-tooling cost
Metrics & DashboardsPrometheus + GrafanaYes (new standard)Self-hosted; integrates with Datadog for unified dashboards
Secret ManagementHashiCorp Vault (existing)YesWorkload Identity federated into Vault for short-lived credentials
CI/CDGitHub Actions (corporate)Yes (partial)Retained for source-repo-level checks; Tekton used for heavier build + signing pipelines
Artefact RegistryGitHub Packages + Artifact RegistryYesHybrid reflects multi-cloud choice
Data & AnalyticsSnowflake (corporate)YesBackstage and DORA telemetry land in Snowflake via Fivetran
  • Backstage developer portal and all first-party plugins (catalogue, TechDocs, Scaffolder, scorecards, cost insights)
  • Platform control plane: Crossplane, Terraform modules, Dagger pipeline libraries
  • Delivery plane: ArgoCD control plane, Tekton pipelines, supply-chain tooling (Sigstore, SLSA attestations)
  • Runtime plane: GKE (primary) and EKS (secondary) fleet, including platform workloads and the multi-tenant application namespaces for product teams
  • Observability plane: Prometheus, Grafana, OpenTelemetry collectors, Datadog integration
  • Golden-path templates for: new Go service, new TypeScript service, new Python batch job, new frontend app, new ephemeral preview environment
  • Developer-facing CLI (stellar) wrapping portal and API actions
  • Documentation, enablement, and paved-road migration tooling
  • Individual product services that run on the platform (owned by stream-aligned teams)
  • The customer-facing Stellar SaaS product (APP-0100)
  • Corporate identity (Okta) and networking (ExpressRoute / Interconnect) — platform consumes these
  • Data warehouse workloads (Snowflake; documented under APP-0070)
  • Third-party SaaS integrations not consumed directly by the platform

Stellar Engineering reached its current scale (400 engineers, 60 teams, ~850 services) without a deliberate platform strategy. The result is a high-cognitive-load environment for stream-aligned teams:

  • Manual service bootstrapping: New services take 3-6 weeks. The process spans 9 Jira tickets across SRE, Security, Networking, Platform, and Finance. Engineers cite this as their top frustration in the 2025 DevEx survey (Net DevEx Score: -18).
  • Jenkins monorepo: A single 12-year-old Jenkins instance runs 2,400 jobs; >60% of incidents in the CI/CD domain originate here. The maintainer left in 2024 and no one fully understands the Groovy shared library.
  • Terraform sprawl: Each team maintains its own Terraform modules. Six competing approaches to VPC, IAM, and Kubernetes namespace provisioning exist.
  • Kubernetes fragmentation: Some teams deploy via Helm charts manually, some via ad hoc kubectl apply, a few via Flux. No consistent RBAC, no consistent resource-quota policy.
  • Observability silos: Three teams run their own Prometheus; others export straight to Datadog; some still use CloudWatch. Cross-service traces are unusable.
  • Documentation decay: Team wikis in Confluence are frequently out of date; new joiners spend their first 3-4 weeks “finding the right page”.

The DORA baseline (measured via manual sampling Q4 2025) sits in the medium performer band: deployment frequency weekly, lead time 9 days, change failure rate 18%, MTTR 8 hours.

Decision / ConstraintRationaleImpact
Backstage as the portal foundationIndustry standard for IDPs; active CNCF project; large plugin ecosystem; hiring signalCommits to a Node.js/React stack and the ongoing cost of tracking upstream
Multi-cloud from day one (GKE primary, EKS secondary)Commercial risk mitigation; two of our largest customers require regional presence in GCP and AWS respectivelyHigher platform engineering cost; requires cloud-agnostic abstractions (Crossplane)
GitOps via ArgoCDDeclarative, auditable, and the dominant pattern for Kubernetes at our scaleCommits teams to writing manifests or using our Scaffolder to generate them
Platform-as-a-product operating modelThe platform only succeeds if adoption is voluntary; we measure ourselves on adoption, DORA, and DevEx survey scoresRequires a dedicated PM (Jane Doe) and ongoing user research
Opinionated golden paths; opt-out allowedThe paved road should be the shortest path, but we do not forbid teams from leaving itSlightly higher support burden; accepts some long-tail variance
FieldValue
Project NameStellar Platform Programme
Project Code / IDPRJ-2026-004
Project ManagerJane Doe (Product Manager, Platform)
Estimated Solution Cost (Capex)GBP 1,200,000 (build phase, 9 months, including cross-functional team of 12)
Estimated Solution Cost (Opex)GBP 350,000/year (run cost: cloud hosting, Datadog, Backstage maintenance, on-call)
Target Go-Live Date2026-07-01 (MVP — first 5 golden paths)

Selected criticality: Tier 3: Medium Impact

The platform is an internal productivity tool with no direct customer-facing revenue impact. If the platform is unavailable:

  • In-flight product deployments are delayed (not blocked — teams can deploy via emergency path using kubectl directly).
  • Customer-facing services continue to run; they are not in the request path of the platform.
  • Developer productivity is reduced; an all-day outage costs approximately 400 engineer-days of lost self-service capability.

The impact of platform unavailability is internal productivity loss, not customer or regulatory harm. Tier 3 is appropriate.


StakeholderRole / GroupKey ConcernsRelevant Views
Priya BloggsHead of Engineering (Sponsor)Engineer productivity, DORA metrics, cost, predictable deliveryExecutive Summary, Scenarios
Jane DoeProduct Manager (Stellar Platform)Adoption, DevEx survey scores, paved-road-first narrativeAll views
Tom BloggsPrincipal Platform Engineer (Platform Lead)Design integrity, platform reliability, long-term maintainabilityAll views
Claire DoeDeveloper Experience LeadOnboarding time, cognitive load, documentation qualityLogical, Scenarios, Lifecycle
Amir BloggsSRE LeadReliability of the platform itself, on-call burden, observabilityPhysical, Operational Excellence, Reliability
Joe BloggsSecurity ArchitectSupply chain, secrets, Kubernetes RBAC, auditSecurity View, Data View
Sam DoeFinOps LeadMulti-cloud cost attribution, showback, waste reductionCost Optimisation
Product Team Tech Leads (c.60)Stream-aligned teams (internal customers)Autonomy, not being blocked, escape hatches when golden paths do not fitLogical, Scenarios
Engineering Directors (c.6)Capability-aligned leadersTeam performance, morale, hiring signalExecutive Summary, Scenarios
Enabling Teams (4 teams, c.18 engineers)Data, ML, Frontend, Mobile enabling teamsShared libraries integrate with golden paths; don’t impose their own contextLogical, Integration
ConcernStakeholder(s)Addressed In
Lead time for changes falls below 2 daysHead of Engineering, Product Teams1.2 Drivers, 3.6 Scenarios, 4.3 Performance
Cognitive load on product teams reducesProduct Teams, DevEx Lead3.1 Logical View (abstractions), 3.6 Scenarios
Platform does not become a bottleneckProduct Teams, Head of Engineering6.3 Risks (R-001), 4.2 Reliability
Golden paths do not become cagesProduct Teams, Tech Leads6.3 Risks (R-002), 3.1 Design patterns
Supply-chain integrity and SBOM generationSecurity Architect3.5 Security View, 5.1 CI/CD
Secrets never present on developer machinesSecurity Architect3.5 Security View
Cross-cloud cost is attributable per teamFinOps Lead4.4 Cost Optimisation
Platform SLIs/SLOs are visible and honouredSRE Lead4.1 Operational Excellence, 4.2 Reliability
Onboarding of a new team takes less than a dayDevEx Lead3.6 Scenarios
Regulation / StandardApplicabilityImpact on Design
UK GDPR & Data Protection Act 2018Platform processes engineer identity data (Okta sync) and may touch customer data indirectly via logs from product servicesAccess controls, audit logging, engineer consent for DevEx telemetry
SOC 2 Type IIStellar Engineering is SOC 2 Type II certified; the platform materially affects the control environment (change management, access, monitoring)Platform controls are in scope; evidence automation required
  • No — the platform itself does not process customer financial, health, or payment data. Product services running on the platform may, but they remain individually accountable for their regulatory posture.
StandardVersionApplicability
Stellar Information Security Policy (POL-0001)4.2All platform controls
Stellar Cloud Landing Zone Standards (STD-0012)3.1GKE and EKS account/project layout
SLSA Supply-chain Levelsv1.0 (target L3)CI/CD supply-chain controls
CIS Kubernetes Benchmarkv1.9Cluster hardening baseline

graph TB
  subgraph Portal[Portal Plane]
      BS[Backstage Portal]
      CLI[stellar CLI]
      TD[TechDocs]
  end
  subgraph Control[Control Plane]
      CP[Crossplane]
      TF[Terraform Modules]
      DG[Dagger Pipelines]
      GH[GitHub - Source of Truth]
  end
  subgraph Delivery[Delivery Plane]
      ARGO[ArgoCD]
      TKN[Tekton]
      SIG[Sigstore + SLSA]
  end
  subgraph Runtime[Runtime Plane]
      GKE[GKE Fleet - Primary]
      EKS[EKS Fleet - Secondary]
  end
  subgraph Obs[Observability Plane]
      PROM[Prometheus]
      GRAF[Grafana]
      OTEL[OpenTelemetry]
      DD[Datadog]
  end
  BS --> GH
  CLI --> BS
  GH --> CP
  GH --> ARGO
  CP --> GKE
  CP --> EKS
  TF --> GKE
  TF --> EKS
  DG --> TKN
  TKN --> SIG
  ARGO --> GKE
  ARGO --> EKS
  GKE --> OTEL
  EKS --> OTEL
  OTEL --> PROM
  OTEL --> DD
  PROM --> GRAF
Stellar Platform logical architecture. The Portal plane (Backstage, CLI, docs) sits above the Control plane (Crossplane, Terraform, Dagger, GitHub as source of truth), which sits above the Runtime plane (ArgoCD, Tekton, GKE and EKS clusters) and the Observability plane (Prometheus, Grafana, OpenTelemetry, Datadog).
ComponentTypeDescriptionTechnologyOwner
Backstage PortalWeb ApplicationSingle pane of glass: catalogue, Scaffolder, TechDocs, scorecards, cost insightsBackstage (Node.js, React, TypeScript)Platform Team (Portal squad)
stellar CLIApplicationThin CLI wrapping Backstage APIs for terminal-first engineersGo; distributed via Homebrew and go installPlatform Team (DevEx squad)
Scaffolder TemplatesApplication AssetGolden-path templates for new services, jobs, frontends, preview envsBackstage Scaffolder, YAML, CookiecutterPlatform Team (Portal squad)
Software CatalogueServiceAuthoritative registry of services, APIs, resources, teams, and ownershipBackstage catalog-backend, PostgreSQLPlatform Team (Portal squad)
Crossplane Control PlaneServiceKubernetes-native API for cloud resources (buckets, databases, IAM)Crossplane v1.15, provider-gcp, provider-awsPlatform Team (Control squad)
Terraform Module LibraryApplication AssetAudited modules for resources Crossplane does not yet modelTerraform 1.7, Terragrunt, AtlantisPlatform Team (Control squad)
Dagger Pipeline LibraryApplication AssetReusable typed CI pipelines (build, test, SBOM, sign, publish)Dagger (Go SDK)Platform Team (Delivery squad)
Tekton PipelinesServiceRuns heavy, privileged pipeline work (signing, image promotion)Tekton v0.56 on GKEPlatform Team (Delivery squad)
ArgoCD Control PlaneServiceGitOps engine; reconciles target state for all tenant namespacesArgoCD v2.11 in HA modePlatform Team (Runtime squad)
GKE FleetRuntimePrimary Kubernetes fleet (3 regions: europe-west2, us-east4, asia-southeast1)GKE AutopilotPlatform Team (Runtime squad)
EKS FleetRuntimeSecondary Kubernetes fleet (eu-west-2, us-east-1)EKS, Karpenter for node autoscalingPlatform Team (Runtime squad)
Prometheus + GrafanaServicePlatform and tenant metrics; self-hosted, multi-tenantPrometheus (Thanos for long-term), GrafanaPlatform Team (Obs squad)
DatadogExternal SaaSAPM, RUM, synthetics, on-call paging; integrated via OpenTelemetry CollectorDatadog (enterprise contract)Platform Team (Obs squad)
DORA Telemetry PipelineBatch JobExtracts deployment frequency, lead time, CFR, MTTR per team into SnowflakeDagger + SnowflakePlatform Team (DevEx squad)
PatternWhere AppliedRationale
Platform-as-a-ProductOverall operating modelPlatform only succeeds through voluntary adoption; treat internal customers as customers
Golden Paths (Paved Road)Scaffolder templates, CI libraries, runtime conventionsMake the right thing the easy thing; avoid hard guardrails where possible
GitOpsArgoCD, CrossplaneDeclarative, auditable, self-healing; Git is the source of truth
Control-Plane / Data-Plane separationPortal/Control vs. Runtime/ObservabilityAllows independent scaling and failure domains
SidecarOpenTelemetry Collector, Istio envoy (phase 2)Non-invasive telemetry and policy enforcement
API GatewayBackstage’s backend-for-frontendSingle authenticated entry point for portal clients
Strangler FigJenkins to Tekton migrationGradual retirement of Jenkins without a big-bang cutover
Service IDService NameCapability IDCapability Name
SVC-1042-01Developer PortalCAP-ENG-010Developer Self-Service
SVC-1042-02Platform Control PlaneCAP-ENG-011Infrastructure Provisioning
SVC-1042-03Delivery PipelinesCAP-ENG-012Build, Test, Deploy
SVC-1042-04Kubernetes RuntimeCAP-ENG-013Application Runtime
SVC-1042-05ObservabilityCAP-ENG-014Monitoring & Incident Response
Application NameApplication IDImpact TypeChange DetailsComments
Jenkins (legacy CI)APP-0205RetireRetire over 18 months via strangler-fig migration to Tekton2,400 jobs rehosted or refactored
Confluence team spacesN/AUse (reduced)TechDocs becomes primary engineering documentation surfaceConfluence retained for non-technical content
OktaAPP-0008UseSCIM sync of groups drives Backstage and cluster RBACNo change to Okta configuration
HashiCorp VaultAPP-0015UseWorkload Identity federation; Vault Agent sidecar for non-Kubernetes workloadsExisting Vault retained
DatadogN/A (SaaS)Use (expanded)Expanded to multi-cloud APM and unified on-callExisting enterprise contract
SnowflakeAPP-0070UseDORA and DevEx telemetry land in SnowflakeRead-only access pattern

3.1.6 Technology & Vendor Lock-in Assessment

Section titled “3.1.6 Technology & Vendor Lock-in Assessment”
Component / ServiceVendor / TechnologyLock-in LevelMitigationPortability Notes
BackstageCNCF (Spotify-origin)ModerateOpen-source, heavily extended internally; catalogue data portablePlugin ecosystem is the main switching cost
CrossplaneCNCFLowKubernetes-native; Compositions are portable YAMLCompositions use Upbound providers (alternative providers exist)
ArgoCDCNCFLowGitOps manifests are portable; Flux is a drop-in alternative
TektonCNCFLowPipelines are YAML; Dagger abstraction shields most pipeline logic
GKEGoogle CloudModerateAutopilot is GKE-specific; workloads themselves are standard KubernetesMigrated workloads would require re-platforming cluster layer
EKSAWSModerateSimilar considerations to GKE; intentional redundancy reduces single-cloud lock-in
DatadogDatadog Inc.HighOpenTelemetry Collector shields application code; dashboards and monitors are Datadog-specificDashboards-as-code (Terraform provider) eases partial migration
Backstage plugins (bespoke)Stellar-internalN/A (internal)Built on stable Backstage APIs; versioned

Primary developer journey — “Create a new service”:

sequenceDiagram
  participant Dev as Engineer
  participant BS as Backstage
  participant GH as GitHub
  participant TKN as Tekton
  participant CP as Crossplane
  participant ARGO as ArgoCD
  participant GKE as GKE Cluster
  participant DD as Datadog
  Dev->>BS: Choose golden-path template
  BS->>GH: Create repo (code + IaC)
  GH->>TKN: Trigger pipeline (push)
  TKN->>TKN: Build, SBOM, sign image
  TKN->>GH: Publish manifests to infra repo
  GH->>CP: Apply Crossplane claim
  CP->>GKE: Provision namespace + secrets
  GH->>ARGO: Sync new Application
  ARGO->>GKE: Deploy workload
  GKE->>DD: Emit metrics + traces
  BS->>Dev: "Service ready - see scorecard"
Developer journey for creating a new service. An engineer opens Backstage, picks a golden-path template, Backstage Scaffolder creates a GitHub repository with CI and infra as code. Crossplane provisions cloud resources. ArgoCD deploys the resulting container image. Datadog observability is auto-configured.

Secondary data flow — DORA telemetry:

  1. Each Tekton pipeline run emits a CloudEvents-formatted event to a Pub/Sub topic.
  2. A Dagger batch job (runs every 15 minutes) aggregates events into deployment, lead time, and CFR metrics per team.
  3. Metrics land in Snowflake (PLATFORM.DORA schema) and are surfaced back into Backstage scorecards.
  4. Weekly exec digest is generated from Snowflake via scheduled query.
Source ComponentDestination ComponentProtocol / EncryptionAuthentication MethodPurpose
Engineer browserBackstage PortalHTTPS / TLS 1.3OIDC (Okta)Portal access
stellar CLIBackstage backendHTTPS / TLS 1.3OIDC device code flowCLI self-service
BackstageGitHub EnterpriseHTTPS / TLS 1.3GitHub App (short-lived tokens)Scaffolder, catalogue sync
BackstagePostgreSQL (catalogue)TCP-TLSmTLS + Workload IdentityCatalogue persistence
TektonGitHub EnterpriseHTTPS / TLS 1.3GitHub AppWebhook-driven pipeline triggers
TektonArtifact Registry / GHCRHTTPS / TLS 1.3Workload IdentityPush container images
ArgoCDGKE / EKS API serversHTTPS / TLS 1.3ServiceAccount + cluster RBACReconcile desired state
CrossplaneGCP / AWS APIsHTTPS / TLS 1.3Workload Identity federationProvision cloud resources
OpenTelemetry CollectorPrometheus (remote write)HTTPS / TLS 1.3mTLSMetrics ingestion
OpenTelemetry CollectorDatadog intakeHTTPS / TLS 1.3API key (from Vault)APM and trace ingestion
Platform workloadsHashiCorp VaultHTTPS / TLS 1.3Workload Identity (JWT)Short-lived dynamic secrets
Source ApplicationDestination ApplicationProtocol / EncryptionAuthenticationSecurity ProxyPurpose
Stellar PlatformOktaHTTPS / TLS 1.3OIDC (server-to-server), SCIMN/AAuthentication, group sync
Stellar PlatformGitHub Enterprise CloudHTTPS / TLS 1.3GitHub App (private key in Vault)N/ASource of truth
Stellar PlatformDatadogHTTPS / TLS 1.3API keyN/AAPM, paging
Stellar PlatformSnowflakeHTTPS / TLS 1.3Key-pair auth (rotated)Private LinkDORA telemetry landing
User TypeAccess MethodAuthenticationProtocol
Engineers (400)Web browser + stellar CLIOkta SSO (OIDC) + MFAHTTPS
Platform admins (12)Web + kubectl via IAP/SSM bastionOkta SSO + Hardware key + PIMHTTPS / SSH
Break-glass / SREEmergency cluster-admin role via PIMOkta SSO + Hardware key + manager approval + 2h TTLHTTPS
NameTypeDirectionData FormatVersionAuthenticatedRate Limited
Backstage Backend APIRESTExposed (internal)JSONv1Yes (OIDC)Yes
Scaffolder Templates CatalogueRESTExposedJSONv1Yes (OIDC)Yes
DORA Metrics APIRESTExposedJSONv1Yes (OIDC + team scope)Yes
Crossplane API (Kubernetes CRDs)Kubernetes APIExposed (internal)JSON/YAMLCrossplane v1Yes (ServiceAccount)Yes (API priority & fairness)
graph TB
  subgraph GKE[GKE - Primary - 3 regions]
      BSCluster[Portal + Backstage]
      ArgoMain[ArgoCD HA]
      TknMain[Tekton]
      CPMain[Crossplane]
      ObsMain[Prometheus + Grafana]
      VaultMain[Vault]
      TenantsG[Tenant Workloads]
  end
  subgraph EKS[EKS - Secondary - 2 regions]
      ArgoSat[ArgoCD Satellite]
      TenantsE[Tenant Workloads]
  end
  subgraph SaaS[External SaaS]
      GH[GitHub Enterprise]
      OK[Okta]
      DD[Datadog]
      SF[Snowflake]
  end
  BSCluster --> GH
  BSCluster --> OK
  ArgoMain --> GKE
  ArgoMain --> EKS
  ObsMain --> DD
  TknMain --> DD
  BSCluster --> SF
Stellar Platform deployment. GKE is primary across three regions for the portal, control plane, ArgoCD, Tekton and observability. EKS is secondary across two regions, running ArgoCD satellites and tenant workloads. GitHub, Datadog, Okta, and Snowflake are SaaS. HashiCorp Vault is self-hosted on GKE.
AttributeSelection
Hosting Venue TypePublic Cloud (multi-cloud)
Hosting Region(s)GCP: europe-west2 (London), us-east4, asia-southeast1. AWS: eu-west-2 (London), us-east-1.
Service ModelPaaS + CaaS (GKE Autopilot, EKS + Karpenter)
Cloud Provider(s)GCP (primary), AWS (secondary)
Account / Subscription TypeStellar corporate landing zones (stellar-platform-prod, stellar-platform-nonprod, plus per-region tenant folders)
Compute TypeTechnologyDetails
Container platform (primary)GKE AutopilotMulti-regional; platform + tenant workloads
Container platform (secondary)EKS + KarpenterRegional; failover and multi-cloud tenant workloads
ServerlessCloud Run (occasional, for platform utility services)Used for infrequent batch utilities

Platform control-plane footprint (steady state, production):

WorkloadClusterQuantityNotes
Backstage PortalGKE (europe-west2)6 pods (HA)2 CPU / 4 GiB each
PostgreSQL (Backstage catalogue)Cloud SQL (regional)1 primary + 1 replicadb-custom-4-16
Crossplane controllersGKE (europe-west2)3 pods
ArgoCDGKE (europe-west2)HA mode, 3 replicasApplication controller sharded by cluster
Tekton pipelinesGKE (europe-west2)Up to 200 concurrent podsAutopilot-managed
PrometheusGKE (each region)2 replicas per region + Thanos14d hot, 1y cold in GCS/S3
AgentCoverageJustification
GKE Security Posture / GuardDutyAll clustersRuntime threat detection
FalcoGKE, EKSeBPF-based runtime anomaly detection on platform clusters
Trivy OperatorAll clustersContinuous image & config scanning
QuestionResponse
Is this an Internet-facing application?Backstage portal is Internet-facing (behind corporate IdP); runtime planes are not directly Internet-facing
Outbound Internet connectivity required?Yes — GitHub, Okta, Datadog, Snowflake, container registries
Cloud-to-on-premises connectivity required?Yes — ExpressRoute to the London colo for Vault HSM root of trust and Okta connector
Wireless networking required?No
Third-party / co-location connectivity required?Yes — Datadog (over PrivateLink / PSC where available), Snowflake (PrivateLink)
Cloud network peering required?Yes — GCP and AWS VPCs peered to a central transit hub; multi-cloud connectivity via Megaport
AttributeSelection
User access methodWeb (HTTPS) + CLI
User locationsGlobal (UK, US, APAC offices; remote workforce)
Administrator access methodIAP-tunnelled kubectl; no public Kubernetes API endpoints
VPN requiredNo (IAP + Okta context-aware access)
Direct Connect / ExpressRoute / InterconnectYes
ProtocolUsed?Purpose
HTTPS (TLS 1.3)YesAll portal, API, and inter-service traffic
gRPC (mTLS)YesService-to-service on the runtime plane (Istio-enforced)
TCP-TLSYesDatabase and Vault traffic
SFTPNo
KafkaNo (yet; planned Phase 2)
EnvironmentDescriptionCount & VenueCompute Solution
Development (per engineer)Ephemeral preview environments on mergeUp to 200 concurrent, GKE (europe-west2)GKE Autopilot
Integration TestContinuous integration testing of the platform itself1x GKE (europe-west2)GKE Autopilot
StagingPre-production validation; mirrors production topology at reduced scale1x GKE + 1x EKSGKE Autopilot + EKS
ProductionLive platform3x GKE regions + 2x EKS regionsGKE Autopilot + EKS

Dev and integration-test environments automatically scale to zero outside business hours.

QuestionResponse
Hosting regions chosen for low carbon intensityeurope-west2 (London), us-east4, asia-southeast1 chosen for customer proximity. Each region operates under its respective cloud provider’s carbon-neutral / 100% renewable matching commitments; europe-west2 published carbon intensity tracks with the UK grid.
Non-production environments auto-shutdownYes — dev and integration-test GKE Autopilot clusters scale to zero outside business hours; non-prod databases (Cloud SQL) auto-paused; ~£18k/year saving on non-prod compute (referenced in 4.4 FinOps).
Compute family chosen for performance-per-wattGKE Autopilot uses Google’s latest-generation efficient nodes (Tau-T2D ARM-equivalent on supported workloads); EKS uses Graviton3 (c7g/m7g) where customer workloads tolerate ARM. AWS Graviton’s ~60% performance-per-watt advantage is captured for backend services.
Auto-scaling configured to release capacity when idleYes — GKE Autopilot scales pods on resource demand; Karpenter on EKS consolidates within 5 minutes; Backstage portal scales to two replicas overnight (down from peak of eight).
DR strategy proportionateMulti-region active-active for the data plane (delivery / artefact services), warm standby for the portal control plane. Hot active-active rejected for the portal: not justified by the SLO (99.5%), would have ~30% additional always-on compute and PostgreSQL replication carbon cost.
Data NameStore TechnologyAuthoritative?Retention PeriodData SizeClassificationPersonal Data?Encryption LevelKey Management
Software catalogueCloud SQL (PostgreSQL)YesIndefinite< 10 GBInternalYes (engineer email, GitHub handle)Storage + column-level for PIICustomer-managed KMS (GCP)
TechDocs (built)GCS / S3No (source is Git)Indefinite< 100 GBInternalNoStorage (CMEK)Customer-managed KMS
Metrics (hot)Prometheus / ThanosYes14 days (hot), 1 year (cold)~2 TB hot; ~15 TB coldInternalNoStorageCustomer-managed KMS
LogsDatadogNo30 daysVariable; projected 8 TB/monthInternalNo (engineers redact)In-transit + at-rest (Datadog-managed)Datadog-managed
DORA metricsSnowflakeYes7 years< 50 GBInternalYes (linked to team, not individual)StorageCustomer-managed (Snowflake)
Tekton pipeline artefactsGCS / S3Yes90 days (SBOMs retained 2 years)~500 GB rollingInternalNoStorageCustomer-managed KMS
SecretsVault + CSI providerYesN/A (zero persistence on workload)< 1 GBRestrictedNoHSM-backedHSM (FIPS 140-2 L3)
Platform configurationGitHub EnterpriseYesIndefinite< 20 GBInternalNoGitHub-managedGitHub-managed
Classification LevelData TypesHandling Requirements
InternalService metadata, metrics, logs, TechDocs, DORA metricsTLS in transit, CMEK at rest, access via Okta-authenticated portal
RestrictedSecrets, signing keysNever present on engineer machines; HSM-backed; short-lived delivery only
StageDescriptionControls
Creation / IngestionEngineers emit events via pipelines, scaffolder, portal interactions; metrics scraped from workloadsSchema validation at ingest (OpenTelemetry, CloudEvents)
ProcessingAggregation of DORA metrics; catalogue reconciliationRuns on platform clusters with Workload Identity
StorageRegional PostgreSQL, Prometheus/Thanos, GCS/S3, Datadog SaaS, SnowflakeCMEK encryption; regional pinning where feasible
Sharing / TransferDatadog and Snowflake SaaS boundary (see 3.4.5)TLS 1.3, PrivateLink where available
ArchivalMetrics tiered to GCS/S3 via Thanos; pipeline artefacts tiered to archival storage classLifecycle policies
Deletion / PurgingCatalog soft-deleted on service retirement; hard-delete after 30 days; DORA metrics retained 7 years then purgedAutomated lifecycle jobs
Assessment TypeIDStatusLink
Data Protection Impact Assessment (DPIA)DPIA-2026-007CompleteStellar SharePoint / Legal / DPIAs

The DPIA concluded that engineer telemetry (DORA, DevEx) is legitimate-interest processing of employee data. Engineers are informed via the engineering handbook; team-level aggregation is preferred over individual attribution.

ApproachSelected
Production data is not used for testing[x]

The platform does not process customer data. Platform-generated data (metrics, logs) in non-production is generated synthetically via load tests.

  • Yes — Sigstore cosign signatures on every container image; SLSA provenance attestations stored alongside each build; Git commit signing enforced on infra repositories; Crossplane compositions reconciled continuously.
  • No — no secrets, certificates, or customer data land on engineer workstations. The stellar CLI uses OIDC device-code flow with tokens in OS keychain (30-minute TTL).
DestinationTypeDataMethodEncrypted
DatadogThird-party SaaSMetrics, traces, logs (scrubbed)API (TLS 1.3)Yes
SnowflakeThird-party SaaS (enterprise-contracted)DORA metricsAPI (PrivateLink)Yes
GitHub Enterprise CloudThird-party SaaSSource, IaC, manifestsAPI (TLS 1.3)Yes
  • Yes — UK customer-facing tenants’ metadata remains in europe-west2 / eu-west-2. Datadog data is routed to the EU site. Snowflake uses an EU deployment.
QuestionResponse
Retention periods minimisedBuild artefacts retained 30 days (latest 5 successful per repo retained indefinitely); container images expire on tag age (90 days for non-stable tags); audit logs 7 years (per Stellar audit policy); telemetry rolled up after 30 days. Lifecycle policies enforce automatic expiry.
Older data tiered to cold/archive storageYes — Cloud Storage / S3 lifecycle: artefacts transition Standard → Nearline → Coldline (90 days) → Archive (1 year). Datadog rolls metrics from raw to aggregated tiers automatically.
Unused or duplicate replicasSingle Cloud SQL primary + 1 read replica (justified by Backstage read-heavy load); Snowflake reserves no idle warehouses (auto-suspend after 10 min). Quarterly orphan-bucket review via gcloud + AWS Trusted Advisor.
Compression appliedBrotli on Backstage HTTPS responses; gzip on artefact uploads to Cloud Storage; Parquet+Zstandard for DORA metric exports to Snowflake.
Cross-region replication justifiedYes — multi-region active-active for the data plane is required by the platform SLO (99.9%). Portal control-plane uses regional Cloud SQL replication only. No cross-cloud data replication beyond explicit pipelines.
Large data transfers off-peakNightly DORA metric ingest to Snowflake 03:00 UTC; weekly Backstage analytics export Sunday 02:00 UTC. Aligned with low UK / EU grid carbon intensity.
QuestionResponse
Does the solution support regulated activities?No directly; platform controls are in scope of SOC 2
Is the solution SaaS or third-party hosted?Hybrid — self-hosted Kubernetes + several SaaS dependencies (Datadog, Okta, Snowflake, GitHub)
Has a third-party risk assessment been completed?Yes — all SaaS vendors have current TPRA records

A lightweight STRIDE threat model has been produced (THREAT-1042-01). Top threats: (1) compromised Backstage instance as a super-power surface, (2) supply-chain injection at Tekton, (3) Crossplane as blast-radius amplifier across clouds.

Impact CategoryBusiness Impact if Compromised
ConfidentialityHigh — platform telemetry includes engineer identity and deployment patterns; secrets for all internal systems pass through Vault
IntegrityHigh — a platform compromise could push malicious manifests to any tenant cluster
AvailabilityMedium — platform outage halts self-service but does not stop customer-facing services
Non-RepudiationMedium — all platform actions signed and audit-logged; break-glass tracked with dual approval
Access TypeRole(s)Destination(s)Authentication MethodCredential Protection
EngineerDeveloperBackstage, CLIOkta SSO (OIDC) + WebAuthnManaged by Okta; hardware keys for privileged groups
Platform AdminPlatform EngineerBackstage admin, kubectl via IAPOkta SSO + Hardware key + PIMJIT elevation, 2h TTL
SRE on-callSREKubectl (break-glass)Okta SSO + Hardware key + manager approval + PIMJIT elevation, 1h TTL, dual-approval
Service AccountPlatform workloadsCloud APIs, VaultWorkload Identity FederationNo long-lived credentials
CI runnerTekton pipelinesRegistries, KubernetesWorkload Identity + signed SPIFFE SVIDsShort-lived (< 15 min)
Access TypeRole / ScopeEntitlement StoreProvisioning Process
Engineer (all)Self-service on own team’s servicesOkta groups -> Backstage + Kubernetes RBACSCIM (automated)
Engineering DirectorView across their directorateOkta groupSCIM
Platform EngineerPlatform maintenance (non-production)Okta group + JIT to production via PIMSCIM + PIM
Break-glass adminFull cluster-adminOkta group (empty steady-state) + PIMManual activation with dual approval
  • RBAC model with ABAC attributes for team ownership
  • Quarterly access recertification enforced via Okta Lifecycle
  • Segregation of duties: no engineer has write-access to both code and signing keys for the same service
Account TypeManagement Approach
Production cluster-adminOkta PIM; JIT 1h; hardware key; session recording via IAP; dual-approval for break-glass
Crossplane provider credentialsWorkload Identity only; no static credentials exist
Vault root tokenSealed, sharded among 5 officers; never unsealed in steady-state

3.5.3 Network Security & Perimeter Protection

Section titled “3.5.3 Network Security & Perimeter Protection”
ControlImplementation
Network segmentationPer-tenant Kubernetes namespaces; NetworkPolicies enforced; Istio planned for mTLS east-west (Phase 2)
Ingress filteringGCP Cloud Armor + AWS WAF on internet-facing portal; IAP context-aware access
Egress filteringPer-namespace egress policies via Cilium; default-deny
Private cluster endpointsYes — Kubernetes API servers are private-only; access via IAP
Encryption in transitTLS 1.3 enforced by Cloud Armor / ALB policies
AttributeDetail
Encryption deployment levelStorage (platform default) + logical-container (KMS key per tenant)
Key typeSymmetric
Algorithm / cipher / key lengthAES-256-GCM
Key generation methodHSM (Cloud KMS, Cloud HSM where FIPS 140-2 L3 required)
Key storageCloud KMS / HSM
Key rotation scheduleAutomatic, every 90 days
AttributeDetail
Secret storeHashiCorp Vault (self-hosted on GKE, HA)
Secret distributionCSI Secrets Store driver -> tmpfs volume in workload pod; never written to disk
Secret protection on hostShort-lived (< 1 hour) dynamic secrets; no static credentials
Secret rotationAutomatic (dynamic secrets have TTL-driven rotation)

3.5.5 Security Monitoring & Threat Detection

Section titled “3.5.5 Security Monitoring & Threat Detection”
CapabilityImplementation
Security event loggingFalco + Kubernetes audit logs shipped to SIEM
SIEM integrationYes — Splunk Enterprise (corporate SIEM); 1-year hot retention
Infrastructure event detectionGuardDuty (AWS) + Security Command Center (GCP)
Security alertingCritical alerts page SRE + Security on-call; Sev-2 go to SOC queue
Supply chainSigstore cosign verification on image admission; SLSA L3 targeted; SBOM generated per build and stored

UC-01: Engineer bootstraps a new service from a golden-path template

AttributeDetail
Actor(s)Engineer on a stream-aligned product team
TriggerNew service needed to deliver a product increment
Pre-conditionsEngineer is authenticated; has membership of the owning team’s Okta group
Main Flow1. Open Backstage, choose “Create new Go service” template. 2. Fill 6 fields (name, team, description, tier, region, data classification). 3. Scaffolder creates GitHub repo + infra repo with sensible defaults. 4. Tekton pipeline runs on first commit — builds, tests, generates SBOM, signs with cosign. 5. Crossplane provisions namespace, bucket, and service account. 6. ArgoCD deploys to staging automatically. 7. Datadog dashboard and SLO are auto-created. 8. Backstage scorecard shows green.
Post-conditionsService is in staging, discoverable in catalogue, observable; total elapsed time target < 30 minutes
Views InvolvedLogical, Integration & Data Flow, Physical, Security

UC-02: Engineer deploys to production via GitOps

AttributeDetail
Actor(s)Engineer (with write on the service repo)
TriggerFeature or fix ready for production
Pre-conditionsPR passed CI (tests, SAST, SCA, image sign); peer review approved
Main Flow1. PR merged to main. 2. Tekton builds new image and pushes signed artefact. 3. A bot PR is raised against the infra repo bumping the image tag in the prod overlay. 4. Once approved and merged, ArgoCD detects drift and syncs to the target cluster. 5. Progressive delivery (Argo Rollouts, canary) shifts traffic 10% -> 50% -> 100% with SLO-based gating. 6. If the SLO burn rate exceeds threshold, automatic rollback.
Post-conditionsChange is live; DORA pipeline emits deployment event; scorecard updates
Views InvolvedLogical, Integration, Physical, Security

UC-03: SRE responds to a platform incident (break-glass)

AttributeDetail
Actor(s)SRE on-call
TriggerDatadog paging event: ArgoCD sync failing cluster-wide
Pre-conditionsSRE is enrolled in break-glass PIM role
Main Flow1. Datadog pages via PagerDuty. 2. SRE acknowledges; opens incident bridge. 3. Requests PIM elevation (dual-approval by secondary on-call). 4. kubectl via IAP tunnel; session recording active. 5. Diagnoses repo sync misconfiguration; reverts offending commit. 6. ArgoCD recovers. 7. Post-incident: role automatically expires at T+1h; full audit trail exported to SIEM.
Post-conditionsPlatform restored; incident report and timeline logged
Views InvolvedPhysical, Security

3.6.2 Architecture Decision Records (ADRs)

Section titled “3.6.2 Architecture Decision Records (ADRs)”

ADR-001: Adopt Backstage rather than build an in-house portal

FieldContent
StatusAccepted
Date2026-01-22
ContextThe platform needs a unified front-door. We considered three directions: build a bespoke portal, adopt Backstage, or buy a commercial IDP (Port.io, Cortex, OpsLevel). Our ambition is a deeply integrated, opinionated IDP and we expect to run it for 5+ years.
DecisionAdopt Backstage as the foundation of the portal plane.
Alternatives ConsideredBuild bespoke: Full control and perfect fit, but requires 4-6 engineer-years to reach catalogue parity; hiring and retention signal is weaker. Port.io / commercial IDP: Fast to stand up, strong out-of-the-box experience, but ongoing per-user SaaS cost at 400 engineers is material (~GBP 200k/year) and customisation of core data model is limited. Backstage: CNCF incubating, large ecosystem (>300 plugins), portable catalogue model, healthy community, used by organisations at comparable scale (Spotify, American Airlines, Expedia).
ConsequencesPositive: strong hiring signal; community velocity; deep extension points; OSS means no per-seat cost. Negative: TypeScript/Node.js operational stack introduced; upstream velocity is high, we must track releases; initial plugin quality is variable.
Quality Attribute TradeoffsOperational excellence and cost (positive) vs. initial delivery speed (slightly negative — steeper initial curve than a SaaS IDP).

ADR-002: ArgoCD for GitOps rather than Flux

FieldContent
StatusAccepted
Date2026-02-09
ContextWe need a GitOps engine to reconcile Kubernetes state across GKE and EKS. The two mature CNCF options are ArgoCD and Flux.
DecisionUse ArgoCD in HA mode as the primary delivery-plane engine.
Alternatives ConsideredFlux: Lightweight, GitOps-toolkit-based, composable, lower resource footprint. Excellent for small deployments but the UX for 850+ applications across 5 regions is weaker. ArgoCD: Rich UI suited to a developer-facing portal experience, Argo Rollouts integration for progressive delivery, Application sets for template-driven fan-out, mature multi-cluster model.
ConsequencesPositive: excellent developer UX; first-class progressive delivery; strong RBAC model. Negative: heavier resource footprint; in-cluster UI is another attack surface (mitigated via IAP + OIDC).
Quality Attribute TradeoffsOperational excellence (positive) over small efficiency gains from Flux (minor negative).

ADR-003: Multi-cloud (GKE primary, EKS secondary) from day one

FieldContent
StatusAccepted
Date2026-03-11
ContextTwo of our five largest customers contractually require workloads to run in AWS regions they already operate in. A third (regulated) requires GCP. Consolidating onto a single cloud would force a painful customer-facing negotiation. The platform is the leverage point: if the platform is cloud-agnostic, product teams inherit multi-cloud capability without new cognitive load.
DecisionDesign Stellar Platform as multi-cloud from inception. GKE is the primary cloud for platform-plane workloads (lower operational cost for control plane at our scale, Autopilot maturity). EKS is a peer runtime for tenant workloads requiring AWS presence. Crossplane provides a uniform abstraction over cloud resources.
Alternatives ConsideredSingle-cloud (GCP only): Simpler, cheaper to run, faster to deliver. Rejected because it forces commercial negotiation with AWS-bound customers. Single-cloud (AWS only): Similar trade-off in reverse. Cloud-agnostic from day one, deploy later: Architecturally tempting but creates a “second day” surprise; abstractions untested under load.
ConsequencesPositive: strategic flexibility, customer alignment, vendor-lock-in reduced. Negative: roughly 25% higher platform engineering cost; requires disciplined use of abstractions (no reaching directly for cloud-specific primitives outside agreed extension points).
Quality Attribute TradeoffsReliability and strategic flexibility (positive) over cost optimisation (negative in the short term).

Log TypeEvents LoggedLocal StorageRetention PeriodRemote Services
Application logsBackstage, ArgoCD, Tekton, CrossplaneStdout (ephemeral)30 days hot (Datadog), 1 year cold (S3/GCS)Datadog
Audit logsKubernetes audit, Backstage audit, Vault auditStdout1 year hot in SplunkSplunk SIEM
Pipeline logsTekton run logs, Dagger logsGCS90 daysDatadog (metadata only)
Platform metricsPrometheus remote-writeLocal TSDB 14 days1 year in Thanos (GCS/S3)Datadog (selected series)

4.1.2 Observability — Monitoring & Alerting

Section titled “4.1.2 Observability — Monitoring & Alerting”
SLIObjectiveMeasurement
Portal availability99.5% monthlyDatadog synthetic
stellar new service end-to-end success99%Scaffolder telemetry
ArgoCD sync success rate99.5% per clusterPrometheus
Median deployment latency (merge-to-prod)< 15 minutesDORA telemetry
p99 Backstage API latency< 800 msPrometheus
Alert CategoryTrigger ConditionNotification MethodRecipient
Platform SLO burnFast-burn (1h) or slow-burn (6h) on any platform SLOPagerDutyPlatform on-call
Security event (Falco)Priority >= criticalPagerDutySecurity on-call
Cost anomaly> 20% daily variance vs 28-day baselineSlack + emailFinOps Lead
ArgoCD sync failure (per tenant)Any sync failure > 15 minSlack (team-owned channel)Tenant team
CapabilityToolCoverage
MetricsPrometheus / ThanosPlatform + tenants (self-service scraping)
DashboardsGrafanaPlatform-owned + team-owned dashboards
APM & tracesDatadogAll tenant services (via OTel)
Logs (aggregation)DatadogAll workloads
SIEMSplunkSecurity-relevant events
Incident managementDatadog + PagerDutyOn-call rotation, post-incident
RunbooksTechDocs (Backstage)Every platform SLO has a linked runbook

4.2.1 Geographic Footprint & Disaster Recovery

Section titled “4.2.1 Geographic Footprint & Disaster Recovery”
QuestionResponse
Is the application deployed across multiple hosting venues for continuity?Yes — multi-region within GCP; EKS fleet adds cross-cloud capability for tenant workloads
What is the DR strategy?Warm-standby for the portal plane (europe-west2 primary, us-east4 warm); backup-restore for GitHub (self-hosted backup via GitHub Enterprise Importer)
Are there data sovereignty requirements affecting geographic choices?Yes — UK data residency for some tenants; UK regions used for their metadata
AttributeResponse
Scaling capabilityFull auto-scaling
Scaling detailsGKE Autopilot handles platform pods; Karpenter handles EKS; ArgoCD application controller sharded by cluster; Backstage horizontal pod autoscaling on CPU and request latency
AttributeResponse
Dependencies adequately sized?Yes
Dependency detailsGitHub Enterprise Cloud scales with enterprise contract; Datadog contract sized for 3x current ingest; Okta has room for 2x workforce; Vault HA cluster sized for 10x current QPS
  • Yes — platform-plane components run in HA mode (>= 3 replicas across zones); ArgoCD and Crossplane reconcile continuously; circuit breakers on third-party calls (Datadog, GitHub); Backstage degrades gracefully if catalogue DB is read-only (serves cached data, self-service creation paused).
Component / DependencyFailure ModeDetection MethodRecovery BehaviourUser Impact
BackstagePod crashloopDatadog APM + PrometheusPod rescheduled; HPA scalesPartial — some requests retry
PostgreSQL (catalogue)Primary failureCloud SQL HAAuto-failover to replica (< 60 s)Brief read-only window
ArgoCDApplication controller failurePrometheusSharded replica continues; failed shard restartsDeployment delays
CrossplaneProvider crashPrometheusProvider restarts; state in etcdProvisioning delayed
GitHubGitHub outageExternal status + syntheticLocal mirror allows read; writes queueScaffolder paused
DatadogDatadog outageDatadog multi-region + our syntheticMetrics continue to Prometheus; paging falls back to backup PagerDuty routeReduced observability
GCP region outageRegional failureGCP status + PrometheusTraffic shifts to secondary region (warm-standby)Elevated latency, 15-20 min recovery
VaultSeal / outagePrometheusStandby unseal via Shamir; workload cached tokens valid for TTLSecret refresh blocked; workloads run until token expiry
AttributeDetail
Backup strategyPer-component: Cloud SQL automated + exported; Vault Raft snapshots; GitHub Enterprise Importer for off-site mirror; ArgoCD state reconstructable from Git
Backup product/serviceCloud SQL automated backups; Velero for Kubernetes resources; GCS/S3 for artefact snapshots
Backup typeMix: snapshot (Cloud SQL, Vault), continuous (Git)
Backup frequencyContinuous (Git), daily snapshots (PostgreSQL, Vault)
Backup retention35 days hot, 1 year cold
ControlDetail
ImmutabilityGCS / S3 Object Lock on DR backups
EncryptionCMEK, AES-256
Access controlDedicated restoration role, PIM-gated
#ScenarioRecovery ApproachRTORPO
1GCP primary region failureCut over portal to warm-standby in us-east4; ArgoCD satellites continue30 min5 min
2PostgreSQL corruptionPITR from Cloud SQL backup1 h5 min
3ArgoCD misconfigurationRevert Git commit; ArgoCD self-heals15 min0
4Supply-chain compromise (signed image tampered)Sigstore verification blocks admission; quarantine namespace; re-sign from source4 hN/A
5Vault unseal loss (catastrophic)Restore from Raft snapshot + Shamir key officers4 h24 h
MetricTargetMeasurement Method
Backstage page load (p95)< 2 sDatadog RUM
Backstage API (p99)< 800 msPrometheus
Scaffolder “new service” end-to-end< 30 min (target), < 10 min (stretch)Scaffolder telemetry
stellar CLI cold-start< 300 msCLI self-telemetry
ArgoCD sync propagation (merge to pod ready, staging)< 8 min (p90)DORA pipeline
DORA lead time (platform-using teams)< 2 days (40% reduction from 9-day baseline)DORA telemetry
DORA change failure rate< 10%DORA telemetry
DORA deployment frequencyDaily per team (up from weekly)DORA telemetry
DORA MTTR< 1 hIncident telemetry

Performance testing is continuous: k6 synthetic load against the portal nightly; chaos experiments monthly (Litmus) against the control plane.

MetricCurrent1 Year3 Years5 Years
Engineers (users)4005508001,000
Teams6080120150
Services in catalogue8501,1001,6002,200
Concurrent pipeline runs (peak)80120180250
Metrics ingest2M series3M5M8M
QuestionResponse
Will the current design scale to accommodate projected growth?Yes — tested to 3-year projection; revisit Thanos retention and Datadog contract at year 3
Are there known seasonal or cyclical demand patterns?Yes — quarterly OKR planning drives deployment spikes in weeks 2-4 of each quarter
PostureSelectedDetail
Cost deliberately balanced against strategic value[x]GKE Autopilot premium accepted in exchange for reduced SRE toil; Datadog retained (vs. full self-host) to avoid re-tooling cost; multi-cloud accepted as a strategic cost; spot/preemptible nodes for non-production; scale-to-zero in non-prod
  • Yes — modelled in FinOps tooling (Cloudability). Run cost of approximately GBP 350k/year (hosting + Datadog + Okta increments + incidental) versus estimated opportunity cost of 15 engineer-years/year lost to platform-adjacent toil in the current state. Payback estimated at 11 months.
  • Per-tenant cost attribution via labels propagated by Crossplane and the Scaffolder (team, service, tier, environment)
  • Showback dashboards rendered in Backstage per team
  • Monthly FinOps review with top-5 spending teams
  • Partial — multi-cloud (ADR-003) adds an estimated GBP 75k/year versus single-cloud. Accepted explicitly as a strategic cost.
QuestionResponse
Has the hosting location been chosen to reduce environmental impact?Partially — europe-west2 (London), us-east4, and asia-southeast1 are all chosen for customer proximity; each region is on a carbon-neutral / renewable power commitment from its respective cloud provider
What is the expected workload demand pattern?Variable predictable — heavier during engineering working hours across regions
QuestionResponse
Must the application be available continuously?Portal yes (engineers across time zones); ephemeral preview environments scale to zero
Can the solution be shut down or scaled down during off-peak hours?Non-production clusters scale to minimal nodes outside working hours; ephemeral previews auto-expire after 48 h idle
Are non-production environments configured to downscale or shut down when not in use?Yes — enforced via Crossplane-managed schedule
Attributes InvolvedDescriptionChosen PriorityRationale
Reliability vs. CostMulti-cloud (GKE + EKS) increases platform engineering costReliabilityStrategic customer commitments and reduced cloud-provider lock-in outweigh ~25% cost premium
Performance vs. Operational ExcellenceGKE Autopilot has slightly higher per-pod cost than standard mode but lower operational burdenOperational ExcellencePlatform team of 12 is the binding constraint; SRE toil reduction compounds
Flexibility vs. Cognitive LoadGolden paths reduce flexibility but lower cognitive loadOperational ExcellencePaved road with opt-out preserves autonomy while making the right path easy

The platform is built internally (open-source-first where appropriate).

AttributeDetail
Source control platformGitHub Enterprise Cloud
CI/CD platformGitHub Actions for repo-level checks; Dagger for typed pipeline logic; Tekton for privileged tasks (image signing, promotion)
Build automationEvery PR: lint, unit tests, SAST, SCA, SBOM, image build, cosign sign (Sigstore)
Deployment automationGitOps via ArgoCD; progressive delivery via Argo Rollouts with SLO gating
Test automation80%+ unit coverage enforced; integration tests via kind clusters in CI; nightly k6 load; monthly chaos
ControlImplementation
Security requirements identificationThreat model per subsystem; reviewed by Security Architect
SASTSemgrep + GitHub CodeQL
DASTOWASP ZAP against staging portal weekly
SCASnyk + Dependabot
Container image scanningTrivy in pipeline + Trivy Operator at runtime
Secure coding practicesMandatory code review, two approvers for platform core
Patch managementSnyk alerts triaged daily; critical within 24h
Supply chainSLSA L3 target; Sigstore signing; in-toto provenance attached
ClassificationApplies toDescription
ReplaceManual bootstrapping workflows, Jenkins Groovy shared libraries, team-specific Terraform modulesReplaced with golden-path templates, Dagger pipelines, and the audited Terraform Module Library
RehostJenkins jobs (~1,600 of the 2,400)Rehost straightforward shell-script jobs onto Tekton with minimal changes
ReplatformJenkins jobs (~500)Jobs moved to Dagger with light refactoring to idiomatic pipeline-as-code
RefactorJenkins jobs (~300)Complex Groovy logic rewritten as typed Dagger pipelines
RetireRemaining Jenkins jobs after audit (~200 found redundant)Confirmed redundant with product team owners
AttributeDetail
Deployment strategyStrangler Fig — platform stands up alongside existing estate; teams migrate in waves
Migration wavesWave 0: platform team dogfoods (months 0-3). Wave 1: 5 volunteer teams (months 4-6). Wave 2: remaining teams opted in by directorate (months 7-18).
Data migration modeNot applicable (no customer data in the platform); catalogue populated via GitHub scan
End-user cutoverPhased by team; no forced cutover
External system cutoverPhased — Jenkins retired per directorate once last job migrates
Maximum acceptable downtimeHours (during migration windows), zero (steady state)
Rollback planTeams can revert to prior CI or deployment pattern at any time during Wave 2; platform monitors adoption and DORA and escalates if rollback trend emerges
Acceptance criteria (Wave 1)1. Five teams onboarded. 2. New-service lead time < 1 day. 3. Net DevEx score positive. 4. SLOs met.
Test TypeScopeApproachEnvironmentAutomated?
UnitEvery componentGo / TypeScript standardCIYes
IntegrationControl plane, portal pluginskind clusters + testcontainersCIYes
End-to-endScaffolder -> running serviceStaging cluster; nightlyStagingYes
PerformancePortal, Scaffolder throughputk6StagingYes (nightly)
ChaosControl plane resilienceLitmusStagingYes (monthly)
SecurityPenetration testingAnnual + on major changesStagingNo
AttributeDetail
Release frequencyContinuous (platform itself deploys multiple times a day)
Release processTrunk-based development; PR -> CI -> merge -> ArgoCD -> canary -> full
Release validationAutomated smoke tests + synthetic after each deploy
Feature flagsLaunchDarkly (shared service) for portal feature toggles
AttributeDetail
Support modelPlatform-as-a-product: #stellar-platform Slack for support; weekly office hours; consulting sessions for adopting teams
Support hoursBusiness hours primary; 24x7 on-call for SLO-violating platform incidents
SLAsPortal 99.5% monthly; delivery plane 99.9% monthly
Escalation pathsSlack -> Platform on-call -> Platform Lead -> Head of Engineering
Team Topologies rolePlatform team = Platform Team (per Skelton/Pais); stream-aligned teams are customers; enabling teams coach adoption
QuestionResponse
Non-prod auto-shutdown schedule and enforcementGKE Autopilot non-prod clusters scale to zero out of hours; Cloud SQL non-prod auto-paused; AWS Config + GCP Org Policy alert FinOps if non-prod resources run > 24h without exception tag.
Right-sizing review cadenceQuarterly via Cloudability + GCP Recommender + AWS Compute Optimizer. Last review (2026-Q1) downsized 4 EKS node groups and one Cloud SQL instance, recovering ~£42k/year.
Unused / orphaned resource reclamationWeekly automation tags resources idle > 14 days; FinOps confirms before deletion. Scope: snapshots, persistent disks, unused service accounts, idle Datadog integrations.
Carbon footprint reported alongside costYes — monthly multi-cloud FinOps + Sustainability review combines AWS Customer Carbon Footprint Tool, GCP Carbon Footprint reports; tracked against a 2026 platform-wide baseline.
Environment retirement actually deletes (vs stops)Yes — decommissioning runbook requires Terraform destroy + bucket emptying + key destruction; CMDB Retired status only after both AWS Cost Explorer and GCP Billing confirm zero spend for 30 days.
Skill AreaCurrent LevelAction Required
Cloud platform (GCP)HighContinued
Cloud platform (AWS)MediumCross-training plan; hire 1 AWS-fluent SRE
KubernetesHigh
Infrastructure as Code (Terraform, Crossplane)MediumCrossplane training rolled out Q2
CI/CD pipeline managementHigh
Backstage (TypeScript, React)MediumNew hire completed; mentoring in progress
Security & complianceMediumEmbed security engineer in platform team (50% allocation)
Product management for platformsMediumJane Doe attends Platform Engineering conferences; internal PaaP community of practice
QuestionResponse
Can the team fully operate and support this solution in production?B: Partially capable — core runtime is in-hand; AWS depth and Backstage plugin velocity are the known gaps with mitigations in place
ConcernApproach
Keeping software versions currentRenovate for automated dependency PRs; Backstage version bumps on a monthly cadence
Hardware lifecycleN/A (fully cloud)
Certificate managementcert-manager (Let’s Encrypt for external; private CA for mTLS)
Dependency managementRenovate + Snyk
Platform deprecation policyBreaking changes to templates announced N+2 minor versions in advance
AttributeDetail
Exit strategyCore platform components are CNCF / OSS; catalogue data is portable YAML; customer teams’ services run on standard Kubernetes so are portable
Data portabilityBackstage catalogue exportable; DORA metrics in Snowflake exportable; manifests live in Git
Vendor lock-in assessmentModerate overall (see 3.1.6); Datadog is the highest-lock component
Exit timeline estimate12-18 months to rehost on an alternative portal / IDP

IDConstraintCategoryImpact on DesignLast Assessed
C-001Must integrate with existing Okta, GitHub Enterprise, Datadog, SnowflakeOrganisationalReuse mandated; no parallel IdP or APM2026-01-14
C-002Multi-cloud required (GCP + AWS)CommercialAdds ~25% platform engineering cost2026-03-11
C-003SOC 2 Type II controls must not regressRegulatoryChange management, access control, monitoring all in scope2026-02-05
C-004Platform team headcount capped at 12 for FY26OrganisationalForces ruthless prioritisation; reinforces platform-as-a-product discipline2026-01-14
C-005Budget cap GBP 1.2M capex + GBP 350k/yr opexFinancialCommercial IDPs (Port.io, Cortex) are out-of-scope due to per-seat pricing at 400 engineers2026-01-14
IDAssumptionImpact if FalseCertaintyStatusOwnerEvidence
A-001Adoption will grow organically given a high-quality paved roadPlatform becomes a white elephant; adoption stallsMediumOpenJane DoeEvidenced by 2025 DevEx survey demand; tracked via quarterly adoption KPI
A-002Stream-aligned teams can absorb the learning curve of GitOps and Kubernetes manifests with Scaffolder supportHigher-than-expected support burdenHighClosedClaire DoeWave 0 + Wave 1 learning feedback positive
A-003Datadog contract can scale to 3x current ingest without renegotiationCost surprise mid-yearHighClosedSam DoeConfirmed with Datadog account team; signed addendum
A-004GKE Autopilot pricing remains stable for 3 yearsRun cost surpriseMediumOpenSam DoeGCP price-hold provisions in enterprise agreement

Risk identification:

IDRisk EventCategorySeverityLikelihoodOwner
R-001Platform team becomes a bottleneck for feature requests from 60 teamsOperationalHighHighJane Doe
R-002Golden paths become too restrictive and teams lose autonomy (“paved road fatigue”)OperationalHighMediumClaire Doe
R-003Shadow platforms emerge — teams route around Stellar Platform, rebuilding parallel stacksOperationalHighMediumTom Bloggs
R-004Backstage upstream velocity outpaces our ability to track; plugins break on version bumpsTechnicalMediumHighTom Bloggs
R-005Multi-cloud abstractions leak, producing unpredictable behaviour between GKE and EKSTechnicalHighMediumTom Bloggs
R-006Compromise of the platform (ArgoCD, Crossplane) amplifies blast radius across all tenant workloadsSecurityCriticalLowJoe Bloggs
R-007Jenkins migration drags beyond 18 months; carrying cost of two systems becomes unsustainableDeliveryMediumMediumTom Bloggs
R-008Datadog vendor lock-in hardens as custom monitors proliferateCommercialMediumMediumAmir Bloggs
R-009DORA metrics misinterpreted as individual performance rather than system healthOperationalMediumMediumJane Doe

Risk response:

IDMitigation StrategyMitigation PlanResidual RiskLast Assessed
R-001MitigatePlatform-as-a-product model with PM-owned roadmap; quarterly prioritisation with top-20 product teams; explicit “escape hatch” guidance so teams can self-serve outside the paved road; community-of-practice model for common contributions back into platformMedium2026-04-10
R-002MitigatePaved-road-with-opt-out philosophy baked in; quarterly DevEx surveys specifically ask about fit; template versioning so teams can pin and diverge if neededMedium2026-04-10
R-003MitigateVisibility through catalogue (anything in GitHub appears); Engineering Director engagement model to sponsor platform adoption; quarterly adoption review at senior leadership levelMedium2026-04-10
R-004MitigateTrack Backstage upstream actively; contribute upstream where we depend on behaviour; plugin acceptance tests in CI; monthly Backstage upgrade cadenceMedium2026-04-10
R-005MitigateClear composition contract per Crossplane resource; contract tests run on both clouds; ADR required before a new cloud-specific primitive is exposed; deliberate small exposure surfaceMedium2026-04-10
R-006MitigateDefence in depth: Sigstore admission, Falco runtime, signed Git, no shared credentials, Crossplane workload identity, annual red-team engagement, zero-standing-privilege modelLow2026-04-10
R-007MitigateMigration wave plan with quarterly go/no-go; published Jenkins EOL date; clear “rehost first, refactor later” policy; dedicated migration squadMedium2026-04-10
R-008MitigateOpenTelemetry Collector as abstraction; dashboards-as-code via Terraform provider (portable); quarterly review of Datadog-specific usageMedium2026-04-10
R-009MitigateDORA only shown at team level; engineering handbook explicitly describes DORA as system-health signals; director-level coaching on psychologically safe useLow2026-04-10
IDDependencyDirectionStatusOwnerEvidenceLast Assessed
D-001Okta SCIM connectors stableInboundCommittedIdentity teamExisting2026-02-15
D-002GitHub Enterprise Cloud API rate limits adequateInboundCommittedGitHub vendorEnterprise contract2026-02-15
D-003Datadog multi-cloud private connectivityInboundCommittedDatadogPrivateLink enabled2026-03-01
D-004Megaport interconnect between GCP and AWSInboundResolvedNetwork teamLive since 2026-022026-02-20
D-005Product teams adopt golden paths (Wave 1 commitments)InboundCommittedEngineering DirectorsMoU signed 2026-032026-03-20
IDIssueCategoryImpactOwnerResolution PlanStatusLast Assessed
I-001Backstage software-templates plugin has a known memory leak at > 2,000 catalogue entitiesTechnicalMediumTom BloggsUpstream fix in v1.26; pinned our instance to v1.25 with workaroundIn progress2026-04-05
QuestionResponse
Does this design create any exception to current policies and standards?No
QuestionResponse
Does this design create an issue against the process library?No
QuestionResponse
Does the design materially change the organisation’s technology risk profile?Yes — the platform concentrates supply-chain risk but also concentrates supply-chain controls; net reduction in organisational risk
ADR #TitleStatusDateImpact
ADR-001Adopt Backstage rather than build an in-house portalAccepted2026-01-22Foundational portal choice
ADR-002ArgoCD for GitOps rather than FluxAccepted2026-02-09Delivery plane foundation
ADR-003Multi-cloud (GKE primary, EKS secondary) from day oneAccepted2026-03-11Strategic cost + capability

TermDefinition
BackstageCNCF-incubating developer portal framework originated by Spotify
Cognitive LoadThe total mental effort required of a team to do its work; a core Team Topologies concept
CrossplaneKubernetes-native control plane for provisioning cloud resources via Compositions
DaggerProgrammable, portable CI engine with typed SDK
DevExDeveloper Experience — the quality of an engineer’s end-to-end experience using internal tooling
DORADevOps Research and Assessment metrics: deployment frequency, lead time, CFR, MTTR
Enabling TeamA Team Topologies team that coaches stream-aligned teams without taking on delivery itself
Golden PathA pre-baked, opinionated route through the software lifecycle that most teams should take by default
IDPInternal Developer Platform
Paved RoadSynonym for golden path; emphasises that teams can leave the road but it is the path of least resistance
Platform-as-a-ProductOperating model where the platform is treated with product-management discipline
PIMPrivileged Identity Management — just-in-time elevation of access
ScaffolderBackstage plugin that turns templates into working repositories
SLSASupply-chain Levels for Software Artefacts — integrity framework
Stream-aligned TeamA product team that delivers value to customers (Team Topologies)
TechDocsBackstage plugin for docs-as-code engineering documentation
Workload IdentityKubernetes-to-cloud identity federation avoiding long-lived credentials
DocumentVersionDescriptionLocation
Stellar Engineering Platform Strategy 2026-20281.0Strategic context for the platformConfluence / Strategy / STRAT-0004
Platform-as-a-Product Operating Model1.0How the platform is runConfluence / Standards / POL-0031
Stellar Cloud Landing Zone Standards3.1Account/project layoutConfluence / Standards / STD-0012
Information Security Policy4.2Security baselineSharePoint / Policies / POL-0001
DPIA — Engineer Telemetry1.0DPIA for DevEx telemetrySharePoint / Legal / DPIA-2026-007
STRIDE Threat Model1.0Platform threat modelConfluence / Security / THREAT-1042-01
Team Topologies (Skelton & Pais)External referenceO’Reilly
RoleNameDateSignature / Approval Reference
Principal Platform EngineerTom Bloggs2026-04-15ARB-2026-004-PPE
Head of EngineeringPriya Bloggs2026-04-16ARB-2026-004-HOE
Security ArchitectJoe Bloggs2026-04-17ARB-2026-004-SEC
Architecture Review BoardARB Panel2026-04-18ARB-2026-004-APPROVED

SectionScore (0-5)AssessorDateNotes
1. Executive Summary4ARB Panel2026-04-18Strong business context; drivers, DORA baseline, and platform-as-a-product framing clear; strategic alignment to platform strategy is explicit
3.1 Logical View4ARB Panel2026-04-18Three-plane decomposition, component ownership, design patterns, and lock-in assessment all documented
3.2 Integration & Data Flow3ARB Panel2026-04-18All interfaces described with protocols and auth; developer-journey sequence diagram present; formal API contracts for DORA endpoint not yet published (tracked)
3.3 Physical View3ARB Panel2026-04-18Multi-cloud topology and environment list complete; cross-cloud failover drill scheduled but not yet executed end-to-end
3.4 Data View3ARB Panel2026-04-18Data stores classified, retention and encryption defined, DPIA complete; sovereignty addressed. Data-contract-style schemas between planes not formalised
3.5 Security View4ARB Panel2026-04-18Zero-standing-privilege model, workload identity, Sigstore, Vault all covered; threat model produced; annual red-team committed
3.6 Scenarios4ARB Panel2026-04-18Three strong use cases (bootstrap, deploy, break-glass); three ADRs with genuine alternatives and trade-offs
4.1 Operational Excellence4ARB Panel2026-04-18SLIs/SLOs, centralised logging, alert runbooks, DORA telemetry pipeline; mature observability posture
4.2 Reliability3ARB Panel2026-04-18HA, multi-region warm standby, chaos monthly; cross-cloud DR rehearsal outstanding
4.3 Performance3ARB Panel2026-04-18Targets explicit including DORA deltas; growth modelled to year 5; continuous synthetic load testing
4.4 Cost Optimisation3ARB Panel2026-04-18Showback per team, FinOps review cadence; multi-cloud premium explicitly accepted and tracked
4.5 Sustainability3ARB Panel2026-04-18Non-prod scale-to-zero; renewable-commitment regions; carbon dashboard planned for Phase 2
5. Lifecycle4ARB Panel2026-04-18Mature CI/CD and supply-chain posture; migration plan with 6 Rs applied to Jenkins estate; skill gaps named and mitigated
6. Decision Making4ARB Panel2026-04-18Constraints, assumptions, and especially risks are well grounded in platform-engineering reality (bottleneck, paved-road fatigue, shadow IT, vendor lock-in)
Overall3ARB Panel2026-04-18Solid Tier 3 platform SAD at Recommended depth. Genuine platform-engineering thinking throughout. Lowest-scoring sections (3) are all known gaps with owners and plans: cross-cloud DR rehearsal, data contracts between planes, Phase-2 carbon dashboard.