[{"content":"There are two kinds of learning a team can do. The first fixes the error: the build broke, so you fix the build. The second fixes the thing that produced the error: the build broke because of how we work, so we change how we work. Chris Argyris named these single-loop and double-loop learning1 — adjusting your actions within the existing rules, versus questioning the rules themselves.\nMost of a team\u0026rsquo;s week is single-loop by design. Tickets, PRs, deploys — execute the process, close the loop, repeat. That\u0026rsquo;s correct; you can\u0026rsquo;t double-loop everything or you\u0026rsquo;d never ship. But a team that only single-loops has no mechanism to improve the system it runs on. It gets very good at moving tickets through a pipeline it never steps back to question.\nThe retrospective is that mechanism. It\u0026rsquo;s the one ritual whose entire job is double-loop: not \u0026ldquo;did we finish the work\u0026rdquo; but \u0026ldquo;is the way we work the way we should work?\u0026rdquo; Strip away the cadence and the sticky notes and that\u0026rsquo;s what a retro is — a team\u0026rsquo;s standing forum for examining its own rules.\nOther rituals do double-loop work in their own lanes — a postmortem interrogates an incident, an architecture review or RFC interrogates a design. The retro is the general one: the team\u0026rsquo;s standing place to question how it works week to week, not only when something breaks or a big decision is on the table. On a small team it\u0026rsquo;s often the only such forum; in a larger org it\u0026rsquo;s one of several, and the others don\u0026rsquo;t make it redundant.\nWhat happens when there isn\u0026rsquo;t one Skip it, and three things happen.\nFirst, the team can only single-loop. Problems with the process — the painful release, the meeting that\u0026rsquo;s secretly status, the handoff that keeps dropping things — have nowhere to go. They get rediscovered, individually, over and over, and might never be properly addressed — because doing so requires a forum that questions the frame, and there isn\u0026rsquo;t one.\nSecond — and teams underrate this — the double-loop observations don\u0026rsquo;t disappear. Engineers vary a lot in how much they naturally run this kind of analysis; some are constantly noticing \u0026ldquo;the system that produced this is off.\u0026rdquo; For those people, a team with no forum to surface and act on process concerns is quietly corrosive. The observations accumulate with nowhere to put them. That unsettled feeling isn\u0026rsquo;t a personality quirk to be managed — it\u0026rsquo;s information the team is failing to capture. And often the people most attuned to process problems are the ones who disengage or leave first, which means the org loses exactly the signal it most needed.\nThird, the negativity has to land somewhere. Acknowledging that a process is broken is unpleasant work, and with no designated forum to do it in, the observations leak out — blurted at standup, dropped into a PR thread, raised in the wrong meeting at the wrong moment. Everyone in the room may privately agree, but the timing and framing land badly, and there\u0026rsquo;s no shared container to process the discomfort together. So the cost falls on the messenger: the people most sensitive to the problems become associated with the problems. A team without a retro doesn\u0026rsquo;t just fail to fix its process — it quietly converts its most perceptive members into \u0026ldquo;the negative ones,\u0026rdquo; and learns to discount exactly the signal it should be amplifying.\nIt\u0026rsquo;s not only about fixing things Frame a retro purely as problem-intake and you miss half its value. A good retrospective is also where a team builds energy and cohesion — and those aren\u0026rsquo;t soft extras, they\u0026rsquo;re what makes the work sustainable.\nA team that only moves tickets experiences work as an endless, undifferentiated churn: finish one, the next appears, forever. Nothing is ever marked. A retro punctuates that churn. We name the problems and commit to solutions — so friction feels like something the team acts on, not weather it endures. We name the wins and actually celebrate them — so progress is visible and shared, not silently absorbed into the backlog. That rhythm of see clearly, decide together, mark the moment is most of what turns a group of people executing tickets into a team.\nWhy most retros fail anyway The failure mode isn\u0026rsquo;t not having retros — it\u0026rsquo;s having ones that produce nothing. A retrospective that surfaces concerns and then changes nothing can be worse than none: it actively trains the team that naming problems is pointless, which is how you get the silent, going-through-the-motions version. What separates the real thing from the theater:\nActioned outcomes. The output is a small number of concrete changes with owners — not a list of grievances that evaporate by Monday. Psychological safety. People only surface real problems where it\u0026rsquo;s safe to. Blame kills the signal. Celebrate, don\u0026rsquo;t just fix. The wins half is load-bearing. Skip it and the ritual becomes a complaint session no one wants to attend. Cadence and follow-through. Regular, and visibly connected to actual change — or it decays into ceremony. A retrospective is, in the end, a feedback loop pointed at the process itself — the meta-loop above all the others. A team that ships constantly but never retrospects is optimizing hard inside a frame it has decided never to examine. Sometimes the most valuable thing a team can do is stop moving tickets long enough to ask whether the tickets are the right shape.\nChris Argyris originated the distinction between single- and double-loop learning. See \u0026ldquo;Teaching Smart People How to Learn,\u0026rdquo; Harvard Business Review (1991), and Argyris \u0026amp; Schön, Organizational Learning (1978).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://allenz.net/writing/retrospectives-are-double-loop-learning/","summary":"Most of a team\u0026rsquo;s week is single-loop — execute the process, close the ticket. The retrospective is the one ritual that questions the process itself, and what quietly breaks when a team skips it.","tags":["SDLC"],"title":"Retrospectives: an opportunity for double-loop learning"},{"content":" I am so afraid of the reviewer\u0026rsquo;s word. — the opening of a Rilke pastiche I wrote; the whole poem is at the end.\nPull on that feeling and it\u0026rsquo;s worth asking what the practice is actually for. Code review is one of the few engineering rituals we treat as non-negotiable — every change, through a gate, before it ships. That near-universality is recent, and the gap between why we say we do it and what it actually delivers is wider than the ritual\u0026rsquo;s status suggests.\nThe form is younger than it looks Code review as a concept is old: Michael Fagan formalized software inspection at IBM in 1976.1 But Fagan inspections were heavyweight and selective — scheduled meetings, defined roles, reserved for code that warranted the cost. They measurably found defects, precisely because they were expensive enough to be done seriously and rare enough to be done only where it paid.\nEarly in my career, at a government contractor, \u0026ldquo;code review\u0026rdquo; meant something else again: once a year the engineers gathered in a room with code projected on the wall and worked through samples together, calibrating a shared sense of what good looked like. No per-change gate — a periodic, collective quality calibration that built shared judgment rather than policing individual diffs.\nThe thing we now treat as mandatory — a lightweight, asynchronous, blocking pull-request gate on every change — is a product of the GitHub-PR era, barely fifteen years old. It\u0026rsquo;s one form among several, and we adopted it near-universally in a fraction of the time it took to understand it.\nWhat the Research Actually Says The most-cited empirical study of modern code review — Bacchelli and Bird\u0026rsquo;s Expectations, Outcomes, and Challenges of Modern Code Review (Microsoft Research, 2013)2 — found a telling gap: the thing developers most expect from review (finding defects) is not its main outcome. The dominant realized value is knowledge transfer, shared awareness of the codebase, and incremental improvement. Review still finds defects — just fewer, and on lower-severity issues, than the bug-hunting framing assumes — and under time pressure much of it stays shallow.\nThat\u0026rsquo;s a sharper claim than \u0026ldquo;review doesn\u0026rsquo;t work.\u0026rdquo; It\u0026rsquo;s that review\u0026rsquo;s value diverges from its justification. We sell it as a quality gate; it pays out mostly as a communication mechanism. Knowledge transfer is genuinely valuable — but it means we\u0026rsquo;re often optimizing the ritual for the wrong thing.\nForm Without Function When review is run as a gate, the rational incentive for everyone involved is to clear the gate — and the cheapest, most legible way to demonstrate that review happened is to comment on what\u0026rsquo;s easiest to see: formatting, naming, brace placement, sort order. Style.\nThis is form without function: the ceremony of review without the substantive engagement that creates its actual value. And it\u0026rsquo;s one of the worst-shaped feedback loops in the pipeline — a human, on their own schedule, applying rules that sometimes didn\u0026rsquo;t exist until they were applied to you. Slow, unpredictable, and frequently about the wrong things. Tests give you feedback in seconds; a deterministic pipeline converges in a minute; a style-gate review can sit for two days and then block on a comma.\nThere\u0026rsquo;s a quieter loss in the same vein. Review often surfaces genuinely useful context — why a choice was made, what was tried, what to watch out for — but that context lives in PR comment threads, which are notoriously hard to find again. The understanding the conversation generated ends up buried in a UI nobody greps six months later, detached from the code it explains. The discussion happens; the knowledge evaporates.\nReview isn\u0026rsquo;t risk-free either We discuss review as pure upside, but it has its own failure mode: the correction that introduces the bug. A reviewer suggests a \u0026ldquo;cleaner\u0026rdquo; rename or a small refactor; it looks harmless, the tests stay green, and it ships a regression the original author would never have written — because the author\u0026rsquo;s domain context is strongest at the moment of writing and weakest by the time they\u0026rsquo;re triaging comments days later. A small style change is not automatically a safe change.\nMeanwhile the defects that actually matter tend to surface from production observability — real signals from the field — far more reliably than from a reviewer speculating about code paths they didn\u0026rsquo;t write. Review trades on intuition; the field trades on facts.\nBe honest about what the PR even is A surprising amount of review friction comes from an unstated question: what is this PR claiming to be? Finished, tested, production-ready code — or a design sketch floated for early comment? Those want completely different responses, and GitHub\u0026rsquo;s signals for the distinction (draft PRs, labels, conventions) are inconsistent and frequently unknown to the people reviewing.\nDiscovery compounds it: how does a teammate even know which PRs need their eyes, and when? And every answer carries a cost we rarely price — pulling someone out of flow to review a change is an interrupt, and interrupts are expensive. \u0026ldquo;Who reviews what, and at what stage\u0026rdquo; deserves to be a deliberate decision, not an ambient expectation that everyone reviews everything, always.\nAI moves the goalposts Two of review\u0026rsquo;s load-bearing benefits are shifting under it. The first is \u0026ldquo;you learn the codebase by reading other people\u0026rsquo;s changes\u0026rdquo; — a real benefit that erodes when comprehension is cheap on demand and a model can explain any file in seconds. The second is bigger: as more code is machine-generated, review\u0026rsquo;s job quietly changes from onboarding a human author to verifying machine output — a different activity, with different failure modes, that the human-to-human PR gate was never designed for.\nThis second shift is a projection, not a finding. It\u0026rsquo;s early, the evidence isn\u0026rsquo;t in, and I could be wrong about the pace — treat it as a hypothesis to watch, not a settled claim.\nMatch the mechanism to the purpose None of this argues for shipping unreviewed code. It argues for being honest about what you want from review and routing each goal to the mechanism that actually delivers it:\nWant to catch the mechanical defects? Invest where those are actually caught: tests, types, static analysis, property-based checks, design review before code is written, and production observability once it\u0026rsquo;s running — which reports what\u0026rsquo;s actually broken instead of what a reviewer imagines might be. Want knowledge transfer? Optimize for understanding, not gatekeeping: pairing, walkthroughs, review-for-comprehension. Don\u0026rsquo;t dress communication up as a quality control it isn\u0026rsquo;t. Want human judgment? Reserve human review for what only a human catches: does this solve the right problem, does it fit the system, is the design sound — plus the security holes, broken assumptions, and logic errors no test was written to check. That\u0026rsquo;s the highest-value thing review can do — and it\u0026rsquo;s exactly what gets crowded out when the same review is also expected to police whitespace. And notice how little is left for the gate. The jobs review is most often used for — consistency and alignment — are exactly the ones other mechanisms do better. Style belongs to linters, run on save or in CI: instant, deterministic, never political. Team conventions belong in shared, distributed tooling — a common formatter config, and increasingly a shared set of AI rules and skills every engineer runs locally — so alignment is baked into everyone\u0026rsquo;s environment before code is written, not policed after the fact by whoever happens to review. Knowledge transfer is pairing\u0026rsquo;s native job: the same understanding plus a real review, in real time, with none of the gate\u0026rsquo;s async lag. Subtract all of that, and what\u0026rsquo;s left for a human reviewer is small and genuinely valuable — the design judgment only a person brings — which is precisely what the all-purpose gate crowds out.\nIt\u0026rsquo;s no coincidence that heavy review gates tend to travel with heavy, infrequent, painful releases — both are the same ceremony-first instinct. DORA — the DevOps Research and Assessment program3 — found the high performers go the other way: lightweight process and frequent, small, reversible deploys correlate with better stability, not worse. The gate feels like safety; the data says fast and reversible is safer.\nLow-ceremony, substantive review beats high-ceremony style-gating on most of the axes that matter. Be intentional about what review is for and who is involved. The dread in the poem below isn\u0026rsquo;t of feedback — good feedback is a gift. It\u0026rsquo;s of the gate: slow, unpredictable, aimed at the wrong target. Fix the loop, and the reviewer\u0026rsquo;s word stops being something to fear.\nCoda — a Rilke pastiche A short pastiche of Rainer Maria Rilke\u0026rsquo;s Ich fürchte mich so vor der Menschen Wort (from Mir zur Feier, 1899), reimagined for the modern code review. The original is about how categorical naming flattens the world — \u0026ldquo;die Dinge singen hör ich so gern. Ihr rührt sie an: sie sind starr und stumm\u0026rdquo; (I love so much to hear the things sing. You touch them: they become rigid and mute). The pastiche keeps the three-quatrain structure and the closing-line punch, swapping source-formatting concerns for Rilke\u0026rsquo;s Hund / Haus / Berg / Garten.\nOriginal — Rainer Maria Rilke, Mir zur Feier (1899) Ich fürchte mich so vor der Menschen Wort.\nSie sprechen alles so deutlich aus:\nUnd dieses heißt Hund und jenes heißt Haus,\nund hier ist Beginn und das Ende ist dort.\nMich bangt auch ihr Sinn, ihr Spiel mit dem Spott,\nsie wissen alles, was wird und war;\nkein Berg ist ihnen mehr wunderbar;\nihr Garten und Gut grenzt grade an Gott.\nIch will immer warnen und wehren: Bleibt fern.\nDie Dinge singen hör ich so gern.\nIhr rührt sie an: sie sind starr und stumm.\nIhr bringt mir alle die Dinge um.\n— Rainer Maria Rilke, Mir zur Feier (1899). Public domain.\nPastiche — German Ich fürchte mich so vor des Reviewers Wort.\nEr ordnet die Token so deutlich aus:\nein jedes Komma gehört in sein Haus,\nund hier ist der Tab und das Leerzeichen dort.\nMich bangt auch ihr Werk, ihr Spiel mit dem Spott,\nsie wissen alles, wie\u0026rsquo;s gewesen war;\nkein Code ist ihnen mehr wunderbar;\nihr Format und Stil grenzt grade an Gott.\nIch will immer warnen und wehren: bleibt fern.\nDen Code, der läuft, seh ich so gern.\nIhr rührt ihn an: er wird starr und stumm.\nIhr bringt mir alle Funktionen um.\nPastiche — English I am so afraid of the reviewer\u0026rsquo;s word.\nHe formats the tokens so clearly and with haste.\nEvery comma belongs only in its place.\nAnd this is the tab and this is the space.\nTheir work alarms me too, their play with scorn;\nthey know each line and where each clause must go;\nno running system for them is wonderful to know;\ntheir format and style so pious and concerned.\nI want to warn them: stay your hand.\nThe code in production — I love to see it run.\nYou touch it: rigid, mute, a frozen thing.\nYou kill each optimization that I\u0026rsquo;d planned.\nMichael E. Fagan, \u0026ldquo;Design and Code Inspections to Reduce Errors in Program Development,\u0026rdquo; IBM Systems Journal 15, no. 3 (1976) — the origin of formal software inspection.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAlberto Bacchelli and Christian Bird, \u0026ldquo;Expectations, Outcomes, and Challenges of Modern Code Review,\u0026rdquo; Proceedings of ICSE 2013 (Microsoft Research). Publication page · PDF\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDORA (DevOps Research and Assessment) is a multi-year research program into software-delivery performance; its findings are synthesized in Accelerate (Nicole Forsgren, Jez Humble, Gene Kim, 2018) and the annual State of DevOps reports. dora.dev\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://allenz.net/writing/what-code-review-is-actually-for/","summary":"The mandatory, blocking PR gate is barely fifteen years old — and the research says review\u0026rsquo;s value diverges from its justification. What code review is actually for, and why most of its jobs are better served elsewhere.","tags":["SDLC"],"title":"What is code review actually for?"},{"content":"Context A common Kubernetes workload pattern: an application is operationally stateless (caches, work directories, JSP compile output — nothing that needs to survive a pod restart) but its database upgrade procedure is fragile. Specifically: the application\u0026rsquo;s schema migration assumes only one instance of the application can be connected to the database while a migration runs. Run two, the upgrade can corrupt the schema.\nThis pattern is common for older applications that were not originally designed for cloud-native deployment — a large Java portal platform is the example throughout, but the same shape applies to many JVM applications with embedded schema-migration logic.\nThe question: should this workload run on a Deployment or a StatefulSet?\nAt chart inception, an informal decision picked StatefulSet on the reasoning that StatefulSet\u0026rsquo;s ordered rolling-update semantics combined with manual scale-to-zero-first can guarantee the single-instance-during-upgrade invariant. The choice was not documented, and live operational evidence has surfaced costs that warrant a structured re-evaluation.\nForces and constraints Upgrade fragility is real. Only one instance can be connected to the database while a schema migration runs. This is the strongest argument for any non-default deployment shape. State is externalized. Database, document library, search, secrets all live in managed services. The pod itself only holds transient cache/work-directory state. Multi-AZ deployment is in scope. GKE Persistent Disks are zonal by default. PVC zone affinity matters operationally. Horizontal Pod Autoscaler is in scope for traffic-based scaling. Argo CD delivers the chart via helm-rendered manifests. Immutable-field constraints and sync-wave annotations matter to operational behavior. This is a greenfield stack, so the decision can still be revisited at low cost. Options A. Keep StatefulSet (status quo) StatefulSet with volumeClaimTemplates producing one ReadWriteOnce PVC per ordinal. Pods get stable DNS identities; ordered rolling-update semantics.\nB. Move to Deployment Deployment with default RollingUpdate for steady-state (maxSurge=1, maxUnavailable=0 for zero-downtime config changes at replicas=1). For upgrade: strategy.type: Recreate + scale-to-1 + an application-level startup lock. Persistent state goes only to externalized services.\nC. Custom workload-aware controller (CRD + operator) An application-shaped CRD plus a controller that encapsulates the application\u0026rsquo;s specific upgrade semantics (single-instance-during-upgrade, migration orchestration, rollback), using Deployment underneath.\nD. StatefulSet without volumeClaimTemplates Keep kind: StatefulSet, replace per-pod RWO PVCs with emptyDir. Strict subset of Option B\u0026rsquo;s work.\nConsequences Option A — StatefulSet Observed cons in live operations:\nDowntime on every spec change at replicas=1. StatefulSet\u0026rsquo;s rolling update kills the lone pod before the replacement starts. With a 2–3 minute boot time, every env-var change, config tweak, or annotation update produces a 2–3 min outage. Per-pod cache divergence. Each replica gets its own PVC. OSGi caches, work directories, and compiled-JSP state diverge across replicas — pods can serve different bundle versions or configuration states. PVC accumulation on scale-down. PVCs remain bound to the deleted ordinals\u0026rsquo; identities. Autoscaling amplifies this. Kubernetes 1.27+ added persistentVolumeClaimRetentionPolicy.whenScaled: Delete1 to address this, but it must be explicitly set. Resize requires downtime. Resizing volumeClaimTemplates requires deleting and recreating the StatefulSet (immutable field). Broader immutable-field surface than Deployment. serviceName, selector, volumeClaimTemplates, podManagementPolicy — all immutable. With Argo CD\u0026rsquo;s apply path, hitting any of these requires either a destructive Replace=true sync (loses ordinal identity and data) or manual kubectl delete --cascade=orphan outside GitOps. Zone-affinity PVC conflicts on multi-AZ clusters. PVCs bind to a specific zone the first time they\u0026rsquo;re provisioned. When the autoscaler later places a node in a different zone, the pod sits Pending indefinitely with volume node affinity conflict. Regional PDs avoid this but cost more. HPA composition is awkward. Stable identity is wasted for HTTP serving. OrderedReady serialization makes scale-up slow (ordinal N waits for ordinal N-1 to be Ready). KEDA scale-to-zero leaves PVCs stranded — the \u0026ldquo;zero cost when idle\u0026rdquo; promise degrades to \u0026ldquo;zero CPU when idle, still paying for stranded disks.\u0026rdquo; Option B — Deployment Pros:\nZero downtime on steady-state spec changes via RollingUpdate with maxSurge=1, maxUnavailable=0, even at replicas=1. Upgrade safety achievable via strategy.type: Recreate + scale-to-1 + application startup lock. This is a documented pattern the vendor\u0026rsquo;s own managed cloud has used in production for years. No PVC accumulation, no per-pod cache divergence, no zone-affinity trap. Smaller immutable-field surface. Natural HPA composition. KEDA scale-to-zero works. Cons:\nMigration cost: chart changes, validation of the upgrade-safety path, PVC migration. Slightly weaker peer-identity guarantees (no stable DNS names). Application\u0026rsquo;s current clustering uses application-layer discovery, so this is not an active constraint — but worth flagging if a future clustering protocol depends on stable DNS. Upgrade orchestration is more configurable than declarative — requires deliberate use of Recreate + lock + scale-to-1. Option C — Custom controller Pros:\nEncapsulates application-specific upgrade knowledge in an application-specific abstraction. The upgrade procedure lives in the controller\u0026rsquo;s reconciliation loop, not in chart annotations. Hides the Deployment-vs-StatefulSet question from chart consumers entirely. Foundation for future application-specific features (graceful drain during upgrade, traffic shifting between data planes). Cons:\nMulti-week engineering investment. Building a production-grade operator is a project, not a sprint task. Yet another in-cluster controller to operate. Risks over-fitting to current constraints. Doesn\u0026rsquo;t address the immediate operational pain from Option A in any near timeframe. Option D — StatefulSet without volumeClaimTemplates Pros:\nStrictly better than Option A. Resolves cache divergence, PVC accumulation, resize downtime, zone-affinity failures, KEDA\u0026rsquo;s stranded-disk problem — without changing the controller kind. Implementation cost is exactly the PVC-removal work. No semantic change for chart consumers; no Argo CD reconfiguration; no pod-identity migration. Doesn\u0026rsquo;t preclude moving to Option B later — same work plus a future controller-kind swap. Cons:\nPreserves the downtime symptom that drove this re-evaluation. StatefulSet\u0026rsquo;s RollingUpdate still kills the lone pod before the replacement starts, regardless of attached storage. Doesn\u0026rsquo;t address the conceptual mismatch (using a stable-identity primitive for stateless HTTP serving). Doesn\u0026rsquo;t align with the vendor\u0026rsquo;s published engineering position. Risks becoming a \u0026ldquo;stopping point\u0026rdquo; — once visible PVC symptoms are gone under Option D, political pressure to keep going to Option B diminishes. Recommendation Adopt Option B (Deployment) as the target end state. Treat Option C (custom controller) as a separate future initiative. Treat Option D as a defensible intermediate step if Option B faces near-term resistance — but only with a written commitment to follow through to B.\nReasoning:\nOption B reverses the full set of operational pain points (downtime, PVC accumulation, cache divergence, immutable-field traps, zone-affinity failures, HPA incompatibility) without multi-week engineering investment. It aligns with the application vendor\u0026rsquo;s own published engineering position. It does not preclude a future Option C: a custom controller can wrap Deployment underneath. The upgrade-safety argument is preserved via Recreate + startup lock + scale-to-1. The cost of being wrong is low: the stack is greenfield, the migration is straightforward, and a future custom controller can supersede this decision. The cost of not deciding is higher than the cost of any of the four options. Leaving the choice undocumented means every future operator who sees a StatefulSet outage will rediscover the same arguments from scratch.\nReceipts What the recommendation rests on, including negative evidence:\nVendor engineering position (published, written, attributed). The vendor\u0026rsquo;s managed-cloud documentation: \u0026ldquo;The application and Backup services use the Deployment type, so that they can share access to the document library.\u0026rdquo; StatefulSet is reserved for CI services. This is the only public, written, vendor-authored defense of a K8s controller choice for the workload I could find — and it chooses Deployment. Vendor-employee-authored tutorial (2020). A published tutorial deploying the platform on Kubernetes demonstrates the workload-on-K8s with Deployment, multi-replica, application-layer clustering, shared persistence. Production-deployed config from the vendor\u0026rsquo;s own production cloud. strategy.type: Recreate + application-level startup lock (CONTAINER_STARTUP_LOCK_ENABLED=true) + scale-to-1 — the documented upgrade-safety pattern in the vendor\u0026rsquo;s own production cloud. Live operational observations. 2–3 minute downtime observed on env-var change applied to the StatefulSet at replicas=1; multiple immutable-field traps hit during Argo CD synchronization on routine chart updates. Community ecosystem signal. All publicly available community Helm charts for this workload (multiple independent implementations) use Deployment. These are mostly hobby projects, so the signal is weak individually — but it indicates that practitioners reaching for the workload-on-K8s default to Deployment without feeling the need to justify it. Negative evidence (acknowledged). No public, vendor-engineering-authored artifact defends StatefulSet for this workload. The vendor\u0026rsquo;s K8s deployment videos either use StatefulSet without justifying it or don\u0026rsquo;t address the choice at all. The strongest argument for StatefulSet — the senior IC\u0026rsquo;s upgrade-safety reasoning — was not written down anywhere and was not validated against the vendor-cloud Deployment counter-example. Open questions If we adopt Option B: what is the migration path for any existing StatefulSet PVCs? Likely \u0026ldquo;discard, since they only hold ephemeral cache state\u0026rdquo; — but worth confirming. If we adopt Option B: do we revisit HPA and KEDA configuration to take advantage of the cleaner scaling model? If we adopt Option D as an intermediate: do we set a target date for the follow-on Option B migration so the intermediate doesn\u0026rsquo;t become permanent? If we adopt Option C: who owns the operator? What\u0026rsquo;s the timeline? Does it block Option B in the interim? Is there a stronger upgrade-safety argument for StatefulSet that hasn\u0026rsquo;t been written down anywhere — i.e., is the original argument complete, or is there additional context that wasn\u0026rsquo;t captured? This documents an architectural reconsideration I led for the GKE deployment of a Java application with fragile database-upgrade semantics. The recommendation has not been formally accepted at time of writing — the document exists to put the alternatives, the evidence, and the consequences on the record so that whichever direction is chosen is chosen with the analysis, not around it.\nKubernetes persistentVolumeClaimRetentionPolicy (stable as of v1.27) lets a StatefulSet delete its PVCs on scale-down or deletion instead of retaining them.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://allenz.net/writing/statefulset-vs-deployment-for-stateless-with-fragile-upgrade-workloads/","summary":"A decision record for a workload that\u0026rsquo;s operationally stateless but has a fragile single-instance upgrade: StatefulSet vs Deployment, with live operational evidence and the case for Deployment plus a startup lock.","tags":["Kubernetes"],"title":"StatefulSet vs Deployment for stateless-with-fragile-upgrade workloads"},{"content":"Crossplane lets you define cloud infrastructure as Kubernetes composite resources, with the actual resource emission handled by a composition pipeline. The default composition language is Go templates rendered by function-go-templating. For small compositions this works fine.\nAs the surface grows — more resource types, more shared logic, more conditional emission — three problems start recurring:\nHard to test. Go templates produce strings. Verifying behavior means rendering YAML and grepping for fields. There is no native unit-test framework. No static typing. A typo in a CRD field name (spec.forProvider.manifest, managementPolicies, etc.) renders fine and only fails at apply time inside the cluster — where the failure attribution is \u0026ldquo;your composition failed, here\u0026rsquo;s the rejected manifest\u0026rdquo; rather than \u0026ldquo;line 47 referenced a field that doesn\u0026rsquo;t exist.\u0026rdquo; Hard to know what a change will affect. Templates are stringly-typed and globally scoped via _helpers.tpl. Any refactor requires reading every consumer to be safe. Once your composition crosses some complexity threshold — call it \u0026ldquo;two engineers can no longer hold the full template surface in their heads\u0026rdquo; — these become real costs.\nKCL as the alternative KCL (Kusion Configuration Language) is an open-source, statically-typed configuration language for generating structured data. CNCF Sandbox project, Python-flavored syntax, schema-based type system. Hermetic and deterministic. Crossplane v2 supports it as a first-class composition function via crossplane-function-kcl.\nFor a Crossplane composition that has outgrown Go templates, KCL gives you:\nStatic typing of CRD references. Provider models can be generated from your CRDs (kcl-openapi); field typos fail at kcl run, not at kubectl apply. The KCL LSP gives autocomplete and jump-to-definition on CRD fields. Native unit tests. kcl test auto-discovers tests. You write per-layer assertions instead of rendering YAML and grepping. Modules and schemas for composability. Composition logic can be broken into per-layer files (e.g., init.k, k8s_resources.k, sql.k, storage.k) with proper import semantics. Shared utilities live in modules, not in a _helpers.tpl swamp. The architecture: multi-step pipeline, not bundled There are two ways to wire KCL into a Crossplane composition:\nSingle bundled function-kcl step. Concatenate all layers into one input. Requires a bundler (often Python) to assemble inputs, a schema {layer}_layer: indirection per layer so the bundled input stays addressable, and an indent step that can leak into string literals. Multi-step pipeline. Each composition layer is its own function-kcl step. The shared context is concatenated into each step\u0026rsquo;s input at template time (e.g., by a small Helm helper). The multi-step shape is materially better:\nNo Python bundler. No schema {layer}_layer: wrapper. No indent step that can corrupt string literals. Per-step error attribution in the cluster. When a step fails, the XR Synced condition names it (\u0026ldquo;pipeline step \u0026lsquo;init-status\u0026rsquo; returned a fatal result\u0026rdquo;). The bundled architecture fails everything identically inside one opaque step. Per-step logs in the function-kcl pod. Each invocation produces its own log line. Step ordering is declarative. The pipeline YAML lists the order; reviewers see it directly. Per-step lifecycle hooks become available — conditional / skip behavior at step granularity. The cost: more pipeline steps to declare, and a small impedance bridge between how KCL is authored on disk (modular, qualified imports, IDE-friendly docstrings) and what function-kcl accepts inline (one flat string with no filesystem). The bridging can be done at install time with a Helm helper (~30 lines), preserving the on-disk dev experience.\nWhat end-to-end validation surfaces (and unit tests don\u0026rsquo;t) Validating against a real cluster — not just kcl test — surfaced a class of bugs that local builds, type-checks, and unit tests all missed:\nobserved.resources keying mismatch. Go templates key by composition-resource-name annotation; function-kcl\u0026rsquo;s params.ocds keys by metadata.name. A direct port of an existing Go-template composition carries over the wrong keys. The dependent guards stay false; the corresponding resources are never emitted. Hit four separate times during validation. Forward references in lambda bodies are accepted, not rejected. KCL evaluates lambda bodies top-down; an identifier referenced before its assignment resolves to an empty value at the use site, which then propagates silently through the safe-navigation chain. A Go or Rust compiler would reject this as use-before-declare. Fail-soft idioms make silent-empty indistinguishable from not-ready. KCL\u0026rsquo;s ?. safe-navigation plus or default causes the entire chain to collapse to [] or \u0026quot;\u0026quot; with no error. Then if _is_ready: evaluates false and the layer emits nothing. The cluster reports Synced=True Ready=True because the composition successfully decided to emit nothing. From outside, this is identical to \u0026ldquo;not yet ready, will retry next reconcile\u0026rdquo; — which it isn\u0026rsquo;t. Items envelope. Every layer file ended with items = {\u0026quot;items\u0026quot;: get_items(...)} — a bundler-era convention. When run as its own step under the multi-step architecture, function-kcl received items: {dict} and rejected it (\u0026ldquo;wrong node kind: expected SequenceNode but got MappingNode\u0026rdquo;). The multi-step pipeline named the failing step in the XR Synced condition — under the bundled architecture this would have failed inside the single opaque step. Defenses that catch this bug class:\nPer-state golden-file tests — assert the exact emitted set for fixtures including \u0026ldquo;all-ready\u0026rdquo;, \u0026ldquo;partially-ready\u0026rdquo;, and \u0026ldquo;nothing-ready\u0026rdquo; states. Parity tests against captured output won\u0026rsquo;t catch the keying mismatch because parity fixtures don\u0026rsquo;t exercise post-readiness emission paths with a populated ocds. Replace ?. with [] on lookups required for correctness. Use safe-navigation only for genuinely-optional fields; let required-but-missing data fail loudly during testing. Render-side coverage. Render the chart (helm template) in CI and run kcl run on each extracted step source. The bytes function-kcl actually runs are the Helm-rendered output, not the on-disk .k file — if your Helm helper regex is wrong, local tests pass but the cluster breaks. Distribution: where this gets harder The inline-multi-step approach embeds the KCL source for each step directly in the Composition CR. This is deliberate: no publish lane, no new OCI image, no new function CRD. The composition layer changes; the surrounding stack does not.\nIf you eventually outgrow inline distribution (composition exceeds Kubernetes\u0026rsquo; ~1 MiB CR limit, or cross-chart sharing becomes load-bearing), three lanes are worth comparing:\nInline KCL KCL via OCI module Custom Go composition function Publish lane required No Yes (one repo, versioned) Yes (one image, versioned) Per-release maintenance Push code Push module Push image + track CVEs + base-image upgrades + SBOM Language familiarity on most teams Narrow Narrow Wide CRD type safety Yes Yes Yes (via function-sdk-go) Stack-trace quality Source line numbers Source line numbers Native Go stack traces Ecosystem maturity crossplane-function-kcl is younger Same function-sdk-go more mature The honest tradeoff: if KCL succeeds as an A/B against Go templates, that argues against Go templates — not automatically for KCL. Once you accept a publish lane, custom Go composition functions become a real contender. Go is what most teams read and write today; function-sdk-go is more mature than crossplane-function-kcl; you get real Go stack traces.\nThe counter-argument: OCI distribution for KCL is operationally lighter than OCI distribution for Go. KCL OCI is config files only; Go OCI is a binary with a base image, CVE tracking, SBOM management, and image-version coordination. Same publish cost, very different maintenance cost.\nThe inline-multi-step lane is the one that defers this question indefinitely. The bundler issues that pushed people toward \u0026ldquo;we\u0026rsquo;ll need OCI eventually\u0026rdquo; are gone, and the multi-step architecture can keep running without one. If the A/B succeeds and the team eventually wants a publish lane, the KCL-vs-Go decision becomes hands-on rather than theoretical — they\u0026rsquo;ve already lived with KCL idioms on internal work.\nReferences KCL Language crossplane-function-kcl Crossplane Compositions ","permalink":"https://allenz.net/writing/when-go-templates-outgrow-you-a-typed-language-alternative-for-crossplane-compositions/","summary":"When Crossplane\u0026rsquo;s Go-template compositions outgrow you — no types, no tests, global scope — KCL offers a typed, testable alternative. The multi-step pipeline architecture, and the bugs only end-to-end validation catches.","tags":["Infrastructure as Code","Kubernetes"],"title":"When Go templates outgrow you: a typed-language alternative for Crossplane compositions"},{"content":"Cloud SQL for PostgreSQL with IAM database authentication is an attractive pattern: each workload connects as a Google service account, there are no database passwords to rotate or leak, and access is governed by IAM. But IAM auth interacts badly with a routine platform operation — restoring a database from one environment into another — in a way that\u0026rsquo;s easy to misdiagnose. The root cause is object ownership, and the fix is to make ownership environment-independent. This is a worked example from running a large Java platform on GKE, but it applies to any platform that uses per-environment database identities and promotes data between environments.\nThe setup Each environment (dev, stg, prd) runs the application as its own per-environment IAM service account, and that SA is the PostgreSQL role the app connects as:\napp-stg-infra-sa@\u0026lt;project\u0026gt;.iam → the DB user in staging app-prd-infra-sa@\u0026lt;project\u0026gt;.iam → the DB user in production When the application creates a table, that table is owned by the role that created it — the per-environment SA. So in staging, every table is owned by ...-stg-infra-sa; in production, by ...-prd-infra-sa. Ownership is baked into the data, keyed by a role name that is different in every environment.\nThe failure Now restore staging\u0026rsquo;s database into a fresh production instance (or refresh a lower environment from a higher one). PostgreSQL preserves object ownership by role name. After the restore, production\u0026rsquo;s tables are owned by ...-stg-infra-sa — a role that may not even exist on the production instance, and certainly isn\u0026rsquo;t the role production\u0026rsquo;s app connects as.\nTwo things break:\nOwnership-gated DDL fails. The production app, connected as ...-prd-infra-sa, can\u0026rsquo;t ALTER/DROP/re-own tables it doesn\u0026rsquo;t own. Schema upgrades and any owner-only operation fail. The restore itself needs a privileged actor. To re-own objects or run the grant fix-ups, you need a role with authority over all of them. On Cloud SQL that pulls you toward the postgres superuser-equivalent — which is why teams end up resetting the postgres password after every cross-environment restore (gcloud sql users set-password postgres ...) and maintaining a dedicated DB-admin service account just to perform the re-owning. That\u0026rsquo;s a recurring manual step and a standing credential, both of which exist only to paper over the ownership mismatch. The symptom looks like a permissions or auth problem. The cause is that ownership is a per-environment identity persisted in the data, and restore moves the data without translating the identity.\nThe fix: own everything as cloudsqlsuperuser Cloud SQL provides a built-in role, cloudsqlsuperuser, that exists identically on every Cloud SQL instance (it\u0026rsquo;s the closest thing Cloud SQL offers to a real superuser, since it withholds true SUPERUSER).1 If every table is owned by cloudsqlsuperuser instead of by the per-environment SA, ownership becomes environment-independent:\nA staging dump restored into production arrives with all objects owned by cloudsqlsuperuser — a role that already exists in production and means the same thing there. The per-environment app SA no longer needs to own tables; it only needs the privileges to use them (SELECT/INSERT/UPDATE/DELETE, USAGE on sequences, etc.), granted via role membership or default privileges. No post-restore re-owning step. No postgres password reset. Nothing keyed to the source environment to translate. The subtlety that makes this real rather than aspirational is dynamically created tables. It\u0026rsquo;s not enough to set ownership on the tables that exist at provisioning time — a large application platform creates tables at runtime (per feature, per deployment). Those must also land as cloudsqlsuperuser-owned, or the next cross-env restore reintroduces exactly the mismatch you just eliminated. The durable fix sets the owning role such that every table, including ones created later, is owned by cloudsqlsuperuser at creation time.\nThe next step: delete the root password and the DB-admin SA entirely Once ownership is environment-stable, the two pieces of machinery that existed only to manage it become removable. The proposal: grant the application\u0026rsquo;s IAM service account cloudsqlsuperuser membership at instance-creation time:\ngcloud sql users insert \u0026lt;app-iam-sa\u0026gt; \\ --instance=\u0026lt;instance\u0026gt; \\ --type=CLOUD_IAM_SERVICE_ACCOUNT \\ --database-roles=cloudsqlsuperuser With the app SA already a member of cloudsqlsuperuser:\nThe dedicated DB-admin service account goes away — there\u0026rsquo;s no longer a separate privileged identity needed to perform owner-level operations. The postgres root password goes away — nothing in the normal lifecycle (provision, deploy, restore, upgrade) needs to authenticate as postgres, so there\u0026rsquo;s no password to reset after restores and no standing root credential to secure. The cross-environment restore flow collapses to \u0026ldquo;restore the data\u0026rdquo; — no identity translation, no privileged fix-up step, no credential reset.\nThe general principle This is the same failure family as stable-vs-rewritten identity across environments: a value persisted in the database that also encodes which environment it belongs to is a cross-environment portability hazard. There the value was a tenant\u0026rsquo;s web.id; here it\u0026rsquo;s object ownership. In both cases the moment you move data between environments, the persisted copy disagrees with the target environment, and the system has no way to reconcile it except to fail or to bolt on a translation step.\nThe robust pattern is to separate identity-for-access from identity-for-ownership. Let the per-environment IAM SA carry access (it\u0026rsquo;s the right place for least-privilege, per-environment credentials), but anchor ownership to a role that is constant across environments. Ownership baked into data should be environment-invariant; anything environment-specific belongs in the grant layer, not the ownership layer. Get that separation right and cross-environment restore stops being a special operation with its own fix-up choreography — it becomes a plain data copy.\ncloudsqlsuperuser is the default superuser role Cloud SQL grants to the user accounts you create; the managed service withholds the true PostgreSQL SUPERUSER attribute. See Cloud SQL for PostgreSQL users and IAM database authentication.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://allenz.net/writing/environment-stable-table-ownership-surviving-cross-environment-restore-with-iam-database-auth/","summary":"Cloud SQL IAM database auth breaks cross-environment restores because table ownership encodes a per-environment service account. Make ownership environment-independent by owning every table as cloudsqlsuperuser.","tags":["Databases","Security","GCP"],"title":"Environment-stable table ownership: surviving cross-environment restore with IAM database auth"},{"content":"When a stateful application bakes an environment-derived identity into its database on first boot, cross-environment database restore breaks in a way that\u0026rsquo;s easy to misdiagnose and instructive to fix. This is a worked example from running a large Java portal platform on Kubernetes, but the shape of the problem and the decision behind the fix generalize to any platform that promotes data between environments.\nThe failure The platform seeds a default company (its top-level tenant) on first boot, keyed by a configured web.id property. The value is written into the company row in the database. At runtime, the portal also uses that same configured value to look up the default company — an internal lookup resolves the default tenant by matching the configured web.id against the persisted row.\nOn Kubernetes, the platform was rendering web.id per environment, derived from the environment\u0026rsquo;s hostname ({project}-{env}.example.com). That\u0026rsquo;s fine until you restore across environments — clone a dev database into UAT, or refresh a lower environment from a higher one:\nDev boots first, writes company with webId = \u0026lt;dev value\u0026gt;. A volume-level restore of dev\u0026rsquo;s database into UAT brings that row over verbatim. UAT boots configured with \u0026lt;uat value\u0026gt;, looks up company WHERE webId = \u0026lt;uat value\u0026gt;, and finds nothing. getDefaultCompanyId() throws IllegalStateException: Unable to get default company ID; initialization aborts; the container exits; the orchestrator restarts it; CrashLoopBackOff. The symptom (a crash loop) points at the runtime. The cause is a configuration value that is required to match a value persisted in the database — and the platform was deriving that value differently in every environment.\nThe root cause is older than it looks Per-environment derivation wasn\u0026rsquo;t a considered design choice; it was a workaround for an unrelated symptom (\u0026ldquo;the company ID kept changing\u0026rdquo;) that traced back to a first-boot race when the workload ran with multiple replicas and no startup lock. Once the workload settled to a single replica behind a startup lock, the race could no longer happen — but the per-environment web.id lived on, no longer solving any real problem and now actively breaking restore. A classic case of a workaround outliving the bug it patched, and only surfacing when a different capability (cross-env restore) exercised it.\nThe decision space Three mechanisms, with materially different cost and reach:\nStabilize the identity at seed time. Default web.id to a project-stable value (independent of environment). Cheap, ships immediately, and removes the failure outright for the default tenant — the environments are already isolated by namespace, project, database instance, and credentials, so nothing of value is lost. This is also what the mature, longer-lived version of the same product platform had effectively converged on (a single stable value frozen across environments).\nRewrite the identity after restore. Run a post-restore step that updates the persisted identity (and its dependent rows — virtual hosts, and anything else keyed to the tenant) to match the target environment. More machinery to build and maintain, but it composes with two things you likely need anyway: data sanitization (you cannot land production PII in a lower environment unscrubbed) and multi-tenant iteration (if the platform hosts more than one tenant, the rewrite has to loop over all of them). If you\u0026rsquo;re building a post-restore SQL hook for sanitization regardless, identity rewriting rides the same mechanism for free.\nDiscover the identity from the database. Have the runtime read the persisted value and align to it, rather than requiring config to match. The most robust in principle, the most invasive to retrofit.\nThe decision is downstream of a tenancy question The right durable mechanism is selected by a product question, not an engineering one: is this single-tenant per deployment, or do we support multiple tenants / database partitioning?\nSingle-tenant, forever: stabilizing at seed time (option 1) is sufficient permanently. There\u0026rsquo;s nothing to iterate over. Multi-tenant or partitioned, or production→lower-environment restores that require PII sanitization: the post-restore hook (option 2) becomes the durable home, because it\u0026rsquo;s the only one of the three that scales to N tenants and subsumes sanitization in a single mechanism. So the engineering recommendation is to ship option 1 now — it removes the crash, deletes the dead workaround, and is correct under every tenancy outcome — and gate option 2 on the tenancy/sanitization answer, rather than build the heavier, harder-to-reverse machinery before the product direction is settled. Fix the outage cheaply today; put the expensive, irreversible decision where it belongs.\nThe deeper lesson The real fragility isn\u0026rsquo;t the per-environment value — it\u0026rsquo;s the pattern of identity that is both persisted in the database and required to match external configuration. Any value with that dual role is a cross-environment portability hazard: the moment you move data between environments, the persisted copy and the configured copy disagree, and the system has no way to reconcile them except to fail.\nThe robust alternative is database-as-source-of-truth for identity: resolve the default tenant from the data itself (e.g. a flag on the row) and cache it, rather than requiring a configuration value to match what\u0026rsquo;s already persisted. Configuration should seed identity on a greenfield install and then get out of the way — not remain a permanent second source of truth that every restore has to keep in sync. It\u0026rsquo;s a twenty-year-old assumption (configured identity that the runtime trusts over the database) that predates cloud deployment patterns where promoting data between environments is routine.\nThat principle — don\u0026rsquo;t make a persisted identity depend on matching external config — is the portable takeaway, and it applies well beyond this one platform.\n","permalink":"https://allenz.net/writing/stable-vs.-rewritten-identity-cross-environment-database-restore-in-a-stateful-platform/","summary":"When a platform bakes an environment-derived identity into its database on first boot, restoring across environments crash-loops on a mismatch. The fix is a product question: stabilize the identity, or rewrite it after restore?","tags":["Databases","Kubernetes"],"title":"Stable vs. rewritten identity: cross-environment database restore in a stateful platform"},{"content":"Policy-as-code is now standard in IaC pipelines: an OPA/Rego policy inspects each plan and denys the run if it violates a rule — a failed Checkov check, a CRITICAL Trivy CVE, a cost increase over budget, a verified secret. The policy is the guardrail. But a guardrail is only as good as your confidence that it fires when it should — and stays quiet when it shouldn\u0026rsquo;t. An untested policy that silently passes everything is worse than no policy: it manufactures the appearance of enforcement while enforcing nothing.\nSo the discipline I\u0026rsquo;d argue for: every policy ships with a unit-test suite, run in CI like any other code. This is from a Spacelift setup where each of the plan/push policies — Checkov, Trivy, TFLint, Kubeconform, Infracost, TruffleHog — had a sibling *_test.rego.\nA policy and the ways it silently breaks Here\u0026rsquo;s the Checkov gate. It reads Checkov\u0026rsquo;s findings out of the run metadata, denys if any check failed, and emits a warn per finding:\npackage spacelift import rego.v1 count_checkov_failures(payload) := c if { c := payload.summary.total_failed } else := c if { c := sum([f | f := payload.results[_].summary.failed]) } else := 0 get_metadata(name) := val if { val := input.third_party_metadata.custom[name] } else := val if { val := input.third_party_metadata[name] } deny contains msg if { payload := get_metadata(\u0026#34;checkov\u0026#34;) failed := count_checkov_failures(payload) failed \u0026gt; 0 msg := sprintf(\u0026#34;🛡️ found %d failed checkov security checks.\u0026#34;, [failed]) } Look at how many ways this passes silently if you get it slightly wrong:\nThe metadata path. Findings live at input.third_party_metadata.custom.checkov in one tool version and input.third_party_metadata.checkov in another. Hard-code the wrong one and get_metadata returns undefined, count_checkov_failures falls through to 0, and every run passes — no error, just a green check that means nothing. The summary shape. Some emitters report summary.total_failed; others only a per-result results[_].summary.failed. Handle one format and the other reads as zero failures. The threshold. For the Infracost gate, is the budget \u0026gt; $300 or \u0026gt;= $300? Off by one comparison and a run at exactly the limit goes the wrong way. None of these throw. They all fail open. You\u0026rsquo;d never notice until an actual violation sailed through.\nThe tests pin exactly those failure modes package spacelift import rego.v1 test_count_failures_standard_format if { count_checkov_failures({\u0026#34;summary\u0026#34;: {\u0026#34;total_failed\u0026#34;: 5}}) == 5 } test_count_failures_fallback_format if { count_checkov_failures({\u0026#34;results\u0026#34;: [{\u0026#34;summary\u0026#34;: {\u0026#34;failed\u0026#34;: 2}}, {\u0026#34;summary\u0026#34;: {\u0026#34;failed\u0026#34;: 1}}]}) == 3 } test_deny_on_failures if { mock := {\u0026#34;third_party_metadata\u0026#34;: {\u0026#34;custom\u0026#34;: {\u0026#34;checkov\u0026#34;: {\u0026#34;summary\u0026#34;: {\u0026#34;total_failed\u0026#34;: 1}, \u0026#34;failed_checks\u0026#34;: []}}}} count(deny) \u0026gt; 0 with input as mock } test_no_deny_when_passing if { mock := {\u0026#34;third_party_metadata\u0026#34;: {\u0026#34;custom\u0026#34;: {\u0026#34;checkov\u0026#34;: {\u0026#34;summary\u0026#34;: {\u0026#34;total_failed\u0026#34;: 0}, \u0026#34;failed_checks\u0026#34;: []}}}} deny == set() with input as mock } The shape of a good guardrail test is both directions and the boundary:\nIt denies when it must (test_deny_on_failures). It passes when it must (test_no_deny_when_passing) — this is the one people skip, and it\u0026rsquo;s the one that catches a policy that denies everything (also useless: it gets disabled within a week). It pins the boundary — for Infracost, an explicit case that $300 exactly does not deny because the rule is \u0026gt;, not \u0026gt;=. For Trivy, that CRITICAL and HIGH deny while MEDIUM and LOW only warn. It covers every input format the upstream tool can emit, with with input as mock feeding synthetic metadata. opa test runs them. (One wrinkle worth documenting: because all the policies share package spacelift and define the same helper names — get_metadata, count_checkov_failures — loading every file together collides. Each suite runs against only its own policy file: opa test policies/checkov.rego policies/checkov_test.rego.)\nThe general principle A guardrail is control-plane code that runs on every change and is supposed to say \u0026ldquo;no.\u0026rdquo; That makes it exactly the code you most need to test, because its failure mode is invisible: a broken guardrail doesn\u0026rsquo;t crash, it just stops guarding, and the dashboard stays green. The cost of being wrong is paid later, by whatever the policy was supposed to catch — a secret in state, a public bucket, a 10× cost regression.\nTreat policies like the security-critical code they are: assert they fire on the bad input, assert they stay quiet on the good input, pin the threshold, and cover every metadata shape the source tool emits. The test suite is what converts \u0026ldquo;we have a Checkov policy\u0026rdquo; into \u0026ldquo;we know our Checkov policy works.\u0026rdquo; Those are very different claims, and only one of them survives an audit.\n","permalink":"https://allenz.net/writing/test-your-guardrails-policy-as-code-that-you-actually-verify/","summary":"Policy-as-code that\u0026rsquo;s never tested usually fails open — it waves violations through and no one notices. How to test guardrails so they both deny when they must and pass when they must.","tags":["Infrastructure as Code","Security"],"title":"Test your guardrails: policy-as-code that you actually verify"},{"content":"If you manage your infrastructure as code, you eventually face a recursive question: what manages the thing that manages your infrastructure? When the IaC orchestrator (Spacelift, in this case) is itself configured through Terraform — stacks, contexts, policies, integrations all declared as resources — the cleanest answer is an admin stack: a stack that provisions every other stack, including itself. It\u0026rsquo;s elegant, and it has two irreducible bootstrap problems that no amount of declarative code can paper over. This is how that pattern actually goes.\nThe shape One repo declares the whole organization via the Spacelift Terraform provider. A root module composes child modules (./aws, ./gcp, ./plugins, …), and a single resource declares the stack that runs the repo:\nresource \u0026#34;spacelift_stack\u0026#34; \u0026#34;admin\u0026#34; { name = \u0026#34;platform-automation-spacelift\u0026#34; repository = \u0026#34;spacelift\u0026#34; branch = \u0026#34;master\u0026#34; project_root = \u0026#34;.\u0026#34; space_id = var.cloudnative_space_id # the \u0026#34;root\u0026#34; space terraform_workflow_tool = \u0026#34;OPEN_TOFU\u0026#34; } # The admin stack gates the delivery pipeline: resource \u0026#34;spacelift_stack_dependency\u0026#34; \u0026#34;admin_to_gcp_gke\u0026#34; { depends_on_stack_id = spacelift_stack.admin.id stack_id = module.gcp.gke_stack_id } After this exists, the admin stack is self-hosting: edit a policy or add a stack, open a PR, and the admin stack plans and applies the change to the org — including changes to itself. That\u0026rsquo;s the goal. Getting there is where it\u0026rsquo;s interesting.\nBootstrap problem 1: the stack can\u0026rsquo;t create itself The first apply cannot run in the admin stack, because the admin stack doesn\u0026rsquo;t exist yet. Something has to create it from outside. So the bootstrap is a local apply, then a state handoff:\nAuthenticate locally (spacectl profile login, bridged to Terraform via SPACELIFT_API_TOKEN). tofu apply from your laptop — this creates the admin stack and all child stacks/contexts/policies, recording everything in a local terraform.tfstate. Now the admin stack exists in Spacelift, but its managed state is empty — the real state is on your laptop. Hand it off: lock the stack, import the local tfstate through the UI, unlock, and trigger a fresh run. That run must plan 0 to add, 0 to change, 0 to destroy. That clean plan is the proof the handoff worked — the self-managed state now matches reality, and the stack has taken over from your laptop. One related trap: do not add a backend \u0026quot;...\u0026quot; block to the repo. Spacelift injects state configuration into every run; a static backend block fights that injection. The state lives where the platform puts it, and the bootstrap import is how you seed it.\nBootstrap problem 2: you can\u0026rsquo;t grant yourself permissions you don\u0026rsquo;t have The admin stack manages resources across multiple spaces, so it needs an org-wide admin role. But it cannot grant itself that role — to create a role binding that powerful, you\u0026rsquo;d already need to be that powerful. This is a genuine chicken-and-egg, not a tooling gap, and the practical answer is a manual seed in the UI:\nConfirm the admin stack lives in the root space (Spacelift only allows a stack a role scoped to a space it sits in or above). Manually bind the Space admin role on root to the admin stack as principal. A single binding on root cascades to child spaces. If the provider later grows a role_attachment resource, you add it to the code as a record of state already set by hand — never as the source of truth, because the source of truth had to exist before the code could run.\nThe general principle Every self-managing or self-hosting system has an irreducible manual seed, and it\u0026rsquo;s always in the same two places: the trust/permission root and the initial state. A system cannot authorize itself (that\u0026rsquo;s circular), and it cannot create the state that records its own existence before it exists. No quantity of \u0026ldquo;everything as code\u0026rdquo; removes this; it only relocates it.\nSo the mature move isn\u0026rsquo;t to pretend the bootstrap is fully declarative — it\u0026rsquo;s to make the seed explicit, documented, and tiny: one local apply, one state import, one manual role binding, each written down with the exact UI steps and the \u0026ldquo;you\u0026rsquo;ll know it worked when the plan is 0/0/0\u0026rdquo; checkpoint. After that single human-in-the-loop moment, the system is genuinely self-managing. The art is shrinking the manual seed to the smallest thing that logically cannot be automated — and being honest that it exists, rather than discovering it the hard way during a disaster-recovery rebuild.\n","permalink":"https://allenz.net/writing/the-admin-stack-that-manages-itself-bootstrapping-a-self-hosted-iac-control-plane/","summary":"If your IaC orchestrator is itself configured as code, you need an admin stack that provisions every stack — including itself. The elegant pattern, and the two bootstrap problems no amount of declarative code removes.","tags":["Infrastructure as Code"],"title":"The admin stack that manages itself: bootstrapping a self-hosted IaC control plane"},{"content":"Terraform\u0026rsquo;s mental model is that a state file owns the resources it declares: apply makes them exist, destroy makes them go away. That model quietly breaks when a resource is actually shared across many deployments but gets declared inside a per-deployment module. The symptom shows up far from the cause — a different deployment\u0026rsquo;s teardown silently breaks the one you\u0026rsquo;re looking at.\nThe setup A per-cluster Terraform module (cloud/terraform/gcp/gke) needs Cloud Storage Transfer Service to run database backups. So it grants the project\u0026rsquo;s Storage Transfer service agent the roles it needs:\n# apis.tf — inside the PER-CLUSTER module resource \u0026#34;google_project_iam_member\u0026#34; \u0026#34;sts_storage_admin\u0026#34; { project = var.project_id role = \u0026#34;roles/storage.admin\u0026#34; member = \u0026#34;serviceAccount:project-${var.project_number}@storage-transfer-service.iam.gserviceaccount.com\u0026#34; } resource \u0026#34;google_project_iam_member\u0026#34; \u0026#34;sts_agent\u0026#34; { project = var.project_id role = \u0026#34;roles/storagetransfer.serviceAgent\u0026#34; member = \u0026#34;serviceAccount:project-${var.project_number}@storage-transfer-service.iam.gserviceaccount.com\u0026#34; } This works when you stand up the first cluster. It keeps working when you stand up the second and third — google_project_iam_member is additive and idempotent, so each cluster\u0026rsquo;s apply just re-asserts the same binding.\nThe failure The Storage Transfer service agent is one identity per GCP project. There is exactly one project-\u0026lt;N\u0026gt;@storage-transfer-service.iam.gserviceaccount.com, and exactly one project-level binding granting it storage.admin. But now N per-cluster Terraform states each declare that single binding as if they owned it.\nApply is forgiving of that. Destroy is not. Tear down any one cluster:\nterraform destroy # cluster A …and google_project_iam_member dutifully removes the project-level binding — the one every other cluster is still relying on. The next backup run on clusters B, C, D fails with:\nstorage: does not have storage.buckets.get access to the Google Cloud Storage object Nothing changed in B, C, or D. Their Terraform still shows the binding in state. But the binding is gone from the project, because a sibling\u0026rsquo;s destroy took it. The blast radius of a teardown reached sideways into unrelated, still-running deployments.\nWhy this class of bug hides Three properties make it hard to catch:\nApply masks it. As long as any cluster has applied recently, the binding exists. The fleet looks healthy. The damage is delayed and remote. The break surfaces on a different deployment, on its next backup, not at destroy time. The causal link is easy to miss. State lies. Every surviving cluster\u0026rsquo;s state still lists the binding as present and owned. terraform plan shows no drift until it tries to use it. The fix: separate ownership layers by lifecycle, not by convenience The binding is project-scoped and singleton; the cluster is per-deployment and disposable. Resources with different lifecycles must not live in the same state.\nFactor project-global, shared resources into a one-shot bootstrap module that runs once per project and is never part of a per-cluster destroy:\nterraform/ bootstrap/ # one state per PROJECT — service-agent bindings, shared APIs, org policy gke-cluster/ # one state per CLUSTER — VPC, cluster, node pools (safe to destroy) The per-cluster module then depends on the bootstrap layer\u0026rsquo;s outputs but never declares the shared bindings itself. A cluster teardown can\u0026rsquo;t touch anything another cluster needs, because it no longer owns it.\nThe general principle Before Terraform manages a resource, ask: is this resource per-instance, or is it shared across instances? Anything project-, account-, or org-scoped — service-agent IAM bindings, enabled APIs, org policies, shared DNS zones, a default network — is a singleton. A singleton declared inside a per-instance module is owned N times and destroyed by the first teardown that runs.\nThe diagnostic test is simple and worth applying in review: \u0026ldquo;If I terraform destroy this one module, does anything outside it break?\u0026rdquo; If yes, the resource is in the wrong layer. Shared resources belong in a bootstrap state whose lifecycle matches the thing they\u0026rsquo;re actually scoped to — the project — not the thing that happened to need them first.\n","permalink":"https://allenz.net/writing/when-terraform-owns-a-shared-resource-as-if-it-were-dedicated/","summary":"When a per-cluster Terraform module owns a project-global, shared resource, tearing down one cluster quietly breaks the others. Why resources with different lifecycles can\u0026rsquo;t share state — and the bootstrap-module fix.","tags":["Infrastructure as Code","GCP"],"title":"When Terraform owns a shared resource as if it were dedicated"},{"content":"Standing up a GKE or EKS deployment is the easy direction. Tearing it down cleanly is where the sharp edges live — because a \u0026ldquo;deployment\u0026rdquo; is rarely a single layer that one terraform destroy fully owns. The cluster is provisioned one way; the things that run inside the cluster provision more cloud resources a second way; and the two layers disagree about who is responsible for cleanup.\nThis is a field guide to deleting one without leaving orphans, drawn from tearing down both a GKE deployment (Argo CD + Crossplane GitOps) and a pair of Terraform-provisioned EKS clusters.\nStrategy 0: if you can, delete the whole project / account The cleanest teardown is the one you don\u0026rsquo;t have to enumerate. If the deployment lives in its own GCP project or its own AWS account, deleting that container is the single most reliable move — it takes every resource, IAM binding, and orphan with it, and you never have to reason about deletion order.\nThis only works if the project/account is dedicated to the deployment. The moment a project is shared across many deployments (a common *-development sandbox pattern), you can\u0026rsquo;t nuke it, and you\u0026rsquo;re back to surgical deletion — which is the rest of this guide.\nShared-project fragility to watch for: some Terraform modules grant project-level IAM bindings (e.g. to the Storage Transfer Service agent) as if they were per-cluster. When any sister deployment\u0026rsquo;s module runs terraform destroy, those shared bindings vanish from the whole project and silently break every other deployment that still expects them. Factor project-scoped bindings into a separate one-shot bootstrap module so per-deployment destroys can\u0026rsquo;t take them down.\nStrategy 1: enumerate by naming convention, not by module\u0026rsquo;s resource list The instinct is to walk the Terraform module\u0026rsquo;s resource types and delete those. This misses everything the module didn\u0026rsquo;t create. A GKE deployment built on a GitOps/Crossplane stack also has, beyond the module\u0026rsquo;s VPC/subnet/router/NAT/firewall/node-SA:\nHigh-privilege control-plane service accounts (serviceAccountAdmin, projectIamAdmin). Per-provider service accounts (KMS, SQL, Secret Manager, Storage admins). Crossplane-provisioned downstream resources — Cloud SQL instances, GCS buckets, KMS keyrings, secrets — that survive after the cluster is gone. Dozens of project IAM bindings, which are policy entries, not resources, and won\u0026rsquo;t show up in any resource listing. Enumerate with Cloud Asset Inventory (GCP) or the Resource Groups Tagging API (AWS), keyed on the deployment\u0026rsquo;s name prefix:\ngcloud asset search-all-resources \\ --scope=projects/\u0026lt;project\u0026gt; \\ --query=\u0026#34;name:\u0026lt;deployment-prefix\u0026gt;\u0026#34; \\ --format=\u0026#34;value(assetType,name)\u0026#34; This catches the secrets, buckets, and controller-created SAs that a module-walk misses. Then separately sweep project IAM, because bindings aren\u0026rsquo;t assets:\ngcloud projects get-iam-policy \u0026lt;project\u0026gt; \\ --flatten=\u0026#34;bindings[].members\u0026#34; \\ --filter=\u0026#34;bindings.members~\u0026lt;prefix\u0026gt;-\u0026#34; \\ --format=\u0026#34;value(bindings.role,bindings.members)\u0026#34; Labels help less than you\u0026rsquo;d hope The tempting axis is --filter labels.deployment_name=.... But most networking and IAM primitives — VPC, subnet, router, NAT, route, firewall, peering, service account, IAM binding — have no label field at the API level. A provider can only set arguments that exist. In practice only a couple of resource types (e.g. google_compute_global_address, the cluster itself) carry the label. So the real discovery axis is the \u0026lt;deployment_name\u0026gt;- naming convention, not labels. Tag what you can, but build your enumeration on names.\nStrategy 2: know what Terraform never tracked terraform destroy only deletes what\u0026rsquo;s in its state. Two large classes of resource routinely aren\u0026rsquo;t:\nResources provisioned by in-cluster controllers. Crossplane managed-resources, the EBS/PD CSI driver\u0026rsquo;s dynamically-provisioned volumes, cloud load balancers created by a Service: LoadBalancer. These are created by software running inside the cluster, not by Terraform — so Terraform has no idea they exist. State drift / empty state. If the local state was reset, migrated, or never captured the import, terraform destroy is a no-op and quietly leaves everything running. Always verify terraform state list is non-empty before trusting destroy. Concrete leftovers seen after the clusters were deleted:\nGKE: Crossplane-created Cloud SQL instances and GCS buckets, plus cp-iam/provider SAs and ~17 project IAM bindings. EKS: two available (detached) 10 GiB EBS volumes named \u0026lt;cluster\u0026gt;-dynamic-pvc-\u0026lt;uuid\u0026gt; — orphaned PersistentVolumes the CSI driver created and Terraform never owned. Sweep for these explicitly after the cluster is gone:\n# Orphaned EBS PVs from a deleted EKS cluster aws ec2 describe-volumes --region \u0026lt;r\u0026gt; \\ --filters \u0026#34;Name=tag:Name,Values=*\u0026lt;cluster\u0026gt;*\u0026#34; \\ --query \u0026#34;Volumes[?State==\u0026#39;available\u0026#39;].VolumeId\u0026#34; --output text Strategy 3: deletion protection and the right order Check deletion protection first. Cloud SQL (settings.deletionProtectionEnabled), GKE clusters, RDS instances, and load balancers can all carry a protection flag that makes delete calls fail outright. Confirm it\u0026rsquo;s off before you script a batch delete, or your loop fails halfway and leaves a partial teardown.\nDependencies dictate order. Networking especially is a dependency graph, not a flat list. A workable GCP order:\nService-networking (PSA) peering — usually the stuck one (see below). NAT → router (NAT lives inside the router). Firewalls. Non-default routes. Subnet. PSA global address (only deletable once its peering is gone). VPC. Project IAM bindings — one remove-iam-policy-binding per (role, member). Service accounts. (Deleting an SA does not remove its project bindings — they become deleted:...?uid= orphan members. Remove the bindings too, or they linger forever.) Controller-created resources (SQL, buckets, keyrings). A useful shortcut when the controllers are GitOps-managed: delete the cluster first. That removes Argo CD, Crossplane, and every finalizer in one shot, so nothing fights you — then clean the now-orphaned cloud resources directly. (This is only safe because the controllers couldn\u0026rsquo;t deprovision anyway; see below.)\nStrategy 4: retries, stuck finalizers, and things that hang Cloud deletes are eventually-consistent and frequently need retries or a workaround:\nService-networking peering often fails with FLOW_SN_DC_RESOURCE_PREVENTING_DELETE_CONNECTION even after Cloud SQL / Memorystore / Filestore are gone — the tenant-project side holds stuck state. When the VPC is going away regardless, delete the peering from the consumer side instead: gcloud compute networks peerings delete servicenetworking-googleapis-com \\ --network=\u0026lt;deployment\u0026gt;-vpc --project=\u0026lt;project\u0026gt; Crossplane / Argo finalizers. If the controller\u0026rsquo;s cloud credentials have lapsed (e.g. the control-plane SA lost authorization — a 403 notAuthorized on observe), Crossplane cannot deprovision its managed resources, and deleting the claim/XR will hang on finalizers forever. Don\u0026rsquo;t wait on it: delete the cloud resources by hand via the provider CLI, then either strip the finalizers or — cleaner — delete the whole cluster so the entire control plane (and its finalizers) disappears at once. Auto-sync will fight you. A GitOps controller set to automated: { prune: true, selfHeal: true } recreates anything you delete out from under it. Remove/disable the Applications first, or delete the cluster they run in, before deleting their managed resources. Asset Inventory lags ~1 hour. After deletion, stale entries for Networks, service-networking Connections, and SecretVersions linger in Asset Inventory. Don\u0026rsquo;t trust it for \u0026ldquo;is this really gone\u0026rdquo; — verify against the direct API (gcloud compute networks describe, aws eks list-clusters, etc.). A teardown checklist Is this its own project/account? If yes — delete it and stop. terraform state list non-empty? If empty, destroy is a lie; enumerate manually. Enumerate via Asset Inventory / Tagging API on the name prefix; sweep IAM separately. Identify controller-created resources Terraform never tracked (CSI volumes, Crossplane MRs, LB-from-Service). Check and clear deletion protection. Delete in dependency order (or delete the cluster first to kill controllers/finalizers, then clean orphans). Remove IAM bindings explicitly — deleting an SA leaves orphan bindings. Handle the known-stuck cases (PSA peering consumer-side, hung finalizers). Verify against the live API, not Asset Inventory. ","permalink":"https://allenz.net/writing/tearing-down-a-managed-kubernetes-deployment-without-leaving-a-tail/","summary":"A field guide to deleting a GKE or EKS deployment cleanly when the cluster, the in-cluster GitOps/Crossplane layer, and Terraform all disagree about who owns cleanup — orphans, deletion order, and the stuck cases.","tags":["Infrastructure as Code","Kubernetes","GCP"],"title":"Tearing down a managed-Kubernetes deployment without leaving a tail"},{"content":"GitHub Actions is excellent at what it is: a stateless task runner that spins up a clean box, runs your steps, and tears down. That model is a near-perfect fit for build-test-lint. It is a poor fit for applying Terraform — and the gap doesn\u0026rsquo;t show up in a demo. It shows up the third week, when two PRs merge close together, a state lock collides, and someone is manually reconciling a half-applied change at 11pm. Applying Terraform is a stateful, collaborative, approval-gated workflow, and forcing it onto a stateless executor means rebuilding state, locking, approvals, audit, and drift detection by hand — usually badly. A purpose-built IaC platform (Spacelift here; the category is \u0026ldquo;TACOS\u0026rdquo; — Terraform Automation and Collaboration Software) gives you those as first-class features. This is the practical case, from having run Terraform both ways.\nWhat raw CI makes you rebuild Applying Terraform safely requires a set of properties that a CI runner doesn\u0026rsquo;t have and can\u0026rsquo;t easily fake:\nState and locking. Terraform state is shared, mutable, and must be serialized — exactly one apply at a time per state. CI runners are concurrent and stateless by design. So you bolt on a remote backend and hand-roll lock handling, and you still hit Error acquiring the state lock races whenever two workflows overlap. The platform owns the state and serializes runs as a built-in invariant. A plan that equals the apply. The whole value of review is approving a specific plan, then applying that plan — not re-planning at apply time against possibly-changed state. In CI you pass the plan as an artifact between jobs and pray nothing drifted in between; the plan→approve→apply handoff is bespoke and fragile. In a TACOS it\u0026rsquo;s the native run lifecycle: plan, a human approves the exact plan, that plan applies. Audit and run history. \u0026ldquo;Who applied what, when, against which commit, and what changed?\u0026rdquo; is a first-class, durable, linkable record. In CI it\u0026rsquo;s log scrollback in an ephemeral job that ages out. Cloud auth, once. OIDC/Workload-Identity federation, per-stack least-privilege identities, secret injection — configured once on the platform, not re-plumbed into every workflow YAML. Drift detection and policy gates. Scheduled drift detection and tested policy-as-code gates are features you turn on, not pipelines you maintain. None of this is impossible in GitHub Actions. The point is that you end up reimplementing an IaC control plane inside a CI tool, and the reimplementation is the part that breaks.\nCollaboration is the underrated half Beyond safety, the platform changes how a team works on infrastructure. Runs are PR-driven; the plan is rendered and reviewable in the platform, not buried in CI logs; approvals and role bindings make \u0026ldquo;who can apply to production\u0026rdquo; an explicit, auditable control rather than a branch-protection rule plus tribal knowledge. Infrastructure changes start to feel like code review — a shared, visible artifact people actually discuss — instead of a privileged operator running apply from a laptop and announcing it in Slack.\nThe signal I didn\u0026rsquo;t know I was missing: \u0026ldquo;0 to change\u0026rdquo; Here\u0026rsquo;s the benefit that surprised me most, and it\u0026rsquo;s specific. Refactoring Terraform is nerve-wracking precisely because the scariest refactors are supposed to change nothing real: restructure modules, rename resources, introduce for_each, add moved blocks. The whole intent is that the configuration changes while the infrastructure stays byte-for-byte identical. But how do you know you didn\u0026rsquo;t accidentally force a replacement of a database or a VPC?\nAgainst a stable, persistent test environment, the platform answers this immediately. You open the refactor PR, the stack plans against the real deployed state, and the plan reads:\nNo changes. Your infrastructure matches the configuration. Plan: 0 to add, 0 to change, 0 to destroy. That 0/0/0 is a precise, visible, reviewable proof that the refactor is purely structural — it touched the code and nothing else. A non-zero plan is an instant, loud signal that you moved a resource you didn\u0026rsquo;t mean to. Getting that feedback loop from raw CI means standing up a durable env, wiring state, and reading plan output out of job logs — all the things the platform already does. With a stable test env behind the stack, \u0026ldquo;did my refactor change anything?\u0026rdquo; goes from a leap of faith to a line in the plan.\nThe general principle Match the tool to the shape of the work, not just the verb. \u0026ldquo;It runs commands in CI\u0026rdquo; is true of terraform apply, but apply is a stateful, serialized, approval-gated, audited operation, and CI runners are stateless, concurrent, ephemeral executors. Every safety property you need for Terraform is something you\u0026rsquo;d have to add back on top of CI — and the homegrown version is exactly where the incidents come from.\nGitHub Actions remains a great fit for building and testing. For applying Terraform across a team, a platform that treats state, locking, plan-approval, audit, drift, and policy as first-class isn\u0026rsquo;t a luxury — it\u0026rsquo;s the difference between infrastructure changes that are reviewable and reversible and ones that are a held breath. And as a bonus, against a stable environment it hands you the single most reassuring sentence in infrastructure work: no changes.\n","permalink":"https://allenz.net/writing/applying-terraform-from-ci-is-a-stateful-problem-wearing-a-stateless-tool/","summary":"GitHub Actions is a near-perfect stateless task runner — and a poor fit for applying Terraform, which is stateful, collaborative, and approval-gated. The practical case, from running it both ways.","tags":["Infrastructure as Code"],"title":"Applying Terraform from CI is a stateful problem wearing a stateless tool"},{"content":"Cloud log aggregators (Cloud Logging, Elasticsearch, Loki) want one thing from a workload: structured JSON, one object per line, with consistent fields. A legacy Java application typically gives you the opposite — two different streams of unstructured text. This is how I retrofitted machine-parseable JSON logging onto a large Java portal platform running on Kubernetes, as an opt-in, cleanly revertible layer that never patches the shipped image and never blocks pod startup.\nThe problem: two logging subsystems, both unstructured The container emits logs from two independent subsystems:\nlog4j2 — the application\u0026rsquo;s own logs (the interesting ones: business logic, stack traces). java.util.logging (JUL) — Tomcat\u0026rsquo;s container logs (startup, lifecycle, access). Out of the box both write human-readable text in different formats. A log aggregator can\u0026rsquo;t reliably parse either, and it certainly can\u0026rsquo;t correlate them. To get clean ingestion you have to make both emit JSON, with a shared field schema, and you have to do it from outside the application — forking and rebuilding the vendor image to change logging config is exactly the kind of maintenance burden a platform team should refuse to take on.\nConstraints that shaped the design No patching the shipped artifacts. The application JARs are vendor-built; modifying a core application JAR to change logging would have to be redone on every upgrade. Don\u0026rsquo;t mutate persistent config. Tomcat\u0026rsquo;s logging.properties lives on a persistent volume. Rewriting a PV-backed file in place makes the change stateful and hard to revert. Opt-in and reversible. Teams that don\u0026rsquo;t want JSON logs should see no behavior change, and turning it off must revert cleanly. Fail open. A logging-configuration problem must never prevent the pod from starting. Degraded logs are acceptable; a crash loop is not. The approach: inject at init time, through seams the platform already provides A small init script runs at container startup (sourced by the existing entrypoint), before the JVM launches. It configures each subsystem through an extension point that already exists, rather than by modifying anything shipped.\nlog4j2 (application logs) — via the classpath, not a patch. The platform\u0026rsquo;s logging bootstrap discovers log4j2 configuration by walking the classloader resource chain (getResources()), the same mechanism it uses to find extension configs. So instead of editing the application JAR, the script drops a log4j2 extension config and a JSON layout template into WEB-INF/classes/META-INF on the classpath, where the resource walk finds it through the webapp classloader\u0026rsquo;s parent. The JSON layout itself comes from the log4j-layout-template-json plugin, fetched at init from a public artifact repository (with checksum verification and bounded timeouts/retries) and dropped alongside log4j-core in the shielded-container lib directory. No rebuild, no fork — just using the discovery seam the framework already exposes.\nJUL (Tomcat logs) — ephemeral override, never touch the PV. The script reads the PV-backed logging.properties, strips the existing console-handler formatter line, appends one that selects a JSON formatter, and writes the result to an ephemeral path in the container\u0026rsquo;s writable layer (wiped on every restart). It never writes back to the PV. It then points the JVM at the ephemeral file with -Djava.util.logging.config.file=..., relying on Java\u0026rsquo;s last-definition-wins rule to override the default the launcher would otherwise apply. Disable the feature and the override simply isn\u0026rsquo;t set; JUL falls back to the untouched PV config. The revert is automatic because nothing persistent ever changed.\nOpt-in + change detection. The whole layer is gated behind a single flag (default off). A checksum annotation over the injected config is stamped onto the pod template, so any change to the layout, the extension config, or the script triggers a rolling restart automatically rather than silently drifting.\nFail open. The remote fetch of the layout plugin is the only step that can fail for external reasons, so it soft-fails: if the artifact repository is unreachable, JUL still emits JSON (its formatter ships with the runtime), log4j2 falls back to its default text layout, and the pod starts normally. You lose JSON on one stream, not availability.\nSchema normalization at the source The two streams are normalized to a shared field set — timestamp, level, logger class, method, thread, message, throwable — so the aggregator sees one schema regardless of origin. Level names are mapped into the aggregator\u0026rsquo;s severity vocabulary (e.g. WARN → WARNING, FATAL → EMERGENCY) at emission time, and a cloud-specific layout variant adds the aggregator\u0026rsquo;s native correlation fields (insert ID, source location) so the platform\u0026rsquo;s log UI lights up without a downstream transform. Normalizing at the source is cheaper and more reliable than teaching every consumer to reconcile two formats.\nA concrete payoff: Tomcat logs stop being errors There\u0026rsquo;s a specific operational misery this fixes. JUL\u0026rsquo;s default ConsoleHandler writes every record — INFO, lifecycle chatter, routine startup messages, all of it — to stderr. GKE\u0026rsquo;s logging agent, given no structured severity to go on, infers a line\u0026rsquo;s severity from the stream it arrived on: stdout becomes INFO, and stderr becomes ERROR.1 The result is that in Cloud Logging\u0026rsquo;s Log Explorer, every Tomcat log line shows up red as an ERROR — including the entirely routine ones. Real errors are buried in a sea of false ones, and any alerting policy keyed on ERROR severity is worthless.\nEmitting structured JSON with an explicit severity field fixes this directly: the agent parses the JSON and honors the per-line severity instead of falling back to the stream. An INFO Tomcat record is now classified INFO, even though it still rides stderr. The Log Explorer goes from uniformly red to correctly leveled — the difference between logs you can alert on and logs you scroll past.\nValidating it Because the failure modes are subtle (a misplaced classpath resource silently does nothing; a JVM property in the wrong place is silently ignored), the change ships with a containerized integration test: it boots the real application image with the layer enabled and asserts that the emitted log lines parse as JSON with the expected fields, above a threshold ratio. That turns \u0026ldquo;the logs look right\u0026rdquo; into a check that fails loudly in CI.\nPortable lessons Extend through the seam the framework already gives you. A classloader resource-discovery path, an entrypoint hook, a JVM property override — these let you change behavior from outside the artifact. Forking the vendor image should be the last resort, not the first. Never mutate config that lives on a persistent volume. Write an ephemeral copy and override the pointer to it. The change stays stateless and the revert is free. Logging must fail open. Anything in the observability path that can fail for external reasons should degrade, never block startup. Normalize the schema at the source. One consistent shape out of the workload beats N reconciliations downstream. Make severity explicit; never let the platform infer it from the stream. A subsystem that logs everything to stderr gets silently classified as all-errors by stream-based heuristics. An explicit severity field in the payload is the only reliable signal. When a log line carries no explicit severity, Google Cloud\u0026rsquo;s logging agent infers it from the output stream — stdout → INFO, stderr → ERROR. Emitting a severity field in structured JSON logs overrides that inference.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://allenz.net/writing/structured-json-logging-for-a-legacy-java-app-on-kubernetes-without-forking-the-image/","summary":"How to retrofit machine-parseable JSON logging onto a legacy Java app on Kubernetes — both log4j2 and Tomcat\u0026rsquo;s JUL — without forking the vendor image, as an opt-in, fail-open, cleanly revertible layer.","tags":["Observability","Kubernetes"],"title":"Structured JSON logging for a legacy Java app on Kubernetes — without forking the image"},{"content":"A common GKE security choice: provision clusters as private clusters — no public IP on the Kubernetes API server. This significantly reduces attack surface; the control plane is unreachable from the public internet.\nThe architectural question that follows: how does authorized traffic — engineers, CI, operators — reach the API server?\nMost teams reach for one of these:\nBastion host or VPN. Works, but adds infrastructure to maintain and a separate authentication chain. Authorized networks (public IP with allowlist). Narrowly exposes the control plane, but defeats the \u0026ldquo;private by default\u0026rdquo; stance and requires managing IP allowlists. Use Connect Gateway instead. It\u0026rsquo;s the GKE-native, platform-included answer:\nNo bastion to run, no VPN to maintain. No public IP on the control plane — the private-cluster invariant stays intact. GCP IAM authenticates the user to the gateway; Kubernetes RBAC authorizes them inside the cluster. Clean separation, standard tooling. Included in the GKE management fee (~$0.10 / hour / cluster as of 2026). No GKE Enterprise license required for base functionality. Minimum config Cluster registered as a member of a GKE Fleet (formerly GKE Hub). The accessing identity has both:\nGCP IAM: roles/gkehub.gatewayReader (or gatewayEditor / gatewayAdmin) plus roles/gkehub.viewer. Kubernetes RBAC: a ClusterRoleBinding granting the appropriate role. (GKE provides implicit cluster-admin mapping for roles/container.admin / Project Owner; explicit bindings are better for least-privilege.) After Fleet registration, getting cluster access is a single command:\ngcloud container fleet memberships get-credentials \u0026lt;MEMBERSHIP_NAME\u0026gt; This rewrites the local kubeconfig to route through the Connect Gateway endpoint instead of the cluster\u0026rsquo;s private master IP. kubectl works normally afterwards.\nWhy this is worth featuring as a recommendation For most teams running private GKE clusters, Connect Gateway is a strong default — and underused, because the \u0026ldquo;I need a bastion\u0026rdquo; instinct is older than the gateway is. It\u0026rsquo;s the platform-native, no-extra-infrastructure path. The auth model (IAM outside, RBAC inside) lines up cleanly with how Kubernetes-on-GCP is already authenticated everywhere else. Set the Fleet membership; grant the IAM + RBAC; done.\nReferences Connect Gateway Overview (Google Cloud docs) Using Connect Gateway (Google Cloud docs) ","permalink":"https://allenz.net/writing/use-gke-connect-gateway-to-protect-your-private-control-plane/","summary":"Reach a private GKE cluster\u0026rsquo;s API server without a bastion or authorized-networks — using the GKE-native Connect Gateway, with GCP IAM outside the gateway and Kubernetes RBAC inside.","tags":["Kubernetes","GCP","Security"],"title":"Use GKE Connect Gateway to protect your private control plane"},{"content":"Here\u0026rsquo;s a quietly dangerous failure mode in a Crossplane + GitOps stack: a chart change is committed, ArgoCD reports Synced, the Crossplane resource reports Ready, every dashboard is green — and the change never actually took effect. The new behavior you shipped silently isn\u0026rsquo;t running. The cause is a two-part interaction between Crossplane management policies and Kubernetes immutability.\nThe setup Crossplane\u0026rsquo;s provider-kubernetes lets a composition manage an arbitrary Kubernetes resource by wrapping it in an Object. Ours wrapped a database-grant Job:\n# compositions/30-k8s.yaml apiVersion: kubernetes.crossplane.io/v1alpha2 kind: Object spec: managementPolicies: [\u0026#34;Create\u0026#34;, \u0026#34;Observe\u0026#34;, \u0026#34;Delete\u0026#34;] # note: no \u0026#34;Update\u0026#34; forProvider: manifest: apiVersion: batch/v1 kind: Job # ...db-grant job spec... managementPolicies controls which verbs Crossplane is allowed to perform on the external resource. This one can Create the Job, Observe it, and Delete it — but it is not permitted to Update it.\nThe failure A later chart change added a step to the Job — an auth-readiness wait before the grant runs. The change merged, ArgoCD synced the new composition, Crossplane reported the Object Ready. But on any environment where the Job already existed (a re-used \u0026ldquo;green\u0026rdquo; instance), the new readiness wait never ran.\nTwo facts combine to produce this:\nmanagementPolicies has no Update. Crossplane created the Job once. With Update absent, it will never reconcile the live Job toward a changed spec — drift is simply not its job. So a new manifest in the composition is observed but never applied. A Job is immutable anyway. Even with Update, you can\u0026rsquo;t kubectl apply a changed spec.template onto an existing Job — Job.spec.template is immutable. Propagating the change requires delete-and-recreate, which Crossplane will do on an immutability conflict only if Delete and Update are both in the policy set. So the Job froze at its first-created spec. Everything upstream reported success because, from Crossplane\u0026rsquo;s and ArgoCD\u0026rsquo;s point of view, nothing was wrong — the Object matched its allowed policy exactly. The desired-state change just had no path to the cluster.\nThe fix For an immutable Kubernetes resource managed through a Crossplane Object, you need Crossplane to be allowed to replace it, and you need to understand it will do so by delete-and-recreate:\nmanagementPolicies: [\u0026#34;Create\u0026#34;, \u0026#34;Observe\u0026#34;, \u0026#34;Update\u0026#34;, \u0026#34;Delete\u0026#34;] With Update (and Delete) present, Crossplane detects the spec drift, hits the immutability conflict on apply, and falls back to delete-and-recreate — so the new Job spec actually lands. (If recreating the Job mid-flight is unacceptable, the alternative is to make the Job name a function of its content — a spec hash in the name — so a changed spec produces a new Job under Create semantics rather than mutating an existing one.)\nThe general principle Two transferable lessons:\nAn allow-list of actions silently caps reconciliation. Anything that lets you restrict which verbs a controller may perform — Crossplane managementPolicies, RBAC, provider scopes — will, when under-scoped, produce a system that looks reconciled but isn\u0026rsquo;t. The resource matches policy; policy just doesn\u0026rsquo;t include \u0026ldquo;make it current.\u0026rdquo; Default to the full action set and remove verbs deliberately, not by omission. \u0026ldquo;Synced\u0026rdquo; and \u0026ldquo;Ready\u0026rdquo; mean conformant to intent, not currently correct. A green GitOps dashboard tells you the controller did everything it was permitted to do. It does not tell you the live resource reflects your latest change — especially across an immutability boundary, where the path from desired to actual requires a replace the controller may not be allowed (or able) to perform. When a change \u0026ldquo;deploys\u0026rdquo; but the behavior doesn\u0026rsquo;t move, suspect the gap between what the controller is permitted to do and what the resource requires to change. ","permalink":"https://allenz.net/writing/the-crossplane-object-that-synced-green-and-changed-nothing/","summary":"Everything\u0026rsquo;s green — Argo CD Synced, Crossplane Ready — but the change never took effect. The trap where Crossplane management policies without Update meet Kubernetes immutability.","tags":["Infrastructure as Code","Kubernetes","GitOps"],"title":"The Crossplane Object that synced green and changed nothing"},{"content":"You remove an environment variable from a Helm values file, commit, and ArgoCD syncs. Status: Synced. But the variable is still set on the running pod. You didn\u0026rsquo;t change it — you deleted it — and it\u0026rsquo;s still there. This isn\u0026rsquo;t an ArgoCD bug; it\u0026rsquo;s a consequence of how Kubernetes merges lists, and it has a specific fix.\nThe failure A StatefulSet\u0026rsquo;s container env was rendered from a customEnv block in values. We removed two entries (an HTTP-port override used only for local port-forwarding) and pushed to the GitOps repo. ArgoCD synced the new revision cleanly — Status=Synced, the latest commit hash, no errors. Yet the pod template still carried the removed entries:\nspec: template: spec: containers: - env: - name: APP_SERVER_HOST # we deleted this from values value: localhost # still here - name: HTTP_PORT # and this value: \u0026#34;8081\u0026#34; # still here A manual one-shot sync with Replace=true finally dropped them. So the desired state was right; the apply mechanism wasn\u0026rsquo;t removing what we\u0026rsquo;d taken out.\nThe cause: strategic-merge-patch can\u0026rsquo;t say \u0026ldquo;remove from list\u0026rdquo; ArgoCD\u0026rsquo;s default apply is a client-side, three-way strategic merge patch1 — the same machinery as kubectl apply. For lists, Kubernetes uses a merge key where one is defined. A container\u0026rsquo;s env list has merge key name. That means env entries are merged by name, not replaced wholesale:\nAdd an entry to desired state → the patch adds it. Change an entry\u0026rsquo;s value → the patch updates it by name. Remove an entry from desired state → the patch contains nothing about it. A strategic merge patch describes what should be present; it has no vocabulary for \u0026ldquo;this used to be here, delete it.\u0026rdquo; When you drop an env var from values, the rendered manifest simply stops mentioning it — and \u0026ldquo;stops mentioning\u0026rdquo; reads as \u0026ldquo;no change\u0026rdquo; to the merge, so the live entry survives. ArgoCD honestly reports Synced, because it successfully applied the patch it computed.\nThe fixes, least to most surgical Server-Side Apply2 (recommended). Switch the Application to SSA:\nsyncPolicy: syncOptions: - ServerSideApply=true SSA tracks field ownership. Because the ArgoCD applier previously owned that env entry, dropping it from the desired manifest causes SSA to relinquish and prune the field. It removes what you removed, without the heavy hammer below. This is the modern, correct default for this whole class of \u0026ldquo;removal doesn\u0026rsquo;t propagate\u0026rdquo; problem.\nReplace=true. Forces a full kubectl replace instead of a patch — the live object becomes exactly the desired manifest, so removed list items disappear. It works, but it\u0026rsquo;s heavyweight: a full replace on every sync, with recreate semantics for some resources. Reserve it for one-shot remediation, not steady state.\nNull it chart-side. Keep the key but render it absent/empty so the desired state explicitly overrides the old value. Workable for a known field, but it\u0026rsquo;s a manual patch per field and doesn\u0026rsquo;t generalize.\nThe general principle A declarative system is only as declarative as its apply semantics allow. Strategic-merge-patch is additive-by-default for merge-keyed lists: it reconciles presence and values, but not absence. So \u0026ldquo;GitOps is the source of truth\u0026rdquo; quietly fails for deletions inside merge-keyed lists — env vars, volumes, volumeMounts, containers, ports, anything keyed by name — unless the apply mechanism can express removal.\nServer-Side Apply closed this gap by making field ownership explicit, so relinquishing a field means pruning it. If you run ArgoCD (or raw kubectl apply) and rely on removing list items by deleting them from values, turn on Server-Side Apply — otherwise your desired state and your cluster will silently diverge precisely on the things you took away.\nA strategic merge patch merges list items by a defined merge key (here, name) instead of replacing the list — so it expresses additions and changes, but not removals.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nKubernetes Server-Side Apply makes field ownership explicit, so a field dropped from the desired manifest is pruned. In Argo CD, enable it per-Application with the ServerSideApply=true sync option.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://allenz.net/writing/why-deleting-an-env-var-from-your-gitops-values-doesnt-remove-it-from-the-pod/","summary":"You delete an env var from your Helm values, Argo CD reports Synced — and it\u0026rsquo;s still on the pod. Why strategic-merge-patch can\u0026rsquo;t remove list items, and the Server-Side Apply fix.","tags":["GitOps","Kubernetes"],"title":"Why deleting an env var from your GitOps values doesn't remove it from the pod"},{"content":"Argo CD needs three things from GitHub, and they\u0026rsquo;re easy to conflate: it has to read your Git repositories (a machine reading code), know when you push (event delivery), and let humans log in (interactive identity). The default instinct is a personal access token for the first, polling for the second, and local accounts for the third. A single GitHub App does all three — better — but only if you keep straight that it\u0026rsquo;s serving two fundamentally different kinds of token.\nThe three needs Repository access — Argo CD clones and polls your GitOps repos to render manifests. Webhooks — without them, Argo CD polls on a timer (every ~3 minutes by default); with them, a push triggers a near-instant sync. SSO — operators log into the Argo CD UI/CLI, and their access should be governed by something you already manage (your GitHub org), not a pile of local accounts. Machine identity: repo access via a GitHub App installation Create a GitHub App, install it on your org (scoped to the repos Argo CD should see), and give Argo CD three values as a githubApp repository credential: the App ID, the Installation ID, and the App\u0026rsquo;s private key. Argo CD uses these to mint short-lived installation access tokens on demand.\nWhy this beats a PAT or a deploy key:\nShort-lived, not long-lived. Installation tokens expire in an hour and are minted as needed. There\u0026rsquo;s no durable secret sitting in a cluster waiting to leak. The only stored material is the App private key, which never travels over the wire to GitHub as a credential — it signs a JWT locally to request the token. Org-owned, not person-owned. A PAT is tied to a human; when they leave, it dies and your sync breaks. A GitHub App is owned by the org and survives staff churn. Fine-grained. The App grants exactly the repo permissions it needs (contents: read, and webhook admin if you want it to manage hooks), nothing more. Higher rate limits. Installation tokens get per-installation rate limits, well above a single user\u0026rsquo;s PAT ceiling — which matters once Argo CD is polling many repos. Event delivery: webhooks the App can manage Polling is a fine default and a bad steady state — three minutes of \u0026ldquo;is it merged yet?\u0026rdquo; on every change. The same GitHub App can hold webhook permissions, so pushes are delivered to Argo CD\u0026rsquo;s /api/webhook endpoint and trigger an immediate refresh of the affected applications. You secure the delivery with a shared webhook secret (stored in your secret manager, referenced by Argo CD), so the server can verify the payload\u0026rsquo;s HMAC signature and reject forgeries. The result is event-driven GitOps: merge, and the cluster starts reconciling in seconds, not minutes.\nHuman identity: SSO via the App\u0026rsquo;s OAuth client, through Dex Here\u0026rsquo;s the part that surprises people: the same GitHub App also carries an OAuth client (a client ID and client secret). That client drives the browser login flow, which is a completely different mechanism from the installation token above. Wire the client into Argo CD\u0026rsquo;s bundled Dex with the github connector, and:\nOperators click \u0026ldquo;Log in via GitHub,\u0026rdquo; do the standard OAuth consent, and land in the Argo CD UI. Dex reads their org and team membership and surfaces it as groups. Argo CD\u0026rsquo;s RBAC maps those groups to roles — e.g. members of the platform-admins team get role:admin, everyone else gets read-only. Access is now governed entirely by your GitHub org. Add someone to a team, they can log in; remove them, they can\u0026rsquo;t. No local Argo CD accounts to provision or deprovision.\nThe conceptual point: two token types, one App The thing to hold onto is that one GitHub App is serving two different grant types for two different audiences:\nRepo access + webhooks SSO Token type Installation access token User OAuth token Audience A machine (Argo CD\u0026rsquo;s repo server) A human (browser) Identity proven \u0026ldquo;this app, installed here\u0026rdquo; \u0026ldquo;this person, in this org\u0026rdquo; Credential used App ID + Installation ID + private key OAuth client ID + secret Flow JWT → installation token (server-to-server) Browser redirect → OAuth consent Almost every \u0026ldquo;I set up the GitHub App, why doesn\u0026rsquo;t X work\u0026rdquo; problem traces to mixing these up: feeding the OAuth client ID where the App ID belongs, or expecting the installation to grant UI login, or expecting the OAuth client to clone a repo. They\u0026rsquo;re orthogonal. The App is a container for both; the two halves are configured independently and fail independently.\nPractical shape Everything sensitive lives in a secret manager, not in Git or a values file:\nApp private key → referenced by the repo-credential config. Webhook secret → referenced by the webhook receiver. OAuth client ID / client secret → referenced by the Dex connector. The Argo CD config then references those secrets by name. In a Terraform/GitOps setup, the App registration can even be automated with a GitHub App manifest flow (you POST a manifest describing the permissions and events, GitHub walks the user through creation and hands back the credentials), so the whole thing is reproducible rather than a click-ops checklist.\nPortable lessons Prefer GitHub App installation tokens over PATs for machine access. Short-lived, fine-grained, org-owned, churn-proof, higher limits. For modern GitHub-based GitOps, a long-lived PAT in a cluster is rarely the right answer — the exceptions are legacy tooling, air-gapped environments, or integrations without App support. Separate machine identity from human identity in your head, even when one App hosts both. Installation tokens are for servers; OAuth tokens are for people. Most integration bugs are a category error between the two. Turn polling into events. Webhooks are a small amount of config that converts GitOps from \u0026ldquo;eventually\u0026rdquo; to \u0026ldquo;immediately,\u0026rdquo; and the App you already created for repo access can carry them. Govern human access through an identity you already manage. Org/team → RBAC means access control is a side effect of org membership, not a second system to keep in sync. ","permalink":"https://allenz.net/writing/one-github-app-two-auth-models-repo-credentials-webhooks-and-sso-for-argo-cd/","summary":"Argo CD needs three different things from GitHub — repo reads, webhook delivery, and human SSO. How a single GitHub App covers all three with short-lived installation tokens instead of a leak-prone PAT.","tags":["GitOps","Security"],"title":"One GitHub App, two auth models: repo credentials, webhooks, and SSO for Argo CD"},{"content":"Distributing a complex cloud platform install — dozens of enabled APIs, IAM bootstrapping, Terraform, secrets, a GitOps repo — is where good infrastructure goes to die in support tickets. \u0026ldquo;Which APIs do I enable?\u0026rdquo; \u0026ldquo;It says I don\u0026rsquo;t have permission.\u0026rdquo; \u0026ldquo;What version of Terraform?\u0026rdquo; \u0026ldquo;Where does the state live?\u0026rdquo; Each of those is a local-environment problem, and each one is avoidable. What if we could turn a multistep platform install into a browser-only, guided, clone-and-go onboarding? That is possible using two GCP features that perhaps are not used often enough together: Cloud Shell tutorials and Infrastructure Manager.\nThe Problem with a README Runbook One runbook shape is a SETUP.md with many numbered steps. But this can lead to many potential failures:\nUsers skip the API-enablement step and hit a cryptic error twenty minutes later. Users run Terraform with their personal owner credentials. Users keep state on their laptop which can leak sensitive information or be easily deleted. Users are on the wrong tool version, leading to hard to debug errors. Users paste the wrong project ID into step 14. The runbook is documentation pretending to be a procedure — nothing verifies that step N actually happened before step N+1 runs.\nThe fix is to make the runbook executable and guided, and to take Terraform off the user\u0026rsquo;s machine entirely.\nPiece 1: Open in Cloud Shell A single \u0026ldquo;Open in Cloud Shell\u0026rdquo; deep link (cloudshell_open with the repo URL) clones the installer repository into the user\u0026rsquo;s Cloud Shell and drops them into it. Cloud Shell already has gcloud, an editor, and an authenticated identity — so there is no local toolchain to install and nothing to authenticate. The user goes from a link to a ready environment in one click. That single move eliminates the entire class of \u0026ldquo;works on my machine\u0026rdquo; issues, because everyone is now on the same machine: Google\u0026rsquo;s.\nPiece 2: A Cloud Shell Tutorial (the Runbook as a Program) Cloud Shell renders an interactive walkthrough from a Markdown file (teachme tutorial.md) — a side panel that guides the user step by step. It\u0026rsquo;s just Markdown with \u0026lt;walkthrough-*\u0026gt; directives, versioned in the repo alongside the code. The high-value ones:\n\u0026lt;walkthrough-project-setup billing=\u0026quot;true\u0026quot;\u0026gt; — a project picker that confirms a billing-enabled project is selected before anything else runs. No more \u0026ldquo;I deployed into the wrong project.\u0026rdquo; \u0026lt;walkthrough-enable-apis apis=\u0026quot;...\u0026quot;\u0026gt; — a one-click button that enables the exact list of required APIs. The tutorial declares the list, so the user enables exactly the right APIs in one click — no guessing, no missed API. \u0026lt;walkthrough-editor-open-file\u0026gt; — opens a specific file (e.g. the Terraform variables) in the Cloud Shell editor at the right moment, so the user edits the real file in place rather than having to \u0026ldquo;go find and edit X.\u0026rdquo; Inline runnable commands — fenced shell blocks the user runs with one click, with the selected project ID interpolated in (\u0026lt;walkthrough-project-id/\u0026gt;), so there\u0026rsquo;s no copy-paste-the-wrong-value step. A slice of the tutorial.md reads like this:\n## Select project \u0026lt;walkthrough-project-setup billing=\u0026#34;true\u0026#34; required=\u0026#34;true\u0026#34;\u0026gt;\u0026lt;/walkthrough-project-setup\u0026gt; ```sh gcloud config set project \u0026lt;walkthrough-project-id/\u0026gt; ``` ## Enable APIs \u0026lt;walkthrough-enable-apis apis=\u0026#34;config.googleapis.com,cloudbuild.googleapis.com,compute.googleapis.com,container.googleapis.com,iam.googleapis.com\u0026#34;\u0026gt;\u0026lt;/walkthrough-enable-apis\u0026gt; ## Configure and apply \u0026lt;walkthrough-editor-open-file filePath=\u0026#34;./setup.sh\u0026#34;\u0026gt;Open setup.sh\u0026lt;/walkthrough-editor-open-file\u0026gt; ```sh ./setup.sh \u0026lt;walkthrough-project-id/\u0026gt; ``` The difference from a README is that the walkthrough is stateful and active, not just text. It knows which project is selected and injects that into every command, and the project-setup step won\u0026rsquo;t continue until a billing-enabled project is chosen. The procedure can\u0026rsquo;t drift from the documentation because the procedure is the documentation, executing.\nWhat the built-in directives don\u0026rsquo;t cover, the scripts those steps run can. setup.sh is an ordinary shell script, so it can prompt for input and run its own checks — and that\u0026rsquo;s where real verification lives:\n# the walkthrough sequences steps; a script is what actually verifies state if [ \u0026#34;$(gcloud billing projects describe \u0026#34;$PROJECT\u0026#34; \\ --format=\u0026#39;value(billingEnabled)\u0026#39;)\u0026#34; != \u0026#34;True\u0026#34; ]; then echo \u0026#34;Enable billing on $PROJECT, then re-run.\u0026#34; \u0026gt;\u0026amp;2 exit 1 fi read -rp \u0026#34;Region [us-central1]: \u0026#34; REGION REGION=\u0026#34;${REGION:-us-central1}\u0026#34; The walkthrough sequences the steps; any gate beyond the project-and-billing selection is only as strong as the checks you write into the scripts it runs.\nPiece 3: Infrastructure Manager Runs the Terraform, Not the User This is the part that most changes the risk profile. Instead of the user running terraform apply locally — with their own broad credentials, their own state file, their own tool version — the install hands the Terraform to Infrastructure Manager (config.googleapis.com), GCP\u0026rsquo;s managed Terraform service. Infrastructure Manager:\nruns the Terraform server-side, as a dedicated runner service account (least-privilege, not the user\u0026rsquo;s owner credentials); manages state for you in a Google-owned bucket, so there\u0026rsquo;s no \u0026ldquo;who has the state, and is it locked?\u0026rdquo; problem; pins the execution environment, so tool-version drift disappears; exposes deployments as first-class, observable GCP resources. The user never installs Terraform, never holds state, and never applies infrastructure with their personal credentials. They trigger a build (here, via Cloud Build, which invokes Infrastructure Manager), and watch it in the console.\nThe IAM Bootstrap That Makes It Work Infrastructure Manager needs a small, specific permission setup, which a bootstrap script does once:\nEnable config.googleapis.com (and Cloud Build) and create the Infrastructure Manager service identity (gcloud beta services identity create --service=config.googleapis.com). Grant that service agent the config.agent role on the project and the ability to act as the runner service account (iam.serviceAccountUser), so it can execute the Terraform as the scoped runner. Grant the trigger (the Cloud Build service account) permission to manage deployments (config.admin) and to impersonate the runner. This means that a managed service runs your Terraform as a service account that you scoped, triggered by a build, with state it owns.\nWhy This Is an Effective Pattern for Distributing Infrastructure Zero local setup. Browser only. No SDK, no Terraform, no auth dance. The \u0026ldquo;my environment\u0026rdquo; support surface all but disappears. No credentials on the laptop. The user authenticates to Cloud Shell with their Google identity; the apply runs as a least-privilege runner SA inside GCP. No owner PAT, no exported service-account key. Project and APIs are handled up front. The project-setup step requires a billing-enabled project, and a one-click step enables the exact APIs the build needs — so the most common silent failures are prevented. State and tool version are managed. Infrastructure Manager owns both, so two different operators get identical, reproducible runs. Reproducible, not click-ops. The tutorial and the Terraform are versioned together; an install is a known revision of a repo, not a person\u0026rsquo;s memory of a Slack thread. Where this fits (and where it doesn\u0026rsquo;t) This is a day-0 pattern — onboarding, trials, demos, the first install of a self-hostable stack. Against what it replaces — a SETUP.md and a pile of bash scripts the user runs on their own laptop — it\u0026rsquo;s a strict upgrade: no local toolchain, no personal credentials on the apply, managed state, no version drift.\nIt\u0026rsquo;s not a day-2 management story. The install is one-shot. Infrastructure Manager can update a deployment, but there\u0026rsquo;s no reconciliation loop, no drift detection, no pull-request change flow. Once the stack is something a team runs in production — upgraded, drifting, owned by more than one person — you\u0026rsquo;ve outgrown the guided installer, and the rest of the landscape takes over:\nTerraform modules distribute the building blocks, but assume the consumer already has state, credentials, a runner, and a pipeline — the setup this pattern removes. Good for teams that already run Terraform; no help for onboarding. GitOps with a Terraform controller (Flux\u0026rsquo;s OpenTofu controller, or Argo CD) reconciles infrastructure from Git continuously, the way Flux and Argo reconcile apps. It\u0026rsquo;s the real day-2 answer — drift correction, PR-based change — but it needs a cluster to run in, so it can\u0026rsquo;t bootstrap the cluster it lives in. A governed IaC platform (Spacelift blueprints, HCP Terraform no-code modules) adds self-service plus ongoing management — policy, managed runners, drift, PR flow — at the cost of adopting another control plane. Infrastructure as an API (Crossplane) turns infrastructure into Kubernetes resources a consumer claims and a control plane reconciles. Strong for a platform team, heavy for a one-off install. A Marketplace listing productizes the same one-click idea with discovery and billing attached — closer to selling the install than onboarding to it. This isn\u0026rsquo;t competing with GitOps or Spacelift; it\u0026rsquo;s the on-ramp before them. Get someone from a link to a running stack in the browser, then hand that stack to whatever manages the rest of your infrastructure.\nPortable Lessons Meet users in the browser. Cloud Shell (or any hosted shell) eliminates an entire category of onboarding failure by removing the local environment as a variable. Encode the runbook as an executable tutorial, not a document. A walkthrough that picks the project, enables the APIs, and injects the right values can\u0026rsquo;t be skipped or fat-fingered the way a numbered list can. Run Terraform as a managed service, not on the operator\u0026rsquo;s machine. Infrastructure Manager (or any server-side Terraform runner) removes credential sprawl, state-handling mistakes, and version drift in one move — the apply runs as a scoped identity you control, not as whoever happened to click the button. ","permalink":"https://allenz.net/writing/a-clone-and-go-installer-gcp-cloud-shell-tutorials--infrastructure-manager/","summary":"Turning a many-step platform install — APIs, IAM, Terraform, state, secrets — into a browser-only, guided, clone-and-go onboarding with GCP Cloud Shell tutorials and Infrastructure Manager.","tags":["Infrastructure as Code","GCP"],"title":"A clone-and-go installer: GCP Cloud Shell tutorials + Infrastructure Manager"},{"content":" A software engineer whose career spans application development and, in recent years, cloud-native platform engineering. I build and operate Kubernetes platforms on GCP — Terraform, Argo CD, Helm, GitOps — and bring a long software-engineering background to the infrastructure underneath them. I work system-first and for low entropy: designing for the whole system\u0026rsquo;s behavior over time, and keeping it predictable, reversible, and self-evident to operate.\nSelected experience Senior Software Engineer · Liferay · 2025–present\nBuilding Liferay\u0026rsquo;s self-hosted Cloud Native offering on GCP / GKE — a GitOps platform built with Terraform, Argo CD, Crossplane, and Helm, including cross-environment backup/restore and database-identity portability. Also prototyped a Kubernetes operator (Python, kopf) for client-extension orchestration, run on a Docker Compose \u0026ldquo;Kubernetes-lite\u0026rdquo; harness that exercises CRD-and-controller patterns — build, reconcile, deploy — without a full cluster. Plus structured-logging and search work, and a RabbitMQ messaging reference implementation built with a team in Budapest.\nSenior Software Engineer · Liferay Cloud · 2023–2025\nLiferay Cloud PaaS + SaaS on Google Cloud (Kubernetes, Cloud SQL, Cloud Storage). Evaluated Cloud SQL performance and implemented support for migrating customers across MySQL and PostgreSQL versions; led a self-service maintenance-page offering; provided deep production support through Cloud log analysis.\nTeam Lead / Senior Software Engineer · Liferay · 2022–2023\nLed and mentored two development teams — set technical direction, reviewed pull requests, enforced standards, ran reviews and OKRs. Delivered full-stack platform projects including a new Marketplace site (Python + headless APIs) and the migration of learn.liferay.com.\nSenior Software Engineer · Liferay · 2018–2021\nCorporate web platform. Migrated from on-prem VMs to a managed PaaS (AWS, then GCP); built DevOps/build infrastructure (Gradle, Jenkins, Kubernetes liveness probes, Dynatrace); an Elasticsearch + React search experience; SEO and performance work; and platform version upgrades.\nEarlier Front-End Engineer · Liferay GmbH, Germany · 2016–2018 — marketing microsites and an Appcelerator Titanium events mobile app.\nSAS-BI Consultant · anaxima GmbH, Germany · 2008–2015 — risk and regulatory reporting for banking (Commerzbank market-risk, Solvency II), and a sales-management reporting platform used by thousands of bank employees.\nIndependent Consultant · 2002–2008 — C# / .NET, Postscript composition, and Oracle data migration.\nSenior Software Engineer / Technical Lead / Instructor · Active Data Corp., Baltimore · 1997–2002 — built database-driven applications (SQLWindows, C/C++, Oracle) and delivered developer training, including a five-day application-development course.\nSelected independent work Vollrad Kutscher — artist catalog site (Astro, Tailwind, Airtable, GitHub Actions) City of Wiesbaden, Germany — city-hall history exhibit (Python, MkDocs, GitHub Actions) Ingeborg Lüscher — artist catalog (custom WordPress theme) Ilona Surrey — digital fine-art portfolio (WordPress) Skills Platform \u0026amp; infrastructure: Kubernetes, GKE, Terraform, Argo CD / GitOps, Helm, Kubernetes operators (kopf), GCP (Cloud SQL, Cloud Storage), Crossplane, Docker CI/CD \u0026amp; operations: GitHub Actions, Jenkins, Dynatrace, Cloudflare, Workload Identity Federation Languages: Java, Python, JavaScript / TypeScript, Go Also: OSGi, Elasticsearch, Okta SSO / SAML, AWS AI: Claude Code, Gemini CLI Spoken languages: English and German (fluent) Contact Find me on LinkedIn and GitHub.\n","permalink":"https://allenz.net/about/","summary":"\u003cfigure class=\"avatar\"\u003e\n    \u003cimg loading=\"lazy\" src=\"/profile.jpg\"\n         alt=\"Allen Ziegenfus\" width=\"150\"/\u003e \n\u003c/figure\u003e\n\n\u003cp\u003eA software engineer whose career spans application development and, in recent years, cloud-native platform engineering. I build and operate Kubernetes platforms on GCP — Terraform, Argo CD, Helm, GitOps — and bring a long software-engineering background to the infrastructure underneath them. I work \u003cstrong\u003esystem-first and for low entropy\u003c/strong\u003e: designing for the whole system\u0026rsquo;s behavior over time, and keeping it predictable, reversible, and self-evident to operate.\u003c/p\u003e\n\u003ch2 id=\"selected-experience\"\u003eSelected experience\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eSenior Software Engineer\u003c/strong\u003e · Liferay · \u003cem\u003e2025–present\u003c/em\u003e\u003c/p\u003e","tags":null,"title":"About"}]