Two Pods. Same Tag. Different Code. Here’s How We Caught It.

Your deployment looked fine. kubectl get pods said Running. But one pod was serving a build from three days ago.

You roll out a deployment. All pods show Running. Your readiness probes pass. Metrics look normal — mostly. But one service is throwing intermittent 500s that disappear on retry. Your logs are inconsistent. Feature X works sometimes, fails others. You restart the pod and it goes away.

You close the incident. You blame flakiness.

Two weeks later it happens again.

What you probably missed: your two replica pods were running different code.

Same tag. Same deployment spec. Different bytes underneath.

Here’s how that happens, how to catch it, and how to make sure it never silently bites you again.

How Two Pods End Up Running Different Code

This scenario has a few common causes. All of them are silent.

Cause 1: Someone re-pushed to the same tag

CI finishes a build and pushes wasaa-web-service:production-latest. Twenty minutes later, a hotfix merges and CI pushes again — to the same tag. No one changes the deployment spec. Kubernetes doesn't know the tag changed underneath it.

Pod A was already running. It pulled the old build. Kubernetes doesn’t restart running pods when a tag moves.

Pod B gets scheduled on a new node (autoscaler event, node rotation, OOM eviction — pick one). It pulls fresh. It gets the new build.

Now you have two pods on the same tag, in the same deployment, running different code. kubectl get pods shows both Running. Nothing is obviously wrong.

Cause 2: imagePullPolicy inconsistency

If imagePullPolicy: IfNotPresent (the default for non-:latest tags), the kubelet checks if the image is already cached on the node. If it is, it skips the pull entirely.

Node A has an old cached layer. Node B pulled fresh last week. Same tag, different cached state.

Cause 3: A tag was silently re-pointed in your registry

Someone with registry write access re-tags an image. Or a third-party base image you depend on moves :stable to a new commit. Your deployment spec hasn't changed. Your pods haven't restarted. But the next pod that pulls gets something different.

Cause 4: A skewed rollout that never finished cleanly

A rolling update got stuck — one pod updated, one didn’t — and the deployment controller gave up or was manually interrupted. Old pod stays old. New pod is new. Both are Running.

The Thing That Exposes All of It: The Image Digest

Every container image has two identifiers:

Tag — a mutable name. production-202605122058-c117b4b, :latest, :stable. Tags are just pointers. Anyone with push access can move them. The same tag can point to entirely different bytes tomorrow.

Digest — an immutable SHA-256 fingerprint, computed over the image manifest. If the bytes change, the digest changes. You cannot have two different images with the same digest. It is physically impossible.

imageID: ghcr.io/your-org/wasaa-web-service@sha256:b8f57b6b03a8...ea2956

That single line is a cryptographic proof of what is actually running inside that pod.

If two pods on the same deployment report different imageID digests, they are running different code. Full stop.

How to Check It Right Now

Method 1: kubectl describe — manual, per pod

kubectl describe pod wasaa-web-service-7d9f8b-xk2pq -n production | grep imageID

Output:

Image ID: ghcr.io/your-org/wasaa-web-service@sha256:b8f57b6b03a8...ea2956

Do this for every replica and compare. If they match — you’re clean. If they don’t — you have a problem.

Method 2: kubectl get pods with jsonpath — fast, scriptable

kubectl get pods -n production \
  -l app=wasaa-web-service \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].imageID}{"\n"}{end}'

Output:

wasaa-web-service-7d9f8b-xk2pq    ghcr.io/.../wasaa-web-service@sha256:b8f57b6b03a8...ea2956
wasaa-web-service-7d9f8b-mn4rs    ghcr.io/.../wasaa-web-service@sha256:b8f57b6b03a8...ea2956

Both digests identical = both pods are byte-identical. This is what you want to see.

Now watch what a skewed deployment looks like:

wasaa-web-service-7d9f8b-xk2pq    ghcr.io/.../wasaa-web-service@sha256:b8f57b6b03a8...ea2956
wasaa-web-service-7d9f8b-mn4rs    ghcr.io/.../wasaa-web-service@sha256:c3a91f2d77e4...1b8034

Two different digests. One tag. Two different builds. That’s your incident.

Method 3: A shell one-liner across your whole namespace

kubectl get pods -n production \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.containerStatuses[*]}{.imageID}{"\n"}{end}{end}' \
  | sort -k2 | awk '{print $2, $1}' | sort

This groups pods by digest. If any deployment has pods clustering into two different digest groups, it appears immediately.

Method 4: Write it as a proper audit script

#!/bin/bash
# digest-audit.sh — detect pods running different image digests within the same deployment
NAMESPACE=${1:-production}
echo "Checking namespace: $NAMESPACE"
echo "---"
kubectl get pods -n "$NAMESPACE" \
  -o json | jq -r '
  .items[] |
  .metadata.name as $pod |
  .metadata.labels["app"] as $app |
  .status.containerStatuses[]? |
  [$app, $pod, .imageID] |
  @tsv
' | sort | awk -F'\t' '
{
  app=$1; pod=$2; digest=$3
  digests[app][digest] = digests[app][digest] " " pod
  count[app]++
}
END {
  for (app in digests) {
    n = 0
    for (d in digests[app]) n++
    if (n > 1) {
      print "⚠️  DIGEST MISMATCH: " app
      for (d in digests[app]) {
        print "   " substr(d, length(d)-15) "...  →" digests[app][d]
      }
      print ""
    }
  }
  if (n <= 1) print "✅ All pods within each deployment are running identical images."
}
'

Run this after every deploy as a smoke test. Better yet, run it in CI as a post-deploy verification step.

What Errors Actually Surface When Pods Are Skewed

This is the part that makes digest skew so dangerous: the errors look like something else entirely.

Intermittent 500s that resolve on retry. Your load balancer is round-robining across pods. One pod handles the request correctly (new code). The other returns an error (old code, missing a migration, incompatible API shape). Retry hits the good pod. You close the ticket as “transient.”

Feature flags behaving inconsistently. New code checks a feature flag. Old code doesn’t have that check — it either always runs the feature or never does. Users in the same session see different behaviour depending on which pod serves them.

Database schema errors on specific pods. You deployed a migration and a new build together. The new build expects column X. The old build doesn’t know about column X and panics when it encounters rows that have it. Two pods, same table, different reactions.

Inconsistent log formats breaking your alerting. New build changed a log field name. Your alerts match on the old field name. The old pod keeps firing alerts. The new pod is silent. Your on-call sees noise from one pod and assumes it’s a fluke.

Health check passing on both. Unless your health check explicitly validates the build version or a feature introduced in the new build, it will pass on both pods regardless. 200 OK from both. Nothing obviously wrong.

How to Prevent It

1. Always use timestamped or commit-sha tags — never :latest in production

image: ghcr.io/your-org/wasaa-web-service:production-202605122058-c117b4b

This doesn’t prevent skew, but it makes re-pushes to the same tag much harder to do accidentally — because CI would have to deliberately push to an existing timestamp tag.

2. Pin to digest in production

image: ghcr.io/your-org/wasaa-web-service@sha256:b8f57b6b03a8...ea2956

This is the hard guarantee. No matter what happens to the tag, this pod will always pull exactly those bytes. A re-push to the tag does nothing. A registry admin moving pointers does nothing. The digest is what gets pulled.

The tradeoff: your deployment spec becomes harder to read and requires tooling to update. Tools like crane digest or cosign triangulate can resolve a tag to its current digest in CI and write it into the spec automatically.

3. Set imagePullPolicy: Always in production

imagePullPolicy: Always

Forces every pod start to check the registry. Combined with digest pinning, this means every pod start verifies it has the right bytes before running. Without Always, a cached old image on the node will be used even if you've updated the tag.

4. Enforce digest-only images with Kyverno

If you’re running Kyverno, you can write a policy that rejects any deployment that specifies a tag instead of a digest:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-digest
spec:
  validationFailureAction: enforce
  rules:
  - name: check-image-digest
    match:
      resources:
        kinds: [Pod]
    validate:
      message: "Image must be specified with a digest, not a tag."
      pattern:
        spec:
          containers:
          - image: "*@sha256:*"

Now if any deployment tries to deploy with a tag, the admission controller rejects it. The only way in is through a digest.

5. Add a post-deploy digest verification step to CI

After your helm upgrade or kubectl apply, add a step that runs the audit script above. If any pods in the deployment come up with different digests, fail the pipeline.

- name: Verify image digest consistency
  run: |
    sleep 30  # wait for rollout
    ./scripts/digest-audit.sh production
    if [ $? -ne 0 ]; then
      echo "Digest mismatch detected. Rollback."
      exit 1
    fi

The Quick Checklist When You Suspect Skewed Pods

□ kubectl get pods -o jsonpath ... | grep imageID
  → Do all replicas show the same digest?
□ kubectl rollout status deployment/your-service -n production
  → Did the rollout actually complete?
□ Check your registry: was the tag re-pushed recently?
  → Compare push timestamps to when pods were scheduled
□ Check node image cache: was one node holding a stale layer?
  → kubectl describe node | grep -A5 "Images:"
□ If skewed: kubectl rollout restart deployment/your-service
  → Forces all pods to pull fresh, re-establishes consistency

Closing Thought

The tag is a lie. A useful lie — human-readable, version-labeled, easy to communicate — but a lie. The digest is the truth.

Your monitoring, your alerting, your incident response all operate on the assumption that pods in the same deployment are running the same code. When that assumption breaks silently, you spend hours debugging symptoms instead of finding the actual cause.

Make digest verification part of your deploy checklist. Run it after every rollout. Add it to your runbooks. It takes 30 seconds and it has saved us from multi-hour investigations more than once.

The next time your logs are inconsistent and your on-call can’t explain why — check the digests first.

Follow me for more Kubernetes war stories, SRE practices, and the kind of infrastructure bugs that only show up when you least expect them. More coming on supply-chain integrity, image signing with cosign, and Kyverno policy patterns.

If you’ve been hit by digest skew before — drop it in the comments. I want to know what the symptom was that finally gave it away.

Tags: Kubernetes · DevOps · SRE · Container Security · Supply Chain · Platform Engineering

Two Pods. Same Tag. Different Code. Here’s How We Caught It. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.