Your deployment looked fine. kubectl get pods said Running. But one pod was serving a build from three days ago.
You roll out a deployment. All pods show Running. Your readiness probes pass. Metrics look normal — mostly. But one service is throwing intermittent 500s that disappear on retry. Your logs are inconsistent. Feature X works sometimes, fails others. You restart the pod and it goes away.
You close the incident. You blame flakiness.
Two weeks later it happens again.
What you probably missed: your two replica pods were running different code.
Same tag. Same deployment spec. Different bytes underneath.
Here’s how that happens, how to catch it, and how to make sure it never silently bites you again.
How Two Pods End Up Running Different Code
This scenario has a few common causes. All of them are silent.
Cause 1: Someone re-pushed to the same tag
CI finishes a build and pushes wasaa-web-service:production-latest. Twenty minutes later, a hotfix merges and CI pushes again — to the same tag. No one changes the deployment spec. Kubernetes doesn't know the tag changed underneath it.
Pod A was already running. It pulled the old build. Kubernetes doesn’t restart running pods when a tag moves.
Pod B gets scheduled on a new node (autoscaler event, node rotation, OOM eviction — pick one). It pulls fresh. It gets the new build.
Now you have two pods on the same tag, in the same deployment, running different code. kubectl get pods shows both Running. Nothing is obviously wrong.
Cause 2: imagePullPolicy inconsistency
If imagePullPolicy: IfNotPresent (the default for non-:latest tags), the kubelet checks if the image is already cached on the node. If it is, it skips the pull entirely.
Node A has an old cached layer. Node B pulled fresh last week. Same tag, different cached state.
Cause 3: A tag was silently re-pointed in your registry
Someone with registry write access re-tags an image. Or a third-party base image you depend on moves :stable to a new commit. Your deployment spec hasn't changed. Your pods haven't restarted. But the next pod that pulls gets something different.
Cause 4: A skewed rollout that never finished cleanly
A rolling update got stuck — one pod updated, one didn’t — and the deployment controller gave up or was manually interrupted. Old pod stays old. New pod is new. Both are Running.
The Thing That Exposes All of It: The Image Digest
Every container image has two identifiers:
Tag — a mutable name. production-202605122058-c117b4b, :latest, :stable. Tags are just pointers. Anyone with push access can move them. The same tag can point to entirely different bytes tomorrow.
Digest — an immutable SHA-256 fingerprint, computed over the image manifest. If the bytes change, the digest changes. You cannot have two different images with the same digest. It is physically impossible.
imageID: ghcr.io/your-org/wasaa-web-service@sha256:b8f57b6b03a8...ea2956
That single line is a cryptographic proof of what is actually running inside that pod.
If two pods on the same deployment report different imageID digests, they are running different code. Full stop.
How to Check It Right Now
Method 1: kubectl describe — manual, per pod
kubectl describe pod wasaa-web-service-7d9f8b-xk2pq -n production | grep imageID
Output:
Image ID: ghcr.io/your-org/wasaa-web-service@sha256:b8f57b6b03a8...ea2956
Do this for every replica and compare. If they match — you’re clean. If they don’t — you have a problem.
Method 2: kubectl get pods with jsonpath — fast, scriptable
kubectl get pods -n production \
-l app=wasaa-web-service \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].imageID}{"\n"}{end}'
Output:
wasaa-web-service-7d9f8b-xk2pq ghcr.io/.../wasaa-web-service@sha256:b8f57b6b03a8...ea2956
wasaa-web-service-7d9f8b-mn4rs ghcr.io/.../wasaa-web-service@sha256:b8f57b6b03a8...ea2956
Both digests identical = both pods are byte-identical. This is what you want to see.
Now watch what a skewed deployment looks like:
wasaa-web-service-7d9f8b-xk2pq ghcr.io/.../wasaa-web-service@sha256:b8f57b6b03a8...ea2956
wasaa-web-service-7d9f8b-mn4rs ghcr.io/.../wasaa-web-service@sha256:c3a91f2d77e4...1b8034
Two different digests. One tag. Two different builds. That’s your incident.
Method 3: A shell one-liner across your whole namespace
kubectl get pods -n production \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.containerStatuses[*]}{.imageID}{"\n"}{end}{end}' \
| sort -k2 | awk '{print $2, $1}' | sort
This groups pods by digest. If any deployment has pods clustering into two different digest groups, it appears immediately.
Method 4: Write it as a proper audit script
#!/bin/bash
# digest-audit.sh — detect pods running different image digests within the same deployment
NAMESPACE=${1:-production}
echo "Checking namespace: $NAMESPACE"
echo "---"
kubectl get pods -n "$NAMESPACE" \
-o json | jq -r '
.items[] |
.metadata.name as $pod |
.metadata.labels["app"] as $app |
.status.containerStatuses[]? |
[$app, $pod, .imageID] |
@tsv
' | sort | awk -F'\t' '
{
app=$1; pod=$2; digest=$3
digests[app][digest] = digests[app][digest] " " pod
count[app]++
}
END {
for (app in digests) {
n = 0
for (d in digests[app]) n++
if (n > 1) {
print "⚠️ DIGEST MISMATCH: " app
for (d in digests[app]) {
print " " substr(d, length(d)-15) "... →" digests[app][d]
}
print ""
}
}
if (n <= 1) print "✅ All pods within each deployment are running identical images."
}
'
Run this after every deploy as a smoke test. Better yet, run it in CI as a post-deploy verification step.
What Errors Actually Surface When Pods Are Skewed
This is the part that makes digest skew so dangerous: the errors look like something else entirely.
Intermittent 500s that resolve on retry. Your load balancer is round-robining across pods. One pod handles the request correctly (new code). The other returns an error (old code, missing a migration, incompatible API shape). Retry hits the good pod. You close the ticket as “transient.”
Feature flags behaving inconsistently. New code checks a feature flag. Old code doesn’t have that check — it either always runs the feature or never does. Users in the same session see different behaviour depending on which pod serves them.
Database schema errors on specific pods. You deployed a migration and a new build together. The new build expects column X. The old build doesn’t know about column X and panics when it encounters rows that have it. Two pods, same table, different reactions.
Inconsistent log formats breaking your alerting. New build changed a log field name. Your alerts match on the old field name. The old pod keeps firing alerts. The new pod is silent. Your on-call sees noise from one pod and assumes it’s a fluke.
Health check passing on both. Unless your health check explicitly validates the build version or a feature introduced in the new build, it will pass on both pods regardless. 200 OK from both. Nothing obviously wrong.
How to Prevent It
1. Always use timestamped or commit-sha tags — never :latest in production
image: ghcr.io/your-org/wasaa-web-service:production-202605122058-c117b4b
This doesn’t prevent skew, but it makes re-pushes to the same tag much harder to do accidentally — because CI would have to deliberately push to an existing timestamp tag.
2. Pin to digest in production
image: ghcr.io/your-org/wasaa-web-service@sha256:b8f57b6b03a8...ea2956
This is the hard guarantee. No matter what happens to the tag, this pod will always pull exactly those bytes. A re-push to the tag does nothing. A registry admin moving pointers does nothing. The digest is what gets pulled.
The tradeoff: your deployment spec becomes harder to read and requires tooling to update. Tools like crane digest or cosign triangulate can resolve a tag to its current digest in CI and write it into the spec automatically.
3. Set imagePullPolicy: Always in production
imagePullPolicy: Always
Forces every pod start to check the registry. Combined with digest pinning, this means every pod start verifies it has the right bytes before running. Without Always, a cached old image on the node will be used even if you've updated the tag.
4. Enforce digest-only images with Kyverno
If you’re running Kyverno, you can write a policy that rejects any deployment that specifies a tag instead of a digest:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-image-digest
spec:
validationFailureAction: enforce
rules:
- name: check-image-digest
match:
resources:
kinds: [Pod]
validate:
message: "Image must be specified with a digest, not a tag."
pattern:
spec:
containers:
- image: "*@sha256:*"
Now if any deployment tries to deploy with a tag, the admission controller rejects it. The only way in is through a digest.
5. Add a post-deploy digest verification step to CI
After your helm upgrade or kubectl apply, add a step that runs the audit script above. If any pods in the deployment come up with different digests, fail the pipeline.
- name: Verify image digest consistency
run: |
sleep 30 # wait for rollout
./scripts/digest-audit.sh production
if [ $? -ne 0 ]; then
echo "Digest mismatch detected. Rollback."
exit 1
fi
The Quick Checklist When You Suspect Skewed Pods
□ kubectl get pods -o jsonpath ... | grep imageID
→ Do all replicas show the same digest?
□ kubectl rollout status deployment/your-service -n production
→ Did the rollout actually complete?
□ Check your registry: was the tag re-pushed recently?
→ Compare push timestamps to when pods were scheduled
□ Check node image cache: was one node holding a stale layer?
→ kubectl describe node | grep -A5 "Images:"
□ If skewed: kubectl rollout restart deployment/your-service
→ Forces all pods to pull fresh, re-establishes consistency
Closing Thought
The tag is a lie. A useful lie — human-readable, version-labeled, easy to communicate — but a lie. The digest is the truth.
Your monitoring, your alerting, your incident response all operate on the assumption that pods in the same deployment are running the same code. When that assumption breaks silently, you spend hours debugging symptoms instead of finding the actual cause.
Make digest verification part of your deploy checklist. Run it after every rollout. Add it to your runbooks. It takes 30 seconds and it has saved us from multi-hour investigations more than once.
The next time your logs are inconsistent and your on-call can’t explain why — check the digests first.
Follow me for more Kubernetes war stories, SRE practices, and the kind of infrastructure bugs that only show up when you least expect them. More coming on supply-chain integrity, image signing with cosign, and Kyverno policy patterns.
If you’ve been hit by digest skew before — drop it in the comments. I want to know what the symptom was that finally gave it away.
Tags: Kubernetes · DevOps · SRE · Container Security · Supply Chain · Platform Engineering
Two Pods. Same Tag. Different Code. Here’s How We Caught It. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.