Hands-On Task: Debugging a Failing Pod

Task Overview

In this task, we intentionally break a Kubernetes Pod, observe how Kubernetes reacts, and then fix the issue without deleting the Pod.

Task Goals

Create a Pod name debug-pod
Force it into a failure state, we will useErrImagePull / ImagePullBackOff failure
Debug the failure using Kubernetes events.
Fix the issue live.
Verify the Pod reaches Running
Capture and document every step you do.

Step 1: Create the Pod with an Invalid Image

Create a Pod named debug-pod using an incorrect container image name.

The mistake is intentional.

Example idea (not mandatory how you created it):

Image name: ngimx:alpine (typo)

After creation, check the Pod status:

kubectl get pods

Expected Result

debug-pod   0/1   ErrImagePull   0   <time>

After some retries, it transitions to:

ImagePullBackOff

Step 2: Inspect the Pod and Identify the Failure

Describe the Pod to understand what went wrong:

kubectl describe pod debug-pod

In Pod Describe you will see Events part with ErrImagePull:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m19s                  default-scheduler  Successfully assigned default/de-pod to kind-worker
  Normal   Pulling    3m40s (x3 over 4m19s)  kubelet            Pulling image "ngimx"
  Warning  Failed     3m39s (x3 over 4m18s)  kubelet            Failed to pull image "ngimx": failed to pull and unpack image "docker.io/library/ngimx:latest": failed to resolve reference "docker.io/library/ngimx:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning  Failed     3m39s (x3 over 4m18s)  kubelet            Error: ErrImagePull
  Normal   BackOff    3m23s (x3 over 4m18s)  kubelet            Back-off pulling image "ngimx"
  Warning  Failed     3m23s (x3 over 4m18s)  kubelet            Error: ImagePullBackOff

What to Look For

In the Events section, you should see entries similar to:

Pulling image "ngimx:alpine"
Failed to pull image
ErrImagePull
ImagePullBackOff

This confirms:

Kubernetes scheduling worked
The failure happened at the image pull stage
The issue is external dependency failure, not scheduling or YAML syntax

Step 3: Fix the Pod Without Deleting It

Instead of deleting and recreating the Pod, edit it live:

kubectl edit pod debug-pod

Fix the image name:

image: nginx:alpine

Save and exit.

Step 4: Verify Pod Recovery

Check the Pod status again:

kubectl get pods

Expected Result

debug-pod   1/1   Running   0   <time>

Describe the Pod again:

kubectl describe pod debug-pod

The output should be like:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m19s                  default-scheduler  Successfully assigned default/de-pod to kind-worker
  Normal   Pulling    3m40s (x3 over 4m19s)  kubelet            Pulling image "ngimx"
  Warning  Failed     3m39s (x3 over 4m18s)  kubelet            Failed to pull image "ngimx": failed to pull and unpack image "docker.io/library/ngimx:latest": failed to resolve reference "docker.io/library/ngimx:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning  Failed     3m39s (x3 over 4m18s)  kubelet            Error: ErrImagePull
  Normal   BackOff    3m23s (x3 over 4m18s)  kubelet            Back-off pulling image "ngimx"
  Warning  Failed     3m23s (x3 over 4m18s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    3m22s                  kubelet            Pulling image "nginx"
  Normal   Pulled     2m51s                  kubelet            Successfully pulled image "nginx" in 31.548s (31.548s including waiting). Image size: 62870438 bytes.
  Normal   Created    2m51s                  kubelet            Created container de-pod
  Normal   Started    2m51s                  kubelet            Started container de-pod

Observations

Container is now Running
Image was pulled successfully
Previous failure events are still visible

This is expected behavior.

Step 5: Verify Pod YAML

Export the final Pod manifest:

kubectl get pod debug-pod -o yaml > debug-pod.yaml

Key things to notice in the YAML:

Correct image name
restartPolicy: Always
qosClass: BestEffort (no resource requests or limits)

Important Notes

Why Old Events Still Exist

If you run:

kubectl describe pod debug-pod

you may see the same events as in a previous kubectl describe.

This happens because Kubernetes:

Does not rewrite or delete old events when a Pod is fixed.
Treats events as append-only and time-limited.
Keeps old ErrImagePull events visible even after the Pod recovers this is normal.

Kubernetes does not retain events forever. Each event has a time-to-live (TTL) and is deleted automatically after a certain period. By default, Kubernetes events expire 1 hour after they are created.

Hidden Observation

The Pod is running with:

QoS Class: BestEffort

This means:

No CPU or memory requests
No resource guarantees
First candidate for eviction under node pressure

Final Result

By the end of this task, we:

Forced a real Kubernetes failure
Used events to identify the root cause
Fixed the issue without recreating resources
Observed actual Pod state transitions

Reference

PreviousUnderstand Docker Multi-Stage Builds and Layer Caching for Faster, Smaller Images NextKubernetes – Core Pod Lifecycle & Storage

Last updated 2 months ago

hashtagTask Overview

hashtagTask Goals

hashtagStep 1: Create the Pod with an Invalid Image

hashtagExpected Result

hashtagStep 2: Inspect the Pod and Identify the Failure

hashtagWhat to Look For

hashtagStep 3: Fix the Pod Without Deleting It

hashtagStep 4: Verify Pod Recovery

hashtagExpected Result

hashtagObservations

hashtagStep 5: Verify Pod YAML

hashtagImportant Notes

hashtagWhy Old Events Still Exist

hashtagHidden Observation

hashtagFinal Result

hashtagReference

Task Overview

Task Goals

Step 1: Create the Pod with an Invalid Image

Expected Result

Step 2: Inspect the Pod and Identify the Failure

What to Look For

Step 3: Fix the Pod Without Deleting It

Step 4: Verify Pod Recovery

Expected Result

Observations

Step 5: Verify Pod YAML

Important Notes

Why Old Events Still Exist

Hidden Observation

Final Result

Reference