Page cover

Hands-On Task: Debugging a Failing Pod

Task Overview

In this task, we intentionally break a Kubernetes Pod, observe how Kubernetes reacts, and then fix the issue without deleting the Pod.

Task Goals

  • Create a Pod name debug-pod

  • Force it into a failure state, we will useErrImagePull / ImagePullBackOff failure

  • Debug the failure using Kubernetes events.

  • Fix the issue live.

  • Verify the Pod reaches Running

  • Capture and document every step you do.


Step 1: Create the Pod with an Invalid Image

Create a Pod named debug-pod using an incorrect container image name.

The mistake is intentional.

Example idea (not mandatory how you created it):

  • Image name: ngimx:alpine (typo)

After creation, check the Pod status:

Expected Result

After some retries, it transitions to:


Step 2: Inspect the Pod and Identify the Failure

Describe the Pod to understand what went wrong:

In Pod Describe you will see Events part with ErrImagePull:

What to Look For

In the Events section, you should see entries similar to:

  • Pulling image "ngimx:alpine"

  • Failed to pull image

  • ErrImagePull

  • ImagePullBackOff

This confirms:

  • Kubernetes scheduling worked

  • The failure happened at the image pull stage

  • The issue is external dependency failure, not scheduling or YAML syntax


Step 3: Fix the Pod Without Deleting It

Instead of deleting and recreating the Pod, edit it live:

Fix the image name:

Save and exit.


Step 4: Verify Pod Recovery

Check the Pod status again:

Expected Result

Describe the Pod again:

The output should be like:

Observations

  • Container is now Running

  • Image was pulled successfully

  • Previous failure events are still visible

This is expected behavior.


Step 5: Verify Pod YAML

Export the final Pod manifest:

Key things to notice in the YAML:

  • Correct image name

  • restartPolicy: Always

  • qosClass: BestEffort (no resource requests or limits)


Important Notes

Why Old Events Still Exist

If you run:

you may see the same events as in a previous kubectl describe.

This happens because Kubernetes:

  • Does not rewrite or delete old events when a Pod is fixed.

  • Treats events as append-only and time-limited.

  • Keeps old ErrImagePull events visible even after the Pod recovers this is normal.

Kubernetes does not retain events forever. Each event has a time-to-live (TTL) and is deleted automatically after a certain period. By default, Kubernetes events expire 1 hour after they are created.

Hidden Observation

The Pod is running with:

This means:

  • No CPU or memory requests

  • No resource guarantees

  • First candidate for eviction under node pressure


Final Result

By the end of this task, we:

  • Forced a real Kubernetes failure

  • Used events to identify the root cause

  • Fixed the issue without recreating resources

  • Observed actual Pod state transitions


Reference

Last updated