Troubleshooting the top 10 Kubernetes errors

As Kubernetes adoption continues to soar, even seasoned engineers can find themselves grappling with perplexing errors.

Whether you're managing a handful of pods or an enterprise-scale cluster, these common hiccups can leave you with a Kubernetes troubleshooting nightmare.

The challenges only grow when you start to introduce:

Different environments, like edge and airgapped sites
More teams of administrators and developers, each with different access privileges
More complex stacks of software integrations in each cluster, each with images and dependencies

Let's look at ten of the most common Kubernetes errors and arm you with debugging strategies, useful tools, and resources to keep your deployments running smoothly.

1. CrashLoopBackOff

A Kubernetes error message was so frequent and feared that it spawned its own memes.

What does it mean?

The error CrashLoopBackOff means a containerized application within a pod is repeatedly crashing after starting. After the crash, the kubelet restarts the container, it crashes again, then backs off and waits to try again.

What’s the cause?

Application code issues
Misconfiguration
Missing dependencies
Insufficient resources

How can you troubleshoot it?

1. Inspect the logs of the crashing pod using:

kubectl logs <pod_name>

This command helps you identify the root cause by showing the output and error logs of the application running inside the pod.

2. Check the event logs for the pod:

kubectl describe pod <pod_name>

This provides detailed information about the pod's lifecycle and recent events, which can indicate configuration issues or resource constraints.

3. Inspect previous container instances with:

kubectl describe pod <pod-name> -p

This shows you the logs of the pod before it crashed. You can use -p or --previous.

4. Adjust resource limits (CPU, memory) if needed. To adjust resource limits in a Pod, you need to modify the pod’s YAML specification to include the resource requests and limits under each container specification. Resource requests specify the amount of CPU and memory a container needs, while resource limits specify the maximum amount it can use. Here's an example for a Pod:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - name: mycontainer
    image: myimage
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Here's an example for a Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mydeployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: mycontainer
        image: myimage
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"

• requests: The minimum amount of CPU and memory required for the container.

• limits: The maximum amount of CPU and memory that the container can use.

• memory: Memory is specified in bytes, with suffixes like K, M, G, Ki, Mi, Gi.

• cpu: CPU is specified in millicores (m). For example, 250m means 0.25 CPU cores.

Useful Resources:

Kubernetes Official Documentation on Pod Lifecycle

2. ImagePullBackOff

What does it mean?

When Kubernetes creates a pod, it looks in the pod specification for the container image it needs, and attempts to pull it from the specified registry. If it can’t retrieve the image, it backs off.

What causes it?

Incorrect image name or tag
Authentication issues with the registry
Network problems

How can you troubleshoot it?

1. Check the image name and tag in your deployment configuration to ensure they are correct.

2. Ensure you have the correct credentials for the container registry and that they are configured in your Kubernetes cluster. Check your registry credentials with:

kubectl get secret

3. Use the describe command to get more details:

kubectl describe pod <pod_name>

4. Test network connectivity to your registry.

# You can use curl to directly test the HTTP connection to the registry
curl -v <registry-url>/v2/

# If you suspect DNS issues, use dig or nslookup to resolve the registry URL.Check if the hostname can be resolved.
dig <registry-url>
nslookup <registry-url> 

# Check if you can reach the server. You can use the FQDN or the IP.
ping <registry-url>

# Test if you can connect to the port exposing the registry. This can be any port number, like 443 or (by default) 5000 for certain registries. 
nc -v <registry-url> <registry-port>

Useful Resources:

Kubernetes Official Documentation on Image Pull Errors

3. ErrImagePull

What does it mean?

This error is very similar to ImagePullBackOff — it also means that Kubernetes failed to pull the container image.

What causes it?

Incorrect image name or tag
Registry authentication issues

How can you troubleshoot it?

Verify the image name and tag in your deployment configuration.

Check if the image exists in the registry.

Inspect events for more details:

kubectl describe pod <pod_name>
kubectl events

Useful Resources:

Kubernetes Official Documentation on Troubleshooting Images

4. CreateContainerConfigError

What does it mean?

Kubernetes failed to create the container configuration.

What causes it?

Invalid configuration in the pod specification, like invalid environment variables, incorrect volume mounts or security context issues.

How can you troubleshoot it?

1. Validate your pod configuration against the Kubernetes API spec.

2. Use kubectl describe to get more information:

kubectl describe pod <pod_name>

Useful Resources:

Pod API specification v1.30

5. PodInitializing

What does it mean?

The pod is stuck in the PodInitializing state, indicating that the pod's containers are in the process of starting up.

What causes it?

Slow initialization due to complex startup scripts or commands
Issues with initialization containers that need to complete before the main containers start
Problems with network connectivity or mounted volumes

How can you troubleshoot it?

1. Check if any init containers are causing delays or failures. Use the following command to describe the pod and inspect the init containers' status:

kubectl describe pod <pod_name>

Look for the Init Containers section in the output.

2. View the logs of any init containers to identify issues.

kubectl logs <pod_name> -c <init_container_name>

3. Ensure the pod has network access to any required services or endpoints. You can use a simple tool like curl or ping inside the init container or main container to test connectivity.

4. Check if the pod is waiting on volumes that are not properly mounted or are slow to initialize. Inspect the volume status in the pod description:

kubectl describe pod <pod_name>
kubectl get pvc
kubectl get pv

5. A valuable Kubernetes troubleshooting tip would be to ensure that the pod's resource requests and limits are correctly set to avoid scheduling issues that could delay initialization.

Useful Resources:

Kubernetes Init Containers Documentation

6. NodeNotReady

What does it mean?

The node is not ready to accept pods.

What causes it?

Node health issues
Network problems

How can you troubleshoot it?

1. Check the status of your nodes:

kubectl get nodes

2. Inspect the node for issues using:

kubectl describe node <node_name>

Useful Resources:

Kubernetes Cluster/Node Troubleshooting Guide

7. Pending

What does it mean?

The Pod or Service is stuck in the pending state.

What causes it?

Lack of available resources (CPU, memory)
Unschedulable due to affinity/anti-affinity rules
For Services, it might be due to missing external resources, like load balancers

How can you troubleshoot it?

1. Check resource availability:

# List all details for a specific Pod
kubectl describe pod <pod_name>

# List the details for a specific Service
kubectl describe service <service_name>

# List all Services.
kubectl get services

2. Review scheduling constraints and policies such as node selectors, taints, and tolerations.

3. For Services, check if the infrastructure supports LoadBalancer services or if tools like MetalLB or kube-vip are deployed and configured correctly.

‍Useful Resources:‍

8. FailedScheduling

‍What does it mean?

Kubernetes failed to schedule the pod on any available node.

‍What causes it?‍

Resource constraints
Taints and tolerations
Node affinity rules

‍How can you troubleshoot it?‍

1. Review resource requests and limits in your pod specification.

2. Check node taints and pod tolerations using:

kubectl describe pod <pod_name>

3. Taints: If required, add taints to a node:

kubectl taint nodes <node_name> key=value:taint-effect

• key: The taint key to be added.

• value: The taint value to be added.

• taint-effect: Can be one of NoSchedule, PreferNoSchedule, or NoExecute.

4. Tolerations: If required, add a toleration to the Pod. To add tolerations to a Pod, you need to modify the pod’s YAML specification. Here’s an example YAML:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
  containers:
  - name: mycontainer
    image: myimage

5. Node Affinity: Here's an example of Node Affinity in the Pod YAML:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  affinity:
    nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
        - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
  containers:
  - name: mycontainer
    image: myimage

Useful Resources:

9. ContainerCannotRun

What does it mean?

The container failed to start.

What causes it?

Invalid command or arguments in the pod specification
Issues with the container image
Missing required files or environment variables

How can you troubleshoot it?

1. Inspect the container logs for errors:

kubectl logs <pod_name>

2. Check the container's command and arguments in the pod specification.

Useful Resources:

Kubernetes Debugging Guide

10. OOMKilled

What does it mean?

This Kubernetes container was destroyed because it used more memory than allowed.

What causes it?

Application memory leaks
Insufficient memory limits set in the pod specification

How can you troubleshoot and fix it?

1. Check container logs for out-of-memory errors.

# Print the logs of the Pod
kubectl logs <pode_name>

# Print the logs of a specific container in a Pod
kubectl logs <pod_name> -c <container_name>

# List all info on the Pod, including any events
kubectl describe pod <pod_name>

2. Review and adjust memory requests and limits in your pod specification. See error number one CrashLoopBackOff at the top of the blog, for examples of how to change the resource requests and limits.

‍Useful Resources:‍

Kubernetes Resource Management Documentation

Useful Tools for Debugging Kubernetes Errors

kubectl: The primary command-line tool for interacting with Kubernetes clusters.
K9s: (My personal favorite) A terminal UI to interact with your Kubernetes clusters.
kubectx & kubens: Command-line tools that help switching between contexts and namespaces, quickly.
Prometheus and Grafana: For monitoring and alerting.
Helm: For managing Kubernetes applications.
Kubernetes Dashboard: A web UI for managing in cluster resources.

Conclusion

Troubleshooting Kubernetes errors can be challenging, but understanding common error messages and knowing how to debug them effectively can save you a lot of time and effort. Use the debugging steps, resources, and tools mentioned in this guide to streamline your Kubernetes management and ensure a smoother deployment experience. Happy troubleshooting!

While managing one cluster is doable this way, you might find that the same issue needs to be resolved across multiple clusters and potentially across different infrastructures. With Spectro Cloud's Palette, you can debug the problem on one cluster, apply the fix to the initial cluster/application blueprint, or what we call a Cluster Profile, and then apply the fixed Profile to any cluster running the faulty configuration.

To see it for yourself, why not book a 1:1 demo with one of our experts?

Tags:

How to

Concepts

Cloud

Developers