Tag Archives: k8s

Chapter 21: Kubernetes Troubleshooting Techniques

Introduction to Kubernetes Troubleshooting

Troubleshooting Kubernetes issues can be daunting due to its distributed and dynamic nature. However, understanding key techniques and tools can simplify diagnosing and resolving problems. This chapter equips you with practical strategies to debug Pods, nodes, and clusters efficiently.

Common Troubleshooting Scenarios

  1. Pods Not Running: CrashLoopBackOff, Pending, or Error states.
  2. Networking Issues: Unreachable services or inter-Pod communication failures.
  3. Persistent Volume Issues: Storage not being provisioned or mounted correctly.
  4. Cluster-Level Failures: Node unavailability, API server errors, or resource constraints.

Step-by-Step Troubleshooting Techniques

Step 1: Troubleshooting Pods

Check Pod Status

1. View Pod Details:

    kubectl get pods

    2. Describe a Problematic Pod:

    kubectl describe pod <pod-name>

    Look for events like ImagePullBackOff or FailedScheduling.

    3. Check Pod Logs:

    kubectl logs <pod-name>

    4. Stream Live Logs:

    kubectl logs <pod-name> -f

    Fix Common Pod Issues

    • ImagePullBackOff:
      • Check the image name and registry credentials:
    kubectl describe pod <pod-name>

    Update the image name or add a secret for private registries

    kubectl set image deployment/<deployment-name> <container-name>=<new-image>

    CrashLoopBackOff:

    • Debug container errors by starting an interactive shell:
    kubectl exec -it <pod-name> -- /bin/bash

    Step 2: Troubleshooting Services

    Verify Service Configuration

    1. Check Service Details:

      kubectl get svc

      2. Describe the Service:

      kubectl describe svc <service-name>

      3. Test Service Reachability:

      kubectl run curl-test --image=curlimages/curl --restart=Never -- curl <service-name>:<port>

      Common Service Issues

      • No Endpoint:
        • Verify Pods are labeled correctly:bashCopy code
      kubectl get pods --selector=<label>

      DNS Resolution Failures:

      • Check the CoreDNS logs:
      kubectl logs -n kube-system -l k8s-app=kube-dns

      Step 3: Troubleshooting Persistent Volumes

      Check Persistent Volume Claims (PVCs)

      1. View PVCs:

        kubectl get pvc

        2. Describe a PVC:

        kubectl describe pvc <pvc-name>

        3. Check Events:

        • Look for messages like FailedBinding.

        Fix Common PVC Issues

        • StorageClass Not Found:
          • Verify the StorageClass
        kubectl get storageclass
          • Update your PVC to use an existing StorageClass.
        • Volume Not Mounted:
          • Ensure the Pod’s volumeMounts are configured correctly in the spec.

        Step 4: Troubleshooting Cluster Issues

        Check Node Status

        1. View All Nodes:

          kubectl get nodes

          2. Describe a Node:

          kubectl describe node <node-name>

          3. Check Node Logs:

          SSH into the node and check logs:

          sudo journalctl -u kubelet

          Fix Node Issues

          • Node Not Ready:
            • Verify system resources (CPU, memory, disk).
            • Restart kubelet:bashCopy code
          sudo systemctl restart kubelet

          Pods Evicted:

          Check resource limits and quotas:

          kubectl describe quota

          Step 5: Advanced Debugging Tools

          Using kubectl Debug

          1. Start a Debug Pod:

            kubectl debug <pod-name> --image=busybox --attach=false

            2. Access the Debug Pod:

            kubectl exec -it <debug-pod-name> -- /bin/sh

            Using kube-ops-view

            1. Deploy kube-ops-view

              kubectl apply -f https://github.com/hjacobs/kube-ops-view/releases/latest/download/kube-ops-view.yaml

              2. Access the Dashboard:

              Forward the service port and open in a browser

              kubectl port-forward svc/kube-ops-view 8080:80

              Using Prometheus for Troubleshooting

              1. Check Resource Metrics:

              • Access Prometheus UI:
                kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090

                Query CPU or memory usage

                sum(rate(container_cpu_usage_seconds_total[5m]))

                2. Set Alerts:

                Create alert rules for resource thresholds (e.g., high memory usage).

                Best Practices for Troubleshooting

                1. Use Namespaces:
                  • Isolate workloads to make debugging easier.
                2. Leverage Dashboards:
                  • Use Grafana or kube-ops-view for visual insights.
                3. Audit Logs:
                  • Regularly review API server and kubelet logs.
                4. Document Resolutions:
                  • Maintain a knowledge base for recurring issues.

                Production Example: Debugging a Payment Service Outage

                1. Scenario:
                  • A payment microservice is unreachable during high traffic.
                2. Steps:
                  • Check Pods
                kubectl get pods -l app=payment
                kubectl logs <pod-name>

                Verify Service and DNS

                kubectl describe svc payment-service
                kubectl logs -n kube-system -l k8s-app=kube-dns

                Check Node Resources

                kubectl describe node <node-name>
                kubectl top nodes
                

                3. Resolution:

                Scale the deployment to handle high traffic

                kubectl scale deployment payment --replicas=5

                Conclusion

                You’ve now mastered the essentials of Kubernetes troubleshooting! By applying these techniques, you can efficiently diagnose and resolve issues in your clusters, ensuring high availability and performance.