Introduction to Kubernetes Troubleshooting
Troubleshooting Kubernetes issues can be daunting due to its distributed and dynamic nature. However, understanding key techniques and tools can simplify diagnosing and resolving problems. This chapter equips you with practical strategies to debug Pods, nodes, and clusters efficiently.
Common Troubleshooting Scenarios
- Pods Not Running: CrashLoopBackOff, Pending, or Error states.
- Networking Issues: Unreachable services or inter-Pod communication failures.
- Persistent Volume Issues: Storage not being provisioned or mounted correctly.
- Cluster-Level Failures: Node unavailability, API server errors, or resource constraints.
Step-by-Step Troubleshooting Techniques
Step 1: Troubleshooting Pods
Check Pod Status
1. View Pod Details:
kubectl get pods
2. Describe a Problematic Pod:
kubectl describe pod <pod-name>
Look for events like ImagePullBackOff or FailedScheduling.
3. Check Pod Logs:
kubectl logs <pod-name>
4. Stream Live Logs:
kubectl logs <pod-name> -f
Fix Common Pod Issues
- ImagePullBackOff:
- Check the image name and registry credentials:
kubectl describe pod <pod-name>
Update the image name or add a secret for private registries
kubectl set image deployment/<deployment-name> <container-name>=<new-image>
CrashLoopBackOff:
- Debug container errors by starting an interactive shell:
kubectl exec -it <pod-name> -- /bin/bash
Step 2: Troubleshooting Services
Verify Service Configuration
1. Check Service Details:
kubectl get svc
2. Describe the Service:
kubectl describe svc <service-name>
3. Test Service Reachability:
kubectl run curl-test --image=curlimages/curl --restart=Never -- curl <service-name>:<port>
Common Service Issues
- No Endpoint:
- Verify Pods are labeled correctly:bashCopy code
kubectl get pods --selector=<label>
DNS Resolution Failures:
- Check the CoreDNS logs:
kubectl logs -n kube-system -l k8s-app=kube-dns
Step 3: Troubleshooting Persistent Volumes
Check Persistent Volume Claims (PVCs)
1. View PVCs:
kubectl get pvc
2. Describe a PVC:
kubectl describe pvc <pvc-name>
3. Check Events:
- Look for messages like
FailedBinding.
Fix Common PVC Issues
- StorageClass Not Found:
- Verify the StorageClass
kubectl get storageclass
-
- Update your PVC to use an existing StorageClass.
- Volume Not Mounted:
- Ensure the Pod’s volumeMounts are configured correctly in the spec.
Step 4: Troubleshooting Cluster Issues
Check Node Status
1. View All Nodes:
kubectl get nodes
2. Describe a Node:
kubectl describe node <node-name>
3. Check Node Logs:
SSH into the node and check logs:
sudo journalctl -u kubelet
Fix Node Issues
- Node Not Ready:
- Verify system resources (CPU, memory, disk).
- Restart kubelet:bashCopy code
sudo systemctl restart kubelet
Pods Evicted:
Check resource limits and quotas:
kubectl describe quota
Step 5: Advanced Debugging Tools
Using kubectl Debug
1. Start a Debug Pod:
kubectl debug <pod-name> --image=busybox --attach=false
2. Access the Debug Pod:
kubectl exec -it <debug-pod-name> -- /bin/sh
Using kube-ops-view
1. Deploy kube-ops-view
kubectl apply -f https://github.com/hjacobs/kube-ops-view/releases/latest/download/kube-ops-view.yaml
2. Access the Dashboard:
Forward the service port and open in a browser
kubectl port-forward svc/kube-ops-view 8080:80
Using Prometheus for Troubleshooting
1. Check Resource Metrics:
- Access Prometheus UI:
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090
Query CPU or memory usage
sum(rate(container_cpu_usage_seconds_total[5m]))
2. Set Alerts:
Create alert rules for resource thresholds (e.g., high memory usage).
Best Practices for Troubleshooting
- Use Namespaces:
- Isolate workloads to make debugging easier.
- Leverage Dashboards:
- Use Grafana or kube-ops-view for visual insights.
- Audit Logs:
- Regularly review API server and kubelet logs.
- Document Resolutions:
- Maintain a knowledge base for recurring issues.
Production Example: Debugging a Payment Service Outage
- Scenario:
- A payment microservice is unreachable during high traffic.
- Steps:
- Check Pods
kubectl get pods -l app=payment
kubectl logs <pod-name>
Verify Service and DNS
kubectl describe svc payment-service
kubectl logs -n kube-system -l k8s-app=kube-dns
Check Node Resources
kubectl describe node <node-name>
kubectl top nodes
3. Resolution:
Scale the deployment to handle high traffic
kubectl scale deployment payment --replicas=5
Conclusion
You’ve now mastered the essentials of Kubernetes troubleshooting! By applying these techniques, you can efficiently diagnose and resolve issues in your clusters, ensuring high availability and performance.