Unixchips : Troubleshooting tips for AKS

Here i am providing some common issues which we are facing in AKS clusters and the method of troubleshooting that .

1. In some cases we may need to login to the pods using "SSH" to collect the logs, troubleshooting purpose etc. Let's check how to configure that

First create a SSH connection to the Linux node we have 1 pod is running in the cluster

To connect to the pod use the kubectl debug command to run the container image and connect

2. If we are getting the error as "quota exceeded error" during creation or upgrade we have to request for more vcpu's by creating a support request .

Code=OperationNotAllowed

Message=Operation results in exceeding quota limits of Core.

Maximum allowed: 4, Current in use: 4, Additional requested: 2.

select subscriptions

select the required subscriptions which we need to increase the quota

select usage + quotas

select request increase and corresponding metrics which we need to increase

3. Troubleshooting the cluster issues with AKS diagnostic tool

There is a good tool provided by Azure along with AKS to identify common cluster and network related issues in AKS. It is called as " Diagnose and solve problems" in the left side of the AKS configuration , there is two type of diagnose available for this .. cluster insights and networking

cluster diagnose is given below , we can check the each link to get more details about the diagnose process

We have other testing which is related to network perspective

Once we will click on each tab we will get more details on each network perspective

This is one of the best method to identify cluster & network related issues for AKS

4. Getting error while connecting to the Kube API server as " Error dialing backend TCP ..."

In this case we have to make sure as "aks-link" or "tunnel front" is working fine in the "kubectl get pods --namespace kube-system" command . If it is not working we may need to delete the pod and recreate it

5. When we are trying to upgrade or scale the cluster , getting the error as below

"Changing property (image reference) is not allowed

This error is due to modifying or deleting the tags in the agent nodes inside the AKS cluster . This is an unexpected error due to the changes in the AKS cluster properties

6. The next error used to get while scaling the cluster is as "cluster is in failed state and upgrading or scaling will not work until it is fixed"

This issue is due to lack of compute resources , so first we have to bring back the cluster with in the stable state quota, then create a service request to upgrade the quota .

7. Too many requests - 429 error's"

When a kubernetes cluster on Azure (AKS or no) does a frequent scale up/down or uses the cluster autoscaler (CA), those operations can result in a large number of HTTP calls that in turn exceed the assigned subscription quota leading to failure.

Service returned an error. Status=429 Code=\"OperationNotAllowed\" Message=\"The server rejected the request because too many requests have been received for this subscription.\" Details=[{\"code\":\"TooManyRequests\",\"message\":\"{\\\"operationGroup\\\":\\\"HighCostGetVMScaleSet30Min\\\",\\\"startTime\\\":\\\"2021-05-20T07:13:55.2177346+00:00\\\",\\\"endTime\\\":\\\"2021-05-20T07:28:55.2177346+00:00\\\",\\\"allowedRequestCount\\\":1800,\\\"measuredRequestCount\\\":2208}\",\"target\":\"HighCostGetVMScaleSet30Min\"}] InnerError={\"internalErrorCode\":\"TooManyRequestsReceived\"}"}

Make sure you are running at least AKS 1.18.x , if not we may need to upgrade the latest version

We can integrate Prometheus with azure monitor to monitor the cluster/container issues very closely and i will explain the same in another session

Thank you for the reading ..

Unixchips

Friday, June 25, 2021

Troubleshooting tips for AKS

No comments:

Post a Comment