I’m trying out the Cloud service mesh that was announced in Google Next 2024.
I’m trying it out on a standard cluster, with 4 nodes, following this guide:
https://cloud.google.com/service-mesh/docs/onboarding/provision-control-plane
During the first few days, everything seems fine. But then the pods in my namespace suddenly cannot reach each other anymore.
The pods are reaching each other via Kubernetes services of NodePort type: Pod A -> Kubernetes service B -> Deployment B -> Pod B.
At first I thought there were some issues during the installation process, so I followed https://cloud.google.com/service-mesh/docs/uninstall to uninstall and reinstall the service mesh. But after a few days, the above issue came back.
I have verified the status of the service mesh as per the guide that it is active and running
gcloud container fleet mesh describe
createTime: '2024-07-16T01:31:43.731699992Z'
membershipSpecs:
projects/123456789123/locations/asia-southeast1/memberships/my-cluster:
mesh:
management: MANAGEMENT_AUTOMATIC
membershipStates:
projects/123456789123/locations/asia-southeast1/memberships/my-cluster:
servicemesh:
conditions:
- code: VPCSC_GA_SUPPORTED
details: This control plane supports VPC-SC GA.
documentationLink: http://cloud.google.com/service-mesh/docs/managed/vpc-sc
severity: INFO
controlPlaneManagement:
details:
- code: REVISION_READY
details: 'Ready: asm-managed'
implementation: TRAFFIC_DIRECTOR
state: ACTIVE
dataPlaneManagement:
details:
- code: OK
details: Service is running.
state: ACTIVE
state:
code: OK
description: |-
Revision ready for use: asm-managed.
All Canonical Services have been reconciled successfully.
updateTime: '2024-08-15T07:19:40.846768015Z'
name: projects/my-project/locations/global/features/servicemesh
resourceState:
state: ACTIVE
spec: {}
updateTime: '2024-08-15T05:34:35.455135300Z'
kubectl describe controlplanerevision -n istio-system
Name: asm-managed
Namespace: istio-system
Labels: app.kubernetes.io/created-by=mesh.googleapis.com
istio.io/owned-by=mesh.googleapis.com
mesh.cloud.google.com/managed-cni-enabled=true
Annotations: mesh.cloud.google.com/proxy: {"managed":"true"}
mesh.cloud.google.com/vpcsc-ga: false
API Version: mesh.cloud.google.com/v1beta1
Kind: ControlPlaneRevision
Metadata:
Creation Timestamp: 2024-08-13T10:32:05Z
Generation: 1
Resource Version: 1101894861
UID: cd4b26f5-d7d1-4a45-b220-2adcf618fa23
Spec:
Channel: regular
Type: managed_service
Status:
Conditions:
Last Transition Time: 2024-08-15T05:55:32Z
Message: The provisioning process has completed successfully
Reason: Provisioned
Status: True
Type: Reconciled
Last Transition Time: 2024-08-15T05:55:32Z
Message: Provisioning has finished
Reason: ProvisioningFinished
Status: True
Type: ProvisioningFinished
Last Transition Time: 2024-08-15T05:55:32Z
Message: Provisioning has not stalled
Reason: NotStalled
Status: False
Type: Stalled
Events: <none>
I have verified that my namespace is labeled with istio-injected=enabled
and can see the istio-proxy
sidecars injected into every pod. The istio-proxy
sidecar seems to be starting properly:
INFO 2024-08-15T05:42:54.174322288Z [resource.labels.containerName: istio-validation] 2024-08-15T05:42:54.173151Z info Starting iptables validation. This check verifies that iptables rules are properly established for the network.
INFO 2024-08-15T05:42:54.174372588Z [resource.labels.containerName: istio-validation] 2024-08-15T05:42:54.173236Z info Listening on 127.0.0.1:15001
INFO 2024-08-15T05:42:54.174376768Z [resource.labels.containerName: istio-validation] 2024-08-15T05:42:54.173423Z info Listening on 127.0.0.1:15006
INFO 2024-08-15T05:42:54.174380538Z [resource.labels.containerName: istio-validation] 2024-08-15T05:42:54.173810Z info Local addr 127.0.0.1:15006
INFO 2024-08-15T05:42:54.174403398Z [resource.labels.containerName: istio-validation] 2024-08-15T05:42:54.173823Z info Original addr 127.0.0.1: 15002
INFO 2024-08-15T05:42:54.174411368Z [resource.labels.containerName: istio-validation] 2024-08-15T05:42:54.173920Z info Validation passed, iptables rules established
INFO 2024-08-15T05:43:03.301899214Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301721Z info FLAG: --concurrency="0"
INFO 2024-08-15T05:43:03.301965054Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301757Z info FLAG: --domain="my-namespace.svc.cluster.local"
INFO 2024-08-15T05:43:03.301971444Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301764Z info FLAG: --help="false"
INFO 2024-08-15T05:43:03.301976384Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301768Z info FLAG: --log_as_json="false"
INFO 2024-08-15T05:43:03.301981364Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301771Z info FLAG: --log_caller=""
INFO 2024-08-15T05:43:03.301986134Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301774Z info FLAG: --log_output_level="default:info"
INFO 2024-08-15T05:43:03.301990984Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301777Z info FLAG: --log_rotate=""
INFO 2024-08-15T05:43:03.301995774Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301781Z info FLAG: --log_rotate_max_age="30"
INFO 2024-08-15T05:43:03.302000094Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301784Z info FLAG: --log_rotate_max_backups="1000"
INFO 2024-08-15T05:43:03.302004424Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301788Z info FLAG: --log_rotate_max_size="104857600"
INFO 2024-08-15T05:43:03.302009284Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301791Z info FLAG: --log_stacktrace_level="default:none"
INFO 2024-08-15T05:43:03.302014394Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301803Z info FLAG: --log_target="[stdout]"
INFO 2024-08-15T05:43:03.302018274Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301808Z info FLAG: --meshConfig="./etc/istio/config/mesh"
INFO 2024-08-15T05:43:03.302021414Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301811Z info FLAG: --outlierLogPath=""
INFO 2024-08-15T05:43:03.302024434Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301815Z info FLAG: --profiling="true"
INFO 2024-08-15T05:43:03.302027454Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301818Z info FLAG: --proxyComponentLogLevel="misc:error"
INFO 2024-08-15T05:43:03.302030504Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301822Z info FLAG: --proxyLogLevel="warning"
INFO 2024-08-15T05:43:03.302033784Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301826Z info FLAG: --serviceCluster="istio-proxy"
INFO 2024-08-15T05:43:03.302036864Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301838Z info FLAG: --stsPort="15463"
INFO 2024-08-15T05:43:03.302070544Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301905Z info FLAG: --templateFile=""
INFO 2024-08-15T05:43:03.302077534Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301916Z info FLAG: --tokenManagerPlugin="GoogleTokenExchange"
INFO 2024-08-15T05:43:03.302082334Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301928Z info FLAG: --vklog="0"
INFO 2024-08-15T05:43:03.302087684Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.301940Z info Version 1.19.10-asm.6-491aae094c181ecc5467c78ddd3591b27a5c84cc-Clean
INFO 2024-08-15T05:43:03.302210354Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.302115Z warn failed running ulimit command:
INFO 2024-08-15T05:43:03.302843944Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.302379Z info Proxy role ips=[10.84.6.35] type=sidecar id=my-pod-c6c9d855d-fj9pz.my-namespace domain=my-namespace.svc.cluster.local
INFO 2024-08-15T05:43:03.302859354Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.302485Z info Apply proxy config from env {"discoveryAddress":"meshconfig.googleapis.com:443","proxyMetadata":{"CA_PROVIDER":"GoogleCA","CA_ROOT_CA":"/etc/ssl/certs/ca-certificates.crt","CA_TRUSTANCHOR":"","FLEET_PROJECT_NUMBER":"123456789123","GCP_METADATA":"my-project|123456789123|my-cluster|asia-southeast1-a","OUTPUT_CERTS":"/etc/istio/proxy","PROXY_CONFIG_XDS_AGENT":"true","XDS_AUTH_PROVIDER":"gcp","XDS_ROOT_CA":"/etc/ssl/certs/ca-certificates.crt"},"meshId":"proj-123456789123"}
INFO 2024-08-15T05:43:03.302865444Z [resource.labels.containerName: istio-proxy] {}
INFO 2024-08-15T05:43:03.305151943Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.305025Z info cpu limit detected as 2, setting concurrency
INFO 2024-08-15T05:43:03.305726793Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.305632Z info Effective config: binaryPath: /usr/local/bin/envoy
INFO 2024-08-15T05:43:03.305742173Z [resource.labels.containerName: istio-proxy] concurrency: 2
INFO 2024-08-15T05:43:03.305748413Z [resource.labels.containerName: istio-proxy] configPath: ./etc/istio/proxy
INFO 2024-08-15T05:43:03.305754654Z [resource.labels.containerName: istio-proxy] controlPlaneAuthPolicy: MUTUAL_TLS
INFO 2024-08-15T05:43:03.305759594Z [resource.labels.containerName: istio-proxy] discoveryAddress: meshconfig.googleapis.com:443
INFO 2024-08-15T05:43:03.305765343Z [resource.labels.containerName: istio-proxy] drainDuration: 45s
INFO 2024-08-15T05:43:03.305769883Z [resource.labels.containerName: istio-proxy] meshId: proj-123456789123
INFO 2024-08-15T05:43:03.305774314Z [resource.labels.containerName: istio-proxy] proxyAdminPort: 15000
INFO 2024-08-15T05:43:03.305779434Z [resource.labels.containerName: istio-proxy] proxyMetadata:
INFO 2024-08-15T05:43:03.305824074Z [resource.labels.containerName: istio-proxy] CA_PROVIDER: GoogleCA
INFO 2024-08-15T05:43:03.305829003Z [resource.labels.containerName: istio-proxy] CA_ROOT_CA: /etc/ssl/certs/ca-certificates.crt
INFO 2024-08-15T05:43:03.305833774Z [resource.labels.containerName: istio-proxy] CA_TRUSTANCHOR: ""
INFO 2024-08-15T05:43:03.305838354Z [resource.labels.containerName: istio-proxy] FLEET_PROJECT_NUMBER: "123456789123"
INFO 2024-08-15T05:43:03.305844074Z [resource.labels.containerName: istio-proxy] GCP_METADATA: my-project|123456789123|my-cluster|asia-southeast1-a
INFO 2024-08-15T05:43:03.305848414Z [resource.labels.containerName: istio-proxy] OUTPUT_CERTS: /etc/istio/proxy
INFO 2024-08-15T05:43:03.305852854Z [resource.labels.containerName: istio-proxy] PROXY_CONFIG_XDS_AGENT: "true"
INFO 2024-08-15T05:43:03.305857503Z [resource.labels.containerName: istio-proxy] XDS_AUTH_PROVIDER: gcp
INFO 2024-08-15T05:43:03.305862883Z [resource.labels.containerName: istio-proxy] XDS_ROOT_CA: /etc/ssl/certs/ca-certificates.crt
INFO 2024-08-15T05:43:03.305867523Z [resource.labels.containerName: istio-proxy] serviceCluster: istio-proxy
INFO 2024-08-15T05:43:03.305872163Z [resource.labels.containerName: istio-proxy] statNameLength: 189
INFO 2024-08-15T05:43:03.305876963Z [resource.labels.containerName: istio-proxy] statusPort: 15020
INFO 2024-08-15T05:43:03.305881603Z [resource.labels.containerName: istio-proxy] terminationDrainDuration: 5s
INFO 2024-08-15T05:43:03.305886443Z [resource.labels.containerName: istio-proxy] tracing:
INFO 2024-08-15T05:43:03.305890974Z [resource.labels.containerName: istio-proxy] zipkin:
INFO 2024-08-15T05:43:03.305895783Z [resource.labels.containerName: istio-proxy] address: zipkin.istio-system:9411
INFO 2024-08-15T05:43:03.305900343Z [resource.labels.containerName: istio-proxy] {}
INFO 2024-08-15T05:43:03.305904854Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.305656Z info JWT policy is third-party-jwt
INFO 2024-08-15T05:43:03.305910383Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.305662Z info using credential fetcher of JWT type in my-project.svc.id.goog trust domain
INFO 2024-08-15T05:43:03.305915654Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.305674Z info stsclient GKE_CLUSTER_URL is not set, fetched cluster URL from metadata server: "https://container.googleapis.com/v1/projects/my-project/locations/asia-southeast1-a/clusters/my-cluster"
INFO 2024-08-15T05:43:03.317072502Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.316881Z info stsserver Start listening on 127.0.0.1:15463
INFO 2024-08-15T05:43:03.317265023Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.317124Z info platform detected is GCP
INFO 2024-08-15T05:43:03.318458742Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.318285Z info Workload SDS socket not found. Starting Istio SDS Server
INFO 2024-08-15T05:43:03.318508033Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.318321Z info CA Endpoint meshca.googleapis.com:443, provider GoogleCA
INFO 2024-08-15T05:43:03.318540842Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.318310Z info Opening status port 15020
INFO 2024-08-15T05:43:03.319659752Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.319553Z info ads All caches have been synced up in 18.479078ms, marking server ready
INFO 2024-08-15T05:43:03.320576522Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.320469Z info xdsproxy Initializing with upstream address "meshconfig.googleapis.com:443" and cluster "cn-my-project-asia-southeast1-a-my-cluster"
INFO 2024-08-15T05:43:03.326444482Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.326206Z info Pilot SAN: [meshconfig.googleapis.com]
INFO 2024-08-15T05:43:03.327752852Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.327567Z info LRS for MCP is enabled
INFO 2024-08-15T05:43:03.328462842Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.328306Z info Starting proxy agent
INFO 2024-08-15T05:43:03.328481212Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.328331Z info starting
INFO 2024-08-15T05:43:03.328488422Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.328373Z info Envoy command: [-c etc/istio/proxy/envoy-rev.json --drain-time-s 45 --drain-strategy immediate --local-address-ip-version v4 --file-flush-interval-msec 1000 --disable-hot-restart --allow-unknown-static-fields --log-format %Y-%m-%dT%T.%fZ %l envoy %n %g:%# %v thread=%t -l warning --component-log-level misc:error --concurrency 2]
INFO 2024-08-15T05:43:03.336131571Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.335792Z info sds Starting SDS grpc server
INFO 2024-08-15T05:43:03.336158351Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.335939Z info starting Http service at 127.0.0.1:15004
INFO 2024-08-15T05:43:03.422323294Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.422160Z info token Prepared federated token request for aud "identitynamespace:my-project.svc.id.goog:https://container.googleapis.com/v1/projects/my-project/locations/asia-southeast1-a/clusters/my-cluster"
INFO 2024-08-15T05:43:03.435771723Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.435623Z info token Prepared federated token request for aud "identitynamespace:my-project.svc.id.goog:https://container.googleapis.com/v1/projects/my-project/locations/asia-southeast1-a/clusters/my-cluster"
INFO 2024-08-15T05:43:03.478943780Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.478781Z info token fetched federated token latency=56.396066ms ttl=3599
INFO 2024-08-15T05:43:03.481962759Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.481629Z info googleca Cert created with GoogleCA asia-southeast1-a chain length 3
INFO 2024-08-15T05:43:03.481995519Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.481744Z info cache generated new workload certificate latency=161.087877ms ttl=23h59m59.518258291s
INFO 2024-08-15T05:43:03.482001569Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.481775Z info cache Root cert has changed, start rotating root cert
INFO 2024-08-15T05:43:03.482006279Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.481793Z info ads XDS: Incremental Pushing ConnectedEndpoints:0 Version:
INFO 2024-08-15T05:43:03.482240759Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.482118Z info cache returned workload trust anchor from cache ttl=23h59m59.517883411s
INFO 2024-08-15T05:43:03.482810239Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.482563Z info token fetched federated token latency=46.758686ms ttl=3599
INFO 2024-08-15T05:43:03.531310485Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.531089Z info token fetched access token latency=48.340526ms ttl=59m59.468912605s
INFO 2024-08-15T05:43:03.537333795Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.537095Z info token fetched access token latency=58.073286ms ttl=59m59.462908625s
INFO 2024-08-15T05:43:03.537381655Z [resource.labels.containerName: istio-proxy] 2024-08-15T05:43:03.537212Z info xdsproxy connected to upstream XDS server: meshconfig.googleapis.com:443
I tried looking into the kube-system namespace of my cluster, where some new stuff were added for the service mesh: a DaemonSet “istio-cni-node”, a DaemonSet “snk”, a Deployment “mdp-controller”.
I checked the logs or istio-cni-node and mdp-controller and they seem to be fine.
But there were issues with the snk DaemonSet, 2 of its pods are restarting due to OOMKilled.
Name Status Restarts Created on
snk-cqfg4 Running 95 Aug 13, 2024, 11:23:07 PM
snk-455cv Running 13 Aug 15, 2024, 7:38:19 AM
snk-7fg4n Running 0 Aug 15, 2024, 7:54:17 AM
snk-95hc4 Running 0 Aug 15, 2024, 11:13:48 AM
WARNING 2024-08-15T07:37:58Z [resource.labels.nodeName: gke-my-cluster-default-pool-6bd5b0f1-shpw] Memory cgroup out of memory: Killed process 709708 (snk) total-vm:2167296kB, anon-rss:29588kB, file-rss:36552kB, shmem-rss:0kB, UID:2692 pgtables:360kB oom_score_adj:999
DEFAULT 2024-08-15T07:37:58.013949Z [resource.labels.nodeName: gke-my-cluster-default-pool-6bd5b0f1-shpw] I0815 07:37:58.013791 2649 log_monitor.go:159] New status generated: &{Source:kernel-monitor Events:[{Severity:warn Timestamp:2024-08-15 07:37:57.564042014 +0000 UTC m=+25195.175669820 Reason:OOMKilling Message:Memory cgroup out of memory: Killed process 709708 (snk) total-vm:2167296kB, anon-rss:29588kB, file-rss:36552kB, shmem-rss:0kB, UID:2692 pgtables:360kB oom_score_adj:999}] Conditions:[{Type:KernelDeadlock Status:False Transition:2024-08-15 00:38:03.556889826 +0000 UTC m=+1.168517601 Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status:False Transition:2024-08-15 00:38:03.556889935 +0000 UTC m=+1.168517721 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only}]}
snk pod memory usage chart
As I understand, for a DaemonSet, each of its pod is on each node of the cluster. So I guess the OOM issue has something to do with the node.
From the logs of the snk pod, it seems to be gathering the IP addresses of the pods in the node.
And the 2 nodes hosting the 2 snk pods with a lot restarts due to OOMKilled have more pods than the other 2 nodes, mostly pods started from Kubernetes cronjobs in the cluster, which have istio sidecar injection disabled since istio sidecar prevents the pod of the cronjob from Completing and shutting down.
If I stop scheduling more pods to those 2 nodes, the OOMKilled issue seems to stop.
I tried to increase the memory limit of the snk DaemonSet to 100MiB but it get reverted back to 30MiB a few minutes later, I assumed by the managed service mesh.
I am not sure what to do from here as I just cannot stop pods from being scheduled into the 2 nodes as they run my business logic, or is the OOMKilled issue with the snk pods really the main cause of the problems of my pods cannot reach each other.
tnd501 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.