Fixing the good Linkerd thriller: A Kubernetes outage story


How Grammarly’s Kubernetes migration led us down a rabbit gap of proxy denials, till our Manufacturing Engineering workforce found that the true villain was hiding in plain sight. Because the workforce liable for making certain tens of millions of customers can entry Grammarly with out interruption, we knew that we wanted to resolve this high-stakes infrastructure thriller and do it rapidly. On this weblog publish, we’ll stroll you thru the precise timeline of our investigation and the way we lastly caught the villain.

The migration launch—January 6, 2025

Once we accomplished the migration of our textual content processing platform (i.e., the core providers that analyze and enhance customers’ writing, which we name the info airplane) from a legacy container service to our new Kubernetes-based knowledge airplane on AWS, we anticipated the standard rising pains. What we didn’t anticipate was for considered one of our manufacturing clusters to erupt right into a storm of mysterious proxy “denied” errors—simply as peak hours began.

 

To repair this difficulty, we reached out to Buoyant, the corporate behind Linkerd, the open-source service mesh that we had deployed to safe and monitor communication between our Kubernetes providers. Via our communication with Buoyant’s assist workforce, we realized that the proxy began refusing connections after the principle API launched a WebSocket storm. But the exact same cluster regarded wholesome as quickly as we drained visitors away or rebooted its nodes.

 

These first scares planted a harmful seed: Is there a bug within the service mesh?

What’s the “Linkerd denial” error anyway?

Earlier than we dive deeper into our investigation, let’s make clear what these “denied” errors really characterize—this distinction turned out to be essential to understanding our thriller.

Authorization denies vs. protocol detection denies

When Kubernetes pods have Linkerd’s light-weight proxy injected as a sidecar container (making them “meshed pods”), Linkerd’s authorization coverage lets you management which forms of visitors can attain them. For instance, you possibly can specify that communication with a selected service (or HTTP route on a service) can solely come from sure different providers. When these insurance policies block unauthorized visitors, the request will get rejected with a PermissionDenied error.

 

However that’s not what we have been seeing.

 

Our Linkerd denial errors have been really associated to protocol detection failures. When a cluster is beneath excessive load, the appliance won’t ship the preliminary bytes rapidly sufficient for Linkerd to detect the protocol. In that scenario, Linkerd falls again to treating the connection as uncooked TCP, and all HTTP-specific options are disabled for that connection.

The TCP vs. HTTP authorization downside

Right here’s the place our confusion started: Linkerd’s authorization coverage lets us management which forms of visitors are allowed to meshed pods. By default, many setups are configured to permit HTTP visitors however not uncooked TCP visitors.

 

We came upon that when protocol detection failed beneath load, Linkerd would fall again to treating connections as TCP. But when our authorization insurance policies solely permitted HTTP visitors, these TCP-fallback connections could be denied. This triggered what regarded like authorization errors, however have been really signs of protocol detection timeouts.

 

Trying again, the messages themselves weren’t the thriller—protocol detection timeouts and 10-second connection delays are documented Linkerd behaviors. The true puzzle was why our cluster saved hitting this situation so constantly, and why Linkerd saved denying what ought to have been abnormal HTTP visitors.

A quiet January, a fortunate escape

With no clear resolution to the thriller, we determined to reroute visitors to different clusters to purchase ourselves extra time, which was workable since January is a quiet month. Throughout this lull, we started an effort to optimize and scale back infrastructure prices. On this effort, we accomplished a full node rotation on the affected cluster, which appeared to “repair” the denies for the remainder of the month. On the time, each on-call word ended with the identical chorus: “Control the service mesh denying allowed visitors,” although in hindsight, this merely masked the true perpetrator.

In the meantime, within the Grammarly management airplane: “Everyone seems to be a gaggle now”

On the identical time, Grammarly launched a brand new enterprise mannequin, which required a migration that remodeled each paying consumer right into a separate group document. The Grammarly management airplane providers that managed these information all of a sudden grew to become key dependencies for the suggestion pipeline.

 

Consequently, the additional load made these providers brittle. At any time when they stalled, consumer visitors vanished, autoscalers dutifully scaled the info airplane down, and we unknowingly set ourselves up for the whiplash that might comply with.

By mid-February, the info airplane felt like a haunted home: Each time we touched it, Linkerd “denies” howled by means of the logs, and customers misplaced solutions. Three outages in a single frenetic week pressured us to hit pause on value cuts and launch a “knowledge airplane stabilization” program. Let’s check out every of those outages in additional element.

January 22—The primary chilly bathe

Period: ~2 hours

 

The primary actual outage struck throughout the night in Europe, US morning peak. The Grammarly management airplane hiccupped, visitors dropped, and the info airplane collapsed to half its standard measurement.

 

On-call engineers stabilized issues the one method they might: pinning minReplicas to peak-hour numbers by hand for the busiest providers.

 

Customers barely seen, however the message was clear—aggressive autoscaling plus flaky management airplane equals bother.

February 11—The lock that took us offline

Period: ~2.5 hours

 

Three weeks later, a bulk delete of two,000 seat licenses locked the database. Requests backed up, and the principle API may now not set up new WebSockets. Autoscalers trimmed the text-checking fleet; when the DB recovered, there weren’t sufficient pods left to hold the surge, and Grammarly was impacted for two.5 hours.

 

Slack erupted with “How can we scale every part up?” messages and frantic readiness probe tweaks. However one thing else caught our consideration: Throughout the scramble, we noticed one other wave of Linkerd denies—coinciding with CPU spikes on predominant API service pods.

Price stress and a nagging principle

All this unfolded in opposition to hovering cloud infrastructure payments. The Kubernetes migration added community expenses, GPU-heavy ML fashions, and a good bit of overprovisioning. A price-cutting program kicked off in February, pushing for smaller pods and quicker scale-down cycles.

 

It made excellent monetary sense—and amplified each weak spot we’d simply found.

 

By mid-February, our working principle was as follows: CPU spikes → Linkerd denies → outage.

It felt per Buoyant’s evaluation and the charts we noticed.

February 19—Redeploy roulette

Period: 2 hours

 

A routine Helm chart change redeployed a dozen text-processing workloads proper on the European night peak. The burst of latest pods stormed Linkerd’s request-validation path, triggering a two-hour incident, the place error charges on textual content checking peaked at 60%.

 

We tried the standard dance:

 

  • Shift visitors away from sick clusters.
  • Manually scale CoreDNS vertically on the failing cluster as a result of some DNS decision errors within the logs; when it didn’t assist, we saved blaming Linkerd TCP connection interception for DNS queries.
  • Scale the largest backends horizontally.
  • Trim the service-mesh pod finances to “give Linkerd some air.”

It labored—however solely after we had thrown additional CPU at virtually each layer, reinforcing the idea that Linkerd was merely choking beneath load.

February 24—The principle backend self-inflicted pileup

Period: 2 hours

 

4 days later, an harmless try to maneuver the principle text-checking backend pod to a separate node pool by accident restarted 17 deployments without delay, since there are 17 variations of the service deployed within the clusters. Their heavy startup, plus mistuned readiness probes and pod disruption budgets, shaped an ideal retry storm: textual content checking overloaded, and solutions limped for 2 hours.

Once more, we blamed Linkerd denies, and once more, the true fixes have been basic hygiene—probe tuning, selective traffic-shaping, and guide upscaling.

February 25—Terraform butterfly impact

Period: ~2.5 hours

 

The subsequent afternoon, a failed Terraform apply within the management airplane deleted vital security-group guidelines, severing visitors.

 

The outage unfolded in two acts:

 

  1. Management airplane blackout (~20 minutes): Re-adding the principles revived logins and billing
  2. Information airplane hunger (140 minutes): Whereas visitors was low, autoscalers fortunately shrunk text-checking providers. As a security measure, engineers determined to scale up all providers to 80% of their allowed maxReplicas—which was an excessive amount of. Not solely did it set off the “Linkerd denies” downside, however it additionally broke Karpenter. Karpenter, making an attempt to parse ~4,500 stale NodeClaims on every cluster, crash-looped with out-of-memory failures, which prevented any new nodes from launching.

We watched denies spike once more throughout the frantic scale-out, however visitors graphs informed a clearer story: The true villain was capability surge, not the mesh.

Every day 15:30 EET “mini storms”

In the meantime, each weekday, the principle API rolled out throughout the European rush hours on schedule. Consequently, every rollout briefly doubled downstream calls, coaxed Linkerd’s deny counter into the purple, and gave us a recent scare.

February 27—”Cease slicing, begin fixing”

By the tip of the week, we lastly admitted the inevitable:

 

  • We couldn’t optimize the infrastructure utilization and repair the infrastructure bugs on the identical time, so we determined to pause value optimization on knowledge airplane clusters for 3 weeks.
  • We opened the info airplane stabilization observe to hunt the foundation trigger, harden probes and pod disruption budgets (PDBs), audit scaling guidelines, and work out the Linkerd points.

March 3: Nonetheless dancing round the true difficulty

We rode out one other service mesh “denial” wave that slowed each text-processing cluster for about 50 minutes. The post-incident evaluation once more pointed at Linkerd overload throughout a predominant API redeploy, which we mitigated by merely upscaling providers—the identical playbook we had utilized in February.

 

We took the CPU-starvation speculation critically: Buoyant’s personal evaluation had proven predominant API and downstreams pegged at 100% throughout each connection storm. So, we remoted the API onto its personal NodeGroup with beneficiant headroom and paused our value optimization program. Consequently, the March 14 stabilization replace proudly reported zero outages for a complete week.

 

We thought we have been profitable, however that stability was pretend: A brand new experiment off-loaded visitors to inside LLMs. This meant there have been fewer cross-service interconnections throughout peak hours, and so we weren’t reaching the visitors threshold the place we have been crumbled. However we didn’t perceive this but.

The plot twist that modified every part

Once we investigated this additional with Buoyant, their CTO suspected we have been “treating the smoke, not the fireplace.” His instinct proved right after we found that denials could also be reported when the primary learn on a socket returns 0 bytes—when connections have been being closed earlier than protocol detection may full. This pointed to a very completely different difficulty altogether.

 

This wasn’t about authorization insurance policies in any respect. It was about network-level connection failures that prevented protocol detection from succeeding within the first place. The “denials” we have been seeing have been a symptom, not the trigger.

March 17: The sample repeats

An emergency rollback of the principle API throughout peak EU visitors triggered the Linkerd denials downside once more. Throughout the rollback, visitors returned to the standard text-processing backends and bypassed LLMs, which had already been seeing decreased hundreds throughout experiments for the reason that starting of March. Denials spiked precisely whereas new pods registered; dashboards regarded painfully acquainted.

unnamed.png

Period: 5 hours

 

The facade cracked the following day. We had added a scaleback prevention that used  the historic variety of replicas to mitigate fast cutting down brought on by denies throughout excessive visitors durations. Nevertheless, this scaleback prevention system was “anticipating” the principle API to be launched, due to the discharge patterns from the earlier week. Regardless that we hadn’t explicitly deployed the API, the system didn’t know that. As an alternative, it ready for the scaling habits from the phantom launch from final week. The end result was an unleashing of the biggest storm of denials we had ever seen, leading to a five-hour, company-wide outage.

 

The workforce carried out quite a few actions to stabilize: We did guide scale-ups of the busiest backends, pinned minReplicas, restarted the principle API, sped up the principle API rollout, opened fuse limits, disabled Linkerd on one cluster, and extra. However in the end, what helped was the pure visitors drop after peak hours.

 

The essential trace: An AWS consultant joined the outage name, confirmed nothing apparent on their aspect, however talked about numerous elements we may have a look at. Considered one of them was CoreDNS, which was the important thing perception.

 

CoreDNS is a versatile, extensible DNS server that may function the Kubernetes cluster DNS. If you launch an Amazon EKS cluster with not less than one node, two replicas of the CoreDNS picture are deployed by default, whatever the variety of nodes deployed in your cluster. The CoreDNS pods present identify decision for all pods within the cluster.

March 19: The correlation turns into clear

The subsequent day, the workforce analyzed the CoreDNS graphs. Nothing vital or too suspicious was discovered, however we determined to scale up the variety of pods to 12 on one cluster simply in case. Within the night, the acquainted sample began once more—besides on the cluster with 12 CoreDNS pods. We fanned replicas out to 12 on each cluster, and denials disappeared inside minutes.

 

For the primary time, the mesh regarded harmless; our DNS layer all of a sudden regarded very, very responsible.

The detective work: Uncovering ENA conntrack limits

 

Over the next week, the workforce:

  • Rolled out NodeLocal DNSCache in manufacturing to dump DNS decision from the centralized CoreDNS to native caches
  • Ready the loadtest setup in preprod to breed the symptom with out customers watching
  • Enabled the ethtool metrics in node_exporter
  • Began to redeploy the principle API in preprod beneath load till denies began taking place

The smoking gun: We noticed that the counter node_ethtool_conntrack_allowance_exceeded jumped precisely when Linkerd denials have been reported. We weren’t hitting Linux nf_conntrack limits in any respect. As an alternative, we have been silently blowing by means of the per Elastic Community Adapter (ENA) conntrack allowance on AWS for the situations that have been working CoreDNS, which mercilessly dropped packets with out leaving a kernel hint. Every drop resulted in a cascading chain of failures: DNS request failure, retries, shopper back-offs, connection closures, Linkerd protocol detection timeouts, and, finally, the denial.

March 28: Closing the loop

By March 28, we have been capable of declare success in our epic “knowledge airplane stabilization” effort:

 

  • CoreDNS mounted at 4 replicas to extend the ENA conntrack capability
  • We deployed NodeLocal DNSCache in all places to distribute the load from central CoreDNS and cache DNS responses.
  • We added ENA-level conntrack metrics completely in our dashboards and alerts to catch this difficulty sooner or later.

Watch the entire stack—Don’t cease at software metrics

Our largest blind spot was trusting surface-level metrics. We have been monitoring CPU, reminiscence, and kernel-level networking, however we fully missed the ENA-level conntrack_allowance metric that signaled the silent packet drops. Consequently, we blamed the service mesh for a community machine restrict that existed a number of layers beneath.

 

In apply: We now monitor ENA conntrack metrics alongside conventional software metrics and have arrange alerts tied to those deeper infrastructure counters.

Scale in a couple of dimension

CoreDNS had loads of CPU and reminiscence, however we have been hitting per-pod UDP circulate limits on the AWS ENA community adapter. Including extra replicas (horizontal scaling) distributed the connection load and solved the issue—one thing vertical scaling by no means may have achieved.

 

In apply: When troubleshooting efficiency points, we now think about connection distribution, not simply computational assets. We preserve a minimal of 4 CoreDNS replicas to maintain per-pod UDP circulate counts beneath ENA thresholds, and we’ve a NodeLocal DNS cache on every node.

Operational maturity: Infrastructure hygiene pays dividends

All through February and March, we systematically hardened our providers: tuning readiness and liveness probes, configuring acceptable PDBs, rightsizing CPU and reminiscence requests, and fixing autoscaling habits. Whereas none of those fixes individually solved our DNS downside, they eradicated noise from our dashboards and made the true sign seen.

 

In apply: We now preserve a “service hardening” guidelines protecting probes, PDBs, useful resource requests, and autoscaling configuration that new providers should full earlier than manufacturing deployment.

“Palette structure”: The facility of similar clusters

Having six similar Kubernetes clusters serving completely different parts of visitors proved invaluable for each experimentation and threat mitigation. We may check completely different CPU settings, autoscaling targets, and even dangerous updates on one cluster whereas holding the others steady.

 

In apply: This structure grew to become our managed testing atmosphere, permitting us to isolate variables like CPU limits, separate node teams, and completely different Linkerd sidecar configurations throughout clusters earlier than rolling out modifications fleet-wide.

Validate suspicions rapidly with systematic testing

Once we suspected CPU hunger, we instantly remoted the principle API onto devoted nodes and paused cost-cutting measures. Whereas this wasn’t the foundation trigger, it allowed us to focus our investigation elsewhere reasonably than chasing false leads for weeks.

 

In apply: We used a “speculation → check → verdict” method for our experiments with knowledge airplane stabilizations, documenting what we dominated out as a lot as what we confirmed.

For AWS EKS customers: Monitor ENA metrics

In the event you’re working workloads on AWS EKS, particularly DNS-intensive providers, arrange monitoring for ENA community efficiency metrics. The conntrack_allowance_exceeded counter may also help you detect connection monitoring points earlier than they affect your purposes.

 

In apply: Allow the ethtool collector in node_exporter utilizing the –collector.ethtool command-line argument. Monitor queries like charge(node_ethtool_conntrack_allowance_exceeded) and alert once they exceed 0.

Sources:

For Linkerd customers: Replace to 2.18+

Linkerd launch 2.18 was closely influenced by the story we’re sharing with you. It has loads of fixes, and clearer metrics and logs that can assist you grasp what’s taking place on the service-mesh stage.

 

To share a number of vital ones: Buoyant discovered that Linkerd was placing a a lot heavier load than anticipated on CoreDNS, which was mounted within the Linkerd launch 2.18 by PR #3807.

 

To scale back the “protocol detection denies” and no bytes learn reported as a deny error, Linkerd 2.18 launched assist for the appProtocol discipline in Companies, permitting protocol detection to be bypassed when the protocol is thought forward of time. It launched transport protocol headers to cross-proxy visitors, eradicating the necessity for inbound protocol detection fully because the peer proxy now shares the protocol. Lastly, it now exposes completely different metrics to obviously distinguish between authorization coverage violations and protocol detection points, making it simpler for operators to establish which kind of “deny” they’re really coping with.

Generally the villain isn’t the plain suspect screaming in your logs. Generally it’s the quiet part you took as a right, silently dropping packets on the community stage whereas a superbly harmless service mesh takes the blame.

 

The stormy trilogy that began with Linkerd denies ended with a quiet line on a Grafana dashboard—however it rewired how we observe, check, and run on Kubernetes.

And that, lastly, feels sustainable.

A fancy, multiweek incident like this one takes a whole group to resolve. Our thanks go to:

 

  • Platform / SRE workforce on-call engineers, incident commanders, and experiment leads for round the clock firefighting, root-cause sleuthing, the “knowledge airplane stabilization” program
  • Core backend squads for fast probe, PDB, and rollout-strategy fixes that purchased us respiratory room
  • And everybody throughout Engineering who cleared roadblocks, merged emergency MRs, or saved the evening shift firm

Your collective effort turned a puzzling outage stream right into a steady, better-instrumented, cost-optimal, and scalable platform. Thanks.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *