OpenAI blames its massive ChatGPT outage on a 'new telemetry service'

OpenAI experienced one of its longest outages due to a new telemetry service that disrupted its Kubernetes operations. The outage affected ChatGPT, Sora, and the developer API, starting around 3 p.m. Pacific and lasting approximately three hours. The issue was not related to a security breach or product launch but stemmed from a telemetry service deployed to collect Kubernetes metrics, which overwhelmed the Kubernetes API servers and impacted DNS resolution. OpenAI's DNS caching delayed visibility of the problem, complicating the situation further.

In a postmortem, OpenAI explained that the outage was a result of multiple systems and processes failing simultaneously. The company was unable to quickly implement a fix due to the overwhelmed Kubernetes servers. To prevent future incidents, OpenAI plans to improve phased rollouts, enhance monitoring for infrastructure changes, and ensure engineers can access Kubernetes API servers under any circumstances. OpenAI apologized for the disruption caused to its customers, acknowledging that it fell short of its own expectations.

Key takeaways:

OpenAI experienced a major outage due to a new telemetry service affecting Kubernetes operations.
The outage was not caused by a security incident or product launch but by resource-intensive Kubernetes API operations.
DNS caching delayed the visibility of the issue, complicating the resolution process.
OpenAI plans to implement measures like improved monitoring and access mechanisms to prevent future incidents.

OpenAI blames its massive ChatGPT outage on a 'new telemetry service' | TechCrunch

Key takeaways:

Comments (0)

Newsletter