On December 11, 2023, OpenAI's ChatGPT and Sora services experienced a 4 hour and 10 minute downtime. The outage was caused by a minor change deployed just three minutes prior, which resulted in a deadlock of the engineering team's access to the cluster.

To rectify the situation, the engineering team employed a serialize scaling-down of the cluster, blocked access to the Kubernetes API, and increased the capacity of the Kubernetes API servers. These measures enabled them to gradually regain control over the cluster and eventually restore normal operation of OpenAI's services.

This incident serves as a reminder of the importance of thorough testing and validation of changes before deployment. Once the issue was identified, the swift response and effective mitigation by the engineering team ensured that OpenAI's services were restored to normal operation without significant disruptions.

For more information about this incident, please refer to the article by [Link to the article on Landian News](https://www.landiannews.com/archives/107098.html).

You can submit articles to [ZaihuaBot](http://t.me/ZaiHuabot), follow us on [Zaihua News Channel](http://t.me/zaihuanews), or join our chat group at [Zaihua Chat](http://t.me/zaihuachat) to stay updated on the latest news and trends.