On March 8th, Datadog totally paralyzed around the world and it continued a few days.
I was interested in the caurses of that failure from the point of view of my pure engineering interest. However, Datadog have not disclosed anything yet.
I found two documents analyzed by two authers, so I share the links.
- Importance of Reliability in Distributed Systems: Datadog Outage Case Study
- Datadog Outage: Multi cloud != reliability
I really want to know the caueses of that failure, but Datadog only uploaded the article shown below after 10 days from the failure.
- Identify the root causes of issues and bottlenecks in your build pipelines with TeamCity and Datadog