Thursday 12th April 2018

Customer Websites Widespread FarmCentric Outage
Postmortem:

The outage was caused by disk utilization growing rapidly (and eventually exhausting) on our Kubernetes Windows nodes. After investigated a surviving node that had close to 100% disk usage but wasn’t dead yet we determined disk utilization was caused by Docker - a large number of ~1GB directories were present under a temp directory. It appears to have been caused by an known issue in the Docker daemon itself. We will be following up with Microsoft support to help solidify the diagnosis.

Instances with a new version of Docker stable (and the latest Windows updates) were put into place last night. We’d previously been running a “preview” version of Docker Enterprise. This change was made to improve stability but had the opposite effect, which wasn't visible in our testing. The preview version of Docker was put back into place shortly after the outage concluded this morning. We've checked the currently running nodes and there are no directories/files present in the area where the other nodes filled up last night/this morning, so it’s unlikely the issue that was experienced last night will repeat itself.

The new generation of instances that were put into place last night have the latest Server 2016 cumulative update (April 2018 - KB4093112) installed, as opposed to the test patches (KB123456 and KB999999) provided by Microsoft to help resolve our DNS issues.

All of these efforts have been undertaken in order to improve/solve the Windows network/DNS issues that we continue to experience. The issues are still present, but from initial observations we believe we are experiencing fewer DNS lookup failures on the new generation of instances.

Finally, we now have more extensive memory and disk monitoring and alerting in place for all of our Windows instances. This includes real-time and predictive alerts on CPU, memory, disk utilization, and more.

Update:

The outage has been fixed and steps are being taken to ensure it does not happen again.

Original:

A widespread outage is affecting the FarmCentric cluster, we're currently investigating.