Background Job Server Outage Causing Widespread Issues
Incident Report for OpenWater
Resolved
At approximately 4:55 PM eastern time our background job server halted. The problem was caught and resolved by approximately 5:10 PM.
The background job server impacted: Reports, Loading of Long List Views, Processing of Carts, and Joining Webinars / Meetings

Summary of technical issue
Automatic maintenance at Microsoft Azure caused our database to grow in log size by 1 GB per minute starting around 1 PM eastern time. typically logs grow and shrink and this is a normal process. This maintenance however did not execute properly and the logs continued to grow until they reached the cap of 250 GB.

Once the cap was hit all new jobs to this server were rejected and only then did our automated alerts notify the ops team. At the same time a large number of customers sent in error reports.

Approximately at 5:05 PM the database capacity was increased to 1 TB. By 5:10 PM jobs started flowing again correctly, however the root cause was still resulting in a 1GB/minute increase to the database. The problem was resolved for end users by 5:10 PM and investigation continued.

Between 5:15 PM and 6:15 PM we continued investigation with our ProDirect level support at Microsoft Azure and uncovered this maintenance operation going rogue. The process was corrected and the DB started to drain. At 5:00 AM this morning the issue was fully resolved.

As a permanent fix we have added monitors to the database when it reaches 100 GB (100x normal capacity), this way we can catch such an issue hours before it impacts end users.
Posted Oct 21, 2020 - 17:00 EDT