A disk failure in one of our coordination nodes caused our some of our write processing to pause. A simple restart solved the problem, but we were slow detecting it due to an unrelated failure in our write-queue depth monitoring software from Friday. We rolled back the monitoring software and were able to quickly detect the write delay and solve the problem.
We'll fix up the monitoring software on Monday and add an inverse check to verify that we're getting proper data from this subsystem in our monitoring software.
Sorry for the inconvenience!