It’s been more than 24 hours since we experienced our UK data centre outage. Most customers’ normal service has resumed, although we experienced a few further difficulties getting the Woking DC back up to speed with newly installed HA switching equipment. As I write, I am told both data centres are now up and running at last, and we are ready for an evening run of warm standby synchronisation. We are also working hard to solve the backlog of customer issues out there, and we appreciate your work alongside us to get everything sorted out. These situations put everyone under considerable pressure, and I’m very grateful for the support we have had from customers and partners, as well as the tireless efforts of Mimecast staff around the world who’ve all played their part in bringing this difficult situation under control.
Yesterday I referred to a 3-hour window of ‘outage’, between 11am and 2pm, from the moment the failure happened to the moment we completed failover to the second data centre. I was, of course, talking about the outage from one particular angle – how long did it take us to failover an entire data centre before we were able to start bringing customers’ services back up.
Of course, when the blog went live, some customers thought I was shutting down the issue and pretending the crisis had only lasted three hours. I wasn’t trying to mislead or underplay it, but I understand how it might have looked that way. And for that, I apologise.
For several customers it wasn’t a three-hour problem, it’s been an ongoing problem and although most services are working normally, we can’t put this to bed until the second data centre is fully back in sync, operating normally. You’ll hear from us as soon as that is the case.
The second issue I’d like to take the opportunity to address, and which has received a lot of attention on Twitter and in the media, is our 100% availability SLA. There is some suggestion that a 100% SLA is impossible, because there’s always a chance a service will go down. What it really means to us is a commitment to a level of service we are willing to be held accountable for. Anything less, and we pay for the short-fall. We give the 100% SLA not because we think we are infallible but because zero downtime is something so important to our customers and us we believe we should stand up for it as the gold standard to aspire to. And be accountable in the event we don’t meet it.
Most cloud service providers don’t want to accept that level of accountability (and the cost) with customers. It is a very rare event at Mimecast but we were reminded failures can happen the hard way yesterday, after 10 years of consistent service. Yesterday we fell below that standard. We are sorry about that and will be working with customers affected by yesterday’s outage to arrange SLA compensation.