Seconds out on training and there’s a frisson of excitement in the air at Mimecast as we count down the final 24 hours before the big night.  Yes, it’s CRN Fight Night 2013!  The final, gruelling, training session complete, Mimecast’s Dave “The Conqueror” Cattermole, is making his final preparations and is, frankly, looking good.CRN Fight Night 2013

The channel world (and a quivering Ross McSorley) has been watching his progress and will know by now that Dave is one to watch.  Nothing has been left to chance. Boxing is a sport that requires the greatest of skill, strength and accuracy. But it also demands the greatest of psychological will power and discipline, and for that reason we’ve even drafted in Mimecast CEO, Peter Bauer, for a final training check, too.

Dave has promised to deliver the fight of his life, and you can see for yourself where that confidence comes from by following him on his journey to becoming a fighting legend on Twitter.  I’m sure you’ll agree, Dave means business!

In addition to the personal glory, he has a Mimecast track record to preserve; Fight Night fans may remember our 2012 fighting legend, none other than Dave ‘The Doctor’ Rodger, so we’ll be looking for a double Dave win!

Remember that Dave, along with all the other Warriors, is fighting for charity (Dave is fighting for the Roy Castle Lung Cancer Foundation) and these guys have all given up a huge amount of personal time to prepare for this event, so let’s show our appreciation by digging deep to donate to these very worthy causes.

We wish all the fighters the best of luck as the final countdown to tomorrow evening begins and, remember, there’s still time to vote for your Ultimate Fighter!  Many of you will be aware of how hard your fellow colleagues have trained over these past months, so now is the time to nominate them for this accolade, the winner of which will be officially recognised on the night.

Seconds out – this is going to be a great night…look forward to seeing you there!

Add your comment (0)

Channel Director
Mimecast

It probably won’t surprise anyone that my capacity to communicate to customers and partners was limited during last Thursday’s problems. I had my sleeves rolled up, with my team, as we worked to restore service from our UK Woking Data Centre. I am hoping that Peter Bauer’s blog posts, and the other communications platforms we’ve been using, kept you informed during the incident and in its aftermath.

Firstly, it’s important I tell you that our service is fully restored and back to normal operations. There will, no doubt, be a few isolated issues to fix here and there, and we’re continuing to address those now. From a service point of view – all our UK data centres are functioning normally and all processing clusters are fully operational in fully resilient mode. Remote device and application services (MSO and mobile) are up and running normally as are archiving indexes and functions.

It’s also important to know that long term data was not damaged during this incident. A very small number of customer emails bounced back to the senders for retry where delivery was not an option.

So – our services are fully recovered, and all of the data we keep for customers is where it should be.

Let me now take this opportunity to outline the key technical facts behind the incident, and the key learnings we’ll be taking forward from here.

What Happened?

We are working through the formal incident report now and we’ll be delivering it directly to customers on Tuesday 21st May.

The incident summarises as follows:

At 10:38BST on Thursday morning, a high availability core switching solution within our Woking UK data centre failed. It’s important to know that our platform software didn’t cause this issue. Our network engineers tried to revive the switching solution, but finally conceded that it could not be recovered and that the Woking DC was offline. At that point we invoked our full data centre failover process so that we could operate from another UK data centre.

We fully completed the failover process by around 14:00 on the same day and customer emails began flowing some time before that. Three hours is a long time in terms of email backlog, and it took some time for email backlogs to work through so some customers still experienced delayed emails into the evening.

Knock-on Issues

Some customers were only configured to deliver outbound emails via our Woking data centre, so we had to work with these customers to ensure that their settings were correct before we could get their emails flowing again. Also, some residual DNS issues affected the delivery of email, having a small knock on effect on a number of North American customers. DNS issues were resolved over the course of the day.

Will It Happen Again?

Like any cloud vendor, we live in the knowledge that at some point, one of our data centres worldwide can fail. After all – that is why we have several of them in a mirrored configuration for every region. We would have expected to recover from such a failure very quickly by using the redundant capabilities we have built into the platform’s design. This case was different as it is the first time that a data centre could not be recovered within a short period of time.

Moving forward, we continue to assume and plan for the eventuality of a volume data centre failing again for any number of reasons – because it will happen. It’s what we do when that happens that matters. We have spent 10 years building software that is able to withstand this kind of catastrophic event – but we need to tweak our services and procedures so that event doesn’t cause us or our customers significant pain.

We are scoping out significantly improved failover processes, which we think could reduce the downtime caused by a single DC failure by as much as 90%, and eliminate the knock-on effects entirely.

It’s important to note that the design of our service is good, and we will not need to rewrite significant parts of our code to deal with this kind of incident in future. Most of the systems did exactly what we designed them to do – which is why we had a 3 hour total outage as opposed to a total outage lasting 8 or even 12 hours as we have seen with some other cloud service providers.

As painful as this event has been, it has undoubtedly made us stronger technically. We now know the realities of this kind of scenario first hand. It’s a hard way to learn, but it has also set in motion many adjustments that will mean that we can deal with disaster scenarios even more quickly. This gives me confidence in our future ability to deliver our service to customers no matter what.

Finally, I would like to add my apology to Peter’s and thank you all for your patience while we dealt with the incident.

Add your comment (1)

Co-founder and CTO
Mimecast

It’s been more than 24 hours since we experienced our UK data centre outage. Most customers’ normal service has resumed, although we experienced a few further difficulties getting the Woking DC back up to speed with newly installed HA switching equipment. As I write, I am told both data centres are now up and running at last, and we are ready for an evening run of warm standby synchronisation. We are also working hard to solve the backlog of customer issues out there, and we appreciate your work alongside us to get everything sorted out. These situations put everyone under considerable pressure, and I’m very grateful for the support we have had from customers and partners, as well as the tireless efforts of Mimecast staff around the world who’ve all played their part in bringing this difficult situation under control.

Yesterday I referred to a 3-hour window of ‘outage’, between 11am and 2pm, from the moment the failure happened to the moment we completed failover to the second data centre. I was, of course, talking about the outage from one particular angle – how long did it take us to failover an entire data centre before we were able to start bringing customers’ services back up.

Of course, when the blog went live, some customers thought I was shutting down the issue and pretending the crisis had only lasted three hours. I wasn’t trying to mislead or underplay it, but I understand how it might have looked that way.  And for that, I apologise.

For several customers it wasn’t a three-hour problem, it’s been an ongoing problem and although most services are working normally, we can’t put this to bed until the second data centre is fully back in sync, operating normally. You’ll hear from us as soon as that is the case.

The second issue I’d like to take the opportunity to address, and which has received a lot of attention on Twitter and in the media, is our 100% availability SLA. There is some suggestion that a 100% SLA is impossible, because there’s always a chance a service will go down. What it really means to us is a commitment to a level of service we are willing to be held accountable for.  Anything less, and we pay for the short-fall. We give the 100% SLA not because we think we are infallible but because zero downtime is something so important to our customers and us we believe we should stand up for it as the gold standard to aspire to. And be accountable in the event we don’t meet it.

Most cloud service providers don’t want to accept that level of accountability (and the cost) with customers. It is a very rare event at Mimecast but we were reminded failures can happen the hard way yesterday, after 10 years of consistent service. Yesterday we fell below that standard. We are sorry about that and will be working with customers affected by yesterday’s outage to arrange SLA compensation.

Add your comment (2)

Co-founder and CEO
Mimecast

This update forms the closing technical update for the UK Grid outage and recovery that has occurred.

Mail Flow and Routing

All UK odd and even numbered grid hosts are now processing email as normal. Residual queues and some slowness caused by processing backlogs may continue to persist throughout the day, but in decreasing numbers.

Other Services

There may be some continued slowness in other services provided by Mimecast Services for Outlook, Mimecast Services for Exchange as well as archive searches. This slowness is caused by the accelerated processing of the email backlog, we are aware of this problem and are working to reduce its impact throughout the day.

USA Issues

Some of our North American hosted customers have experienced problems with the DNS resolution of MX Records for outbound email. This has been resolved and we continue to monitor the situation.

Isolated Issues

We are still receiving reports of some isolated issues due to local configurations. If you are experiencing any continued problems with email delivery or AdCon please contact our Service Delivery team at support@mimecast.com

What Next?

We will be making an Incident Report available shortly, which will be sent to all affected customers and Partners.

Our Customer Experience teams are taking proactive steps to contact all our affected customers in the coming days and to honour of SLA obligations.

Thank you again for your patience during this incident, we are very sorry we failed to live up to your expectations and will keep you updated as to how we are working hard to ensure this never happens again.

 

The Mimecast Team

Add your comment (0)

Customers hosted on our UK Grid on the Service63 and Service64 cluster pair should now have normal service. Email is flowing normally but there may be some residual queues while the backlog clears.

We are very sorry for this further inconvenience. Any local issues should be reported to our Incident Hotline on 020 7843 2302.

Please continue to check this blog and our twitter feed https://twitter.com/mimecast for more details. We will provide further updates throughout the day.

Thank you again for your patience.

 

_____________
Previous updates and email routing information can be found here.
Mimecast Outage – Update for Customers on Service63 and Service64
Mimecast Outage – Status Update – 1045hrs
Mimecast Service Outage – Status Update – 0930hrs

Add your comment (0)

Customers hosted on our Service63 and Service64 cluster pair should expect a loss of service for the next hour. Regrettably we have suffered a hardware failure on that system which is affecting the backend storage arrays.

In order to resolve this problem we must take the entire cluster offline for a period of time. Unfortunately there is no alternative for customers hosted on this cluster.

We are very sorry for this further inconvenience. This short outage will only affect customer hosted on Service63 and Service64. Our Incident Hotline is available on 020 7843 2302

Please continue to check this blog and our twitter feed https://twitter.com/mimecast for more details. We will provide more updates throughout the day.

Thank you again for your patience.

 

_____________
Previous updates and email routing information can be found here.
Mimecast Outage – Status Update – 1045hrs
Mimecast Service Outage – Status Update – 0930hrs

 

Add your comment (0)

Our infrastructure and technical teams are bringing more services back online in an ordered manner.

Mail delivery in and out of the Mimecast Platform is flowing as normal. Please direct any isolated local problems to our service delivery team on 020 7843 2302

We are slowly removing the work-arounds in place, although customers will not notice or be affected by these changes.

Some residual problems may still exist with Mimecast Services for Outlook, but we are aware of these issues and are working to resolve them as soon as possible.

Please continue to check this blog and our twitter feed https://twitter.com/mimecast for more details. We will provide more details throughout the day.

Thank you again for your patience.

 

_____________
Previous update and email routing information can be found here.

Add your comment (2)

Our infrastructure and technical teams have been working through the night to restore as many services to normal as possible.

Mail delivery to and from the Mimecast platform is flowing as normal, and in most cases there should be no queues or delays; there may be some residual mail queues where a large backlog existed overnight.

We still have some issues to resolve. The failed network infrastructure that caused the problems yesterday has been fully restored and protected from further outage, and we have made DNS updates that will route email away from hosts that remain offline. There are a small number of temporary work-arounds in place that will provide service to customers and allow us to restore normal service in the back ground.

Our infrastructure teams will bring service back online in a managed process so normal service will be restored throughout the day. Residual problems with Mimecast Services for Exchange and Mimecast Services for Outlook may occur, and customers may see queued messages with an error of “Unable to send message body” or “could not open file”. We are aware of these problems and will resolve them as quickly as possible.

Please continue to check this blog and our twitter feed https://twitter.com/mimecast for more details. We will provide more details throughout the day.

Thank you again for your patience.

_____________

Previous update and email routing information can be found here.

Add your comment (0)

A Tough Email Day

Today Mimecast suffered an HA network hardware failure that shut down services from one of our data centres in the UK. The outage lasted a little over three hours from about 11am UK time to just after 2pm. Afterwards some customers experienced slower service responses due to back log recoveries underway.

I wanted to take this opportunity to say sorry personally and on behalf of Mimecast to our customers and partners affected by this issue today. I also wanted to give some background to the problem and our response.

First of all, I know how critically important email is to you and to your businesses. The importance and value of email and the challenges of running robust email infrastructures were among the main reasons Neil Murray, my co-founder and our CTO, and I started Mimecast just over 10 years ago. We sincerely appreciate the faith and trust that you place in us as your email gateway and your email continuity provider. We make promises to you that we will always be there to deliver your messages. We work day and night to meet that promise and invest extensively in our software development, data centres and infrastructure resilience to meet it. We have teams of people who work around the clock to support the service and support you as our customers and partners.

For three hours today we did not live up to our availability promise. We are very sorry.

Over the last ten years we have not had any significant outages because of our infrastructure and because of the constant scenario planning we conduct to ensure we’re mitigating against any points of failure.

As a cloud vendor, our platform infrastructure works in an active-active model, where communications are handled by all sides of our grid. If there is any unavailability in a component another part of the grid can take over. Failing over an entire data centre happens extremely rarely and we deliberately do it manually as an automatic failover of this scale brings significant risks.  The plans we had in place underestimated the time it would take to complete the task.  We aim for under 30 minutes, however this one took us over 2 hours.

We will be reviewing this procedure and making sure that we can do it faster – much faster – should we be called upon to do it again.

In terms of next steps, we will of course honour our SLA obligations, and we’ll be in touch proactively with all affected customers on this issue in the coming days.  We appreciate the patience that many of our customers have shown during this tough day and we will be working extra hard to ensure it doesn’t happen ever again.

__________
Previous update and email routing information can be found here.

Add your comment (14)

Co-founder and CEO
Mimecast

First, an apology.  Today Mimecast UK customers have experienced problems with our email services, caused by a network hardware failure at our Woking data centre.  Our infrastructure teams have identified and isolated the problem, and are bringing all affected customer systems back online now.

Customers’ systems will be available soon, but there will be a backlog of email to process, so email may take some time to return to normal.

There are several things you can also do as a customer to make sure your email service is restored as quickly as possible.

  • Firstly check your AdCon addresses for your system availability. Once you can log back into AdCon your email will be flowing too. If you’re not sure of the AdCon address you can click the “Log In” link on our website.
  • Only the even numbered hostname for AdCon will be available for now. They will look something like this. http://serviceNN.mimecast.com/mimecast/admin , where ‘serviceNN’ is the cluster number for your account.
  • Importantly, please check your outbound SMTP smarthost connectors on your Exchange server. You should have two configured, one for each host name. Mail will only be delivered to the live host because the other half of the pair will be offline. If you only have one hostname, please make sure you add the second hostname to an outbound SMTP smart host connector.  Please refer to this these KB articles: Exchange 2003, 2007, 2010.
  • MTA2 Customers ONLY: If you’re using MTA2, please check your smart hosts are directed to eu-smtp-1.mimecast.com and eu-smtp-2.mimecast.com
  • Your email will be queued at the sending servers, so you will see email start to flow in and out of your mail server without needing to resend it.

Please note, this is not a full incident report, but an interim update to help you get back online. Our CEO, Peter Bauer, will be posting about the issue later today. But in the meantime, we hope this list of tips helps to resolve the issue as soon as possible.

We would like to take the opportunity to thank you for your patience and we will continue to post updates on our Twitter feed @mimecast

Add your comment (3)