RIM outage not only hitting consumers!

In April 2010, Mimecast released a report entitled “Keeping the Enterprise Agile and Mobile” in which we examined the growing pressure to keep BlackBerry services up and running at all times.

At the time, we thought the results were pretty interesting and events over the past few days have played them out pretty well.

Our report found that the expectations of BlackBerry users are extremely high – 66% of respondents claimed that as much as one hour of downtime per month is not acceptable and a further 22% saying NO downtime is acceptable at all! I can only imagine how these users feel about the last three days’ worth of interruptions…

With the reported impact on support desks and the board level fall out that BlackBerry outages seem to cause, we were, at the time, surprised by the high percentages of organizations that had no provisions for high availability (41%) in place at all. A further 59% said they couldn’t provide continuity for their users and 61% don’t have an internal BlackBerry availability SLA.

So with these numbers, the corporate world breathed a collective sigh of relief when RIM announced that the outages that they have been having are only affecting their BIS and BBM users… Well, they sighed until their corporate users started complaining about service unavailability.



Cloud based continuity insures against migration mishaps

Exchange migrations tend to be complex.  Even smaller organizations running Small Business Server with less than 75 users, may take a week or more to plan, prepare and execute their email migration.

Any business that’s been through a migration at least once will remember that most of the migration effort was spent in planning. Otherwise they may remember the large mop-up operation and the time spent visiting desktops, recovering mail and rolling aspects of the migration backwards and forwards.

Data loss (what PSTs?), client upgrades and wrongly migrated data tend to come to mind when thinking about what can go wrong, as well as the mail server that crashed during the migration. During a migration a fair amount of change is introduced and additional processing is forced onto both the source and target Exchange platform. For an older platform at the limits of its lifespan or operational capacity, the extra overhead an email migration introduces may be the straw that breaks the camel’s back.

Cloud based email continuity may act as insurance in this regard by enabling client continuity and transactional continuity in case the migration wobbles or breaks. Let’s explore that in a bit more detail.

Migrations are heavily process driven. In order to migrate, a fair amount of surveying, planning, lab testing, etc need to be accomplished. It makes sense to use the desktop visit of the plan/survey component to introduce the agents required onto the desktops in order to make client continuity possible.

If an Exchange server in the source or the target organization were to fail during the migration, Outlook clients would be redirected to the cloud, with little or no disruption to service or – crucially – the user experience. This allows the outage to be addressed, mail flow and client mail service to be restored without the pressure of fighting two fires concurrently – ie, a broken environment and a broken migration.

Cloud based email continuity allows you to benefit from the scale of the cloud as a side effect of leveraging continuity in the cloud, provided of course your users have the  required network or internet connectivity to beat a path to the cloud.

In our day to day lives we’re generally quite comfortable accepting the argument of personal insurance, which guards us against any number of possible scenarios, such as breaking a leg while skiing, medical insurance, insurance against theft, and so on. All of these boil down to paying a small amount of money to a much larger entity and thereby being guaranteed the benefit of that entity’s scale and reach in the case of something unfortunate happening.

As the idea of cloud on demand becomes more pervasive, insuring your migration in the short term against loss of email continuity makes as much sense as taking out insurance on your car before you take it on the road.


Murphy's Laws & Business Continuity

Last week I was reading Robin Gaddum’s post for Continuity Central describing how he communicates the concept of risk management in business continuity by applying Murphy’s Laws. Gaddum’s proposed three laws as follows:

If it can go wrong, it will go wrong
If it cannot possibly go wrong, it’ll still go wrong
In real life, puppies die… Get over it. (Or, disasters always have an impact)

Gaddum’s post can be summed up by saying, you must always conduct a risk assessment, invest in risk prevention, there is always residual risk, there is always an impact. He’s not wrong!

The thing about Murphy’s law; there is really only one. The adage usually goes “Anything that can go wrong, will go wrong,” there are also less polite ways of putting it, but us humans generally accept that ‘stuff’ does happen, and there is little we can do about it. Murphy’s law helps our brains rationalize and bring order to what is otherwise a wildly chaotic universe, to an extent we try to control that chaos, but not always or all ways.

Of course, I understand that Gaddum is looking for the best way to communicate the concepts of risk management in relation to business continuity, but I’m inclined to think that thinking about risk in relation to “stuff happens” only really achieves an in-depth risk analysis.

The Risk Assessment is a vital part of a Business Continuity Plan and should never be underestimated; all too often have I seen senior manager dismiss a risk because their preconceived ideas are still stuck in the “It’ll never happen to us” or “we’ll deal with it when it happens… until then” mentality. In this situation I always like to ask them how they would feel if the Captain and Co-pilot on their next commercial flight had the same attitude?

When it comes to business continuity, and aviation for that matter, there’s plenty that can go wrong; regardless of how well we prepare things still do go wrong, accidents still happen. More often than not when examining the contributing factors and cause of an incident, but after the fact, human error is identified as the most significant contribution. As they say, “aeroplanes don’t have accidents, pilots do.” As a result Human Factors makes up a significant part of the Aviation industry, where planning, assessing, designing, building & monitoring around the way ‘humans’ do things and behave is the key.

Gaddum makes a point to remind us of the importance of a business continuity plan, which he describes as:

“…our last ditch defense to enable recovery once that most improbable and unforeseen event has taken us out.”

But I find this quite alarmist, after-all how many BCP documents include an Emergency Action Plan for meteor strikes, or herds of marauding donkeys? Those are “most improbable” and certainly “unforeseen”. Why not think about this in terms of human failures instead – what are the most likely human failings that will cause your business suffer an outage?

Reliance on Murphy and his (or her) tendency to be right in hindsight will leave us worrying about those donkeys. Instead think about what your admins might get wrong when they’re overly tired, or when they have made multiple changes at once, will mean your BCP doc is much more relevant. It’ll also mean your BCP Planning Team have considered the individuals in your organizations and how their actions could affect your continuing business. Looking for a human cause and effect angle takes time but is well worth it in the long run, just ask a pilot.

This is a much more powerful place to be; better than staying awake at night wondering how high a donkey-proof fence needs to be.


Dressed to the Nines; what your Uptime SLA really means

Google and Microsoft have recently been poking holes in each others’ uptime SLAs (Service Level Agreements.) The squabble has been summed up here by Paul Thurrot from Windows IT Pro.

In short Google claimed its Google Apps service had achieved 99.984% uptime in 2010 and, citing an independent report, went on to say this was 46 times more available than Microsoft’s Exchange Server. Microsoft retaliated by saying BPOS achieved 99.9%  (or better) uptime in 2010 and this was in line with their SLA. Microsoft quite rightly protested at Google’s definitions of uptime and what should or should not be included.

The discussion continues.

Uptime is one of those things included in your service provider’s SLA that you never really give much attention to, unless it’s alarmingly low: 90%, for example. Most Cloud, SaaS or hosted providers will give uptime SLA figures of between 99.9% (three nines) and 99.999% (five nines). Mimecast proudly offers a 100% uptime SLA.

All of these nines represent different levels of ‘guaranteed’  service availability. For example, one nine (90%) allows for 36.5 days of downtime per year. As I said, alarming. Two nines (99%) would give you 3.65 days of downtime per year, three nines (99.9%) 8.76 hours, four nines (99.99%) 52.56 minutes and five nines (99.999%) 5.26 minutes per year. Lastly six nines, which is largely academic, gives a mere 31.5 seconds.

What does all of this mean to you as a consumer of  these services?  In terms of actual service, very little, unless you happen to be in the minority percentage; that is to say everything has gone dark and quiet and you’re suffering a service outage.

What is much more important is how the vendor treats you in the event they don’t achieve 100%. It is hard for any vendor to absolutely guarantee 100% uptime all of the time, so you must make sure there is a provision for service credits or financial compensation in the event of an outage. If not, the SLA is worthless. Any reputable SaaS or Cloud vendor will have absolute confidence in their infrastructure, so based on historical performance a 100% availability SLA will be justifiable. Mimecast offers 100% precisely for this reason.  We have spent a large amount of R&D time on getting the infrastructure right so it can be used to back up our SLA, and as a result we win many customers from vendors whose SLAs have flattered to deceive.


A larger issue perhaps we ought to consider is highlighted by the arrows Google is flinging in Microsoft’s direction: namely, how do vendors really define uptime? What sort of event do they class as an outage? Does the event have to occur for any length of time to qualify? Is planned downtime included in the calculation? And so on.

There is no standard with which uptime  is defined and common sense isn’t always applied either. In other markets, consumers are reasonably protected from spurious vendor claims by independent third parties like Consumer Reports or Which. Not so with the claims tech companies make regarding the effectiveness of their solutions, and the result is a great deal of spin, which in turn inevitably leads to misinterpretation and confusion.

Fortunately, we’re not the only ones to see the need for standards here.  Although it’s early days still, you can get an overview of ongoing current efforts at

Google and Microsoft’s argument is based largely on differences in measurement rather than any meaningful level of service. In a highly competitive market, any small differentiation can be a perceived bonus (by the vendor) but if we’re all using different tape measures to mark our lines, the only reliable way tell who comes out on top is to talk to the long-term customers.