by Mounil Patel
The 2014 Atlantic Hurricane season is in full swing through November, putting your organization – and mission-critical systems, like email – at sudden risk of exposure to tropical storms, floods and fires.
Ask yourself: When was the last time you tested your business continuity plan? If the answer is one year or longer, you risk significant network downtime, data leakage and financial loss. According to Gartner, depending on your industry, network downtime can typically cost $5,600 per minute or more than $300,000 per hour, on average. Don’t wait for disaster to strike. Treat email like the critical system it is, and avoid making these six mistakes that could jeopardize business continuity – and your job.
Combat downtime during hurricane season by planning ahead.
- Not testing your continuity solution. You’ve devised and implemented what you believe to be a solid continuity solution, but you’ve not given it a production test. Instead, you cross your fingers and hope when (and if) the time comes, the solution works as planned. There are two major problems with not testing your plan from the start. First, things get dusty over time. It’s possible the technology no longer works, or worse, maybe it was not properly configured in the first place. Plus, you might not be regularly backing up critical systems. Without testing the solution, you’ll learn the hard way that data is not being entirely backed up when you perform the restore. Second, when it comes to planning, you need a clear chain of command, should disaster strike. If your network goes down, you need to know who to call, immediately. Performing testing once simply is not enough. You need to test your solution once a year, at a minimum. Depending on the tolerance of your business, you’ll likely have to test more frequently, like quarterly or even monthly.
- Forgetting to test fail back. Testing the failover capabilities of your continuity solution is only half the job. Are you prepared for downtime that could last hours, days or even weeks? The ability to go from the primary data center to the secondary one – then reverting back – is critical, and this needs to be tested. You need to know that data can be restored into normal systems after downtime.
- Assuming you can easily engage the continuity solution. It’s common to plan for “normal” disasters like power outages and hardware failure. But in the event of something more severe, like a flood or fire, you need to know how difficult it’s to trigger a failover. Also, you need to know where you need to be. For example, can you trigger the fail over from your office or data center? It’s critical to know where the necessary tools are located and how long it’ll take you or your team to locate them. Physical access is critical. Distribute tools to multiple data centers, as well as your local environment.
- Excluding policy enforcement. When an outage occurs, you must still account for regulatory and policy-based requirements that impact email communications. This includes archiving, continuity and security policies. Otherwise, you risk non-compliance.
- Trusting agreed RTP and RPO. In reality, you’ve got to balance risk and budget. When an outage happens, will the email downtime agreed upon by the business really stick? In other words, will the CEO really be able to tolerate no access to email for two hours? And will it be acceptable for customers to be out of touch with you for one day? The cost associated with RTO and RPO could cause a gap in data restore. If you budget for a two-day email restore, be prepared that during an outage, this realistically means two days without email for the entire organization. As part of your testing methodology, you may discover that you need more or less time to back up and restore data. It’s possible that, as a result, you may need to implement more resilient technology – like moving from risky tape backup to more scalable and accessible cloud storage.
- Neglecting to include cloud services. Even when you implement cloud technologies to deliver key services, such as email, you still have the responsibility of planning for disruptions. Your cloud vendor will include disaster recover planning on their end to provide reliable services, but mishaps – and disasters – still happen. Mitigate this risk by stacking multi-vendor solutions wherever possible to ensure redundancy, especially for services like high availability gateways in front of cloud-based email services, or cloud backups of key data.
With the proper testing and upfront business continuity preparation, you can significantly reduce – or even prevent – email downtime, data leakage and financial loss after disaster strikes.
by Barry Gill
Exchange migrations tend to be complex. Even smaller organizations running Small Business Server with less than 75 users, may take a week or more to plan, prepare and execute their email migration.
Any business that’s been through a migration at least once will remember that most of the migration effort was spent in planning. Otherwise they may remember the large mop-up operation and the time spent visiting desktops, recovering mail and rolling aspects of the migration backwards and forwards.
Data loss (what PSTs?), client upgrades and wrongly migrated data tend to come to mind when thinking about what can go wrong, as well as the mail server that crashed during the migration. During a migration a fair amount of change is introduced and additional processing is forced onto both the source and target Exchange platform. For an older platform at the limits of its lifespan or operational capacity, the extra overhead an email migration introduces may be the straw that breaks the camel’s back.
Cloud based email continuity may act as insurance in this regard by enabling client continuity and transactional continuity in case the migration wobbles or breaks. Let’s explore that in a bit more detail.
Migrations are heavily process driven. In order to migrate, a fair amount of surveying, planning, lab testing, etc need to be accomplished. It makes sense to use the desktop visit of the plan/survey component to introduce the agents required onto the desktops in order to make client continuity possible.
If an Exchange server in the source or the target organization were to fail during the migration, Outlook clients would be redirected to the cloud, with little or no disruption to service or – crucially – the user experience. This allows the outage to be addressed, mail flow and client mail service to be restored without the pressure of fighting two fires concurrently – ie, a broken environment and a broken migration.
Cloud based email continuity allows you to benefit from the scale of the cloud as a side effect of leveraging continuity in the cloud, provided of course your users have the required network or internet connectivity to beat a path to the cloud.
In our day to day lives we’re generally quite comfortable accepting the argument of personal insurance, which guards us against any number of possible scenarios, such as breaking a leg while skiing, medical insurance, insurance against theft, and so on. All of these boil down to paying a small amount of money to a much larger entity and thereby being guaranteed the benefit of that entity’s scale and reach in the case of something unfortunate happening.
As the idea of cloud on demand becomes more pervasive, insuring your migration in the short term against loss of email continuity makes as much sense as taking out insurance on your car before you take it on the road.
by Orlando Scott-Cowley
Google and Microsoft have recently been poking holes in each others’ uptime SLAs (Service Level Agreements.) The squabble has been summed up here by Paul Thurrot from Windows IT Pro.
In short Google claimed its Google Apps service had achieved 99.984% uptime in 2010 and, citing an independent report, went on to say this was 46 times more available than Microsoft’s Exchange Server. Microsoft retaliated by saying BPOS achieved 99.9% (or better) uptime in 2010 and this was in line with their SLA. Microsoft quite rightly protested at Google’s definitions of uptime and what should or should not be included.
The discussion continues.
Uptime is one of those things included in your service provider’s SLA that you never really give much attention to, unless it’s alarmingly low: 90%, for example. Most Cloud, SaaS or hosted providers will give uptime SLA figures of between 99.9% (three nines) and 99.999% (five nines). Mimecast proudly offers a 100% uptime SLA.
All of these nines represent different levels of ‘guaranteed’ service availability. For example, one nine (90%) allows for 36.5 days of downtime per year. As I said, alarming. Two nines (99%) would give you 3.65 days of downtime per year, three nines (99.9%) 8.76 hours, four nines (99.99%) 52.56 minutes and five nines (99.999%) 5.26 minutes per year. Lastly six nines, which is largely academic, gives a mere 31.5 seconds.
What does all of this mean to you as a consumer of these services? In terms of actual service, very little, unless you happen to be in the minority percentage; that is to say everything has gone dark and quiet and you’re suffering a service outage.
What is much more important is how the vendor treats you in the event they don’t achieve 100%. It is hard for any vendor to absolutely guarantee 100% uptime all of the time, so you must make sure there is a provision for service credits or financial compensation in the event of an outage. If not, the SLA is worthless. Any reputable SaaS or Cloud vendor will have absolute confidence in their infrastructure, so based on historical performance a 100% availability SLA will be justifiable. Mimecast offers 100% precisely for this reason. We have spent a large amount of R&D time on getting the infrastructure right so it can be used to back up our SLA, and as a result we win many customers from vendors whose SLAs have flattered to deceive.
A larger issue perhaps we ought to consider is highlighted by the arrows Google is flinging in Microsoft’s direction: namely, how do vendors really define uptime? What sort of event do they class as an outage? Does the event have to occur for any length of time to qualify? Is planned downtime included in the calculation? And so on.
There is no standard with which uptime is defined and common sense isn’t always applied either. In other markets, consumers are reasonably protected from spurious vendor claims by independent third parties like Consumer Reports or Which. Not so with the claims tech companies make regarding the effectiveness of their solutions, and the result is a great deal of spin, which in turn inevitably leads to misinterpretation and confusion.
Fortunately, we’re not the only ones to see the need for standards here. Although it’s early days still, you can get an overview of ongoing current efforts at cloud-standards.org.
Google and Microsoft’s argument is based largely on differences in measurement rather than any meaningful level of service. In a highly competitive market, any small differentiation can be a perceived bonus (by the vendor) but if we’re all using different tape measures to mark our lines, the only reliable way tell who comes out on top is to talk to the long-term customers.
by Orlando Scott-Cowley
This week’s #list is a lighthearted look at the start-to-finish phases of an email outage.
1. Everything is working just fine. It’s been a while since your last outage, if ever.
2. Users are happily sending and receiving email and doing what users do.
3. The odd 200MB attachment causes the a slowdown. Thank-you users!
Trouble is Brewing
4. Your mail stores are getting a little full, but you’re not that worried.
5. The mail server or OS software vendor releases a major update.
6. You know this could mean trouble, You put off applying the update.
The Holes in the Cheese Line Up
7. You find out the update fixes a problem you’ve been having with the system.
8. You tell the CIO you’ll have to apply the update.
9. The CIO tells you downtime is only available at 10pm on Sunday night, this week.
10. You cancel the plans you had for Sunday, and pre-empt spending the night on the couch as you partner won’t be happy.
You Plan and Prepare – 7 P’s right?
11. You’ve planned the update, backed up the server, you’re ready.
12. You bounce the server once last time (just to make sure)
13. You start to apply the update: Diligently pressing the Next button.. Next, next, next, ….
Then it Happens
14. Then the blue progress bar stops moving. Your anxiety level (already high) begins to rise.
15. You wonder whether the vendor has included an anxiety detector, as the longer this takes the more stressed you get, and therefore the longer it takes.
16. You wait.. and wait.. and 45 minutes later you decide. Reboot.. Heck, why not?.
17. Hey, everything looks ok, it’s back up. No problem.
18. Then you check the services, and the ohnosecond happens.
19. You realize that although everything looks good, a critical service isn’t starting. Maybe a key service for the email server.
20. Panic sets in.. This is going to be a long night.
What Happens Next?
21. Support might work, it might not. Roll back plan might work, it might not.
22. You know the users will be connecting soon.
23. Reverting to a previous back-up is a scary prospect.
24. You call the CIO to explain; the CIO tells you to get it working by 8am on Monday and goes back to sleep.
25. You’re in for a long night… You wish you had pushed harder for that email continuity solution.
26. The rest depends, on luck, skill, the alignments of the planets and anything else you’d care to channel.
27. You promise yourself to get the CIO to signoff on the Email Continuity solution.
28. No-one noticed, No-one cares.
29. Many espresso’s and a 4-pack of red bull, zero sleep and despair have given you the appearance of a vagrant.
30. Your colleagues tell you to take it easy on the weekends, you need more sleep…