(this is a follow on to the previous post on measuring business impact, and are my own thoughts on the case for moving to Exchange 2007)
There are plenty of resources already published which talk about the top reasons to deploy, upgrade or migrate to Exchange 2007 – the top 10 reasons page would be a good place to start. I’d like to draw out some tangible benefits which are maybe less obvious than the headline-grabbing “reduce costs”/”make everyone’s life better” type reasons. I’ll approach these reasons over a number of posts, otherwise this blog will end up reading like a whitepaper (and nobody will read it…)
GOAL: Be more available at a realistic price
High availability is one of those aims which is often harder to achieve than it first appears. If you want a really highly-available system, you need to think hard not only about which bits need to be procured and deployed (eg clustered hardware and the appropriate software that works with it), but the systems management and operations teams need to be structured in such a way that they can actually deliver the promised availability. Also, a bit like disaster recovery, high availability is always easier to justify following an event where not having it is sorely missed… eg if a failure happens and knocks the systems out of production for a while, it’ll be easier to go cap-in-hand to the budget holder and ask for more money to stop it happening again.
Example: Mrs Dalton runs her own business, and like many SMBs, money was tight when the company was starting up – to the extent that they used hand-me-down PC hardware to run their main file/print/mail server. I had always said that this needed to be temporary only, and how they really should buy something better, and it was always something that was going to happen in the future.
Since I do all the IT in the business (and I don’t claim to do it well – only well enough that it stops being a burden for me… another characteristic of small businesses, I think), and Mrs D is the 1st line support for anyone in the office if/when things go wrong, it can be a house of cards if we’re both away. A year or two after they started, the (temporary) server blew its power supply whilst we were abroad on holiday, meaning there was no IT services at all – no internal or internet access (since the DHCP server was now offline) which ultimately meant no networked printers, no file shares with all the client docs, no mail (obviously) – basically everything stopped.
A local PC repair company was called in and managed to replace the PSU and restore working order (at a predictably high degree of expense), restoring normal service after 2 days of almost complete downtime.
Guess what? When we got back, the order went in for a nice shiny server with redundant PSU, redundant disks etc etc. No more questions asked…
Now a historical approach to making Exchange highly available would be to cluster the servers – something I’ve talked about previously in a Clustering & High Availability post.
The principal downside to the traditional Exchange 2003-style cluster (now known as a Single Copy Cluster) was that it required a Storage Area Network (at least if you wanted more than 2 nodes), which could be expensive compared to the kind of high-capacity local disk drives that might be the choice for a stand-alone server. Managing a SAN can be a costly and complex activity, especially if all you want to do with it is to use it with Exchange.
Also, with the Single-Copy model, there’s still a single point of failure – if the data on the SAN got corrupted (or worst case, the SAN itself goes boom), then everything is lost and you have to go back to the last backup, which could have been hours or even days old.
NET: Clustering Exchange, in the traditional sense, can help you deliver a better quality of service. Downtime through routine maintenance is reduced and fault tolerance of servers is automatically provided (to a point).
Now accepting that a single copy cluster (SCC) solution might be fine for reducing downtime due to more minor hardware failure or for managing the service uptime during routine maintenance, it doesn’t provide a true disaster-tolerant solution. Tragic events like the Sept 11th attacks, or the public transport bombs in cities such as London and Madrid, made a lot of organisations take the threat of total loss of their service more seriously … meaning more started looking at meaningful ways of providing a lights-out disaster recovery datacenter. In some industries, this is even a regulatory requirement.
Replication, Replication, Replication
Thinking about true site-tolerant DR just makes everything more complex by multiples – in the SCC environment, the only supported way to replicate data to the DR site will be to do it synchronously – ie the Exchange servers in site A write data to their SAN, which replicates that write to the SAN in site B, which acknowledges that it has received that data, all before the SAN in site A can acknowledge to the servers that the data has successfully been written. All this adds huge latency to the process, and can consume large amounts of high-speed bandwidth not to mention duplication of hardware and typically expensive software (to manage the replication) at both sides.
If you plan to shortcut this approach and use some other piece of replication software (which is installed on the Exchange servers at both ends) to manage the process, be careful – there are some clear supportability boundaries which you need to be aware of. Ask yourself – is taking a short cut to save money in a high availability solution, just a false economy? Check out the Deployment Guidelines for multi-site replication in Exchange 2003.
There are other approaches which could be relevant to you for site-loss resilience. In most cases, were you to completely lose a site (and for a period of time measured at least in days and possibly indefinitely), there will be other applications which need to be brought online more quickly than perhaps your email system – critical business systems on which your organisation depends. Also, if you lost a site entirely, there’s the logistics of managing where all the people are going to go? Work from home? Sit in temporary offices?
One practical solution here is to use something in Exchange 2003 or 2007 called Dial-tone recovery. In essence, it’s a way of bringing up Exchange service at a remote location without having immediate access to all the Exchange data. So your users can at least log in and receive mail, and be able to use email to communicate during the time of adjustment, with the premise that at some point in the near future (once all the other important systems are up & running), their previous Exchange mailbox data will be brought back online and they can access it again. Maybe that data isn’t going to be complete, though – it could be simply a copy of the last night’s backup which can be restored onto the servers at the secondary site.
Using Dial-tone (and an associated model called Standby clustering, where manual activation of standby servers in a secondary datacenter can bring service – and maybe data – online), can provide you a way of keep service availability high (albeit with temporary lowering of the quality, since all the historic data isn’t there) at a time when you might really need that service (ie in a true disaster). Both of these approaches can be achieved without the complexity and expense of sharing disk storage, and without having to replicate the data in real-time to a secondary location.
Exchange 2007 can help you solve this problem, out of the box
Exchange 2007 introduced a new model called Cluster Continuous Replication (CCR) which provides a near-real-time replication process. This is modelled in such
a way that you have a pair of Exchange mailbox servers (and they can only be doing the mailbox role, meaning you’re going to need other servers to take care of servicing web clients, performing mail delivery etc), and one of the servers is “active” at any time, with CCR taking care of the process of making sure that the copy of the data is also kept up to date, and providing the mechanism to automatically (or manually) fail over between the two nodes, and the two copies of the data.
What’s perhaps most significant about CCR is (apart from the fact that it’s in the box and therefore fully supported by Microsoft), is that there is no longer a requirement for the cluster nodes to access shared disk resources… meaning you don’t need a SAN (now, you may still have reasons for wanting a SAN, but it’s just not a requirement any more).
NET: Cluster Continuous Replication in Exchange 2007 can deliver a 2-node shared-nothing cluster architecture, where total failure of all components on one side can be automatically dealt with. Since there’s no requirement to share disk resources between the nodes, it may be possible to use high-speed, dedicated disks for each node, reducing the cost of procurement and the cost & complexity of managing the storage.
Exchange 2007 also offers Local Continuous Replication (LCR), designed for stand-alone servers to keep 2 copies of their databases on different sets of disks. LCR could be used to provide a low-cost way of keeping a copy of the data in a different place, ready to be brought online through a manual process. It is only applicable in a disaster recovery scenario, since it will not offer any form of failover in the event of a server failure or planned downtime.
Standby Continuous Replication (SCR) is the name given to another component of Exchange 2007, due to be part of the next service pack. This will provide a means to have standby, manually-activated, servers at a remote location, which receive a replica of data from a primary site, but without requiring the servers to be clustered. SCR could be used in conjunction with CCR, so a cluster which provides high availability at one location could also send a 3rd replica of its data to a remote site, to be used in case of total failure of the primary site.
The key point is “reasonable price”
In summary, then: reducing downtime in your Exchange environment through clustering presents some challenges.
- If you only have one site, you can cluster servers to share disk storage and get a higher level of service availability (assuming you have the skills to manage the cluster properly). To do this, you’ll need some form of storage area network or iSCSI NAS appliance.
- If you need to provide site-loss resilience (either temporary but major, such as a complete power loss, or catastrophic, such as total loss of the site), there are 3rd-party software-based replication approaches which may be effective, but are not supported by Microsoft. Although these solutions may work well, you will need to factor in the possible additional risk of a more complex support arrangement. The time you least want to be struggling to find out who can and should be helping you get through a problem, is when you’ve had a site loss and are desperately trying to restore service.
- Fully supported site-loss resilience with Exchange 2003 can only be achieved by replicating data at a storage subsystem level – in essence, you have servers and SANs at both sites, and the SANs take care of real-time, synchronous, replication of the data between the sites. This can be expensive to procure (with proprietary replication technology not to mention high speed, low latency network to connect the sites – typically dark fibre), and complex to manage.
- There are manual approaches which can be used to provide a level of service at a secondary site, without requiring 3rd party software or hardware solutions – but these approaches are designed to be used for true disaster recovery, not necessarily appropriate for short-term outages such as temporary power failure or server hardware failure.
- The Cluster Continuous Replication approach in Exchange 2007 can be used to deliver a highly-available cluster in one site, or can be spanned across sites (subject to network capacity etc) to provide high-availability for server maintenance, and a degree of protection against total site failure of either location.
NET: The 3 different replication models which are integral to Exchange 2007 (LCR, CCR and SCR) can help satisfy an organisation’s requirements to provide a highly-available, and disaster-tolerant, enterprise messaging system. This can be achieved without requiring proprietary and expensive 3rd party software and/or hardware solutions, compared with what would be required to deliver the same service using Exchange 2003.
Topics to come in the next installments of the business case for Exchange 2007 include:
- Lower the risk of being non-compliant
- Reduce the backup burden
- Make flexible working easier