Welcome to this second article on my view on High Availability. In the first part, we’ve taken a look at what high availability was and what the potential impact of requiring a higher availability rate might be.
Today, we’re going to focus on the question: “what should be measured?” and how you take the answer to that question and build your solution.
I ended my previous article by explaining that you’d want to measure a functionality rather than building blocks of your application architecture. Regular monitoring won’t get you far though. Instead, so-called synthetic transactions (tests which mimic a user’s actions like e.g. logging into OWA) allow you to test a functionality end-to-end. This allows you to determine whether or not the functionality that is being tested is working as expect. Underneath there is still a collection of building blocks which may or may not be aware of each other’s presence. This is something you – as an architect/designer/whatever – should definitely be cautious about.
All too often I come across designs that do not take inter-component-dependencies into account: from an application point-of-view everything might seem operational but that does not necessarily reflect the situation as it is for and end user…
For example, your Exchange services might be up and running and everything might seem OK from the backend; if the connection between the server- and client network is broken or otherwise not functioning, your end-users won’t be able to use their email (or perhaps other services that depend on email) anyway.
This means that – when designing your solution – you should try (mentally) reconstructing the interaction between the end user and the application (in this case: Exchange). While doing so, you write down all components that come into play. Soon you’ll end up with a list that also involves elements that – at first – aren’t really part of your application’s core-architecture:
- Networking (switches, routers, WAN-links, …)
- Storage (e.g. very important in scenarios where virtualization of Exchange is used)
- Physical hosts (hypervisors, clustering, …)
- Human interaction
You’ll find out that the above elements play an important part in your application architecture which automatically make them sort of part of the core-architecture after all…
Negotiate your SLAs
Whether or not these ‘external’ elements are your responsibility probably depend on how your IT department is organized. I can image that if you have no influence at all on e.g. the networking-components, you don’t want to be responsible if something ever goes wrong at that layer. While it is still important to know what components might influence the health of your application, it wouldn’t be a bad idea to leave these components out of the SLAs. In other words: if the outage of your application was due to the network layer it wouldn’t count towards your SLA.
In my opinion that beats the entire purpose of defining SLAs and trying to make IT work as a service for the business. After all: they don’t care what caused an outage, they only care about how long it takes you (or someone else) to get the service/functionality back up and running.
Now that I brought that up, imagine the following example: one of the business requirements state that mails to outside your company should always be delivered within x-period of time (Yes, I deliberately left the timeframe out because it’s inferior to the point I’m trying to make). When doing a component break-down, you could come up with something similar like this (high-level):
- Client network
- Mailbox Server(s)
- Hub Transport Server(s)
- Server network
- WAN-links (internet)
While the first four components might lie within your reach to manage and remediate in case of an outage, the 5th (WAN link) usually isn’t. So if it takes your ISP 8 hours to solve an issue (because they can according to their SLA for instance), you might perhaps think twice before accepting a 99,9% uptime in your SLA… However, if that isn’t an option you could try finding an ISP who can solve your issues quicker or you could try installing a backup internet connection. Bottom-line: you also need to take into account external factors when designing your solution.
In some cases, I’ve seen that WAN-links (or outages due to) were marked as an exception to the SLA, just because the probability of an outage was very low (and the cost of an additional backup link was too high).
Probability vs. impact
When you are designing your solution, you don’t always have to take into account every little bit that could go wrong. Simply because you cannot account for everything that can go wrong (Murphy?). While your design definitely should take into account whatever it can, it should also pay attention to the cost-effectiveness of the solution. Remember the graph in the first part which said that cost tend to grow exponentially when trying to achieve a higher availability rate?
This means that sometimes, because the cost to mitigate a single-point-of-failure or risk cannot be justified, you’ll have to settle for less. In such case, you’d want to assess what the probability of a potential error/fault is and how that might affect your application. If both probability that it occurs and impact on your application are low, it’s sometimes far more interesting to accept a risk then trying to mitigate/solve it. On the other hand, if there’s an error which is very likely to occur and might knock down your environment for a few hours, you might reconsider and try solving that.
Solving such an issue can be done in various ways (depending on what the error/fault can be): either increase the application’s resiliency or solve the issue at the layer that it occurs. For instance: if you’ve got a dodgy (physical) network in one of both sites; you might rethink your design to make more use of the site that has got a proper network OR you could try solving the issues at the network layer to make it less dodgy (which I would prefer).
Although I’m convinced that what I wrote didn’t surprise you, by now you should realize that creating a highly available (Exchange) solution takes proper planning. There are far more elements that come into play than one might think at first. Also keep in mind that I only touched on these different aspects superficially; when dealing with potential risks like human error there are other things that come into play like e.g. defining a training-plan to lower the risk of human error.
I personally believe that the importance of these elements will only grow in the future. I’m sure you’ve already heard of the phenomenon “IT as a service”? When approaching the aspect of high availability, try thinking you’re the electricity supplier and the business is the customer (which they actually are). You don’t care how electricity gets to your home or – if it doesn’t – why it doesn’t. All you care about is having electricity in the end…