Achieving High Availability with Windows Azure Environment - Part 1

Technology is growing tremendously, the world is becoming more closer day by day. When a started my studies, peoples were using telegram for sending messages in a quick time. But now the mobile technologies becomming powerfull and the new spectrum releases such as 2G, 3G and 4G change the world to make everything possible with internet. Cloud computing is one of them, which make the business to be able to host whole/part of their business on cloud and achieve the hardest possible in easiest way.

When one of my friend travels for a business visit to US, he needs to work his company network for some deliverable. Cloud computing helps him lots as their company hosted a part of their network on cloud and created VPN connectivity to their enterprise network. He just needs to get connected with internet and connect to their VPN on cloud. He connected with the internet using a satellite internet provider - hughes net internet and completed his deliverables easily.

Cloud computing technology provides lots of flexibilities, some of them are - scalability, availability and pay per usage etc. In this article, I am planning to point out some of the availability features from Microsoft Azure which can be used for achieving high availability on cloud.

Note: This three page article talks about the features which are in production with Microsoft Azure such as Cloud Services, Storage Services, and SQL Azure. This article not talks about Virtual Machine, Networking etc.

Cloud Provider - Windows Azure
Windows Azure is an open cloud platform that enables the customer to develop their application on their own platform and deploy on to Microsoft managed datacenters. It also enables to monitor and manage the hosted application in multiple ways such as Management console, Power Shell script or application APIs.

There are various advantages Windows Azure provides apart from the cost benefits such as High Availability, Scalability etc. This paper provides information about achieving High availability on Windows Azure environment.

High Availability
Wikipedia defines the High availability as - High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.

High availability calculation
The high availability will be measured by the downtime of the system where the end user being unable to access the application. The downtime can be categorized in two types – Scheduled and unscheduled.

Scheduled downtime – This downtime occur due to system maintenance such as applying patches to the system software, OS update etc., This can be planned well in advance and take precaution before it starts.

Unscheduled downtime – This downtime occur due to network failure, power outage, CPU/RAM failure, application/system crash etc., As this downtime can’t be predicted before, it require more consideration for achieving high availability.

The Service Level Agreement will be calculated based on the percentage of the availability defined for the system. Following table shows the availability percentage and corresponding downtime period.

Availability %	Downtime per year	Downtime per month	Downtime per week
90% ("one nine")	36.5 days	72 hours	16.8 hours
95%	18.25 days	36 hours	8.4 hours
97%	10.96 days	21.6 hours	5.04 hours
98%	7.30 days	14.4 hours	3.36 hours
99% ("two nines")	3.65 days	7.20 hours	1.68 hours
99.5%	1.83 days	3.60 hours	50.4 minutes
99.8%	17.52 hours	86.23 minutes	20.16 minutes
99.9% ("three nines")	8.76 hours	43.2 minutes	10.1 minutes
99.95%	4.38 hours	21.56 minutes	5.04 minutes
99.99% ("four nines")	52.56 minutes	4.32 minutes	1.01 minutes
99.999% ("five nines")	5.26 minutes	25.9 seconds	6.05 seconds
99.9999% ("six nines")	31.5 seconds	2.59 seconds	0.605 seconds
99.99999% ("seven nines")	3.15 seconds	0.259 seconds	0.0605 seconds

Considering the Azure services, SLA varies for each component. The following table shows the SLA for each component that Microsoft offer.

Azure Services	SLA	Terms	Current SLA information
Cloud Services	99.95%	2+ instances	http://go.microsoft.com/fwlink/?LinkId=159704
Storage	99.9%	See the URL	http://go.microsoft.com/fwlink/?LinkId=159705
SQL Database	99.9%	See the URL	http://go.microsoft.com/fwlink/?LinkId=159706
SQL Reporting	99.9%	See the URL	http://go.microsoft.com/fwlink/?LinkId=253477
Service Bus	99.9%	See the URL	http://go.microsoft.com/fwlink/?LinkId=159707
Access Control	99.9%	See the URL	http://go.microsoft.com/fwlink/?LinkId=159707
Caching	99.9%	See the URL	http://go.microsoft.com/fwlink/?LinkId=159707
CDN	99.9%	See the URL	http://go.microsoft.com/fwlink/?LinkId=195943

High Availability on Windows Azure by Default
Windows Azure provides High Availability for all the applications deployed on it by default. Below is some reason for Windows Azure environment provides high-availability.
1. Microsoft sets up world class datacenters in multiple geo-locations across the globe. So when any data center goes down by geo damage such as earthquakes, wild fires, tornados, nuclear reactor meltdown, etc., the other data center can take advantage to respond to the user request.
2. The datacenters are designed and constructed with stringent levels of physical security and access control, power redundancy and efficiency, environment control, and recoverability capabilities.
3. The physical facilities on a datacenter are achieved by broad industry compliance, including ISO 27001 and SOC / SSAE 16 / SAS 70 Type II and within the United States, FISMA certification.
4. To ensure the recovery of the Windows Azure platform core components, Microsoft established an Enterprise Business Continuity Program based on DRII Professional Practice Statement (link) and BCI Good Practice Guidelines. This program also aligns to FISMA and ISO27001 Continuity Control requirements.
5. Azure provides build in network load balancing, automatic OS and service patching.
Windows Azure Cloud Services

Azure Cloud Services provides 99.95 percent availability by default to the subscribers.

Windows Azure Compute Services provides high availability by deploying the instances into totally isolated grouping of hardware and network devices which is known as fault domain and upgrade domain. To achieve this implementation, the role must be deployed with at least two instances.

Fault domain is a physical unit which acts as the separate rack with dedicated hardware and network infrastructure for deploying VMs. When deploying a role with two instances, Windows Azure Fabric controller will deploy each instance in different fault domains such as Instance #1 in Fault Domain #1 and Instance #2 in Fault Domain #2. Developer will not have any control on allocation of fault domain either by configuration or API call.

Upgrade domain is a local unit which determines how the role will be upgraded. This allows separating the instances in a role to different upgrade domain and upgrade one by one while upgrading the role.

Figure 1 - Cloud Service Upgrade Domain and Fault Domain representation

As per the above figure #1, there are four instances deployed on Azure with two fault domains and two upgrade domains. So when Fault Domain #1 fails due to any issues such as network failure or hardware issue, Fault domain #2 (instance #2, instance #4) will be available to respond to the user. The Fabric Controller will notice the instances #1 and instance #2 does not respond in short time and will redeploy in another fault domain other than Fault Domain #2 and bring active.

While upgrading software patches or configuration changes, the changes will apply in upgrade domain #1 (instance #1 and instance #2) at first and Upgrade domain #2 (instance #2, instance #4) will be available to response to the user. When the Upgrade domain #1 completes the upgrade, the upgrade domain #2 (instance #3 and instance #4) will start the upgrade and upgrade domain #1 will be available to response to the user.

The allocation of fault domain and upgrade domain will be handled by fabric controller and it will be depending on the cluster availability on the time of deployment.

As explained, while deploying computing service such as Web Role, Worker Role or VM Role with more than one instances, Microsoft ensure the deployment to be in different fault domains. So the application always available even any of the fault domain fail and other fault domain takes the responsibility to respond back to the user.

4.1 Windows Azure Traffic Manager
As we seen previously, one Cloud service provides 99.95% availability by default. So, the application hosted on Azure with Web Role or Worker Role (or both) with two instances count will make sure the application available at 99.95% rate and the downtime could be 1.83 days in a year / 3.60 hours in a month / 50.4 minutes in a week.

When the enterprise expecting more availability, such as 99.9, 99.99 etc., Windows Azure released a component calls Windows Azure Traffic Manager which allows the user to host the same application in multiple region (or in same region) and combine together in a single domain URL. When a user requesting the URL provides by Traffic Manager, the request first goes to the Traffic Manager Load Balancing module and that will be routed to the other cloud services based on the rules defined.

For Example, An enterprise wanted to host a Sales Order application to Windows Azure Environment which will be consumed by all the customers across the world. The application expected to give higher availability rate (>99.95). In such scenario, Windows Azure allows the enterprise to do the following.
1. Host the same application in different region. The region can be selected where the most of the users are expected the use the application.
  For Ex: An application hosted in three regions, the URL follows
  http://salesorderus.cloudapp.net/
  http://salesordereurope.cloudapp.net/
  http://salesorderasia.cloudapp.net/
2. Create a Traffic Manager Policy with any rules suites for the enterprise requirement and choose a common DNS prefix for the policy.
3. Activate the policy. When activation completes, Traffic Manager configure required settings and give a complete URL which can be used for requesting all three cloud services hosted in different region.
  For Ex: http://salesorderenterprise.trafficmanager.net/
4. When enterprise configure a DNS entry of enterprise custom DNS to traffic manager domain, the user will be requesting the application using enterprise domain url. For Ex: http://shellappdomain.shell.com
5. The user can request the application with the traffic manager URL instead of cloud service URL. By requesting the Traffic Manager URL, Traffic manager make sure to route to the correct cloud services based on the rule and the cloud service current status.
Figure #2 explains the conceptual diagram of how Windows Azure Traffic Manager works –

Figure 2 - Conceptual Diagram how Windows Azure Traffic Manager works
1. The user requests information using the application domain name. The process to resolve a DNS name to an IP address begins.
2. The DNS resource record for the application domain points to a Traffic Manager domain maintained in Windows Azure Traffic Manager.
3. Traffic enters through the domain and the policy dictates how to route that traffic.
4. Traffic Manager Policy uses a chosen load balance method and monitoring status to determine windows Azure hosted service should service the request.
5. Traffic Manager returns the DNS name of the hosted service to the IP address of a chosen hosted service to the user. The user's local DNS resolver resolves the domain to the IP address of a chosen hosted service.
6. User calls the hosted service directly using the returned IP address. The user continues to interact with the chosen hosted service until its local DNS cache expires.
4.1.1 Load balancing methods in Windows Azure Traffic Manager
There are three types of load balancing method available in Traffic Manager. We can choose only one method per policy and there can be multiple policies can be declared for a single hosted service.

4.1.1.1 Performance
When there is a single application deployed in multiple hosted services each hosted different region, the Performance load balancing method determine the traffic origin and routes to the closest datacenter. As the traffic occur near to the user, the performance achieved as much as possible.

The closest datacenter determined by Traffic Manager using a network performance table which has the round trip time between various IP addresses and each Windows Azure datacenter. This table will be updated at periodical intervals and reflecting the real time performance updates across the Internet.

Performance method does not consider about the load of a particular datacenter when the load are getting heavy. It considers the closest datacenter to the user and route accordingly.

4.1.1.2 Failover
When a policy declared with Failover load balancing method, the first hosted service will be consider as the primary hosted service and the subsequent hosted services will be taken for the next preference.

When the primary hosted service is offline for reason such as datacenter goes down, major damage on the region, the next hosted service in the list will take preference and respond to the user requests. When 1st and 2nd hosted services goes offline, the request will be routed to the 3rd hosted service on the list and so on.

This method helps to achieve high availability requirement on enterprise applications.

4.1.1.3 Round Robin
This method slits up the incoming traffic from the user to various hosted services. So all the hosted services will be getting equal traffic loads. Traffic Manager will keeps track of the last hosted service that received the traffic and sends to the next one in the chosen list on the policy.

When monitoring setup with the policy, the traffic manager will not route to the hosted service which are offline.

4.1.2 Monitoring the hosted service using Traffic Manager
Windows Azure Traffic Manager can monitor the hosted services to ensure they are available. This monitoring must setup for every policies.

The Traffic Manager will try to access the default directory of the service in the policy when “/” chosen to monitor.

To monitor a specific path and filename, the following steps can be done
1. Create a file with the same name on each hosted service which plan to include in the policy.
2. Allow the traffic manager to perform a http(s) GET on the file.
3. Specify the monitoring endpoint for the files to monitor in the Specify a monitoring endpoint section of Create Traffic Manager Policy screen. There are three values required to specify,
  - Protocol – The file required to access in https or https.(Ex: http)
  - Port – Which port the application is running or the port used to request the file. (Ex: 80 for http)
  - Relative – The path and the name of the file that monitoring system will attempt to access. (Ex: /WATMMonitorfile.htm)
4.1.3 Monitor poll state alert
Traffic Manager displays policy and hosted services health in the Management Portal. The poll state column displays the most recent monitor status of the Traffic Manager policies. This status helps to understand the health of the domains according to the Traffic Manager monitoring settings. When the policy is health, DNS queries will be distributed to the hosted services based on the policy selected.