Cloud-based services offer greater flexibility and economy than many traditional information services. But can they meet or even exceed end-users’ expectations for reliability and availability?
Cloud computing offers a compelling business model for information services. Consequently, many new applications are being developed explicitly for cloud deployment, while many traditional applications will eventually evolve to the cloud. End users want these cloud-based services to be at least as reliable and available as traditional offerings. And to meet these expectations, cloud service providers and cloud consumers need to gain a solid understanding of the unique challenges of cloud computing and learn how to mitigate risks.
The new challenges are primarily related to virtualization, rapid elasticity and resource sharing. These capabilities enable a new level of flexibility, convenience and economy, but they also make cloud computing inherently more complicated than traditional computing. This complexity adds more areas for potential failures.
Delivering reliable and available cloud-based services must start with an awareness of how operations have changed in the cloud, including recognition of where new points of vulnerability lie. For example, load distribution, overload control and data management are all more complex in the cloud, and new usage models enabled by cloud computing can increase the impact of a site or server failure.
After carefully identifying these issues, cloud service providers and cloud consumers can then take advantage of architectural opportunities for mitigating the risks. When this approach is backed by traditional engineering diligence, cloud-based services have the potential to meet or exceed the service reliability and availability requirements of traditional deployments. Satisfying these requirements can be crucial for all players in the cloud environment, where accountability is often split between cloud service providers and cloud consumers — and where standards bodies are still working to establish clear outage measurement rules.
Evolving to the cloud
To benefit from elastic growth and other new capabilities offered by cloud computing, many traditional applications will be evolved to a cloud environment over several releases. The following usage scenarios, organized from the simplest to the most complex, illustrate a variety of advantages following virtualization.
- Hardware independence – Virtualization minimizes the dependence of an application on its underlying hardware. That means the application may still require the same machine instructions, such as Intel. But its software is decoupled from hardware-based details, such as physical memory and storage, so the application can be easily moved onto new hardware.
- Server consolidation – In this case, virtualization increases resource utilization because multiple applications can share hardware resources, including previously underutilized hardware.
- Multi-tenant – With this usage scenario, multiple independent instances of an application, such as e-mail or web service, can be consolidated on a single virtualized platform. The instances are then available simultaneously to diverse user communities.
- Virtual appliance – In this vision of virtualization, defined by the Distributed Management Task Force (DSP2017), applications are delivered as turnkey software, prepackaged with operating systems, protocol stacks and supporting software. This approach allows suppliers to thoroughly test the production configuration of all system software, while customers enjoy simpler installation and maintenance.
- Cloud deployment – This usage scenario includes rapid elasticity and is the typical endpoint of the evolution of an application to the cloud. It offers the most flexible configuration, which can expand or contract automatically in response to changing workloads.
Understanding the impact
While the new usage scenarios of the cloud deliver important benefits, they also present new challenges. For example:
Co-residency: This type of server consolidation usage model makes it more difficult to predict application performance. Vulnerability to service impairments due to “noisy neighbor” applications is greater.
These challenges can be mitigated with a fully tested, high-availability architecture that supports failure containment and recovery of each of the applications.
Multi-tenancy: Multi-tenancy has the same cost benefits and challenges as co residency. But the challenges are more pronounced because failures may impact multiple user populations. Multi-tenancy also has an increased security challenge, as user populations must be kept completely separate.
To mitigate these challenges, high-availability architecture is required. It should support rigid failure containment and independent service recovery. Workflows should be tested under various failure scenarios. Robustness testing must insure that each tenant is appropriately isolated; security testing should make sure that there is no cross-tenant access to applications or resources.
Establishing new ways to handle service load
In a cloud environment, service load can potentially be distributed seamlessly across multiple servers, locations and cloud providers, with the assistance of load balancing mechanisms and policies. The challenge is to satisfy wide-ranging requirements, such as subscriber affinity, redundancy, latency, availability, security, capacity, and even regulatory issues.
For example, appropriate load distribution architecture should consider the number of application instances, their proximity to end users, and application and data redundancy. Policies must also be clearly defined, so service distribution can be managed in accordance with latency, regulatory and security requirements. The distance between data centers should be considered, too, particularly when data exchanges are frequent and high transactional reliability is required.
Overload control: To handle overload events, traditional systems set capacity thresholds, then shed or reject traffic as needed to keep the system from crashing. Cloud management mechanisms, however, can add new instance(s) of the application, to share the growing traffic load. For example, rapid elasticity can be used to address traffic spikes and shorten the time a system is in overload as extra service capacity is brought online. Native overload control mechanisms should also be present to handle any excess traffic during the interval before scaling activates and the new instances are sharing the traffic. In addition, the mechanisms should be there to manage traffic when the offered load exceeds maximum elastic capacity (for example, license or policy limits).
Rapid elasticity: Besides supporting overload control, this powerful mechanism enables more efficient use of hardware resources. It can automatically increase (or decrease) resources (vertical growth) of a virtual machine — or expand (or reduce) the number of virtual machines (horizontal growth). Horizontal growth can occur within the limits of a single data center or grow into an additional data center. Outgrowth expands capacity by adding resources in other cloud data centers.
Effective use of rapid elasticity is based on resource monitoring, policies and thresholds. Hysteresis (that is, different growth and shrink thresholds) should be used to prevent capacity oscillations. To mitigate the risks associated with rapid elasticity, systems must be thoroughly tested and cloud-based applications must be designed to:
- Manage scaling and de-scaling
- Accurately monitor resource utilization and performance
- Support well-defined policies, backed by robust trigger mechanisms to control growth and contraction
Addressing widespread data storage
For service reliability, all data must be redundantly stored and managed to survive the failure of a component. In addition, data synchronization presents new challenges, because cloud transactions can span multiple application instances and be stored in several locations. ACID and BASE mechanisms are typically used to keep data synchronized.
- ACID (atomicity, consistency, isolation, durability) properties are essential for transactional reliability and immediate consistency. However, they can be resource intensive and introduce latency into transactions.
- BASE (basically available, soft state, eventual consistency) properties enable simpler solutions that are less resource intensive. They are appropriate when data consistency can be achieved over longer time periods. For example, they are well suited to many web services, such as e-mail.
Using additional mechanisms for high availability
Cloud services should be redundant at the software and hardware levels and incorporate high-availability mechanisms at their foundation, including automatic failure detection, reporting and recovery mechanisms. To enhance the internal mechanisms, the virtualization platform can provide an additional layer of failure detection and recovery at the virtual machine level. One must assure that the two mechanisms can peacefully coexist and don’t collide during failure recovery.
Dealing with latency challenges
For isochronal applications like video calling, it’s crucial to prevent latencies that disrupt service quality. But with virtualized configurations, resource contention, real-time notification latency, and virtualization overhead can all add latency. To address these issues, architects need to take the following actions:
- Carefully identify the real-time isochronal expectations for a virtualized platform. For example, the maximum notification latency must explicitly represent how “late” a real-time notification interrupt can be.
- Determine whether the target platform or infrastructure service can actually meet the identified requirements.
- Establish a recommended architecture and configuration for optimal isochronous performance on the specified platform or infrastructure service.
- Prototype and test the service to validate whether it is technically feasible to meet its requirements on a virtualized platform.
For an in-depth analysis of these challenges, with recommendations for mitigating risks, see Reliability and Availability of Cloud Computing, which will be published by Wiley-IEEE Press in summer 2012.
Maintaining traditional engineering diligence
Cloud computing introduces new technologies with unique benefits and risks. But it does not change the basic structure or importance of the engineering diligence required to maintain reliability and availability. The process of maintaining this diligence can be summarized in the following steps:
- Clearly define service reliability and availability requirements.
- Model and analyze overall solution architecture to ensure that it is capable of meeting reliability requirements over the long term.
- Carry out reliability diligence on individual components to make sure they can meet the overall solution requirements.
- Test the solution thoroughly and make sure that automated methods of failure detection and recovery work effectively.
- Track the performance of the solution in the field and follow up with corrective actions as needed.
When this diligence process is applied to mitigate both traditional risks and the new challenges of the cloud, cloud-based services have the potential to meet or exceed the service reliability and availability requirements of traditional deployments.
To contact the authors or request additional information, please send an e-mail to techzine.editor@alcatel-lucent.com.

Reliability and Availability are important factors in offering Cloud Computing services, and I believe when we try to drill down and identify the underlying paradigms and concepts associated to those, we do understand that resource management and infrastructure optimization are of core importance. There comes a need for a consistent monitoring of all kinds of objects and attributes across the whole implementation in an agile and continuous manner.
Cloud-based applications rely on the (virtualized) compute, memory, storage and networking resources offered by the cloud service provider. Disruptions, delay and degradation in delivering those resources (e.g., fewer CPU cycles, reduced network or disk I/O) can impact the user service exposed by the applications using those resources. Thus, for applications to deliver consistently acceptable service to their end users it is important for the underlying resources offered by cloud to consistently deliver acceptable service as well.
Dear Editor:
1. I do not agree with exceeding emd-user’s expectation as in ‘But can they meet or even exceed end-users’ expectations..’
2. We need to use abbreviations more systematically. The ‘BASE (basically available, soft state, eventual consistency)’ better be adjusted to “BASE (Basic, available, soft-state, and eventual-consistency)”.
HFR
In general, users expect ‘new’ service offerings that are substantially the same as traditional offerings to be at least as good as those traditional offerings. For example, the baseline for customers service quality expectations for (newer) IP-TV is often the (older) cable TV service that they previously had, and customers will compare service accessibility, retainability and performance of 4G/LTE wireless service to the 3G or 2G service that they previously had. In some cases services are so different (e.g., wireless telephony v. wireline telephony) that other service characteristics (e.g., mobility) are more important than differences in service quality, reliability or availability. However from an end user perspective, cloud deploymnet is often not about a better (e.g., cheaper, faster, more agile, more elastic) way for an enterprise to deliver application services to end users. In fact, most end users may neither recognize nor care if a service is delivered from a traditional data center or from a cloud, so long as the service quality, reliability, latency and availability can meet or exceed their expectations.