Vous êtes sur la page 1sur 3

Optimizing Cloud Data Centers through Machine Learning

Background
A lot of companies are nowadays moving towards the private cloud following a
SaaS (Software as a Service) model. The company centralizes all its applications
and provides some means for users to access these applications. A very popular
choice is to use a cluster of hypervisor servers (like VMware ESXi or Microsoft HyperV) to host a farm of Citrix servers that in turn host the required applications. In
such cloud systems, there are various algorithms at play at different layers the
hypervisors, the virtual machines and the application virtualization layer. This
makes it nearly impossible to deterministically predict end-user performance under
different conditions.
However, if IT engineers want to optimize such an infrastructure in terms of server
consolidation and power consumption then they would need to predict application
performance under different circumstances with a relatively marginal error. In order
to render the process easier they would also have to be able to predict application
performance, with high confidence, by considering server metrics only rather than
polling individual users for feedback. Hypervisor manufacturers do provide some
clustering mechanisms to share resources amongst virtual machines fairly. But the
algorithms behind these mechanisms are very simple. The typical approach is to
statistically measure current load (resource consumption) in each server. Virtual
machines are then migrated between clustered servers when the difference in load
between servers crosses a pre-determined threshold. The major drawback is that
such an approach is inherently reactive and cannot be used to provide guaranteed
performance.
In hybrid clouds, the term cloudbursting refers to the scenario where an
application scales into the public cloud in response to resource crunch in the private
cloud due to a surge in demand for resources. This adds another dimension to the
problem because companies have to pay for using public cloud services.
Cloudbursting can also be optimized if application performance is well understood.
In public clouds, a major concern is to provide the guaranteed level of performance
while allocating as little resource as possible.
Thus there are multiple goals that can be achieved only if application performance
can be understood and guaranteed only by considering server metrics. Yet, such a
complete understanding of application performance under different load conditions
is elusive to even the most experienced IT engineers.
Research Aim

The primary goal of my research is to provide a clear and complete understanding


of application performance in cloud infrastructures, even to a novice engineer. This
will have two main effects on the cloud data center:
1. Reduced operating costs in terms of investment in infrastructure, power and
cooling requirements and management.
2. Guaranteed end-user performance.
Research Methodology
I want to use machine learning (ML) algorithms, especially artificial neural networks,
to achieve these goals. The major reason for using ML is that ML can be used for
black-box modelling and no information about the application is typically required.
This will allow me to create a generic model.
I have previously employed linear regression to study this problem. The linear
regression model was able to relate, with relatively lesser error threshold, server
metrics to end-user performance in the 50 70 seconds response time zone. This
was an encouraging result which indicates that more advanced algorithms like
artificial neural networks can produce more accurate results.
Dr. Anshul Gandhi of Stony Brook University has done similar study where he
modeled performance of three-tier web applications [Adaptive, Model-driven
Autoscaling for Cloud Applications]. He used queuing theory and EKF (Extended
Kalman Filter) filters to model the application. However, as he explained to me in an
email this approach was taken because web applications are better modeled this
way. Also, since there was prior information regarding the application this approach
was used as a grey-box model. This can be used as an alternate method which will
then be able to target individual applications. However, in order to model the
system as a whole ML methods are probably better. The results should also
generalize well to arbitrary applications. Furthermore, as Dr. Gandhi explained to
me, response times in case of a single application were assumed to be related to
the inverse of the CPU load in a simple form or the ratio between jobs in the
system/arrival rate. However, this would also necessitate checking individual virtual
machines as against the server as a whole.
Once the application performance is modeled with a low error margin, further ML
methods like K-means clustering or SVM (Support Vector Machines) can be used to
cluster virtual machine together for optimal performance. These two methods can
also be used to build a simulator that can be used to verify changes in system
parameter due changes in the environment.
Reflections

A major obstacle in the proposed study is that this would probably require recurrent
neural networks to obtain best results. Recurrent networks are notoriously difficult
to train and I do not have much experience with them.
Conclusion
I intend to statistically relate server metrics to end-user application performance in
a cloud infrastructure. Such relationships can be used to optimize cloud
infrastructure. The major benefits are reduced costs and guaranteed performance
levels.

Vous aimerez peut-être aussi