Thesis

This section of the Web site provides theses and projects proposals for students. The main topics are Cloud computing, Cloud service centers energy management, and virtualized systems performance evaluation.

If you are interested in one of the proposal, contact me by e-mail.

Capacity planning in shared Spark and Hadoop clusters

Recent years have seen the rapid growth of interest for enterprise applications built on top of data-intensive technologies such as MapReduce/Hadoop, Spark, NoSQL databases, and stream processing systems fed by mobile and sensor data. Moreover, Cloud platform services for Big Data (e.g., Amazon Elastic MapReduce, S3, Kinesis; Microsoft HDInsights) are now creating massive growth opportunities for software vendors to develop and sell novel data-intensive cloud applications in various market segments, from predictive analytics to environmental monitoring, from e-government to smart cities. Since the software development market expects to be dominated by data-intensive cloud applications in the next years, there is now an urgent need for novel, highly productive, software engineering methodologies capable of supporting the design of data-intensive applications. The focus of the thesis is to define a quality-driven framework for developing data-intensive applications that leverage Big Data technologies hosted in private or public clouds. The thesis will develop a methodology for the capacity planning of Spark and MapReduce based applications in shared clusters.

The thesis will be developed within the context of the DICE H2020 European project:

http://www.dice-h2020.eu

Pre-requisites: Performance evaluation, Optimization models (linear and non-linear programming), Spark, Hadoop.

Petri Net Models to Estimate Big Data Applications Performance

Nowadays more and more companies deal with large amounts of raw, unstructured data during business operations. This trend is fostering the emergence of the Big Data market, which is growing at a fast 27% worldwide compound annual growth rate through 2017 and, in Europe, at a 31.96% compound annual growth rate through 2016. Moreover, nearly 40% of Big Data worldwide will likely be hosted on public Clouds by 2020, while Hadoop is expected to touch half of world data during the same period. Apache Hadoop, open source implementation of the MapReduce framework, holds a central role in the Big Data paradigm and is widely adopted in the industry to process huge datasets. Alongside Hadoop, with its I/O bounded workflow mainly targeted at batch processing, currently in-memory frameworks such as Spark are available, enabling faster elaboration of iterative algorithms, e.g., regression, classification, and other machine learning applications. Cost effectiveness considerations encourage to share computational clusters among heterogeneous classes of workloads, but this practice gives rise to difficulties in performance prediction. Furthermore, real world applications are usually bound to meet Service Level Agreements providing, e.g., an upper bound for query execution time, thus requiring careful resource allocation. Among other approaches, it is possible to study multi-class systems adopting Petri Nets. At the expense of a significant computational complexity, these tools allow for a great accuracy in performance prediction. In addition, Petri Nets appear good abstractions for data-intensive applications: a token circulating in the model represents well a request being processed and atomic fork/join operations and colors can be profitably exploited to express at the same time the memory, disk read/write operations, network/stream traffic, and other concurrent operations that a single request implies on the available computational resources. Then the results will be exploited to solve the capacity allocation problem and to assess possible advantages of admission control. The aim of this thesis is the development and validation of Petri Nets models through an experimental campaign conducted on Hadoop and Spark, so as to quantify the obtained precision in performance prediction and to reach a good balance with simulation times. In particular, these models will address the multi-class admission control and capacity allocation problem, with possible extensions to design time or runtime applications, such as tools for optimal resource sizing, application development, or scheduling. The thesis will be developed within the H2020 EU project DICE (http://www.dice-h2020.eu)

Machine Learning techniques to Model Data Intensive Application Performance

Nowadays Big Data are becoming more and more important. Many sectors of our economy are now guided by data-driven decision processes. Big Data and business intelligence applications are facilitated by the MapReduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. In order to obtain an efficient usage of such resources, it is required a model of such systems being at the same time precise and efficient to use.

Machine learning techniques black-box approaches can develop a model being accurate and scalable at the same time.

This thesis involves the development and validation of models for Big Data clusters (Hadoop 2.x/Spark) obtained by applying Support Vector Regression on empirical data. Support Vector Regression is a technique which is akin to neural networks, but provides more robustness to noise and outliers in the training data and shows nice theoretical properties.

Game Theory Models for Cloud Resource Allocation

In recent years the evolution and the widespread adoption of virtualization, service-oriented architectures, autonomic, and utility computing have converged letting a new paradigm to emerge: The Cloud Computing. Cloud Computing aims at streamlining the on-demand provisioning of software, hardware, and data as services, providing end-user with flexible and scalable services accessible through the Internet. Since the Cloud offer is currently becoming wider and more attractive to business owners, the development of efficient resource provisioning policies for Cloud-based services becomes increasingly challenging. Indeed, modern Cloud services operate in an open and dynamic world characterized by continuous changes where strategic interaction among different economic agents takes place. This thesis aims to study the hourly basis service provisioning and capacity allocation problem through the formulation of a mathematical model based on noncooperative-game-theoretic approach. We take the perspective of Software as a Service (SaaS) providers which want to minimize the costs associated with the virtual machine instances allocated in a multi-IaaSs (In- frastructure as a Service) scenario, while avoiding incurring in penalties for requests execution failures and providing quality of service guarantees. SaaS providers compete and bid for the use of infrastructural resources, while the IaaSs want to maximize their revenues obtained providing virtualized resources. The thesis will focus also on the uncertainty related to workload prediction and estimate of the resource demands leading to a “robust” game.

Model Driven Design of Big Data Applications

Recent years have seen the rapid growth of interest for enterprise applications built on top of data-intensive technologies such as MapReduce/Hadoop, NoSQL databases, and stream processing systems fed by mobile and sensor data. Moreover, Cloud platform services for Big Data (e.g., Amazon Elastic MapReduce, S3, Kinesis; Microsoft HDInsights) are now creating massive growth opportunities for software vendors to develop and sell novel data-intensive cloud applications in various market segments, from predictive analytics to environmental monitoring, from e-government to smart cities. Since the software development market expects to be dominated by data-intensive cloud applications in the next years, there is now an urgent need for novel, highly productive, software engineering methodologies capable of supporting the design of data-intensive applications.

The focus of the thesis is to define a quality-driven framework for developing data-intensive applications that leverage Big Data technologies hosted in private or public clouds. The thesis will develop a methodology and tools for data-aware quality-driven development. The work will focus on quality assessment, architecture enhancement, agile delivery and continuous monitoring of data-intensive applications.

Pre-requisites: Model Driven Methodologies, Performance evaluation, Optimization models (linear and non-linear programming, AMPL, etc.), Hadoop, Spark.

Strategies for Cloud Systems Run-time Adaptation

Cloud infrastructures live in an open world characterized by continuous changes in the environment and requirements they have to meet. Continuous changes occur unpredictably, and they are out of control of the cloud provider. Nevertheless, cloud-based services must be provided with different Service Level Agreements (SLAs) in terms of reliability, security and performance. The aim of this thesis is to develop solutions for performance guarantee that are able to dynamically adapt the resources of the cloud infrastructure in order to satisfy SLAs and to minimize costs. Capacity allocation and load balancing techniques able to coordinate multiple distributed resource controllers working in geographically distributed cloud sites will be developed.

Pre-requisites: Optimization models (linear and non-linear programming, AMPL, etc.), Performance Models (queueing networks), Game Theory, Java programming.

Filling the gap between Run-time Behaviour of Cloud Systems and the Corresponding Design Model

Models play a central role in software engineering. They may be used to reason about requirements, to identify possible missing parts or conflicts. They may be used at design time to analyze the effects and trade-offs of different architectural choices before starting an implementation, anticipating the discovery of possible defects that might be uncovered at later stages, when they might be difficult or very expensive to remove. They may also be used at run time to support continuous monitoring of compliance of the running system with respect to the desired model. However, models are abstraction of real systems, hence the predictions at design-time need to be validated once the system is deployed in a real system. For example in cloud environments, assumptions on the mix of requests in operation at design-time might differ from the ones observed in the production system depending on customer preferences. Similarly, the performance or reliability profile of certain Cloud resources in practise may differ from the figures assumed at design time. The aim of this thesis, is to define a feedback loop between the operational systems deployed in the cloud and software design. Quality and cost models used at design-time will be kept alive at run-time and refined by exploiting the information gathered by monitoring the underlying cloud system. The feedback loop wiil integrate run-time data into design-time models for fine tuning and will provide recommendations to the software designer to improve the design time QoS and cost estimates.

Pre-requisites: Performance Models (queueing networks).

Energy Aware Policies for Federated Clouds

The reduction of carbon dioxide emissions targeted for the next years is fostering an increased utilization of renewable energy sources (green energies) and, more in general, a decreased impact on the environment (carbon footprint) of human activities. ICT plays a key role in this greening process, as ICT solutions can greatly improve the environmental performance of other sectors of the world economy. However, the potential impact of the carbon emissions of the ICT sector itself has also to be carefully considered. Recent studies show that service centers accounts for 2-4% of global CO2 emissions and it is projected to reach up to 10% in 5-10 years, fuelled by the expected massive adoption of Cloud services. Nowadays, service centers consume as much power as medium-size cities and Cloud providers are among the largest customers of electricity providers. So, one of the main challenges for adoption of Cloud services is to be able to reduce their energy consumption and carbon emissions, while keeping up with the high growth rate of associated data storage, server and communication infrastructures. The thesis aims at defining a unifying energy load management framework, which takes into account both the economic perspective of Cloud providers (thus performance levels, energy costs, etc.) and the overall energy distribution system efficiency (load/production balancing, load previsioning and guaranteeing) with the aim of minimizing the usage of brown energy sources and reducing the environmental footprint of Cloud systems, enabling the cooperation among Cloud service providers, communication networks, and the electrical grid in multiple scenarios and with different cooperation levels .

Pre-requisites: Optimization models (linear and non-linear programming, AMPL, etc.), Performance evaluation, Java programming.

Some Previous Thesis Works