Thesis

This section of the Web site provides theses and projects proposals for students. The main topics are Cloud computing, Big Data cluster management, Deep Learning applications and GPGPUs systems performance evaluation.

If you are interested in one of the proposal, contact me by e-mail.

Bayesian Optimization for Sizing Big Data and Deep Learning Applications Cloud Clusters

Advisors: Prof. Alessandra Guglielmi, Prof. Danilo Ardagna

Today data mining, along with general big data analytic techniques, are heavily changing our society, e.g., in the financial sector or healthcare. Companies are becoming more and more aware of the benefits of data processing technologies; across almost any sector most of the industries use or plan to use machine learning techniques.

In particular, deep learning methods are gaining momentum across various domains for tackling different problems, ranging from image recognition and classification to text processing and speech recognition.

Picking the right cloud cluster configuration for recurring big data/deep learning analytics is hard, because there can be tens of possible virtual machines/GPUs instance types and even more cluster sizes to pick from. Choosing poorly can lead to performance degradation and higher costs to run an application. However, it is challenging to identify the best configuration from a broad spectrum of cloud alternatives.

The goal of this thesis is to identify novel Bayesian Optimization methods to build performance models for various big data and deep learning applications based on Spark, the most promising big data framework which will probably dominate the big data market in the next 5-10 years.

The aim of this research work is to building accurate machine learning models to estimate the performance of Spark applications (possibly running on GPU clusters) by considering only few test runs on reference systems and identify optimal or close to optimal configurations.  Bayesian methods will be mixed with traditional techniques for performance modelling, which includes computer systems simulations or bounding techniques.

References

  1. Brochu, V. M. Cora, N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.
  2. Snoek, H. Larochelle, R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms.
  3. Venkataraman, Z. Yang, M. Franklin, B. Recht, I. Stoica. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. NSDI 2016 Proceedings.
  4. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, M. Zhang. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. NSDI 2017 Proceedings.

Machine Learning techniques to Model Data Intensive and Deep Learning Applications Performance

Nowadays, Big Data are becoming more and more important. Many sectors of our economy are now guided by data-driven decision processes. Spark is becoming the reference framework while at the infrastructural layer, cloud computing provides flexible and cost-effective solutions for allocating on-demand large clusters, often based on GPGPUs.  In order to obtain an efficient use of such resources, it is required a performance model of such systems being at the same time precise and efficient to use.

One common way to model ICT systems performance makes use of analytical models like queueing networks or Petri nets. However, despite having a great accuracy in performance prediction, their significant computational complexity limits their usage. Machine learning techniques can solve this problem and develop models being accurate and scalable at the same time.

This thesis involves the development and validation of models for Big Data clusters based on Spark or based on GPGPUs to support deep learning applications training.  The research work will compare multiple machine learning algorithms like Support Vector Regression, Linear regression, Random Forests, Neural Network and will develop feature engineering solutions to identify compact and, possibly, interpretable models to predict the performance of large clusters.

References

  1. Venkataraman, Z. Yang, M. Franklin, B. Recht, I. Stoica. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. NSDI 2016 Proceedings.
  2. N. Yigitbasi, T. L. Willke, G. Liao, D. Epema, Towards machine learning-based auto-tuning of MapReduce. MASCOTS 2013, 11-20.
  3. A. D. Popescu, A. Balmin, V. Ercegovac, A. Ailamaki, Predict: Towards predicting the runtime of large scale iterative analytics, VLDB  2013, 1678–1689.
  4. Ataie, E. Gianniti, D. Ardagna, A. Movaghar. A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Cloud Environment. SYNASC 2016: 431-439.

 Robust Games for the Run-time Management of Cloud Systems

Cloud Computing aims at streamlining the on-demand provisioning of software, hardware, and data as services, providing end-user with flexible and scalable services accessible through the Internet. Since the Cloud offer is currently becoming wider and more attractive to business owners, the development of efficient resource provisioning policies for Cloud-based services becomes increasingly challenging. Indeed, modern Cloud services operate in an open and dynamic world characterized by continuous changes where strategic interaction among different economic agents takes place.

This thesis aims to study the run-time service provisioning and capacity allocation problem through the formulation of a mathematical model based on noncooperative-game-theoretic approach. We take the perspective of Software as a Service (SaaS) providers which want to minimize the costs associated with the virtual machine/container instances allocated in a multi-IaaSs (Infrastructure as a Service) scenario, while avoiding incurring in penalties for requests execution failures and providing quality of service guarantees. SaaS providers compete and bid for the use of infrastructural resources, while the IaaSs want to maximize their revenues obtained providing the underlying resources. The thesis will focus also on the uncertainty related to workload prediction and estimate of the resource demands leading to a “robust” game.

References

  1. D. Bertsimas, M. Sim. The price of robustness. Operations Research, 52(1):35–53, 2004.
  2. D. Ardagna, B. Panicucci, M. Passacantando. A Game Theoretic Formulation of the Service Provisioning Problem in Cloud Systems. WWW 2011, 177-186.
  3. D. Ardagna, M. Ciavotta, M. Passacantando. Generalized Nash Equilibria for the Service Provisioning Problem in Multi-Cloud Systems. IEEE Trans. Services Computing 10(3): 381-395, 2017.

Hierarchical Resource Management of  Very Large Cloud Platforms

Worldwide interest in the delivery of computing and storage capacity as a service continues to grow at a rapid pace. Thanks to development of virtualized and container-based systems and micro-services architecture, cloud platforms are becoming more and more flexible but their complexity require advanced resource management solutions that are capable of dynamically adapting the underlying infrastructure while providing continuous service and performance guarantees.

Cloud systems are continuously growing in terms of size: Today, cloud service centers include up to 10,000 servers and each server hosts several VMs and/or possibly more containers. In this context, centralized solutions are subject to critical design limitations, including a lack of scalability and expensive monitoring communication costs, and cannot provide fast and effective control.

The goal of this thesis is to devise resource allocation policies for virtualized and container-based environments that satisfy performance and availability guarantees and minimize operating costs (e.g., energy) of very large cloud service centers. The work will develop a scalable distributed hierarchical framework based on a mixed-integer nonlinear optimization acting at multiple timescales.

References

  1. Nowicki, M.S. Squillante, and C.W. Wu. Fundamentals of Dynamic Decentralized Optimization in Autonomic Computing Systems. Self-Star Properties in Complex Information Systems, 204-218, Springer-Verlag, 2005.
  2. Addis, D. Ardagna, B. Panicucci, M. S. Squillante, L. Zhang. A Hierarchical Approach for the Resource Management of Very Large Cloud Platforms. IEEE Trans. Dependable Sec. Comput. 10(5): 253-272, 2013.
  3. Sedaghat, F. Hernandez-Rodriguez, E. Elmroth. Decentralized Cloud Datacenter Reconsolidation Through Emergent and Topology-Aware Behavior. Future Generation Computer Systems, 56: 51–63, 2016.
  4. Farahnakian, T. Pahikkala, P. Liljeberg, J. Plosila, H. Tenhune. Hierarchical VM Management Architecture For Cloud Data Centers, CloudCom 2014, 306–311.

Optimizing the infrastructure for training Deep Neural Networks

Nowadays, deep learning (DL) methods are fruitfully exploited in a wide gamut of products across industries, ranging from medical diagnosis to public security. Among them, convolutional neural networks (CNNs) are the most popular technique, most notably for image recognition and classification tasks, which represented the first successful application. In addition to the established applications, CNN models are widely used for other use cases, like speech recognition and machine translation. Over time, many frameworks have been developed to provide high level APIs for CNN design, learning, and deployment. Among the most well known, Torch, PyTorch, TensorFlow, and Caffe can be cited.

DL models are usually trained by relying on GPGPU systems, which allow users to achieve from 5 up to 40x time improvement when compared to CPU deployments. Cloud platforms are becoming popular to provide GPGPUs on demand: the market size for GPU as a service was estimated to be 200 million US dollars in 2016 with a compound annual growth rate of over 30% from 2017.

High-end systems (e.g., NVIDIA DGX-2 and DGX-1) provide up to 16 fully interconnected GPUs which, thanks to  NVSwitch technology, can be partitioned among multiple training jobs submitted by different users.  The goal of the thesis is to design an advanced scheduler which: (i) allows specifying job deadlines, (ii) automatically partitions the available GPUs across submitted jobs to minimize costs (in case of cloud use) and provide deadline guarantees.

References

  1. NVIDIA. The Challenge of Scaling to Meet the Demands of Modern AI and Deep learning.  http://images.nvidia.com/content/pdf/dgx-2-print-datasheet-738070-nvidia-a4-web.pdf

AutoML++ Optimization of Deep Networks

Advisors: Prof. Matteo Matteucci, Prof. Danilo Ardagna

Cloud AutoML is a suite of machine learning products that enables developers with limited machine learning expertise to train high-quality models specific to their business needs, by leveraging Google’s state-of-the-art transfer learning, and Neural Architecture Search technology.

Deep neural networks form a powerful framework for machine learning and have achieved a remarkable performance in several areas in recent years. However, despite the compelling arguments for using neural networks as a general template for solving machine learning problems, training these models and designing the right network for a given task has been filled with many theoretical gaps and practical concerns.

To train a neural network, one needs to specify the parameters of a typically large network architecture with several layers and units, and then solve a difficult non-convex optimization problem. Moreover, if a network architecture is specified a priori and trained using back-propagation, the model will always have as many layers as the one specified a priori. Since not all machine learning problems admit the same level of difficulty and different tasks naturally require varying levels of complexity, complex models trained with insufficient number of layers can provide unsatisfactory accuracy. AutoML helps in automatically changing the network architecture and its parameters.  The goal of this thesis is to: (i) compare and analyse available AutoML opensource toolkits, (ii) integrate one of such toolkit with the performance analysis tools developed at Politecnico di Milano, (iii) provide a Bayesian optimization framework that extends AutoML toolkits to drive the search of the best deep (convolutional) neural networks architecture providing  also execution time/budget guarantees (e.g., run 100,000 epochs in <8 h,  cost <1K$).

References

  1. Google. Cloud AutoMLBETA  https://cloud.google.com/automl
  2. AdaNet.  https://github.com/tensorflow/adanet
  3. Corinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, Scott Yang. AdaNet: Adaptive Structural Learning of Artificial Neural Networks.  https://arxiv.org/pdf/1607.01097.pdf
  4. Eugenio Gianniti, Li Zhang, Danilo Ardagna. Performance Prediction of GPU-based Deep Learning Applications. SBAC-PAD 2018 Proceedings. Lyon, France.

Studying Spark performance on GPGPU-based systems

In today’s world, the amount of data created on daily basis has reached quintillion bytes. Big data plays a major role for many different areas as it makes extraordinary changes in the analysis of the available data.

Apache Spark is one of the most widely used big data frameworks. The Spark project has emerged with the motivation of democratizing the superpower of big data, through a unified tool to perform various big data analyses and providing high-level APIs.

Today, deep learning methods perform very important tasks, which were considered unthinkable a few years ago. Yet the available deep learning frameworks still provide only low-level APIs that need manual work, experimenting etc. As an exception, the new Spark Deep Learning pipelines framework provides high-level abstractions for deploying deep learning methods.

Within the context of investigating the capacities of big data technologies, performance evaluation techniques play a central role. There is a strong need to predict application execution times especially in the analysis of the data in connection to businesses. Providing deadline guarantees is becoming critical to run both batch and interactive queries. Besides the prediction of execution times, performance models can be also used to perform capacity planning or to manage production environments at runtime.

The goal of this thesis is to: (i) profile the SparkDL pipelines framework when GPGPUs are used by running a benchmark developed at Polimi, (ii) develop and compare multiple performance models based on machine learning.

References

  1. DataBricks.  Deep Learning Pipelines. https://docs.databricks.com/applications/deep-learning/deep-learning-pipelines.html
  2. Demet Sude Saplik, Danilo Ardagna.  Spark Deep Learning. https://github.com/eubr-atmosphere/spark-deep-learning
  3. Elif Sahin, Marco Lattuada.  A performance MLLibrary. https://github.com/eubr-atmosphere/a-MLLibrary

Some Previous Thesis Works