PhD Thesis

This section of the Web site reports PhD theses available. General information on Doctoral studies at Politecnico are available at the DEIB Web site. Interested candidates should contact me by e-mail.  Thesis can be funded by the ATMOSPHERE H2020 European project in collaboration with Brazil.

Bayesian Optimization for Sizing Big Data and Deep Learning Applications Cloud Clusters

Today data mining, along with general big data analytic techniques, are heavily changing our society, e.g., in the financial sector or healthcare. Companies are becoming more and more aware of the benefits of data processing technologies; across almost any sector most of the industries use or plan to use machine learning techniques.

In particular, deep learning methods are gaining momentum across various domains for tackling different problems, ranging from image recognition and classification to text processing and speech recognition.

Picking the right cloud cluster configuration for recurring big data/deep learning analytics is hard, because there can be tens of possible virtual machines/GPUs instance types and even more cluster sizes to pick from. Choosing poorly can lead to performance degradation and higher costs to run an application.

However, it is challenging to identify the best configuration from a broad spectrum of cloud alternatives.

The goal of this thesis is to identify novel Bayesian Optimization methods to build performance models for various big data and deep learning applications based on Spark, the most promising big data framework which will probably dominate the big data market in the next 5-10 years.

The aim of this research work is to building accurate machine learning models to estimate the performance of Spark applications (possibly running on GPU clusters) by considering only few test runs on reference systems and identify optimal or close to optimal configurations.  Bayesian methods will be mixed with traditional techniques for performance modelling, which includes computer systems simulations or bounding techniques.

References

  1. Brochu, V. M. Cora, N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.
  2. Snoek, H. Larochelle, R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms.
  3. Venkataraman, Z. Yang, M. Franklin, B. Recht, I. Stoica. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. NSDI 2016 Proceedings.
  4. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, M. Zhang. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. NSDI 2017 Proceedings.

Machine Learning techniques to Model Data Intensive and Deep Learning Applications Performance

Nowadays, Big Data are becoming more and more important. Many sectors of our economy are now guided by data-driven decision processes. Spark is becoming the reference framework while at the infrastructural layer, cloud computing provides flexible and cost-effective solutions for allocating on-demand large clusters, often based on GPGPUs.  In order to obtain an efficient use of such resources, it is required a performance model of such systems being at the same time precise and efficient to use.

One common way to model ICT systems performance makes use of analytical models like queueing networks or Petri nets. However, despite having a great accuracy in performance prediction, their significant computational complexity limits their usage. Machine learning techniques can solve this problem and develop models being accurate and scalable at the same time.

This thesis involves the development and validation of models for Big Data clusters based on Spark or based on GPGPUs to support deep learning applications training.  The research work will compare multiple machine learning algorithms like Support Vector Regression, Linear regression, Random Forests, Neural Network and will develop feature engineering solutions to identify compact and, possibly, interpretable models to predict the performance of large clusters.

References

  1. Venkataraman, Z. Yang, M. Franklin, B. Recht, I. Stoica. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. NSDI 2016 Proceedings.
  2. N. Yigitbasi, T. L. Willke, G. Liao, D. Epema, Towards machine learning-based auto-tuning of MapReduce. MASCOTS 2013, 11-20.
  3. A. D. Popescu, A. Balmin, V. Ercegovac, A. Ailamaki, Predict: Towards predicting the runtime of large scale iterative analytics, VLDB  2013, 1678–1689.
  4. E. Ataie, E. Gianniti, D. Ardagna, A. Movaghar. A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Cloud Environment. SYNASC 2016: 431-439.