Leveraging Heterogeneity and Multiplicity in Cloud Storage Systems

Author: kumar rishabh
Date: 2019-08-06
Report no: IIIT/TH/2019/101
Advisor:Vasudeva Varma

Abstract

Cloud storage is central to working of any data center. With the explosion of data it has further taken a center stage. The per gigabyte cost of cloud storage has drastically gone down over the last 5 years. But while the demand for compute power increases in additive manner, the accompanying provisioned storage increases in multiplicative manner. To add to that for provisioning of reliable data, any storage system would need to provide features such as scalability, reliability and de-duplication out of box. Providing such features on bare metal storage is expensive and would make any such system out of reach of most of the end users. The advent of Software Defined Storage has proved to be very helpful to solve this problem. Software Defined Storage frameworks typically decouple the task of managing and provisioning data from the underlying hardware. Hence the data plane can reside on cheap commercial hardware making it accessible to end users. But different Software Defined Storage provide different novelties out of box. The myriad of solutions provided makes it difficult for users to choose one over of the other. We try to solve this problem heterogeneity of storage systems by studying various ideas that can potentially help us to leverage novel features of different storage systems. Specifically we study four ideas which can potentially help us combine the heterogeneous storage systems. • As first case study, IO Workload assignment in a heterogeneous Software Defined Storage environment using offline statistical modelling to estimate performance measures such as Throughput, IOPS, et al. As a proof of concept we use support vector regression to estimate the performance of individual IO Workloads on each available Software Defined Storage(SDS) system for optimal assignment and build a working prototype comprising of HDFS, GlusterFS, and Ceph. • As second case study we tackle the same problem of IO Workload assignment using online statistically modelling. Finding ground truth for such a system can be cumbersome. To this end we use hierarchical clustering algorithms to down sample the data to find canonical data points and determine which data points are worthy of finding ground truth. • As third case study, we study MultiStack, which is a cloud orchestration platform for provisioning job in a hybrid cloud environment. We leverage such platform to provision storage in a hybrid cloud environment. As a POC we use the platform to provision storage across Amazon Web Service and Openstack. The performance measure we try to optimize here for is the job completion time of IO workloads. • As fourth case study, we try to go to a more basic key value pair domain. We try to leverage multiple hash rings to see if the load distribution characteristics of the DHTs change. We establish that overall it does not lead to substantial change in the load distribution characteristics but in the process we find it improves the load distribution in case of adversarial traffic. This work is essentially a collection of trying to find various avenues in the cloud storage ecosystem where heterogeneity can be leveraged to get betterment in terms of cloud performance metrics. In places where we establish such a gain is not substantial we see heterogeneity can help us gain betterment of other metrics like load distribution in case of adverserial traffic.

Full thesis: pdf

Centre for Search and Information Extraction Lab

IIIT Hyderabad Publications

Leveraging Heterogeneity and Multiplicity in Cloud Storage Systems

Abstract