The exorbitant costs of cloud computing can stifle machine learning and data science projects, and many organizations use multiple public clouds for different purposes to save money. However, a multi-cloud approach can add significant complexity, as not everyone is an expert in cloud infrastructure.
To solve this problem, researchers at UC Berkeley’s Sky Computing Lab launched SkyPilot, an open-source framework for running ML and Data Science batch jobs on any cloud, or multiple clouds, with a single cloud-independent interface.
SkyPilot uses an algorithm to determine which cloud zone or service provider is the most profitable for a given project. The program considers the resource requirements of a workload (whether it needs CPU, GPU, or TPU) and then automatically determines which locations (zone/region/cloud) have compute resources available for complete the task before sending it to the least expensive fulfillment option.
The solution automates some of the most challenging aspects of running cloud workloads. The makers of SkyPilot claim that the program can reliably provision a cluster with automatic failover to other locations if capacity or quota errors occur, it can sync user code and files from local or cloud buckets to the cluster, and it can manage task queuing and execution. Researchers say this comes with dramatically reduced costs, sometimes more than 3 times.
SkyPilot developer and postdoctoral researcher Zongheng Yang said in a blog post that the growing trend of multi-cloud and multi-region strategies led the team to create SkyPilot, calling it an “intercloud broker”. He notes that organizations are strategically choosing a multi-cloud approach for greater reliability, avoiding cloud provider lock-in, and stronger bargaining leverage, to name a few reasons.
To reduce costs, SkyPilot exploits the large price differences between cloud providers for similar hardware resources. Yang gives the example of Nvidia A100 GPUs, and how Azure currently offers the cheapest A100 instances, but Google Cloud and AWS charge an 8% and 20% premium for the same compute power. For processors, some price differences can exceed 50%.
Specialized hardware is also a reason to shop around, as many cloud providers now offer custom options for different workloads. For example, Google Cloud offers TPUs for ML training, AWS has Inferentia for ML inference and Graviton processors for CPU workloads, and Azure provides Intel SGX codes for confidential computing. The scarcity of these specialized resources is also a reason to use multiple clouds, as high-end GPUs are often unavailable with long wait times.
Whatever the benefits of multi-cloud, there is often added complexity, and the Berkeley team has experienced this by using public clouds to run projects in ML, data science, systems, databases and security. Yang notes that using a single cloud is hard enough, but using multiple clouds exacerbates the burden on the end user, which the SkyPilot developers aim to alleviate.
The project has been under active development for more than a year in Berkeley’s Sky Computing Lab, according to Yang, and is used by more than 10 organizations for use cases including GPU/TPU model training, distributed hyperparameter shooting and CPU in-place batch jobs. instances. According to Yang, users report benefits including reliable provisioning of GPU instances, queuing multiple tasks to a cluster, and running hundreds of hyperparameter tests concurrently.
To learn more about how SkyPilot works, check out Yang’s blog or read the documentation here.
The cloud is ideal for data, except for these very high costs
Public cloud horse racing heats up: Gartner
Back to Basics: Big Data Management in the Hybrid and Multi-Cloud World
#Berkeley #launches #SkyPilot #manage #soaring #cloud #costs