Storing and Retrieving Machine Learning Models at Scale With Distributed Object Storage
- Posted by Daitan Innovation Team
- On September 6, 2019
- Machine Learning, Object Storage
The need to quickly create, store, and fetch machine learning models at scale is rapidly increasing. Examples of applications include recommender systems that are based on individual customers’ purchasing habits or detection of fraud attempts based on each customer’s past behavior, among many others.
As a result, deploying a scalable, cost-effective, and reliable storage solution becomes central to platforms that handle a large number of individual machine learning models.
In this post, we explore object storage systems as a potential alternative to address this specific scenario and test two implementations: AWS S3 (public cloud) and Red Hat Ceph (public/private cloud and bare metal), although other good contenders are available, such as OpenStack Swift.
We analyze not only if they can be used, but also how they compare to each other. Although our tests are focused on a specific use case, the results can be useful for other scenarios involving a large number of blob (binary large objects) items.
How Do Distributed Object Storage Systems Work?
Object storage systems wrap data into objects, identified through a unique hash key. This key is usually based on the name of the uploaded file and determines the location (including the HDD or server) where it will be stored.
With that in mind, we might want to avoid uploading multiple files containing the same initial characters in their names. This practice can help some systems (like it did for S3, up until July 2018) to spread data across the nodes, improving performance.
This architecture also implies that it is not possible to append data to the end of files or to execute them directly. We must first retrieve and unwrap the original file from the object containing it.
S3 alone stores a few trillion objects without sacrificing latency or reliability. Dynamo, its underlying technology, was one of the first solutions to employ concepts like sloppy quorums and multiple Merkle trees (one for each node) simultaneously, allowing the platform to remain operational through server failures and network partitions.
Most offerings can also store multiple copies of a file, to improve availability. As a result, if an object is overwritten and immediately read, the system might not have time to propagate changes to the additional copies.
A user might receive the older version of the file (while S3 simply delivers both versions simultaneously, Ceph, for example, only marks the object as updated after the changes are propagated to all of the copies).
On the other hand, this improves performance in environments where files are simultaneously requested by multiple users and allows the system to fulfill requests, even if a node is offline.
On top of that, some solutions can improve performance by dividing each copy of an object into smaller pieces (in a similar way to RAID 0). Ceph splits objects into 4MB slices by default and allows users to adjust the number of concurrent write operations.
This enables the writing and retrieval of objects from multiple disks in parallel and provides an increased queue depth. On some devices (and up to a point), this increase can be translated into greater performance.
Besides security or regulations, one of the main advantages of deploying an object storage solution on-premises, such as in a private cloud, is the increased opportunity for performance tuning.
In the context of machine learning, quickly writing and retrieving datasets and models can be central to the pipeline. For this test, however, we deployed Ceph on EC2 instances (public cloud) for a better comparison to S3.
We executed performance tests on both Ceph and S3, through Boto3, the AWS SDK for python. By using the same interface to interact with both platforms we can ignore the overhead caused by the test itself.
Also, many solutions are designed to operate with S3, and by using the AWS API we enable scenarios that wouldn’t be possible otherwise (such as using Ceph with Apache Spark, for example).
In our tests, Ceph was deployed across three machines, containing SSDs capable of delivering 20.000 IOPS each. The SSDs were subdivided into three partitions and deployed through nine object storage daemons (one for each partition).
Each machine had access to two CPU cores and 4GB of RAM. Additionally to the OSDs, we deployed one instance of the Ceph Monitor Daemon and one instance of the Ceph Manager Daemon to each of them.
We also deployed the Rados Gateway Daemon to one of the machines.
The requests were executed from a fourth machine, with the same hardware configurations. Lastly, we connected all machines to a 10 Gbps network to avoid bottlenecks.
We measured upload and download speeds of 100 files, with sizes varying from 100KB and 100MB, and transferring one file at a time.
You can check the results on the plot below, where each point represents the average between five measurements.
Download Speeds Per File Size
Ceph delivered download speeds of up to 230MB/s, for a single file. Considering all files larger than 25MB, the solution speeds averaged at 211MB/s.
S3 measurements peaked at 150.5MB/s, with an average of 137MB/s considering all files larger than 25MB.
We can see more details in the table below.
Table 1: Measured download bandwidth of Red Hat Ceph and AWS S3, for files with sizes between 25MB and 100MB.
On S3, we observed some occasional drops in bandwidth while downloading larger files, to a minimum of 62.02MB/s.
This is the main reason for the observed uncertainty levels for the solution. In the plot below, we can see the confidence interval of each point, at 95%.
S3 Download Speed Per File Size
Ceph delivered more consistent results, with major deviations only appearing at the slope around 16MB to 25MB points.
Ceph Download Speed Per File Size
Both solutions presented regions of sharp increases in download speeds. S3 exhibited two of them, at the 8MB and 16MB points.
In this platform, the second jump was caused by a switch to multipart uploads. The next table presents details about measurements taken in between the slopes.
Table 2: Measured download bandwidth of AWS S3, with and without the use of multi-part transfers.
As we can see below, changing the parameter
multipart_thresholdin Boto3 to 8MB also changes the location of this slope.
We can observe that the bandwidth increases more gradually, exemplifying the diminishing returns of multi-part transfers for smaller file sizes.
S3 Download Speed Per File Size, with Boto3’s Multipart_Threshold Set at 8MB
The causes of the other slopes, however, are not that clear. We would love to hear your ideas on it!
Based on our preliminary experiment, both S3 and Ceph are potential candidates to address our goal of storing and retrieving machine learning models at scale with reasonable cost and performance.
From the performance perspective, both will be able to retrieve a relatively large 100MB model in ~800 milliseconds or less, on average, which is acceptable in our case.
While better performance would be nice, their ability to provide a reliable and scalable storage solution at low-cost is a very important aspect.
Comparing each solution’s results, we noticed that for larger files, Ceph offered improvements over the performance registered by S3, exemplifying the potential of deploying object storage solutions on custom hardware.
The importance of properly tuning our connection mechanisms also becomes apparent. By simply activating multipart-transfers, we increased S3 download speeds by an average of 59.45MB/s.
Tweaking settings like TCP Window scaling could increase speeds of both solutions even more.
It’s worth noting that we didn’t change the maximum concurrent writes or the size of the object slices while testing Ceph, choosing to keep the default configuration.
When optimized for a specific environment, the solution could deliver even better performance and you should test multiple settings before deploying it to production.
Also, while an on-premises bare-metal object storage solution is a viable option and can even deliver greater performance (depending on the hardware), there is a tradeoff.
S3 is a cheap and managed system that scales automatically and, if performance per dollar is the main factor for you, it might be the best choice. Deploying Ceph to a private cloud could be an intermediate solution.
While many tests were already made on top of these solutions, the exact behavior of these platforms is not widely known. We hope we showed their performance from a new perspective and are excited to see what the community will uncover in the future.
For example, another interesting test, depending on your use case, would be measuring performance under the stress caused by multiple concurrent users (load test). We’d love to learn if you have done it already!
Thanks for reading!
By Ranieri Castellar, Ivan Marin, and Fernando Moraes from Daitan’s Innovation and Technical Council teams. Thanks to the reviewers Felipe Pereira, Thalles Silva, and Kathleen McCabe.