Tuning - Hot Rodding Decentralized Storage Part 2
Good ( uses 4 CPU Cores)
These settings on a test 100GB file with adequate compute and a supporting network connection resulted in an observed transfer speed of 596Mb/s (74.5MB/s).
Better ( Uses 8 CPU Cores)
These settings on a test 100GB file with adequate compute and a supporting network connection resulted in an observed transfer speed of 1839.2Mb/s (229.9MB/s).
Best (Uses 24 CPU Cores)
These settings on a test 100GB file with adequate compute and a supporting network connection resulted in an observed transfer speed of 2651.2Mb/s (331.4MB/s).
More is not always better (Uses 24 CPU Cores)
These settings on a test 100GB file with adequate compute and a supporting network connection resulted in an observed transfer speed of 2676Mb/s (334.5MB/s). This results in speeds that are 1% better but completely saturate the servers compute showing there is isn’t much advantage at this scale to exceeding available CPU Cores.
Download Tuning - (Many Files)
If downloading many small files we need to use a different method to get parallelism and thus higher performance. Small files (<64MB) contain only one segment and do not have the opportunity to use segment parallelism to receive high speeds.
To get the greatest overall throughput we download many files at one time—RCLONE is great for this. This method can be accomplished via the hosted Gateway MT or through native methods. If you implement a native method, you bypass our edge services which is the slightly more performant path. The commands below are agnostic to the method for which you implement RCLONE.
The goal of the following tuning is to get the best overall total throughput for your data transfer session.
Upload - Choose your own adventure
Let’s start with a focus on the goal of best overall throughput. You likely fall somewhere in the gradient listed below:
- I want to upload a single to a few large files
- I want to upload many smaller files
Given that we get outstanding performance with parallel actions, we need to calculate how to get the most out of what you have. We will list a few examples below. Please note these examples focus on the maximum possible performance and can require significant memory resources. We address this in detail in the tuning section.
Single to few huge large files (upload)
In testing, we find that upload performance can be fully realized with single files as small as 6GB. With the proper resources (memory and network), you can realize the best theoretical performance with the following figures.
- 1x 6GB file uploaded with a chunk size of 64M and concurrency of 96
- 2x 3GB files uploaded with a chunk size of 64M and concurrency of 48
- 4x 1.5GB files uploaded with a chunk size of 64m and concurrency of 24
Many smaller files (upload)
If uploading many small files we need to use a different method to get parallelism and thus higher performance. Small files (<64MB) contain only one segment and do not have the opportunity to use segment parallelism to receive high speeds.
To get the greatest overall throughput we upload many files at one time—RCLONE is great for this.
Let’s quickly review what we just learned above:
- Files 64MB or smaller are a single segment, and to push to the network, we need to move a bunch at the same time
- Large files can benefit from parallel segment transfers
- You can combine concurrent file transfer and segment parallelism for several large files if you haven’t reached our theoretical limits
Upload Tuning (Large Files)
Rclone via Gateway MT
When uploading files larger than 64MB, the fastest solution is using a tool that supports multipart upload via the S3 standard and our S3 compatible hosted Gateway MT. We advise using RCLONE with our Gateway MT integration pattern to enable multipart uploads and concurrency tuning.
Upload performance is generally memory bound and uses very little compute, as the burden of encryption and erasure coding is handled by the hosted Gateway MT service. Memory requirements are as follows:
- (S3 Upload Concurrency * Chunk Size) * Concurrent File transfers
Simply put, memory requirements are equal to the total file size to upload if maximizing throughput. If you want to upload a 2GB file (2048/64=32) and use the ideal figure of 32x S3 Upload Concurrency, you’ll need 2GB of free memory. If you use 16x S3 upload concurrency, you’ll only need 1GB of memory. Conversely, if you want to upload 2x files simultaneously and use the ideal figure of 32 concurrencies, you’ll need 4GB of memory.
As a general rule, enormous files tend to reach peak transfer speeds of around 16GB of memory usage, which works out to S3 upload concurrency of 96 and chunk size of 64M. We’ve demonstrated over 1340Mb/s (168MB/s) with such settings and resources.
Ultimately, your compute, ram, or internet connection is likely to be the gating factor. We advise you perform a speed test to understand the capability of your network before testing.
Single 1GB File
Below, you have an optimal command to achieve the best theoretical transfer speed. Higher concurrency or segment size won’t help as 1024MB/64MB=16 total segments. The following command uploads all 16 segments in parallel.
Single 10GB File
Below, you can see us using three different amounts of memory to achieve up to the best theoretical speed. Ram usage equals concurrency * chunk size, and as a rule of thumb, we like concurrency to stay at or under 96. Because of this, we increased chunk size by 64MB steps to improve performance.
Good (uses 2.5GB of ram to upload)
This will achieve 25% of theoretical max performance but uses much less ram:
Better (uses 5GB of ram to upload)
This will achieve 50% of theoretical max performance but uses less ram:
Best (uses 10GB of ram to upload)
10,240/64=160 this is the max concurrency * chunk size possible for 10GB
Single 100GB File
Things get fun with enormous files, fun, fast, and resource-intensive. At these sizes and speeds, we start to run into limitations outside the client configuration. In some instances, your host may limit your performance via tools like Quality of Service (QOS). Below you can see us using three different amounts of memory to achieve up to the best speed. We find that the best possible performance occurs around 16GB of memory usage, where we’ve observed over 1340Mb/s (168MB/s) inclusive of encryption and erasure coding.
Ram usage equals concurrency * chunk size, and as a rule of thumb, we like concurrency to stay at or under 96. Because of this, we increased the chunk size to 1024MB to improve performance and greatly reduce the concurrency requirement.
Good (uses 4GB of ram to upload)
This will achieve >25% of max performance but uses half the ram as our Better settings. We’ve observed 488Mb/s (61MB/s) with this configuration.
Better (uses 8GB of ram to upload)
This will achieve >50% of max performance but uses half the ram as our Best settings. We’ve observed 720Mb/s (90MB/s) with this configuration.
Best (uses 16GB of ram to upload)
Although you can use more concurrency or larger chunk sizes, we find performance tends to top out above 1200Mb/s (150MB/s) around this level.
Multi 1GB File
As discussed earlier, uploading is ram limited, so we’ll explore calculating upload settings featuring the --transfers flag. All we do is the same calculation as done for a single 1GB file and then multiply it by the number of files you want to upload simultaneously.
As calculated earlier 1024MB/64MB=16 total segments. A 1GB file is 16 segments with a chunk size of 64MB. Let's go ahead and optimize around 4GB of ram with the assumption enough upstream bandwidth is available.
Working backward 4GB allows us to upload four files in parallel while also uploading all 16 segments in each file simultaneously. This process is equivalent to a 4GB file uploaded with a concurrency of 64. Avoid total concurrency figures above 96. Let’s look at two commands that will present your system with the same load in different ways.
4x 1GB Files 16 concurrency per file (max)
Transfer 4 is the default in Rclone
8x 1GB Files 8 concurrency per file
Exploring 8x transfers but with 8 concurrency to total concurrency stays at 64. This process should have a similar performance to the command above. There is no “best” command between the two. The extra examples are intended to offer additional insight into how we get to a total concurrency load.
Did you happen to miss part one in this series? You can find it here.