Hi folks, Ned here again. A few years ago, I demonstrated using Storage Replica as an extreme data mover, not just as a DR solution; copying blocks is a heck of lot more efficient than copying files. At the time, having even a single NVME drive and RDMA networking was gee-wiz. Well, times have changed, and all-flash storage deployments are everywhere. Even better, RDMA networking like iWARP is becoming commonplace. When you combine Windows Server 2012 R2, Windows Server 2016, or the newly announce Windows Server 2019 with ultrafast flash storage and ultrafast networking, you can get amazing speed results.
What sort of speeds are we talking about here?
The Gear
The good folks at Chelsio – makers of iWARP RDMA networking used by SMB Direct – setup a pair of servers with the following config:
- OS: Windows 2016
- System Model: 2x Supermicro X10DRG-Q
- RAM:128GB per node
- CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz (2 sockets, 24 cores) per node
- INTEL NVME SSD Model: SSDPECME016T4 (1.6TB) – 5x in source node
- Micron NVME SSD Model: MTFDHAX2T4MCF-1AN1ZABYY (2.4TB) – 5x in destination node
- 2x Chelsio T6225-CR 25Gb iWARP RNICs
- 2x Chelsio T62100-CR 100Gb iWARP RNICs
25 and 100 Gigabit networking with CPU offloading the transfers and doing remote direct memory placement with SMB!? Yes please.
The Goal
We wanted to see if Storage Replica block copying with NVME could fully utilize an iWARP RDMA network and what the CPU overhead would look like. When using NVME drives, servers are much more likely to run out of networking under high data transfer workloads than storage IOPS and MB/sec throughput. 10Gb ethernet and TCP simply cannot keep up, and their need to use the motherboard’s CPU for all the work restricts perf even further.
We already know that straight file copying would not be able to match the perf of Storage Replica block copy and also show significant CPU usage on each node. But where would the bottleneck be now?
The Grades
25Gb
First, I tried the 25Gb RDMA network, configuring Storage Replica to perform initial sync and clone the entire 2TB volume residing on top of the storage pool.
As you can see, this immediately consumed the entire 25Gb network. The NVME is just too fast, and Storage Replica is a kernel mode disk filter that pump data blocks at the line rate of the storage.
The CPU and memory are looking very low. This is the advantage that SMB Direct & RDMA offloading is bringing to the table; the server is left with all the resources to do its real job, and not deal with user-mode nonsense.
In the end this is quite a respectable run and the data moved very fast. Copying 2TB in 12 minutes with no real CPU or memory hit is great by any definition.
But we can do better .
100Gb
Same test with the same servers, storage, volumes, Storage replica – except this time I’m using 100Gb Chelsio iWARP networking.
I like videos. Let’s watch a video this time (turn on CC if you’re less familiar with SR and crank the resolution).
Holy smokes!!! The storage cannot keep up with the networking. Let me restate:
The striped NVME drives cannot keep up with SMB Direct and iWARP.
We just pushed 2 terabytes of data over SMB 3.1.1 and RDMA in under three minutes. That’s ~10 gigabytes a second.
The Rundown
When you combine Windows Server and Chelsio iWARP RDMA, you get ultra-low latency, low-CPU, low-memory, high throughput SMB and workload performance in:
- Storage Spaces Direct
- Storage Replica
- Hyper-V Live Migration
- Windows Server and Windows 10 Enterprise client SMB operations
You will not be disappointed.
A huge thanks to the good folks at Chelsio for the use of their loaner gear and lab. Y’all rock.
– Ned Pyle
PS: