Note: this post originally appeared on https://aka.ms/clausjor by Claus Joergensen.
Hello, Claus here again, this time at 30,000 feet on a plane going back to Denmark for my dad’s 80th birthday. I think it is time that we explore some of the inner workings of Storage Spaces Direct (S2D) – it is much more exciting that any movie in the entertainment system. We are going to look at the Software Storage Bus, which is the central nervous system of Storage Spaces Direct. If you don’t already know what Storage Spaces Direct is, please see my blog post introducing Storage Spaces Direct.
Software Storage Bus introduction
The Software Storage Bus (SSB) is a virtual storage bus spanning all the servers that make up the cluster. SSB essentially makes it possible for each server to see all disks across all servers in the cluster providing full mesh connectivity. SSB consists of two components on each server in the cluster; ClusPort and ClusBlft. ClusPort implements a virtual HBA that allows the node to connect to disk devices in all the other servers in the cluster. ClusBlft implements virtualization of the disk devices and enclosures in each server for ClusPort in other servers to connect to.
Figure 1: Windows Server storage stack with the Software Storage Bus in green.
SMB as transport
SSB uses SMB3 and SMB Direct as the transport for communication between the servers in the cluster. SSB uses a separate named instance of SMB in each server, which separates it from other consumers of SMB, such as CSVFS, to provide additional resiliency. Using SMB3 enables SSB to take advantage of the innovation we have done in SMB3, including SMB Multichannel and SMB Direct. SMB Multichannel can aggregate bandwidth across multiple network interfaces for higher throughput and provide resiliency to a failed network interface (for more information about SMB Multichannel go here). SMB Direct enables use of RDMA enabled network adapters, including iWARP and RoCE, which can dramatically lower the CPU overhead of doing IO over the network and reduce the latency to disk devices (for more information about SMB Direct go here). I did a demo at the Microsoft Ignite conference back in May showing the IOPS difference in a system with and without RDMA enabled (demo is towards the end of the presentation)
Software Storage Bus Bandwidth Management
SSB also implements a fair access algorithm that ensures fair device access from any server to protect against one server starving out other servers. It also implements an algorithm for IO prioritization that prioritizes Application IO, which usually is IO from virtual machines, over system IO, which usually would be rebalance or repair operations. However, at the same time it ensures that rebalance and repair operations can make forward progress. Finally, it implements an algorithm that de-randomizes IO going to rotational disk devices to drive a more sequential IO pattern on these devices, despite the IO coming from the application (virtual machines) being a random IO pattern.
Software Storage Bus Cache
Finally, SSB implements a caching mechanism, which we call Storage Bus Cache (SBC). SBC is scoped to each server (per node cache) and is agnostic to the storage pools and virtual disks defined in the system. SBC is resilient to failures as it sits underneath the virtual disk, which provides resiliency by writing data copies to different nodes. When S2D is enabled in a cluster, SBC identifies which devices to use a caching devices and which devices are capacity devices. Caching devices will, as the name suggest, cache data for the capacity devices, essentially creating hybrid disks. Once it has been determined if a device is a caching device or a capacity device, the capacity devices are bound to a caching device in a round robin manner, as shown in the diagram below. Rebinding will occur if there is a topology change, such as if a caching device fails.
Figure 2: Storage Bus Cache in a hybrid storage configuration with SATA SSD and SATA HDD
The behavior of the caching devices is determined by the actual disk configuration of the system and outlined in the table below:
In systems with rotational capacity devices (HDD), SBC will act as both a read and write cache. This is because there is a seek penalty on rotational disk devices. In systems with all flash devices (NVMe SSD + SATA SSD), SBC will only act as a write cache. Because the NVMe devices will absorb most of the writes in the system, it is possible to use mixed-use or even read-intensive SATA SSD devices, which can lower the overall cost of flash in the system. In systems with only a single tier of devices, such as an all NVMe system or all SATA SSD system, SBC will need to be disabled. For more details on how to configure SBC, please see the Storage Spaces Direct experience guide here.
SBC creates a special partition on each caching device that, by default, consumes all available capacity except 32GB. The 32GB is used for storage pool and virtual disk metadata. SBC uses memory for runtime data structures, about 10GB of memory per TB of caching devices in the node. For instance, a system with 4x 800GB caching devices requires about 32GB of memory to manage the cache in addition to what is the for the base operating systems and any hosted hyper-converged virtual machines.
I hope you enjoyed reading this as much as I enjoyed writing it. I still have a couple of hours left on my flight, maybe I should try and catch some sleep. Until next time.