What’s New in Data Deduplication?
If I had to pick two words to sum up the major changes for Data Deduplication coming in the next version of Windows Server, they would be “scale” and “performance”. In this posting, I’ll explain what these changes are and provide some recommendations of what to evaluate in Windows Server Technical Preview 2.
In Windows Server 2016, we are making major investments to enable Data Deduplication (or “dedup” for short) to more effectively scale to handle larger amounts of data. For example, customers have been telling us that they are using dedup for such scenarios as backing up all the tenant VMs for hosting businesses, using from hundreds of terabytes to petabytes of data. For these cases, they want to use larger volumes and files while still getting the great space savings results they are currently getting from Windows Server.
Dedup Improvement #1: Use the volume size you need, up to 64TB
Dedup in Windows Server 2012 R2 optimizes data using a single-threaded job and I/O queue for each volume. It works great, but you do have to be careful not to make the volumes so big that the dedup processing can’t keep up with the rate of data changes, or “churn”. In a previous blog posting (Sizing Volumes for Data Deduplication in Windows Server), we explained in detail how to determine the right volume size for your workload and typically we have recommended to keep volume size <10TB.
That all changes in Windows Server 2016 with a full redesign of dedup optimization processing. We now run multiple threads in parallel using multiple I/O queues on a single volume, resulting in performance that was only possible before by dividing up your data into multiple, smaller volumes:
The result is that our volume guidance changes to a very simple statement: Use the volume size you need, up to 64TB.
Dedup Improvement #2: File sizes up to 1TB are good for dedup
While the current version of Windows Server supports the use of file sizes up to 1TB, files “approaching” this size are noted as “not good candidates” for dedup. The reasons have to do with how the current algorithms scale, where, for example, things like scanning for and inserting changes can slow down as the total data set increases. This has all been redesigned for Windows Server 2016 with the use of new stream map structures and improved partial file optimization, with the results being that you can go ahead and dedup files up to 1TB without worrying about them not being good candidates. These changes also improve overall optimization performance by the way, adding to the “performance” part of the story for Windows Server 2016.
Dedup Improvement #3: Virtualized backup is a new usage type
We announced support for the use of dedup with virtualized backup applications using Windows Server 2012 R2 at TechEd last November, and there has been a lot of customer interest in this scenario since then. We also published a TechNet article with the DPM Team (see Deduplicating DPM Storage) with a reference configuration that lists the specific dedup configuration settings to make the scenario optimal.
With a new release we can do more interesting things to simplify these kinds of deployments and in Windows Server 2016 we have combined all the dedup configuration settings into a new usage type called, as you might expect, “Backup”. This both simplifies the deployment as well as helps to “future proof” your configuration since any future setting changes can be included to be automatically changed by setting this usage type.
Suggestions for What to Check Out in Windows Server TP2
What should you try out in Windows Server TP2? Of course, we encourage you to evaluate overall the new version of dedup on your own workloads and datasets (and this applies to any deployment you may be using or interested in evaluating for dedup, including volumes for general file shares or for supporting a VDI deployment, as described in our previous blog article on Large Scale VDI Deployment).
But specifically for the new features, here are a couple of areas we think it would be great for you to try.
Volume Sizes
Try larger volume sizes, up to 64TB. This is especially interesting if you have wanted to use larger volumes in the past but were limited by the requirements for smaller volume sizes to keep up with optimization processing.
Basically the guidance for this evaluation is to only follow the first section of our previous blog article Sizing Volumes for Data Deduplication in Windows Server, “Checking Your Current Configuration”, which describes how to verify that dedup optimization is completing successfully on your volume. Use the volume size that works best for your overall storage configuration and verify that dedup is scaling as expected.
Virtualized Backup
In the TechNet article I mentioned above, Deduplicating DPM Storage, there are two changes you can make to the configuration guidance.
Change #1: Use the new “Backup” usage type to configure dedup
In the section “Plan and set up deduplicated volumes” and in the following section “Plan and set up the Windows File Server cluster”, replace all the dedup configuration commands with the single command to set the new “Backup” usage type.
Specifically, replace all these commands in the article:
# For each volume Enable-DedupVolume -Volume <volume> -UsageType HyperV Set-DedupVolume -Volume <volume> -MinimumFileAgeDays 0 -OptimizePartialFiles:$false -Volume <volume>
# For each cluster node Set-ItemProperty -Path HKLM:\Cluster\Dedup -Name DeepGCInterval -Value 0xFFFFFFFF Set-ItemProperty -Path HKLM:\Cluster\Dedup -Name HashIndexFullKeyReservationPercent -Value 70 Set-ItemProperty -Path HKLM:\Cluster\Dedup -Name EnablePriorityOptimization -Value 1 |
…with this one new command:
# For each volume Enable-DedupVolume -Volume <volume> -UsageType Backup |
Change #2: Use the volume size you need for the DPM backup data
In the article section “Plan and set up deduplicated volumes”, a volume size of 7.2TB is specified for the volumes containing the deduplicated VHDX files containing the DPM backup data. For evaluating Windows Server TP2, the guidance is to use the volume size you need, up to 64TB. Note that you still need to follow the other configuration guidance, e.g., for configuring Storage Spaces and NTFS. But go ahead and use larger volumes as needed, up to 64TB.
Conclusion
We think that these improvements to Data Deduplication coming in Windows Server 2016 and available for you to try out in Windows Server Technical Preview 2 will give you great results as you scale up your data sizes and deploy dedup with virtualized backup solutions.
And we would love to hear your feedback and results. Please send email to dedupfeedback@microsoft.com and let us know how your evaluation goes and, of course, any questions you may have.
Thanks!