Large Scale Study and System Design for Primary Data Deduplication accepted by USENIX

Microsoft Research (MSR) and the Windows File Server team worked together to build a new Data Deduplication feature in Windows Server 2012. This feature came from 2 years of collaboration with MSR on the design. The development of the architecture and the algorithms we use for deduplication was driven, in part, by analysis of data in a large global enterprise. The USENIX Annual Technical Conference (ATC) was held on June 13-15, and we submitted a Large Scale Study and System Design paper and gave a talk about our findings. The new paper and presentation video have just gone public on the USENIX website.

The paper describes the algorithms used to chunk data, identify unique data chunks using indexes on chunk hashes, and how to scale deduplication resources on large amounts of data, including performance evaluation numbers. The paper and talk give a review of the advanced analysis carried out on the datasets and how the insights were used to determine design points that address the challenges of primary data deduplication. Many of the design decisions for deduplication were made to create a balance of on-disk space savings, resource usage, performance, and transparency. The key feature is that deduplication can be installed on primary data volumes without impacting the server’s regular workload and still offer significant savings.

Overview:

A large-scale study of primary data deduplication on 7TB of data across 15 globally-distributed servers in a large enterprise.

Architecture overview of deduplication in Windows Server 2012 and the design decisions that were driven by data analysis.

How deduplication is made friendly to the server’s primary workload, how CPU, memory and disk IO resource usage for deduplication scales efficiently with the size of the data.

Highlights of the innovations that went into the areas of data chunking / compression, chunk indexing, data partitioning and reconciliation.

Primary data serving, reliability, and resiliency aspects of the system are not covered in this paper.

Check out the live video of the talk given by Sudipta Sengupta and Adi Oltean and download the PDF of the paper here: https://www.usenix.org/conference/usenixfederatedconferencesweek/primary-data-deduplication%E2%80%94large-scale-study-and-system

Cheers,
Scott M. Johnson
Program Manager II
Data Deduplication Team

Large Scale Study and System Design for Primary Data Deduplication accepted by USENIX

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

pinout ecu b5vf 18881a

Stories • Goddess Stepmom

BQ40Z80EVM-020: Installation problems with Battery Management Studio Software...

Cops bust UVF goon Matthews at east Belfast gym

* Start SLD Registration * Failed to open HTTP connection

Practical Research 2 DLP for SHS

South Sudan: CCM VACANCY FOR Primary Health Care Supervisor (PHCS) – SOUTH SUDAN

Sarah Samis, Emil Bove III

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

IP400 Series Phones Fail to Connect to CAS

Who's been in the courts?

LSI SMIS на ESXi 6.7

MDG F: Cost Centre Hierarchy - File upload

FUNG: ROMELIA MARIA

あいみょん (Aimyong) –瞬間的シックスセンス [FLAC 24bit/48kHz]

Error when updating pager_heading in Views Module - "A valid cache entry...

Re: No option for 'Guest Isolation' in VMware Workstation 16 player

Burbank Police Log: May 16 – May 22