= 2010-07-26 = == File System : Data Deduplication == * 今天看到戴爾(Dell)併購 Ocarina Networks 好獲得刪除重複資料功能,又重新搜尋了一下目前自由軟體可以做到 Deduplication 的解決方案。目前看起來 File System 有三種解決方案:(1) [http://hub.opensolaris.org/bin/view/Community+Group+zfs/dedup ZFS] (2) [http://www.lessfs.com/wordpress/ lessfs] (3) [http://www.opendedup.org/ SDFS],軟體部分有看到:(1) [http://backuppc.sourceforge.net/ backuppc] (2) [http://www.bacula.org/ bacula] (3) [http://code.google.com/p/ostor/ OStor] * [http://www.digitimes.com.tw/tw/dt/n/shwnws.asp?Cnlid=4&cat=400&cat1=10&cat1=&id=0000192291_E963XHUF8IVGE92T9CQPV 硬體龍頭戴爾極力發展軟實力] {{{ 19日(2010-07-19)戴爾宣布購併儲存軟體業者 Ocarina Networks,該公司旗下軟體以壓縮、刪除重複資料功能見長。 所以透過Ocarina Networks 的軟體,便可達到儲存效率最佳化,有效降低硬體、能源等各方面相關成本。 }}} * [http://punetech.com/understanding-data-de-duplication/ 關於 Data Deduplication 的定義與分類] * 定義:[http://en.wikipedia.org/wiki/Data_deduplication 維基百科] {{{ Data deduplication or Single Instancing essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy (single instance) of the data to be stored. However, indexing of all data is still retained should that data ever be required. }}} * 2009-09-23 : [http://www.linux-mag.com/id/7535 Deduping Storage Deduplication] * 2007-07-12 : [http://www.backupcentral.com/content/view/58/47/ What is deduplication? (updated 6-08)] * 分類: 1. Point of Application – Source Vs Target 2. Time of Application – Inline vs Post-Process 3. Granularity – File vs Sub-File level 4. Algorithm – Fixed size blocks Vs Variable length data segments * [[Image(http://blog.druva.com/wp-content/uploads/2009/01/dedup-tree.jpg)]] * '''Target based Deduplication'' vs '''Source based Deduplication'' * [[Image(http://blog.druva.com/wp-content/uploads/2009/01/target-source-dedup.jpg)]] * 2007-07-30 : [http://www.backupcentral.com/content/view/129/47/ Two different types of de-duplication] * 2007-07-31 : [http://www.backupcentral.com/content/view/130/47/ De-duplication & remote restores] * '''Inline-process Deduplication''' vs '''Post-process Deduplication''' * [[Image(http://blog.druva.com/wp-content/uploads/2009/01/inline-post-dedup.jpg)]] * '''File Level Deduplication''' vs '''Sub-file Level Deduplication''' * '''Fixed-Length Blocks''' vs '''Variable-Length Data Segments''' * [[Image(http://blog.druva.com/wp-content/uploads/2009/01/file-bocks.jpg)]]] * 論文: FAST'08 - [http://www.usenix.org/events/fast08/tech/full_papers/zhu/zhu_html/index.html Avoiding the Disk Bottleneck in the Data Domain Deduplication File System] * [http://www.snia.org/education/tutorials/2009/spring/data-management/ SNIA Data Protection and Management 2009] - [http://www.snia.org/education/tutorials/2009/spring/data-management/DanielBudiansky_Understanding_Data_Deduplication.pdf Understanding Data Deduplication] - Daniel Budiansky, Larry Freeman * 2010-05-13 : [http://www.enterprisestorageforum.com/continuity/article.php/11568_3882106_2/Open-Source-Deduplication-Ready-for-Enterprises.htm Open Source Deduplication: Ready for Enterprises?] * 這裡提到一個 [http://www.baculasystems.com/eng Bacula System] 也要 Open Source,自由軟體版本必須到 http://www.bacula.org/ 才找得到。 * [http://www.zmanda.com/images/logo-index-main.png Zmanda] Zmanda 這間公司倒是超常看到,不管是 Linux World 或者最近在找 MySQL 備份,都可以看到它的蹤影。從公司網站上看起來,有分成(1) 基於[http://www.amanda.org/ 自由軟體 Amanda] 的網路備份、(2) MySQL 備份跟(3) 雲端備份。 * 至於 [http://www.nexenta.com/corp/nexentastor-overview/nexentastor-releases/nexentastor-30 Nexenta Systems] 是基於 ZFS 來作 inline deduplication * [http://www.opendedup.org/ Opendedup 的 SDFS] - 看起來是 2010 年初才變成自由軟體 - 授權是 GPLv2. - [http://code.google.com/p/opendedup/ Google Code 專案網站] * 2010-03-25 : [http://ostatic.com/blog/sdfs-a-robust-deduplication-file-system-for-linux SDFS: A Robust Deduplication File System for Linux] * 2010-03-25 : [http://www.cio.com.au/article/340870/open_source_deduplication_software_released_linux/ Open source deduplication software released for Linux] * [[Image(http://opendedup.googlecode.com/files/Screenshot-1.png)]] * 2010-02-22 : [http://searchstorage.techtarget.com.au/articles/38919-Two-open-source-data-deduplication-tools Two open-source data deduplication tools] * 這裡除了介紹 ZFS 以外,還介紹了一套叫做 [http://backuppc.sourceforge.net/ backuppc] * 還有人幫 backuppc 作了一個虛擬機器版本 - [http://gotitsolutions.org/2007/01/15/open-source-backup-and-data-de-duplication-virtual-appliance-2.html Open source backup and data de-duplication virtual appliance] * [http://code.google.com/p/ostor/ OStor] - [http://ostor.sourceforge.net/ 舊的 SourceForge 網站] * 2009-11-01 : [http://ppraveen.wordpress.com/2009/11/01/introducing-ostor-data-deduplication-in-the-cloud-open-source-project/ Introducing OStor – data deduplication in the cloud. Open source project.] * 2009-11-01 : [http://dedup.wordpress.com/2009/11/01/ostor-data-deduplication-in-the-cloud-howto/ OStor – data deduplication in the cloud – HowTo] * [cloud:wiki:jazz/10-05-20 2010-05-20] 至中興大學演講,會後與另一場關於虛擬化的講者,麟瑞科技陳中欣先生聊到 NetApp 的 Deduplication 技術是 Block-level,也因此針對虛擬機器的 Disk 可以達到 deduplication 的目的。 * [[Image(http://trac.nchc.org.tw/cloud/raw-attachment/wiki/jazz/10-05-20/10-05-30_NetApp_Block-level-deduplication.png)]] * [cloud:wiki:jazz/10-03-03 2010-03-03] 邀請 Sun 來演講 ZFS, 發現原來 ZFS 也有 deduplication 的特性呢!!真好!! * 2009-12-03 : [http://hub.opensolaris.org/bin/view/Community+Group+zfs/dedup ZFS Deduplication Frequently Asked Questions (FAQ)] * 2009-11-03 : [http://www.h-online.com/open/news/item/ZFS-with-data-deduplication-848638.html ZFS with data deduplication] * 2009-11-02 : [http://blogs.sun.com/bonwick/entry/zfs_dedup ZFS Deduplication] * [wiki:jazz/09-02-11#A-SISCOW 2009-02-11 : 關於 A-SIS & COW] * [http://en.wikipedia.org/wiki/NTFS#Single_Instance_Storage_.28SIS.29 Single Instance Storage (SIS)] 是 NTFS 的特點 * SIS 的特性跟 [http://en.wikipedia.org/wiki/Copy-on-write Copy-on-Write (COW)] 很相似,而 COW 最被廣泛應用的地方就是 QEMU 這些虛擬化技術所用的檔案系統了。 * [http://blog.scottlowe.org/2007/09/21/nfs-for-vmware-storage/ 不少人在談論在 NFS 上跑 VMWare] * 而 ITHome 也報導過 [http://www.ithome.com.tw/itadm/article.php?c=45228&s=11 NetApp A-SIS] 可以幫忙 [http://www.ithome.com.tw/itadm/article.php?c=52322 活用儲存虛擬化:以少量的實體空間,因應龐大的資料存取需求] * [http://media.netapp.com/documents/tr-3505.pdf 這裡]有 !NetApp 官方的 A-SIS 的介紹。 * [http://www.redbooks.ibm.com/redpapers/pdfs/redp4320.pdf IBM RedBook] 介紹了 IBM 自家的儲存系統也支援 A-SIS,這張圖非常經典地說明了如何在 Block-level 做 deduplication。 * [[Image(wiki:jazz/09-02-11:SIS_Deduplication.jpg)]] * 目前有實作 Deduplication 概念的檔案系統並不多。或許可以進一步了解一下 A-SIS 是怎麼做 Block-level 的 Deduplication,這樣或許可以強化 DRBL AOE Windows 跟 Xen 虛擬化所需的硬碟空間。 * 先前一直有在注意 virtualization 所帶來的資料重疊問題,!NetApp 在這方面就很厲害,可以從 File System 下手,把重複的檔案進行濃縮(deduplication)。今天剛好看到 Linux Magazine 的文章「[http://www.linux-mag.com/id/7535 Deduping Storage Depulication]」,裡面有提到目前許多商業解決方案,但自由軟體呢?目前似乎只有用 FUSE 寫的 [http://sourceforge.net/projects/lessfs/ lessfs],它的官方網站 http://www.lessfs.com/ 目前並沒有太多資料,希望未來會有更多這類的檔案系統出現。我第一個想到的問題是在 loop device image 裡重複的檔案,該怎麼進行 deduplication 呢?? 同樣的 vmdk 這一些虛擬化的硬碟,有辦法作 deduplication 嘛?? ([wiki:jazz/09-09-23#FileSystem:lessfs:deduplication 2009-09-23])