給SPDK添加 WRITE SAME的支持

給SPDK添加 WRITE SAME的支持

來自專欄存儲系統札記1 人贊了文章

WRITE SAME 命令是SCSI中一個不是必須的實現的命令,主要的用途是在重置設備內容。一個典型的場景是ESXi下厚製備立即置零整個卷。在雲場景,一個VM一般對應多個卷,每個卷的空間都是G到T級別。為了性能的穩定,很多分散式系統都需要將卷寫一遍,然後跑業務或者跑性能。全寫卷的目的是讓 backend 存儲提前分配好元數據,做好預熱等。如果是上層直接發寫全卷的調用,例如write,那麼寫一個T級別的卷,需要耗費非常長的時間。例如1T的卷,順序寫的速度是500M/s,那麼寫入需要2000s左右,約等於30多分鐘。Write Same 就是為這種場景準備的。通過在塊設備層,下發Write Same命令,上層不需要傳輸那麼多數據,只需要很少的數據(512B),然後在backend反覆的寫這部分數據,就可以達到 offload 寫到backend的目的。總之,Write Same的目的是:大大減少數據的傳輸;offload 全卷寫到backend;如果跟UNMAP結合,就可以最大限度的避免寫,更進一步提升性能。本文Agenda如下:介紹一下SCSI provison的知識;如何查看SCSI provison 和 Write Same/Unmap的協商信息WRITE SAME測試方法;provisionprovision 決定了邏輯塊與物理塊的對應關係。讀一下 SBC-3 手冊可知,provision 分為如下幾類:full provision:邏輯塊和物理塊一一對應logcal block provisionresource provision:有足夠的資源,使得所有的邏輯快都可以對應到一個物理塊,但是當前有些是 unmap 或者 anchor;thin provision:可以超分,也就是 lb 的數量可以大於物理塊的數量術語- anchor:預留的意思,lb 與物理塊有對應關係,但是並沒有使用- unmapping:lb 與物理塊沒有對應sg_utils 查看SCSI相關特性查看是否支持 logical provision:$ sg_readcap /dev/sda -lRead Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=1, lbprz=0 Last logical block address=629145599 (0x257fffff), Number of logical blocks=629145600 Logical block length=512 bytes Logical blocks per physical block exponent=0 Lowest aligned logical block address=0Hence: Device size: 322122547200 bytes, 307200.0 MiB, 322.12 GB查看 block limits 的限制: $ sg_vpd -p oi /dev/sdb 1 ?Block limits VPD page (SBC): Write same no zero (WSNZ): 0 Maximum compare and write length: 0 blocks Optimal transfer length granularity: 1 blocks Maximum transfer length: 2097152 blocks Optimal transfer length: 64 blocks Maximum prefetch length: 0 blocks Maximum unmap LBA count: 4294967295 Maximum unmap block descriptor count: 256 Optimal unmap granularity: 1 Unmap granularity alignment valid: 0 Unmap granularity alignment: 0 Maximum write same length: 0xffff blocks查看一塊盤的 lbp 的信息 $ sg_vpd -p lbpv /dev/sda Logical block provisioning VPD page (SBC): Unmap command supported (LBPU): 0 Write same (16) with unmap bit supported (LBWS): 1 Write same (10) with unmap bit supported (LBWS10): 0 Logical block provisioning read zeros (LBPRZ): 0 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Provisioning type: 0以上幾個命令,在做 SCSI provision,Write same和unmap協議支持檢查時經常用到的,SCSI 設備通過這幾個inquiry暴露自己的特性。以下是上面查詢結果中,最關鍵的縮寫,以及意義:LBPU:Logical block provisioning unmap,支持unmapLBWS:Logical block provisioning write sameLBWS10:Logical block provisioning write same16LBPRZ:Logical block provisioning read zeroslbpme:logical block provision management enable,如果是1,表示支持 logical block provision;lbprz:ogical block provisioning read zeros,如果是1,表示從 provision 的地方讀0;Maximum write same length: 0xffff blocks 表示一個Write Same命令可以寫的最大長度;查看 mapping 的狀態: $ sg_get_lba_status /dev/sdadescriptor LBA: 0x0000000000000000 blocks: 838860800 mapped使用scsi_debug 測試驗證modprobe scsi_debug lbprz=1 lbpu=1 lbpws=1 dev_size_mb=1024$ sg_readcap /dev/sdb -l Read Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=1, lbprz=1 Last logical block address=2097151 (0x1fffff), Number of logical blocks=2097152 Logical block length=512 bytes Logical blocks per physical block exponent=0 Lowest aligned logical block address=0Hence: Device size: 1073741824 bytes, 1024.0 MiB, 1.07 GB $ sg_vpd -p lbpv /dev/sdbLogical block provisioning VPD page (SBC): Unmap command supported (LBPU): 1 Write same (16) with unmap bit supported (LBWS): 1 Write same (10) with unmap bit supported (LBWS10): 0 Logical block provisioning read zeros (LBPRZ): 1 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Provisioning type: 0 $ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=0 lbpws=0 dev_size_mb=1024 $ sg_readcap /dev/sdb -l && sg_vpd -p lbpv /dev/sdbRead Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=0, lbprz=0 Last logical block address=2097151 (0x1fffff), Number of logical blocks=2097152 Logical block length=512 bytes Logical blocks per physical block exponent=0 Lowest aligned logical block address=0Hence: Device size: 1073741824 bytes, 1024.0 MiB, 1.07 GBLogical block provisioning VPD page (SBC): Unmap command supported (LBPU): 0 Write same (16) with unmap bit supported (LBWS): 0 Write same (10) with unmap bit supported (LBWS10): 0 Logical block provisioning read zeros (LBPRZ): 1 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Provisioning type: 0 $ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=1 lbpws=0 dev_size_mb=1024 $ sg_readcap /dev/sdb -l && sg_vpd -p lbpv /dev/sdbRead Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=1, lbprz=1 Last logical block address=2097151 (0x1fffff), Number of logical blocks=2097152 Logical block length=512 bytes Logical blocks per physical block exponent=0 Lowest aligned logical block address=0Hence: Device size: 1073741824 bytes, 1024.0 MiB, 1.07 GBLogical block provisioning VPD page (SBC): Unmap command supported (LBPU): 1 Write same (16) with unmap bit supported (LBWS): 0 Write same (10) with unmap bit supported (LBWS10): 0 Logical block provisioning read zeros (LBPRZ): 1 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Provisioning type: 0關掉 logical provision $ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=0 lbpws=0 dev_size_mb=1024 $ sg_readcap /dev/sdb -l && sg_vpd -p lbpv /dev/sdbRead Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=0, lbprz=0 Last logical block address=2097151 (0x1fffff), Number of logical blocks=2097152 Logical block length=512 bytes Logical blocks per physical block exponent=0 Lowest aligned logical block address=0Hence: Device size: 1073741824 bytes, 1024.0 MiB, 1.07 GBLogical block provisioning VPD page (SBC): Unmap command supported (LBPU): 0 Write same (16) with unmap bit supported (LBWS): 0 Write same (10) with unmap bit supported (LBWS10): 0 Logical block provisioning read zeros (LBPRZ): 1 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Provisioning type: 0 $ sg_get_lba_status -l 1024 /dev/sdbGet LBA Status command not supported測試 unmap $ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=1 lbpws=1 dev_size_mb=1024 $ dd if=/dev/zero of=/dev/sdb bs=512 seek=1024 count=138 && sg_get_lba_status -l 1024 /dev/sdb138+0 records in138+0 records out70656 bytes (71 kB) copied, 0.0192505 s, 3.7 MB/sdescriptor LBA: 0x0000000000000400 blocks: 144 mapped $ sg_unmap -v -l 1024 -n 16 /dev/sdb unmap cdb: 42 00 00 00 00 00 00 00 18 00 $ sg_get_lba_status -l 1024 /dev/sdbdescriptor LBA: 0x0000000000000400 blocks: 16 deallocated $ sg_get_lba_status -l 1040 /dev/sdbdescriptor LBA: 0x0000000000000410 blocks: 128 mappedwrite same詳情在這裡:https://www.systutorials.com/docs/linux/man/8-sg_write_same/ $ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=1 lbpws=1 dev_size_mb=1024 $ dd if=/dev/zero of=/dev/sdb bs=512 seek=1024 count=138 && sg_get_lba_status -l 1024 /dev/sdb138+0 records in138+0 records out70656 bytes (71 kB) copied, 0.0187926 s, 3.8 MB/sdescriptor LBA: 0x0000000000000400 blocks: 144 mapped $ sg_write_same -U --in /dev/zero --num=128 --lba=1024 /dev/sdb $ sg_get_lba_status -l 1024 /dev/sdbdescriptor LBA: 0x0000000000000400 blocks: 128 deallocated $ dd if=/dev/zero of=/dev/sdb bs=512 seek=1024 count=138 && sg_get_lba_status -l 1024 /dev/sdb138+0 records in138+0 records out70656 bytes (71 kB) copied, 0.0186514 s, 3.8 MB/sdescriptor LBA: 0x0000000000000400 blocks: 144 mapped $ sg_write_same --in /dev/zero --num=128 --lba=1024 /dev/sdb $ sg_get_lba_status -l 1024 /dev/sdbdescriptor LBA: 0x0000000000000400 blocks: 144 mapped $ cat /sys/bus/pseudo/drivers/scsi_debug/map1152-1167SPDK 實現完成之後 WRITE SAME測試用例1. 寫512的0dd if=/dev/urandom of=/dev/sdf bs=512 seek=0 count=2 && sg_get_lba_status -l 0 /dev/sdfhexdump -C -n1024 /dev/sdfsg_write_same --num=1 --lba=0 /dev/sdf -vvvhexdump -C -n1024 /dev/sdf2. unmap# not supportsg_write_same -U --in buf --num=1 --lba=0 /dev/sdf -vvv3. 寫小於512的內容perl -e print("-" x 504, "+" x 4); >buftime sg_write_same -U --in buf --num=4 --lba=0 /dev/sdf -vvv4. 寫大於512的內容非對齊perl -e print("-" x 512, "+" x 4); >buftime sg_write_same -U --in buf --num=4 --lba=0 /dev/sdf -vvv對齊perl -e print("-" x 510, "+" x 514); >buftime sg_write_same --in buf --num=4 --lba=0 /dev/sdf -vvvhexdump -C -n1024 /dev/sdf5. 性能測試寫256M數據time sg_write_same --num=$((256*1024*1024/512)) --lba=0 /dev/sdf -vvvno-data-outtime sg_write_same -N --num=$((256*1024*1024/512)) --lba=0 /dev/sdf -vvvESXi測試用例創建一個VM創建一個卷,類型為「厚製備立即置零」,打開wireshark,可以看到ESXi下發的Write Same命令總結實現 SCSI 的write same 比較簡單,按照spec實現就行,patch有空整理髮出來。實現了 write same之後,ESXi 厚製備立即置零(thick provision eager zeroed)在我的測試機上提速5倍以上。

推薦閱讀:

回應脈脈昨日熱門匿名帖。
社交的未來 (上)
如何使用Snapchat?這份指南讓你從新手成為專家
我對社交領域的想法
社交APP的基礎和機會

TAG:移動應用 | 產品經理 | 社交產品 |