1. 24 6月, 2019 6 次提交
    • S
      vmdk: Add read-only support for seSparse snapshots · 98eb9733
      Sam Eiderman 提交于
      Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
      QEMU).
      
      This format was lacking in the following:
      
          * Grain directory (L1) and grain table (L2) entries were 32-bit,
            allowing access to only 2TB (slightly less) of data.
          * The grain size (default) was 512 bytes - leading to data
            fragmentation and many grain tables.
          * For space reclamation purposes, it was necessary to find all the
            grains which are not pointed to by any grain table - so a reverse
            mapping of "offset of grain in vmdk" to "grain table" must be
            constructed - which takes large amounts of CPU/RAM.
      
      The format specification can be found in VMware's documentation:
      https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
      
      In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
      introduced: SESparse (Space Efficient).
      
      This format fixes the above issues:
      
          * All entries are now 64-bit.
          * The grain size (default) is 4KB.
          * Grain directory and grain tables are now located at the beginning
            of the file.
            + seSparse format reserves space for all grain tables.
            + Grain tables can be addressed using an index.
            + Grains are located in the end of the file and can also be
              addressed with an index.
            - seSparse vmdks of large disks (64TB) have huge preallocated
              headers - mainly due to L2 tables, even for empty snapshots.
          * The header contains a reverse mapping ("backmap") of "offset of
            grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
            specifies for each grain - whether it is allocated or not.
            Using these data structures we can implement space reclamation
            efficiently.
          * Due to the fact that the header now maintains two mappings:
              * The regular one (grain directory & grain tables)
              * A reverse one (backmap and free bitmap)
            These data structures can lose consistency upon crash and result
            in a corrupted VMDK.
            Therefore, a journal is also added to the VMDK and is replayed
            when the VMware reopens the file after a crash.
      
      Since ESXi 6.7 - SESparse is the only snapshot format available.
      
      Unfortunately, VMware does not provide documentation regarding the new
      seSparse format.
      
      This commit is based on black-box research of the seSparse format.
      Various in-guest block operations and their effect on the snapshot file
      were tested.
      
      The only VMware provided source of information (regarding the underlying
      implementation) was a log file on the ESXi:
      
          /var/log/hostd.log
      
      Whenever an seSparse snapshot is created - the log is being populated
      with seSparse records.
      
      Relevant log records are of the form:
      
      [...] Const Header:
      [...]  constMagic     = 0xcafebabe
      [...]  version        = 2.1
      [...]  capacity       = 204800
      [...]  grainSize      = 8
      [...]  grainTableSize = 64
      [...]  flags          = 0
      [...] Extents:
      [...]  Header         : <1 : 1>
      [...]  JournalHdr     : <2 : 2>
      [...]  Journal        : <2048 : 2048>
      [...]  GrainDirectory : <4096 : 2048>
      [...]  GrainTables    : <6144 : 2048>
      [...]  FreeBitmap     : <8192 : 2048>
      [...]  BackMap        : <10240 : 2048>
      [...]  Grain          : <12288 : 204800>
      [...] Volatile Header:
      [...] volatileMagic     = 0xcafecafe
      [...] FreeGTNumber      = 0
      [...] nextTxnSeqNumber  = 0
      [...] replayJournal     = 0
      
      The sizes that are seen in the log file are in sectors.
      Extents are of the following format: <offset : size>
      
      This commit is a strict implementation which enforces:
          * magics
          * version number 2.1
          * grain size of 8 sectors  (4KB)
          * grain table size of 64 sectors
          * zero flags
          * extent locations
      
      Additionally, this commit proivdes only a subset of the functionality
      offered by seSparse's format:
          * Read-only
          * No journal replay
          * No space reclamation
          * No unmap support
      
      Hence, journal header, journal, free bitmap and backmap extents are
      unused, only the "classic" (L1 -> L2 -> data) grain access is
      implemented.
      
      However there are several differences in the grain access itself.
      Grain directory (L1):
          * Grain directory entries are indexes (not offsets) to grain
            tables.
          * Valid grain directory entries have their highest nibble set to
            0x1.
          * Since grain tables are always located in the beginning of the
            file - the index can fit into 32 bits - so we can use its low
            part if it's valid.
      Grain table (L2):
          * Grain table entries are indexes (not offsets) to grains.
          * If the highest nibble of the entry is:
              0x0:
                  The grain in not allocated.
                  The rest of the bytes are 0.
              0x1:
                  The grain is unmapped - guest sees a zero grain.
                  The rest of the bits point to the previously mapped grain,
                  see 0x3 case.
              0x2:
                  The grain is zero.
              0x3:
                  The grain is allocated - to get the index calculate:
                  ((entry & 0x0fff000000000000) >> 48) |
                  ((entry & 0x0000ffffffffffff) << 12)
          * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
            grain which results from the guest using sg_unmap to unmap the
            grain - but the grain itself still exists in the grain extent - a
            space reclamation procedure should delete it.
            Unmapping a zero grain has no effect (0x2 will not change to 0x1)
            but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
      
      In order to implement seSparse some fields had to be changed to support
      both 32-bit and 64-bit entry sizes.
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Reviewed-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NArbel Moshe <arbel.moshe@oracle.com>
      Signed-off-by: NSam Eiderman <shmuel.eiderman@oracle.com>
      Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      98eb9733
    • S
      vmdk: Reduce the max bound for L1 table size · 59d6ee48
      Sam Eiderman 提交于
      512M of L1 entries is a very loose bound, only 32M are required to store
      the maximal supported VMDK file size of 2TB.
      
      Fixed qemu-iotest 59# - now failure occures before on impossible L1
      table size.
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Reviewed-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NArbel Moshe <arbel.moshe@oracle.com>
      Signed-off-by: NSam Eiderman <shmuel.eiderman@oracle.com>
      Message-id: 20190620091057.47441-3-shmuel.eiderman@oracle.com
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      59d6ee48
    • S
      vmdk: Fix comment regarding max l1_size coverage · 940a2cd5
      Sam Eiderman 提交于
      Commit b0651b8c ("vmdk: Move l1_size check into vmdk_add_extent")
      extended the l1_size check from VMDK4 to VMDK3 but did not update the
      default coverage in the moved comment.
      
      The previous vmdk4 calculation:
      
          (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB
      
      The added vmdk3 calculation:
      
          (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB
      
      Adding the calculation of vmdk3 to the comment.
      
      In any case, VMware does not offer virtual disks more than 2TB for
      vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
      not implemented yet in qemu.
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Reviewed-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NArbel Moshe <arbel.moshe@oracle.com>
      Signed-off-by: NSam Eiderman <shmuel.eiderman@oracle.com>
      Message-id: 20190620091057.47441-2-shmuel.eiderman@oracle.com
      Reviewed-by: Nyuchenlin <yuchenlin@synology.com>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      940a2cd5
    • A
      iotest 134: test cluster-misaligned encrypted write · 6ec889eb
      Anton Nefedov 提交于
      COW (even empty/zero) areas require encryption too
      Signed-off-by: NAnton Nefedov <anton.nefedov@virtuozzo.com>
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Reviewed-by: NAlberto Garcia <berto@igalia.com>
      Message-id: 20190516143028.81155-1-anton.nefedov@virtuozzo.com
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      6ec889eb
    • V
      blockdev: enable non-root nodes for transaction drive-backup source · 85c9d133
      Vladimir Sementsov-Ogievskiy 提交于
      We forget to enable it for transaction .prepare, while it is already
      enabled in do_drive_backup since commit a2d665c1
          "blockdev: loosen restrictions on drive-backup source node"
      Signed-off-by: NVladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
      Message-id: 20190618140804.59214-1-vsementsov@virtuozzo.com
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      85c9d133
    • K
      nvme: do not advertise support for unsupported arbitration mechanism · 1cc354ac
      Klaus Birkelund Jensen 提交于
      The device mistakenly reports that the Weighted Round Robin with Urgent
      Priority Class arbitration mechanism is supported.
      
      It is not.
      Signed-off-by: NKlaus Birkelund Jensen <klaus.jensen@cnexlabs.com>
      Message-id: 20190606092530.14206-1-klaus@birkelund.eu
      Acked-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      1cc354ac
  2. 21 6月, 2019 34 次提交