1. 01 7月, 2019 4 次提交
    • M
      pcie: work around for racy guest init · 110c477c
      Michael S. Tsirkin 提交于
      During boot, linux guests tend to clear all bits in pcie slot status
      register which is used for hotplug.
      If they clear bits that weren't set this is racy and will lose events:
      not a big problem for manual hotplug on bare-metal, but a problem for us.
      
      For example, the following is broken ATM:
      
      /x86_64-softmmu/qemu-system-x86_64 -enable-kvm -S -machine q35  \
          -device pcie-root-port,id=pcie_root_port_0,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
          -device virtio-balloon-pci,id=balloon,bus=pcie_root_port_0 \
          -monitor stdio disk.qcow2
      (qemu)device_del balloon
      (qemu)cont
      
      Balloon isn't deleted as it should.
      
      As a work-around, detect this attempt to clear slot status and revert
      status to what it was before the write.
      
      Note: in theory this can be detected as a duplicate button press
      which cancels the previous press. Does not seem to happen in
      practice as guests seem to only have this bug during init.
      
      Note2: the right thing to do is probably to fix Linux to
      read status before clearing it, and act on the bits that are set.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NMarcel Apfelbaum <marcel.apfelbaum@gmail.com>
      Reviewed-by: NIgor Mammedov <imammedo@redhat.com>
      Tested-by: NIgor Mammedov <imammedo@redhat.com>
      110c477c
    • M
      pcie: check that slt ctrl changed before deleting · 2841ab43
      Michael S. Tsirkin 提交于
      During boot, linux would sometimes overwrites control of a powered off
      slot before powering it on. Unfortunately QEMU interprets that as a
      power off request and ejects the device.
      
      For example:
      
      /x86_64-softmmu/qemu-system-x86_64 -enable-kvm -S -machine q35  \
          -device pcie-root-port,id=pcie_root_port_0,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
          -monitor stdio disk.qcow2
      (qemu)device_add virtio-balloon-pci,id=balloon,bus=pcie_root_port_0
      (qemu)cont
      
      Balloon is deleted during guest boot.
      
      To fix, save control beforehand and check that power
      or led state actually change before ejecting.
      
      Note: this is more a hack than a solution, ideally we'd
      find a better way to detect ejects, or move away
      from ejects completely and instead monitor whether
      it's safe to delete device due to e.g. its power state.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NMarcel Apfelbaum <marcel.apfelbaum@gmail.com>
      Reviewed-by: NIgor Mammedov <imammedo@redhat.com>
      Tested-by: NIgor Mammedov <imammedo@redhat.com>
      2841ab43
    • M
      pcie: don't skip multi-mask events · 861dc735
      Michael S. Tsirkin 提交于
      If we are trying to set multiple bits at once, testing that just one of
      them is already set gives a false positive. As a result we won't
      interrupt guest if e.g. presence detection change and attention button
      press are both set. This happens with multi-function device removal.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NIgor Mammedov <imammedo@redhat.com>
      Reviewed-by: NMarcel Apfelbaum <marcel.apfelbaum@gmail.com>
      Reviewed-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
      861dc735
    • P
      Merge remote-tracking branch 'remotes/maxreitz/tags/pull-block-2019-06-24' into staging · 7fec76a0
      Peter Maydell 提交于
      Block patches:
      - The SSH block driver now uses libssh instead of libssh2
      - The VMDK block driver gets read-only support for the seSparse
        subformat
      - Various fixes
      
      # gpg: Signature made Mon 24 Jun 2019 15:42:56 BST
      # gpg:                using RSA key 91BEB60A30DB3E8857D11829F407DB0061D5CF40
      # gpg:                issuer "mreitz@redhat.com"
      # gpg: Good signature from "Max Reitz <mreitz@redhat.com>" [full]
      # Primary key fingerprint: 91BE B60A 30DB 3E88 57D1  1829 F407 DB00 61D5 CF40
      
      * remotes/maxreitz/tags/pull-block-2019-06-24:
        iotests: Fix 205 for concurrent runs
        ssh: switch from libssh2 to libssh
        vmdk: Add read-only support for seSparse snapshots
        vmdk: Reduce the max bound for L1 table size
        vmdk: Fix comment regarding max l1_size coverage
        iotest 134: test cluster-misaligned encrypted write
        blockdev: enable non-root nodes for transaction drive-backup source
        nvme: do not advertise support for unsupported arbitration mechanism
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      7fec76a0
  2. 24 6月, 2019 8 次提交
    • M
      iotests: Fix 205 for concurrent runs · ab5d4a30
      Max Reitz 提交于
      Tests should place their files into the test directory.  This includes
      Unix sockets.  205 currently fails to do so, which prevents it from
      being run concurrently.
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      Message-id: 20190618210238.9524-1-mreitz@redhat.com
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      ab5d4a30
    • P
      ssh: switch from libssh2 to libssh · b10d49d7
      Pino Toscano 提交于
      Rewrite the implementation of the ssh block driver to use libssh instead
      of libssh2.  The libssh library has various advantages over libssh2:
      - easier API for authentication (for example for using ssh-agent)
      - easier API for known_hosts handling
      - supports newer types of keys in known_hosts
      
      Use APIs/features available in libssh 0.8 conditionally, to support
      older versions (which are not recommended though).
      
      Adjust the iotest 207 according to the different error message, and to
      find the default key type for localhost (to properly compare the
      fingerprint with).
      Contributed-by: NMax Reitz <mreitz@redhat.com>
      
      Adjust the various Docker/Travis scripts to use libssh when available
      instead of libssh2. The mingw/mxe testing is dropped for now, as there
      are no packages for it.
      Signed-off-by: NPino Toscano <ptoscano@redhat.com>
      Tested-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
      Acked-by: NAlex Bennée <alex.bennee@linaro.org>
      Message-id: 20190620200840.17655-1-ptoscano@redhat.com
      Reviewed-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
      Message-id: 5873173.t2JhDm7DL7@lindworm.usersys.redhat.com
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      b10d49d7
    • S
      vmdk: Add read-only support for seSparse snapshots · 98eb9733
      Sam Eiderman 提交于
      Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
      QEMU).
      
      This format was lacking in the following:
      
          * Grain directory (L1) and grain table (L2) entries were 32-bit,
            allowing access to only 2TB (slightly less) of data.
          * The grain size (default) was 512 bytes - leading to data
            fragmentation and many grain tables.
          * For space reclamation purposes, it was necessary to find all the
            grains which are not pointed to by any grain table - so a reverse
            mapping of "offset of grain in vmdk" to "grain table" must be
            constructed - which takes large amounts of CPU/RAM.
      
      The format specification can be found in VMware's documentation:
      https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
      
      In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
      introduced: SESparse (Space Efficient).
      
      This format fixes the above issues:
      
          * All entries are now 64-bit.
          * The grain size (default) is 4KB.
          * Grain directory and grain tables are now located at the beginning
            of the file.
            + seSparse format reserves space for all grain tables.
            + Grain tables can be addressed using an index.
            + Grains are located in the end of the file and can also be
              addressed with an index.
            - seSparse vmdks of large disks (64TB) have huge preallocated
              headers - mainly due to L2 tables, even for empty snapshots.
          * The header contains a reverse mapping ("backmap") of "offset of
            grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
            specifies for each grain - whether it is allocated or not.
            Using these data structures we can implement space reclamation
            efficiently.
          * Due to the fact that the header now maintains two mappings:
              * The regular one (grain directory & grain tables)
              * A reverse one (backmap and free bitmap)
            These data structures can lose consistency upon crash and result
            in a corrupted VMDK.
            Therefore, a journal is also added to the VMDK and is replayed
            when the VMware reopens the file after a crash.
      
      Since ESXi 6.7 - SESparse is the only snapshot format available.
      
      Unfortunately, VMware does not provide documentation regarding the new
      seSparse format.
      
      This commit is based on black-box research of the seSparse format.
      Various in-guest block operations and their effect on the snapshot file
      were tested.
      
      The only VMware provided source of information (regarding the underlying
      implementation) was a log file on the ESXi:
      
          /var/log/hostd.log
      
      Whenever an seSparse snapshot is created - the log is being populated
      with seSparse records.
      
      Relevant log records are of the form:
      
      [...] Const Header:
      [...]  constMagic     = 0xcafebabe
      [...]  version        = 2.1
      [...]  capacity       = 204800
      [...]  grainSize      = 8
      [...]  grainTableSize = 64
      [...]  flags          = 0
      [...] Extents:
      [...]  Header         : <1 : 1>
      [...]  JournalHdr     : <2 : 2>
      [...]  Journal        : <2048 : 2048>
      [...]  GrainDirectory : <4096 : 2048>
      [...]  GrainTables    : <6144 : 2048>
      [...]  FreeBitmap     : <8192 : 2048>
      [...]  BackMap        : <10240 : 2048>
      [...]  Grain          : <12288 : 204800>
      [...] Volatile Header:
      [...] volatileMagic     = 0xcafecafe
      [...] FreeGTNumber      = 0
      [...] nextTxnSeqNumber  = 0
      [...] replayJournal     = 0
      
      The sizes that are seen in the log file are in sectors.
      Extents are of the following format: <offset : size>
      
      This commit is a strict implementation which enforces:
          * magics
          * version number 2.1
          * grain size of 8 sectors  (4KB)
          * grain table size of 64 sectors
          * zero flags
          * extent locations
      
      Additionally, this commit proivdes only a subset of the functionality
      offered by seSparse's format:
          * Read-only
          * No journal replay
          * No space reclamation
          * No unmap support
      
      Hence, journal header, journal, free bitmap and backmap extents are
      unused, only the "classic" (L1 -> L2 -> data) grain access is
      implemented.
      
      However there are several differences in the grain access itself.
      Grain directory (L1):
          * Grain directory entries are indexes (not offsets) to grain
            tables.
          * Valid grain directory entries have their highest nibble set to
            0x1.
          * Since grain tables are always located in the beginning of the
            file - the index can fit into 32 bits - so we can use its low
            part if it's valid.
      Grain table (L2):
          * Grain table entries are indexes (not offsets) to grains.
          * If the highest nibble of the entry is:
              0x0:
                  The grain in not allocated.
                  The rest of the bytes are 0.
              0x1:
                  The grain is unmapped - guest sees a zero grain.
                  The rest of the bits point to the previously mapped grain,
                  see 0x3 case.
              0x2:
                  The grain is zero.
              0x3:
                  The grain is allocated - to get the index calculate:
                  ((entry & 0x0fff000000000000) >> 48) |
                  ((entry & 0x0000ffffffffffff) << 12)
          * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
            grain which results from the guest using sg_unmap to unmap the
            grain - but the grain itself still exists in the grain extent - a
            space reclamation procedure should delete it.
            Unmapping a zero grain has no effect (0x2 will not change to 0x1)
            but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
      
      In order to implement seSparse some fields had to be changed to support
      both 32-bit and 64-bit entry sizes.
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Reviewed-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NArbel Moshe <arbel.moshe@oracle.com>
      Signed-off-by: NSam Eiderman <shmuel.eiderman@oracle.com>
      Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      98eb9733
    • S
      vmdk: Reduce the max bound for L1 table size · 59d6ee48
      Sam Eiderman 提交于
      512M of L1 entries is a very loose bound, only 32M are required to store
      the maximal supported VMDK file size of 2TB.
      
      Fixed qemu-iotest 59# - now failure occures before on impossible L1
      table size.
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Reviewed-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NArbel Moshe <arbel.moshe@oracle.com>
      Signed-off-by: NSam Eiderman <shmuel.eiderman@oracle.com>
      Message-id: 20190620091057.47441-3-shmuel.eiderman@oracle.com
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      59d6ee48
    • S
      vmdk: Fix comment regarding max l1_size coverage · 940a2cd5
      Sam Eiderman 提交于
      Commit b0651b8c ("vmdk: Move l1_size check into vmdk_add_extent")
      extended the l1_size check from VMDK4 to VMDK3 but did not update the
      default coverage in the moved comment.
      
      The previous vmdk4 calculation:
      
          (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB
      
      The added vmdk3 calculation:
      
          (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB
      
      Adding the calculation of vmdk3 to the comment.
      
      In any case, VMware does not offer virtual disks more than 2TB for
      vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
      not implemented yet in qemu.
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Reviewed-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NArbel Moshe <arbel.moshe@oracle.com>
      Signed-off-by: NSam Eiderman <shmuel.eiderman@oracle.com>
      Message-id: 20190620091057.47441-2-shmuel.eiderman@oracle.com
      Reviewed-by: Nyuchenlin <yuchenlin@synology.com>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      940a2cd5
    • A
      iotest 134: test cluster-misaligned encrypted write · 6ec889eb
      Anton Nefedov 提交于
      COW (even empty/zero) areas require encryption too
      Signed-off-by: NAnton Nefedov <anton.nefedov@virtuozzo.com>
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Reviewed-by: NAlberto Garcia <berto@igalia.com>
      Message-id: 20190516143028.81155-1-anton.nefedov@virtuozzo.com
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      6ec889eb
    • V
      blockdev: enable non-root nodes for transaction drive-backup source · 85c9d133
      Vladimir Sementsov-Ogievskiy 提交于
      We forget to enable it for transaction .prepare, while it is already
      enabled in do_drive_backup since commit a2d665c1
          "blockdev: loosen restrictions on drive-backup source node"
      Signed-off-by: NVladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
      Message-id: 20190618140804.59214-1-vsementsov@virtuozzo.com
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      85c9d133
    • K
      nvme: do not advertise support for unsupported arbitration mechanism · 1cc354ac
      Klaus Birkelund Jensen 提交于
      The device mistakenly reports that the Weighted Round Robin with Urgent
      Priority Class arbitration mechanism is supported.
      
      It is not.
      Signed-off-by: NKlaus Birkelund Jensen <klaus.jensen@cnexlabs.com>
      Message-id: 20190606092530.14206-1-klaus@birkelund.eu
      Acked-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      1cc354ac
  3. 21 6月, 2019 28 次提交