1. 07 2月, 2018 8 次提交
    • A
      vfio/pci: Allow relocating MSI-X MMIO · 89d5202e
      Alex Williamson 提交于
      Recently proposed vfio-pci kernel changes (v4.16) remove the
      restriction preventing userspace from mmap'ing PCI BARs in areas
      overlapping the MSI-X vector table.  This change is primarily intended
      to benefit host platforms which make use of system page sizes larger
      than the PCI spec recommendation for alignment of MSI-X data
      structures (ie. not x86_64).  In the case of POWER systems, the SPAPR
      spec requires the VM to program MSI-X using hypercalls, rendering the
      MSI-X vector table unused in the VM view of the device.  However,
      ARM64 platforms also support 64KB pages and rely on QEMU emulation of
      MSI-X.  Regardless of the kernel driver allowing mmaps overlapping
      the MSI-X vector table, emulation of the MSI-X vector table also
      prevents direct mapping of device MMIO spaces overlapping this page.
      Thanks to the fact that PCI devices have a standard self discovery
      mechanism, we can try to resolve this by relocating the MSI-X data
      structures, either by creating a new PCI BAR or extending an existing
      BAR and updating the MSI-X capability for the new location.  There's
      even a very slim chance that this could benefit devices which do not
      adhere to the PCI spec alignment guidelines on x86_64 systems.
      
      This new x-msix-relocation option accepts the following choices:
      
        off: Disable MSI-X relocation, use native device config (default)
        auto: Use a known good combination for the platform/device (none yet)
        bar0..bar5: Specify the target BAR for MSI-X data structures
      
      If compatible, the target BAR will either be created or extended and
      the new portion will be used for MSI-X emulation.
      
      The first obvious user question with this option is how to determine
      whether a given platform and device might benefit from this option.
      In most cases, the answer is that it won't, especially on x86_64.
      Devices often dedicate an entire BAR to MSI-X and therefore no
      performance sensitive registers overlap the MSI-X area.  Take for
      example:
      
      # lspci -vvvs 0a:00.0
      0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
      	...
      	Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
      	Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
      	...
      	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
      		Vector table: BAR=3 offset=00000000
      		PBA: BAR=3 offset=00002000
      
      This device uses the 16K bar3 for MSI-X with the vector table at
      offset zero and the pending bits arrary at offset 8K, fully honoring
      the PCI spec alignment guidance.  The data sheet specifically refers
      to this as an MSI-X BAR.  This device would not see a benefit from
      MSI-X relocation regardless of the platform, regardless of the page
      size.
      
      However, here's another example:
      
      # lspci -vvvs 02:00.0
      02:00.0 Serial Attached SCSI controller: xxxxxxxx
      	...
      	Region 0: I/O ports at c000 [size=256]
      	Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
      	Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
      	...
      	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
      		Vector table: BAR=1 offset=0000e000
      		PBA: BAR=1 offset=0000f000
      
      Here the MSI-X data structures are placed on separate 4K pages at the
      end of a 64KB BAR.  If our host page size is 4K, we're likely fine,
      but at 64KB page size, MSI-X emulation at that location prevents the
      entire BAR from being directly mapped into the VM address space.
      Overlapping performance sensitive registers then starts to be a very
      likely scenario on such a platform.  At this point, the user could
      enable tracing on vfio_region_read and vfio_region_write to determine
      more conclusively if device accesses are being trapped through QEMU.
      
      Upon finding a device and platform in need of MSI-X relocation, the
      next problem is how to choose target PCI BAR to host the MSI-X data
      structures.  A few key rules to keep in mind for this selection
      include:
      
       * There are only 6 BAR slots, bar0..bar5
       * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
       * PCI BARs are always a power of 2 in size, extending == doubling
       * The maximum size of a 32-bit BAR is 2GB
       * MSI-X data structures must reside in an MMIO BAR
      
      Using these rules, we can evaluate each BAR of the second example
      device above as follows:
      
       bar0: I/O port BAR, incompatible with MSI-X tables
       bar1: BAR could be extended, incurring another 64KB of MMIO
       bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
       bar3: BAR could be extended, incurring another 256KB of MMIO
       bar4: Unavailable, bar3 is 64bit, this register is used by bar3
       bar5: Available, empty BAR, minimum additional MMIO
      
      A secondary optimization we might wish to make in relocating MSI-X
      is to minimize the additional MMIO required for the device, therefore
      we might test the available choices in order of preference as bar5,
      bar1, and finally bar3.  The original proposal for this feature
      included an 'auto' option which would choose bar5 in this case, but
      various drivers have been found that make assumptions about the
      properties of the "first" BAR or the size of BARs such that there
      appears to be no foolproof automatic selection available, requiring
      known good combinations to be sourced from users.  This patch is
      pre-enabled for an 'auto' selection making use of a validated lookup
      table, but no entries are yet identified.
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Tested-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      89d5202e
    • A
      qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR · c3bbbdbf
      Alex Williamson 提交于
      Add an option which allows the user to specify a PCI BAR number,
      including an 'off' and 'auto' selection.
      
      Cc: Markus Armbruster <armbru@redhat.com>
      Cc: Eric Blake <eblake@redhat.com>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Tested-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NMarkus Armbruster <armbru@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      c3bbbdbf
    • A
      vfio/pci: Emulate BARs · 04f336b0
      Alex Williamson 提交于
      The kernel provides similar emulation of PCI BAR register access to
      QEMU, so up until now we've used that for things like BAR sizing and
      storing the BAR address.  However, if we intend to resize BARs or add
      BARs that don't exist on the physical device, we need to switch to the
      pure QEMU emulation of the BAR.
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Tested-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      04f336b0
    • A
      vfio/pci: Add base BAR MemoryRegion · 3a286732
      Alex Williamson 提交于
      Add one more layer to our stack of MemoryRegions, this base region
      allows us to register BARs independently of the vfio region or to
      extend the size of BARs which do map to a region.  This will be
      useful when we want hypervisor defined BARs or sections of BARs,
      for purposes such as relocating MSI-X emulation.  We therefore call
      msix_init() based on this new base MemoryRegion, while the quirks,
      which only modify regions still operate on those sub-MemoryRegions.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      3a286732
    • A
      vfio/pci: Fixup VFIOMSIXInfo comment · edd09278
      Alex Williamson 提交于
      The fields were removed in the referenced commit, but the comment
      still mentions them.
      
      Fixes: 2fb9636e ("vfio-pci: Remove unused fields from VFIOMSIXInfo")
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Tested-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      edd09278
    • A
      spapr/iommu: Enable in-kernel TCE acceleration via VFIO KVM device · 9ded780c
      Alexey Kardashevskiy 提交于
      In order to enable TCE operations support in KVM, we have to inform
      the KVM about VFIO groups being attached to specific LIOBNs;
      the necessary bits are implemented already by IOMMU MR and VFIO.
      
      This defines get_attr() for the SPAPR TCE IOMMU MR which makes VFIO
      call the KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl and establish
      LIOBN-to-IOMMU link.
      
      This changes spapr_tce_set_need_vfio() to avoid TCE table reallocation
      if the kernel supports the TCE acceleration.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [aw - remove unnecessary sys/ioctl.h include]
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      9ded780c
    • A
      vfio/spapr: Use iommu memory region's get_attr() · 07bc681a
      Alexey Kardashevskiy 提交于
      In order to enable TCE operations support in KVM, we have to inform
      the KVM about VFIO groups being attached to specific LIOBNs. The KVM
      already knows about VFIO groups, the only bit missing is which
      in-kernel TCE table (the one with user visible TCEs) should update
      the attached broups. There is an KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
      attribute of the VFIO KVM device which receives a groupfd/tablefd couple.
      
      This uses a new memory_region_iommu_get_attr() helper to get the IOMMU fd
      and calls KVM to establish the link.
      
      As get_attr() is not implemented yet, this should cause no behavioural
      change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      07bc681a
    • A
      memory/iommu: Add get_attr() · f1334de6
      Alexey Kardashevskiy 提交于
      This adds get_attr() to IOMMUMemoryRegionClass, like
      iommu_ops::domain_get_attr in the Linux kernel.
      
      This defines the first attribute - IOMMU_ATTR_SPAPR_TCE_FD - which
      will be used between the pSeries machine and VFIO-PCI.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      f1334de6
  2. 06 2月, 2018 1 次提交
  3. 05 2月, 2018 4 次提交
  4. 03 2月, 2018 7 次提交
  5. 02 2月, 2018 20 次提交
    • P
      Merge remote-tracking branch 'remotes/kraxel/tags/audio-20180202-pull-request' into staging · fabbd691
      Peter Maydell 提交于
      audio: two small fixes.
      
      # gpg: Signature made Fri 02 Feb 2018 07:49:20 GMT
      # gpg:                using RSA key 4CB6D8EED3E87138
      # gpg: Good signature from "Gerd Hoffmann (work) <kraxel@redhat.com>"
      # gpg:                 aka "Gerd Hoffmann <gerd@kraxel.org>"
      # gpg:                 aka "Gerd Hoffmann (private) <kraxel@gmail.com>"
      # Primary key fingerprint: A032 8CFF B93A 17A7 9901  FE7D 4CB6 D8EE D3E8 7138
      
      * remotes/kraxel/tags/audio-20180202-pull-request:
        hw/audio/sb16.c: change dolog() to qemu_log_mask()
        hw/audio/wm8750: move WM8750 declarations from i2c/i2c.h to audio/wm8750.h
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      fabbd691
    • P
      Merge remote-tracking branch 'remotes/cminyard/tags/for-release-20180201' into staging · 6a95e258
      Peter Maydell 提交于
      Lots of litte miscellaneous fixes for the IPMI code, plus
      add me as the IPMI maintainer.
      
      # gpg: Signature made Thu 01 Feb 2018 18:44:55 GMT
      # gpg:                using RSA key 61F38C90919BFF81
      # gpg: Good signature from "Corey Minyard <cminyard@mvista.com>"
      # gpg:                 aka "Corey Minyard <minyard@acm.org>"
      # gpg:                 aka "Corey Minyard <corey@minyard.net>"
      # gpg:                 aka "Corey Minyard <minyard@mvista.com>"
      # gpg: WARNING: This key is not certified with a trusted signature!
      # gpg:          There is no indication that the signature belongs to the owner.
      # Primary key fingerprint: FD0D 5CE6 7CE0 F59A 6688  2686 61F3 8C90 919B FF81
      
      * remotes/cminyard/tags/for-release-20180201:
        ipmi: Allow BMC device properties to be set
        ipmi: disable IRQ and ATN on an external disconnect
        ipmi: Fix macro issues
        ipmi: Add the platform event message command
        ipmi: Don't set the timestamp on add events that don't have it
        ipmi: Fix SEL get/set time commands
        Add maintainer for the IPMI code
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      6a95e258
    • P
      Merge remote-tracking branch 'remotes/elmarco/tags/dump-pull-request' into staging · e486b528
      Peter Maydell 提交于
      # gpg: Signature made Thu 01 Feb 2018 11:15:42 GMT
      # gpg:                using RSA key DAE8E10975969CE5
      # gpg: Good signature from "Marc-André Lureau <marcandre.lureau@redhat.com>"
      # gpg:                 aka "Marc-André Lureau <marcandre.lureau@gmail.com>"
      # Primary key fingerprint: 87A9 BD93 3F87 C606 D276  F62D DAE8 E109 7596 9CE5
      
      * remotes/elmarco/tags/dump-pull-request:
        dump-guest-memory.py: skip vmcoreinfo section if not available
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      e486b528
    • G
    • G
      tests: virtio-9p: add FLUSH operation test · 357e2f7f
      Greg Kurz 提交于
      The idea is to send a victim request that will possibly block in the
      server and to send a flush request to cancel the victim request.
      
      This patch adds two test to verifiy that:
      - the server does not reply to a victim request that was actually
        cancelled
      - the server replies to the flush request after replying to the
        victim request if it could not cancel it
      
      9p request cancellation reference:
      
      http://man.cat-v.org/plan_9/5/flushSigned-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      (groug, change the test to only write a single byte to avoid
              any alignment or endianess consideration)
      357e2f7f
    • G
      libqos/virtio: return length written into used descriptor · be3a6781
      Greg Kurz 提交于
      When a 9p request is flushed (ie, cancelled) by the guest, the device
      is expected to simply mark the request as used, without sending a 9p
      reply (ie, without writing anything into the used buffer).
      
      To be able to test this, we need access to the length written by the
      device into the used descriptor. This patch adds a uint32_t * argument
      to qvirtqueue_get_buf() and qvirtio_wait_used_elem() for this purpose.
      
      All existing users are updated accordingly.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      be3a6781
    • P
      Merge remote-tracking branch 'remotes/cody/tags/block-pull-request' into staging · 707eafb8
      Peter Maydell 提交于
      # gpg: Signature made Thu 01 Feb 2018 04:05:22 GMT
      # gpg:                using RSA key BDBE7B27C0DE3057
      # gpg: Good signature from "Jeffrey Cody <jcody@redhat.com>"
      # gpg:                 aka "Jeffrey Cody <jeff@codyprime.org>"
      # gpg:                 aka "Jeffrey Cody <codyprime@gmail.com>"
      # Primary key fingerprint: 9957 4B4D 3474 90E7 9D98  D624 BDBE 7B27 C0DE 3057
      
      * remotes/cody/tags/block-pull-request:
        iotests: Make 200 run on tmpfs
        block/ssh: fix possible segmentation fault when .desc is not null-terminated
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      707eafb8
    • P
      virtio-gpu: disallow vIOMMU · 34e304e9
      Peter Xu 提交于
      virtio-gpu has special code path that bypassed vIOMMU protection.  So
      for now let's disable iommu_platform for the device until we fully
      support that (if needed).
      
      After the patch, both virtio-vga and virtio-gpu won't allow to boot with
      iommu_platform parameter set.
      
      CC: Gerd Hoffmann <kraxel@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-id: 20180131040401.3550-1-peterx@redhat.com
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      34e304e9
    • J
      hw/audio/sb16.c: change dolog() to qemu_log_mask() · 8ec660b8
      John Arbuckle 提交于
      Changes all the occurrances of dolog() to qemu_log_mask().
      Signed-off-by: NJohn Arbuckle <programmingkidx@gmail.com>
      Message-id: 20180201172744.7504-1-programmingkidx@gmail.com
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      8ec660b8
    • P
      hw/audio/wm8750: move WM8750 declarations from i2c/i2c.h to audio/wm8750.h · 7ab14c5a
      Philippe Mathieu-Daudé 提交于
      while here use TYPE_WM8750 and declare a data_req_cb() typedef.
      Signed-off-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Message-id: 20170919123053.32675-1-f4bug@amsat.org
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      7ab14c5a
    • D
      ui: correctly advance output buffer when writing SASL data · 627ebec2
      Daniel P. Berrangé 提交于
      In this previous commit:
      
        commit 8f61f1c5
        Author: Daniel P. Berrange <berrange@redhat.com>
        Date:   Mon Dec 18 19:12:20 2017 +0000
      
          ui: track how much decoded data we consumed when doing SASL encoding
      
      I attempted to fix a flaw with tracking how much data had actually been
      processed when encoding with SASL. With that flaw, the VNC server could
      mistakenly discard queued data that had not been sent.
      
      The fix was not quite right though, because it merely decremented the
      vs->output.offset value. This is effectively discarding data from the
      end of the pending output buffer. We actually need to discard data from
      the start of the pending output buffer. We also want to free memory that
      is no longer required. The correct way to handle this is to use the
      buffer_advance() helper method instead of directly manipulating the
      offset value.
      Reported-by: NLaszlo Ersek <lersek@redhat.com>
      Signed-off-by: NDaniel P. Berrangé <berrange@redhat.com>
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NLaszlo Ersek <lersek@redhat.com>
      Message-id: 20180201155841.27509-1-berrange@redhat.com
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      627ebec2
    • D
      ui: convert VNC server to QIONetListener · 13e1d0e7
      Daniel P. Berrange 提交于
      The VNC server already has the ability to listen on multiple sockets.
      Converting it to use the QIONetListener APIs though, will reduce the
      amount of code in the VNC server and improve the clarity of what is
      left.
      Signed-off-by: NDaniel P. Berrange <berrange@redhat.com>
      Message-id: 20180201164514.10330-1-berrange@redhat.com
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      13e1d0e7
    • D
      ui: fix mixup between qnum and qcode in SDL1 key handling · 8ea9c80a
      Daniel P. Berrangé 提交于
      The previous commit:
      
        commit 2ec78706
        Author: Daniel P. Berrange <berrange@redhat.com>
        Date:   Wed Jan 17 16:47:15 2018 +0000
      
          ui: convert GTK and SDL1 frontends to keycodemapdb
      
      changed the x_keymap.c keymap so that its target was qcodes instead of
      qnums. It updated the GTK frontend to take account of this change, but
      forgot to update the SDL1 frontend. Thus the SDL frontend was getting
      qcodes but dispatching them as if they were qnums. IOW, keyboard input
      was completely hosed with SDL1. Since the keyboard layout tables are
      still all based on qnums, it is easier to just keep SDL1 using qnums as
      it will be deleted in a few releases time.
      Reported-by: NBALATON Zoltan <balaton@eik.bme.hu>
      Signed-off-by: NDaniel P. Berrangé <berrange@redhat.com>
      Tested-by: NBALATON Zoltan <balaton@eik.bme.hu>
      Message-id: 20180201180033.14255-1-berrange@redhat.com
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      8ea9c80a
    • G
      tests: virtio-9p: add WRITE operation test · 354b86f8
      Greg Kurz 提交于
      Trivial test of a successful write.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      (groug, handle potential overflow when computing request size,
              add missing g_free(buf),
              backend handles one written byte at a time to validate
              the server doesn't do short-reads)
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      354b86f8
    • G
      tests: virtio-9p: add LOPEN operation test · 82469aae
      Greg Kurz 提交于
      Trivial test of a successful open.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      82469aae
    • G
      tests: virtio-9p: use the synth backend · 2893ddd5
      Greg Kurz 提交于
      The purpose of virtio-9p-test is to test the virtio-9p device, especially
      the 9p server state machine. We don't really care what fsdev backend we're
      using. Moreover, if we want to be able to test the flush request or a
      device reset with in-flights I/O, it is close to impossible to achieve
      with a physical backend because we cannot ask it reliably to put an I/O
      on hold at a specific point in time.
      
      Fortunately, we can do that with the synthetic backend, which allows to
      register callbacks on read/write accesses to a specific file. This will
      be used by a later patch to test the 9P flush request.
      
      The walk request test is converted to using the synth backend.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      2893ddd5
    • G
      tests: virtio-9p: wait for completion in the test code · 60b1fa9d
      Greg Kurz 提交于
      In order to test request cancellation, we will need to send multiple
      requests and wait for the associated replies. Since we poll the ISR
      to know if a request completed, we may have several replies to parse
      when we detect ISR was set to 1.
      
      This patch moves the waiting out of the reply parsing path, up into
      the functional tests.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      60b1fa9d
    • G
      tests: virtio-9p: move request tag to the test functions · 693b21d2
      Greg Kurz 提交于
      It doesn't really makes sense to hide the request tag from the test
      functions. It prevents to test the 9p server behavior when passed
      a wrong tag (ie, still in use or different from P9_NOTAG for a
      version request). Also the spec says that a tag is reusable as soon
      as the corresponding request was replied or flushed: no need to
      always increment tags like we do now. And finaly, an upcoming test
      of the flush command will need to manipulate tags explicitely.
      
      This simply changes all request functions to have a tag argument.
      Except for the version request which needs P9_NOTAG, all other
      tests can pass 0 since they wait for the reply before sending
      another request.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      693b21d2
    • K
      9pfs: Correctly handle cancelled requests · fc78d5ee
      Keno Fischer 提交于
      # Background
      
      I was investigating spurious non-deterministic EINTR returns from
      various 9p file system operations in a Linux guest served from the
      qemu 9p server.
      
       ## EINTR, ERESTARTSYS and the linux kernel
      
      When a signal arrives that the Linux kernel needs to deliver to user-space
      while a given thread is blocked (in the 9p case waiting for a reply to its
      request in 9p_client_rpc -> wait_event_interruptible), it asks whatever
      driver is currently running to abort its current operation (in the 9p case
      causing the submission of a TFLUSH message) and return to user space.
      In these situations, the error message reported is generally ERESTARTSYS.
      If the userspace processes specified SA_RESTART, this means that the
      system call will get restarted upon completion of the signal handler
      delivery (assuming the signal handler doesn't modify the process state
      in complicated ways not relevant here). If SA_RESTART is not specified,
      ERESTARTSYS gets translated to EINTR and user space is expected to handle
      the restart itself.
      
       ## The 9p TFLUSH command
      
      The 9p TFLUSH commands requests that the server abort an ongoing operation.
      The man page [1] specifies:
      
      ```
      If it recognizes oldtag as the tag of a pending transaction, it should
      abort any pending response and discard that tag.
      [...]
      When the client sends a Tflush, it must wait to receive the corresponding
      Rflush before reusing oldtag for subsequent messages. If a response to the
      flushed request is received before the Rflush, the client must honor the
      response as if it had not been flushed, since the completed request may
      signify a state change in the server
      ```
      
      In particular, this means that the server must not send a reply with the
      orignal tag in response to the cancellation request, because the client is
      obligated to interpret such a reply as a coincidental reply to the original
      request.
      
       # The bug
      
      When qemu receives a TFlush request, it sets the `cancelled` flag on the
      relevant pdu. This flag is periodically checked, e.g. in
      `v9fs_co_name_to_path`, and if set, the operation is aborted and the error
      is set to EINTR. However, the server then violates the spec, by returning
      to the client an Rerror response, rather than discarding the message
      entirely. As a result, the client is required to assume that said Rerror
      response is a result of the original request, not a result of the
      cancellation and thus passes the EINTR error back to user space.
      This is not the worst thing it could do, however as discussed above, the
      correct error code would have been ERESTARTSYS, such that user space
      programs with SA_RESTART set get correctly restarted upon completion of
      the signal handler.
      Instead, such programs get spurious EINTR results that they were not
      expecting to handle.
      
      It should be noted that there are plenty of user space programs that do not
      set SA_RESTART and do not correctly handle EINTR either. However, that is
      then a userspace bug. It should also be noted that this bug has been
      mitigated by a recent commit to the Linux kernel [2], which essentially
      prevents the kernel from sending Tflush requests unless the process is about
      to die (in which case the process likely doesn't care about the response).
      Nevertheless, for older kernels and to comply with the spec, I believe this
      change is beneficial.
      
       # Implementation
      
      The fix is fairly simple, just skipping notification of a reply if
      the pdu was previously cancelled. We do however, also notify the transport
      layer that we're doing this, so it can clean up any resources it may be
      holding. I also added a new trace event to distinguish
      operations that caused an error reply from those that were cancelled.
      
      One complication is that we only omit sending the message on EINTR errors in
      order to avoid confusing the rest of the code (which may assume that a
      client knows about a fid if it sucessfully passed it off to pud_complete
      without checking for cancellation status). This does mean that if the server
      acts upon the cancellation flag, it always needs to set err to EINTR. I
      believe this is true of the current code.
      
      [1] https://9fans.github.io/plan9port/man/man9/flush.html
      [2] https://github.com/torvalds/linux/commit/9523feac272ccad2ad8186ba4fcc891Signed-off-by: NKeno Fischer <keno@juliacomputing.com>
      Reviewed-by: NGreg Kurz <groug@kaod.org>
      [groug, send a zero-sized reply instead of detaching the buffer]
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefano Stabellini <sstabellini@kernel.org>
      fc78d5ee
    • G
      9pfs: drop v9fs_register_transport() · 066eb006
      Greg Kurz 提交于
      No good reasons to do this outside of v9fs_device_realize_common().
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NStefano Stabellini <sstabellini@kernel.org>
      066eb006