1. 18 6月, 2019 1 次提交
  2. 27 5月, 2019 4 次提交
    • J
      vhost: scsi: add weight support · c1ea02f1
      Jason Wang 提交于
      This patch will check the weight and exit the loop if we exceeds the
      weight. This is useful for preventing scsi kthread from hogging cpu
      which is guest triggerable.
      
      This addresses CVE-2019-3900.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Fixes: 057cbf49 ("tcm_vhost: Initial merge for vhost level target fabric driver")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      c1ea02f1
    • J
      vhost: vsock: add weight support · e79b431f
      Jason Wang 提交于
      This patch will check the weight and exit the loop if we exceeds the
      weight. This is useful for preventing vsock kthread from hogging cpu
      which is guest triggerable. The weight can help to avoid starving the
      request from on direction while another direction is being processed.
      
      The value of weight is picked from vhost-net.
      
      This addresses CVE-2019-3900.
      
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Fixes: 433fc58e ("VSOCK: Introduce vhost_vsock.ko")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      e79b431f
    • J
      vhost_net: fix possible infinite loop · e2412c07
      Jason Wang 提交于
      When the rx buffer is too small for a packet, we will discard the vq
      descriptor and retry it for the next packet:
      
      while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
      					      &busyloop_intr))) {
      ...
      	/* On overrun, truncate and discard */
      	if (unlikely(headcount > UIO_MAXIOV)) {
      		iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
      		err = sock->ops->recvmsg(sock, &msg,
      					 1, MSG_DONTWAIT | MSG_TRUNC);
      		pr_debug("Discarded rx packet: len %zd\n", sock_len);
      		continue;
      	}
      ...
      }
      
      This makes it possible to trigger a infinite while..continue loop
      through the co-opreation of two VMs like:
      
      1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
         vhost process as much as possible e.g using indirect descriptors or
         other.
      2) Malicious VM2 generate packets to VM1 as fast as possible
      
      Fixing this by checking against weight at the end of RX and TX
      loop. This also eliminate other similar cases when:
      
      - userspace is consuming the packets in the meanwhile
      - theoretical TOCTOU attack if guest moving avail index back and forth
        to hit the continue after vhost find guest just add new buffers
      
      This addresses CVE-2019-3900.
      
      Fixes: d8316f39 ("vhost: fix total length when packets are too short")
      Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      e2412c07
    • J
      vhost: introduce vhost_exceeds_weight() · e82b9b07
      Jason Wang 提交于
      We used to have vhost_exceeds_weight() for vhost-net to:
      
      - prevent vhost kthread from hogging the cpu
      - balance the time spent between TX and RX
      
      This function could be useful for vsock and scsi as well. So move it
      to vhost.c. Device must specify a weight which counts the number of
      requests, or it can also specific a byte_weight which counts the
      number of bytes that has been processed.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      e82b9b07
  3. 21 5月, 2019 2 次提交
  4. 15 5月, 2019 1 次提交
    • I
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny 提交于
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
  5. 13 5月, 2019 1 次提交
  6. 11 4月, 2019 1 次提交
  7. 09 3月, 2019 1 次提交
  8. 07 3月, 2019 1 次提交
    • A
      vhost: silence an unused-variable warning · cfdbb4ed
      Arnd Bergmann 提交于
      On some architectures, the MMU can be disabled, leading to access_ok()
      becoming an empty macro that does not evaluate its size argument,
      which in turn produces an unused-variable warning:
      
      drivers/vhost/vhost.c:1191:9: error: unused variable 's' [-Werror,-Wunused-variable]
              size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
      
      Mark the variable as __maybe_unused to shut up that warning.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      cfdbb4ed
  9. 20 2月, 2019 1 次提交
  10. 05 2月, 2019 1 次提交
  11. 29 1月, 2019 1 次提交
    • J
      vhost: fix OOB in get_rx_bufs() · b46a0bf7
      Jason Wang 提交于
      After batched used ring updating was introduced in commit e2b3b35e
      ("vhost_net: batch used ring update in rx"). We tend to batch heads in
      vq->heads for more than one packet. But the quota passed to
      get_rx_bufs() was not correctly limited, which can result a OOB write
      in vq->heads.
      
              headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
                          vhost_len, &in, vq_log, &log,
                          likely(mergeable) ? UIO_MAXIOV : 1);
      
      UIO_MAXIOV was still used which is wrong since we could have batched
      used in vq->heads, this will cause OOB if the next buffer needs more
      than 960 (1024 (UIO_MAXIOV) - 64 (VHOST_NET_BATCH)) heads after we've
      batched 64 (VHOST_NET_BATCH) heads:
      Acked-by: NStefan Hajnoczi <stefanha@redhat.com>
      
      =============================================================================
      BUG kmalloc-8k (Tainted: G    B            ): Redzone overwritten
      -----------------------------------------------------------------------------
      
      INFO: 0x00000000fd93b7a2-0x00000000f0713384. First byte 0xa9 instead of 0xcc
      INFO: Allocated in alloc_pd+0x22/0x60 age=3933677 cpu=2 pid=2674
          kmem_cache_alloc_trace+0xbb/0x140
          alloc_pd+0x22/0x60
          gen8_ppgtt_create+0x11d/0x5f0
          i915_ppgtt_create+0x16/0x80
          i915_gem_create_context+0x248/0x390
          i915_gem_context_create_ioctl+0x4b/0xe0
          drm_ioctl_kernel+0xa5/0xf0
          drm_ioctl+0x2ed/0x3a0
          do_vfs_ioctl+0x9f/0x620
          ksys_ioctl+0x6b/0x80
          __x64_sys_ioctl+0x11/0x20
          do_syscall_64+0x43/0xf0
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      INFO: Slab 0x00000000d13e87af objects=3 used=3 fp=0x          (null) flags=0x200000000010201
      INFO: Object 0x0000000003278802 @offset=17064 fp=0x00000000e2e6652b
      
      Fixing this by allocating UIO_MAXIOV + VHOST_NET_BATCH iovs for
      vhost-net. This is done through set the limitation through
      vhost_dev_init(), then set_owner can allocate the number of iov in a
      per device manner.
      
      This fixes CVE-2018-16880.
      
      Fixes: e2b3b35e ("vhost_net: batch used ring update in rx")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b46a0bf7
  12. 18 1月, 2019 1 次提交
    • J
      vhost: log dirty page correctly · cc5e7107
      Jason Wang 提交于
      Vhost dirty page logging API is designed to sync through GPA. But we
      try to log GIOVA when device IOTLB is enabled. This is wrong and may
      lead to missing data after migration.
      
      To solve this issue, when logging with device IOTLB enabled, we will:
      
      1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
         get HVA, for writable descriptor, get HVA through iovec. For used
         ring update, translate its GIOVA to HVA
      2) traverse the GPA->HVA mapping to get the possible GPA and log
         through GPA. Pay attention this reverse mapping is not guaranteed
         to be unique, so we should log each possible GPA in this case.
      
      This fix the failure of scp to guest during migration. In -next, we
      will probably support passing GIOVA->GPA instead of GIOVA->HVA.
      
      Fixes: 6b1e6cc7 ("vhost: new device IOTLB API")
      Reported-by: NJintack Lim <jintack@cs.columbia.edu>
      Cc: Jintack Lim <jintack@cs.columbia.edu>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc5e7107
  13. 15 1月, 2019 2 次提交
  14. 12 1月, 2019 1 次提交
    • Z
      vhost/vsock: fix vhost vsock cid hashing inconsistent · 7fbe078c
      Zha Bin 提交于
      The vsock core only supports 32bit CID, but the Virtio-vsock spec define
      CID (dst_cid and src_cid) as u64 and the upper 32bits is reserved as
      zero. This inconsistency causes one bug in vhost vsock driver. The
      scenarios is:
      
        0. A hash table (vhost_vsock_hash) is used to map an CID to a vsock
        object. And hash_min() is used to compute the hash key. hash_min() is
        defined as:
        (sizeof(val) <= 4 ? hash_32(val, bits) : hash_long(val, bits)).
        That means the hash algorithm has dependency on the size of macro
        argument 'val'.
        0. In function vhost_vsock_set_cid(), a 64bit CID is passed to
        hash_min() to compute the hash key when inserting a vsock object into
        the hash table.
        0. In function vhost_vsock_get(), a 32bit CID is passed to hash_min()
        to compute the hash key when looking up a vsock for an CID.
      
      Because the different size of the CID, hash_min() returns different hash
      key, thus fails to look up the vsock object for an CID.
      
      To fix this bug, we keep CID as u64 in the IOCTLs and virtio message
      headers, but explicitly convert u64 to u32 when deal with the hash table
      and vsock core.
      
      Fixes: 834e772c ("vhost/vsock: fix use-after-free in network stack callers")
      Link: https://github.com/stefanha/virtio/blob/vsock/trunk/content.texSigned-off-by: NZha Bin <zhabin@linux.alibaba.com>
      Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fbe078c
  15. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  16. 20 12月, 2018 2 次提交
  17. 13 12月, 2018 3 次提交
  18. 07 12月, 2018 2 次提交
    • S
      vhost/vsock: fix use-after-free in network stack callers · 834e772c
      Stefan Hajnoczi 提交于
      If the network stack calls .send_pkt()/.cancel_pkt() during .release(),
      a struct vhost_vsock use-after-free is possible.  This occurs because
      .release() does not wait for other CPUs to stop using struct
      vhost_vsock.
      
      Switch to an RCU-enabled hashtable (indexed by guest CID) so that
      .release() can wait for other CPUs by calling synchronize_rcu().  This
      also eliminates vhost_vsock_lock acquisition in the data path so it
      could have a positive effect on performance.
      
      This is CVE-2018-14625 "kernel: use-after-free Read in vhost_transport_send_pkt".
      
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: syzbot+bd391451452fb0b93039@syzkaller.appspotmail.com
      Reported-by: syzbot+e3e074963495f92a89ed@syzkaller.appspotmail.com
      Reported-by: syzbot+d5a0a170c5069658b141@syzkaller.appspotmail.com
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      834e772c
    • S
      vhost/vsock: fix reset orphans race with close timeout · c38f57da
      Stefan Hajnoczi 提交于
      If a local process has closed a connected socket and hasn't received a
      RST packet yet, then the socket remains in the table until a timeout
      expires.
      
      When a vhost_vsock instance is released with the timeout still pending,
      the socket is never freed because vhost_vsock has already set the
      SOCK_DONE flag.
      
      Check if the close timer is pending and let it close the socket.  This
      prevents the race which can leak sockets.
      Reported-by: NMaximilian Riemensberger <riemensberger@cadami.net>
      Cc: Graham Whaley <graham.whaley@gmail.com>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      c38f57da
  19. 04 12月, 2018 1 次提交
  20. 29 11月, 2018 2 次提交
  21. 28 11月, 2018 1 次提交
  22. 18 11月, 2018 1 次提交
    • J
      vhost_net: mitigate page reference counting during page frag refill · e4dab1e6
      Jason Wang 提交于
      We do a get_page() which involves a atomic operation. This patch tries
      to mitigate a per packet atomic operation by maintaining a reference
      bias which is initially USHRT_MAX. Each time a page is got, instead of
      calling get_page() we decrease the bias and when we find it's time to
      use a new page we will decrease the bias at one time through
      __page_cache_drain_cache().
      
      Testpmd(virtio_user + vhost_net) + XDP_DROP on TAP shows about 1.6%
      improvement.
      
      Before: 4.63Mpps
      After:  4.71Mpps
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4dab1e6
  23. 01 11月, 2018 1 次提交
  24. 25 10月, 2018 4 次提交
  25. 08 10月, 2018 1 次提交
  26. 27 9月, 2018 2 次提交
    • T
      net: vhost: add rx busy polling in tx path · 441abde4
      Tonghao Zhang 提交于
      This patch improves the guest receive performance.
      On the handle_tx side, we poll the sock receive queue at the
      same time. handle_rx do that in the same way.
      
      We set the poll-us=100us and use the netperf to test throughput
      and mean latency. When running the tests, the vhost-net kthread
      of that VM, is alway 100% CPU. The commands are shown as below.
      
      Rx performance is greatly improved by this patch. There is not
      notable performance change on tx with this series though. This
      patch is useful for bi-directional traffic.
      
      netperf -H IP -t TCP_STREAM -l 20 -- -O "THROUGHPUT, THROUGHPUT_UNITS, MEAN_LATENCY"
      
      Topology:
      [Host] ->linux bridge -> tap vhost-net ->[Guest]
      
      TCP_STREAM:
      * Without the patch:  19842.95 Mbps, 6.50 us mean latency
      * With the patch:     37598.20 Mbps, 3.43 us mean latency
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      441abde4
    • T
      net: vhost: factor out busy polling logic to vhost_net_busy_poll() · dc151282
      Tonghao Zhang 提交于
      Factor out generic busy polling logic and will be
      used for in tx path in the next patch. And with the patch,
      qemu can set differently the busyloop_timeout for rx queue.
      
      To avoid duplicate codes, introduce the helper functions:
      * sock_has_rx_data(changed from sk_has_rx_data)
      * vhost_net_busy_poll_try_queue
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc151282