1. 19 8月, 2019 11 次提交
    • D
      xfs: fix use-after-free race in xfs_buf_rele · 8475cb1c
      Dave Chinner 提交于
      commit 37fd1678245f7a5898c1b05128bc481fb403c290 upstream.
      
      When looking at a 4.18 based KASAN use after free report, I noticed
      that racing xfs_buf_rele() may race on dropping the last reference
      to the buffer and taking the buffer lock. This was the symptom
      displayed by the KASAN report, but the actual issue that was
      reported had already been fixed in 4.19-rc1 by commit e339dd8d
      ("xfs: use sync buffer I/O for sync delwri queue submission").
      
      Despite this, I think there is still an issue with xfs_buf_rele()
      in this code:
      
              release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
              spin_lock(&bp->b_lock);
              if (!release) {
      .....
      
      If two threads race on the b_lock after both dropping a reference
      and one getting dropping the last reference so release = true, we
      end up with:
      
      CPU 0				CPU 1
      atomic_dec_and_lock()
      				atomic_dec_and_lock()
      				spin_lock(&bp->b_lock)
      spin_lock(&bp->b_lock)
      <spins>
      				<release = true bp->b_lru_ref = 0>
      				<remove from lists>
      				freebuf = true
      				spin_unlock(&bp->b_lock)
      				xfs_buf_free(bp)
      <gets lock, reading and writing freed memory>
      <accesses freed memory>
      spin_unlock(&bp->b_lock) <reads/writes freed memory>
      
      IOWs, we can't safely take bp->b_lock after dropping the hold
      reference because the buffer may go away at any time after we
      drop that reference. However, this can be fixed simply by taking the
      bp->b_lock before we drop the reference.
      
      It is safe to nest the pag_buf_lock inside bp->b_lock as the
      pag_buf_lock is only used to serialise against lookup in
      xfs_buf_find() and no other locks are held over or under the
      pag_buf_lock there. Make this clear by documenting the buffer lock
      orders at the top of the file.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8475cb1c
    • W
      x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() · 226356df
      Will Deacon 提交于
      commit 6e693b3ffecb0b478c7050b44a4842854154f715 upstream.
      
      Commit 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'")
      makes the access_ok() check part of the user_access_begin() preceding a
      series of 'unsafe' accesses.  This has the desirable effect of ensuring
      that all 'unsafe' accesses have been range-checked, without having to
      pick through all of the callsites to verify whether the appropriate
      checking has been made.
      
      However, the consolidated range check does not inhibit speculation, so
      it is still up to the caller to ensure that they are not susceptible to
      any speculative side-channel attacks for user addresses that ultimately
      fail the access_ok() check.
      
      This is an oversight, so use __uaccess_begin_nospec() to ensure that
      speculation is inhibited until the access_ok() check has passed.
      Reported-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      226356df
    • L
      make 'user_access_begin()' do 'access_ok()' · 6342e75a
      Linus Torvalds 提交于
      commit 594cc251fdd0d231d342d88b2fdff4bc42fb0690 upstream.
      
      Originally, the rule used to be that you'd have to do access_ok()
      separately, and then user_access_begin() before actually doing the
      direct (optimized) user access.
      
      But experience has shown that people then decide not to do access_ok()
      at all, and instead rely on it being implied by other operations or
      similar.  Which makes it very hard to verify that the access has
      actually been range-checked.
      
      If you use the unsafe direct user accesses, hardware features (either
      SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
      Access Never - on ARM) do force you to use user_access_begin().  But
      nothing really forces the range check.
      
      By putting the range check into user_access_begin(), we actually force
      people to do the right thing (tm), and the range check vill be visible
      near the actual accesses.  We have way too long a history of people
      trying to avoid them.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      [ Shile: fix following conflicts by adding a dummy arguments ]
      Conflicts:
      	kernel/compat.c
      	kernel/exit.c
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      6342e75a
    • L
      i915: fix missing user_access_end() in page fault exception case · 89b31e43
      Linus Torvalds 提交于
      commit 0b2c8f8b6b0c7530e2866c95862546d0da2057b0 upstream.
      
      When commit fddcd00a49e9 ("drm/i915: Force the slow path after a
      user-write error") unified the error handling for various user access
      problems, it didn't do the user_access_end() that is needed for the
      unsafe_put_user() case.
      
      It's not a huge deal: a missed user_access_end() will only mean that
      SMAP protection isn't active afterwards, and for the error case we'll be
      returning to user mode soon enough anyway.  But it's wrong, and adding
      the proper user_access_end() is trivial enough (and doing it for the
      other error cases where it isn't needed doesn't hurt).
      
      I noticed it while doing the same prep-work for changing
      user_access_begin() that precipitated the access_ok() changes in commit
      96d4f267e40f ("Remove 'type' argument from access_ok() function").
      
      Fixes: fddcd00a49e9 ("drm/i915: Force the slow path after a user-write error")
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: stable@kernel.org # v4.20
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      89b31e43
    • C
      drm/i915: Force the slow path after a user-write error · 7e6c8a93
      Chris Wilson 提交于
      commit fddcd00a49e9122a3579247151e9cb3ce5a1a36e upstream.
      
      If we fail to write the user relocation back when it is changed, force
      ourselves to take the slow relocation path where we can handle faults in
      the write path. There is still an element of dubiousness as having
      patched up the batch to use the correct offset, it no longer matches the
      presumed_offset in the relocation, so a second pass may miss any changes
      in layout.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180903083337.13134-3-chris@chris-wilson.co.ukSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7e6c8a93
    • A
      userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults · 8e823df5
      Andrea Arcangeli 提交于
      commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 upstream.
      
      get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
      not be waiting for userfaults before failing and it would hit on a SIGBUS
      instead.  Using get_user_pages_locked/unlocked instead will allow
      get_mempolicy to allow userfaults to resolve the fault and fill the hole,
      before grabbing the node id of the page.
      
      If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
      address inside an area managed by uffd and there is no page at that
      address, the page allocation from within get_mempolicy() will fail
      because get_user_pages() does not allow for page fault retry required
      for uffd; the user will get SIGBUS.
      
      With this patch, the page fault will be resolved by the uffd and the
      get_mempolicy() will continue normally.
      
      Background:
      
      Via code review, previously the syscall would have returned -EFAULT
      (vm_fault_to_errno), now it will block and wait for an userfault (if
      it's waken before the fault is resolved it'll still -EFAULT).
      
      This way get_mempolicy will give a chance to an "unaware" app to be
      compliant with userfaults.
      
      The reason this visible change is that becoming "userfault compliant"
      cannot regress anything: all other syscalls including read(2)/write(2)
      had to become "userfault compliant" long time ago (that's one of the
      things userfaultfd can do that PROT_NONE and trapping segfaults can't).
      
      So this is just one more syscall that become "userfault compliant" like
      all other major ones already were.
      
      This has been happening on virtio-bridge dpdk process which just called
      get_mempolicy on the guest space post live migration, but before the
      memory had a chance to be migrated to destination.
      
      I didn't run an strace to be able to show the -EFAULT going away, but
      I've the confirmation of the below debug aid information (only visible
      with CONFIG_DEBUG_VM=y) going away with the patch:
      
          [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
          [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
          [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
          [20116.371466] Call Trace:
          [20116.371473]  dump_stack+0x5c/0x80
          [20116.371476]  handle_userfault.cold.37+0x1b/0x22
          [20116.371479]  ? remove_wait_queue+0x20/0x60
          [20116.371481]  ? poll_freewait+0x45/0xa0
          [20116.371483]  ? do_sys_poll+0x31c/0x520
          [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
          [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
          [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
          [20116.371493]  shmem_fault+0x78/0x1e0
          [20116.371495]  ? filemap_map_pages+0x3a1/0x450
          [20116.371498]  __do_fault+0x1f/0xc0
          [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
          [20116.371502]  handle_mm_fault+0xda/0x200
          [20116.371504]  __get_user_pages+0x238/0x790
          [20116.371506]  get_user_pages+0x3e/0x50
          [20116.371510]  kernel_get_mempolicy+0x40b/0x700
          [20116.371512]  ? vfs_write+0x170/0x1a0
          [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
          [20116.371517]  do_syscall_64+0x5b/0x160
          [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above harmless debug message (not a kernel crash, just a
      dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
      and improve kernel spots that may have to become "userfaultfd
      compliant" like this one (without having to run an strace and search
      for syscall misbehavior).  Spots like the above are more closer to a
      kernel bug for the non-cooperative usages that Mike focuses on, than
      for for dpdk qemu-cooperative usages that reproduced it, but it's still
      nicer to get this fixed for dpdk too.
      
      The part of the patch that caused me to think is only the
      implementation issue of mpol_get, but it looks like it should work safe
      no matter the kind of mempolicy structure that is (the default static
      policy also starts at 1 so it'll go to 2 and back to 1 without crashing
      everything at 0).
      
      [rppt@linux.vnet.ibm.com: changelog addition]
        http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
      Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8e823df5
    • X
      random: speed up the initialization of module · e1064533
      Xingjun Liu 提交于
      During the module initialization phase, entropy will be added
      to entropy pool for every interrupt, the change should speed up
      initialization of the random module.
      
      Before optimization:
      [   22.180236] random: crng init done
      
      After optimization:
      [    1.474832] random: crng init done
      Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com>
      Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: jia zhang's avatarJia Zhang <zhang.jia@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      e1064533
    • X
      random: introduce the initialization seed · 8c233ea3
      Xingjun Liu 提交于
      Add random entropy with the module parameter as the initialization
      seed when the kernel startup.
      
      For guest OS working in VM, the random entropy will be less,
      it cause the random module to initialize very slowly, and if
      the application which running in guest os gets a certain amount of
      random numbers in the initialization phase, it will be blocked.
      
      This patch allows the VMM to provide a certain amount of random seed
      when starting guest OS, speeding up the initialization of the entire
      guest OS random module.
      
      Before optimization:
      [   22.180236] random: crng init done
      
      After optimization:
      [    1.553362] random: crng init done
      Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com>
      Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: jia zhang's avatarJia Zhang <zhang.jia@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      8c233ea3
    • B
      cpufreq/intel_pstate: Load only on Intel hardware · 748fc02e
      Borislav Petkov 提交于
      commit 4ab526468344c11d2d1807ae95feb1f5305dc014 upstream.
      
      This driver is Intel-only so loading on anything which is not Intel is
      pointless. Prevent it from doing so.
      
      While at it, correct the "not supported" print statement to say CPU
      "model" which is what that test does.
      
      Fixes: 076b862c7e44 (cpufreq: intel_pstate: Add reasons for failure and debug messages)
      Suggested-by: NErwan Velu <e.velu@criteo.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Renninger <trenn@suse.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      748fc02e
    • E
      cpufreq: intel_pstate: Add reasons for failure and debug messages · 3bc2474e
      Erwan Velu 提交于
      commit 076b862c7e4409d2dcacfda19f7eaf8d07ab9200 upstream.
      
      The init code path has several exceptions where the driver can
      decide not to load.
      
      As CONFIG_X86_INTEL_PSTATE is generally set to Y, the return code is
      not reachable.  The initialization code is neither verbose of the
      reason why it did choose to prematurely exit, so it is difficult for
      a user to determine, on a given platform, why the driver didn't load
      properly.
      
      This patch is about reporting to the user the reason/context of why
      the driver failed to load.  That is a precious hint when debugging
      a platform.
      Signed-off-by: NErwan Velu <e.velu@criteo.com>
      [ rjw: Subject & changelog, minor fixups ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      3bc2474e
    • S
      cpufreq: intel_pstate: Force HWP min perf before offline · 5602eeb8
      Srinivas Pandruvada 提交于
      commit af3b7379e2d709f2d7c6966b8a6f5ec6bd134241 upstream.
      
      Force HWP Request MAX = HWP Request MIN = HWP Capability MIN and EPP to
      0xFF. In this way the performance limits on the offlined CPU will not
      influence performance limits on its sibling CPU, which is still online.
      
      If the sibling CPU is calling for higher performance, it will impact the
      max core performance. Here core performance will follow higher of the
      performance requests from each sibling.
      Reported-and-tested-by: NChen Yu <yu.c.chen@intel.com>
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      5602eeb8
  2. 17 8月, 2019 29 次提交