1. 26 9月, 2018 1 次提交
  2. 23 8月, 2018 1 次提交
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  3. 15 8月, 2018 1 次提交
  4. 13 8月, 2018 1 次提交
  5. 11 8月, 2018 1 次提交
  6. 08 8月, 2018 1 次提交
    • L
      RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq · 0dfe4522
      Leon Romanovsky 提交于
      [   61.182439] UBSAN: Undefined behaviour in drivers/infiniband/hw/mlx5/qp.c:5366:34
      [   61.183673] shift exponent 4294967288 is too large for 32-bit type 'unsigned int'
      [   61.185530] CPU: 0 PID: 639 Comm: qp Not tainted 4.18.0-rc1-00037-g4aa1d69a9c60-dirty #96
      [   61.186981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
      [   61.188315] Call Trace:
      [   61.188661]  dump_stack+0xc7/0x13b
      [   61.190427]  ubsan_epilogue+0x9/0x49
      [   61.190899]  __ubsan_handle_shift_out_of_bounds+0x1ea/0x22f
      [   61.197040]  mlx5_ib_create_wq+0x1c99/0x1d50
      [   61.206632]  ib_uverbs_ex_create_wq+0x499/0x820
      [   61.213892]  ib_uverbs_write+0x77e/0xae0
      [   61.248018]  vfs_write+0x121/0x3b0
      [   61.249831]  ksys_write+0xa1/0x120
      [   61.254024]  do_syscall_64+0x7c/0x2a0
      [   61.256178]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   61.259211] RIP: 0033:0x7f54bab70e99
      [   61.262125] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89
      [   61.268678] RSP: 002b:00007ffe1541c318 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [   61.271076] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54bab70e99
      [   61.273795] RDX: 0000000000000070 RSI: 0000000020000240 RDI: 0000000000000003
      [   61.276982] RBP: 00007ffe1541c330 R08: 00000000200078e0 R09: 0000000000000002
      [   61.280035] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004005c0
      [   61.283279] R13: 00007ffe1541c420 R14: 0000000000000000 R15: 0000000000000000
      
      Cc: <stable@vger.kernel.org> # 4.7
      Fixes: 79b20a6c ("IB/mlx5: Add receive Work Queue verbs")
      Cc: syzkaller <syzkaller@googlegroups.com>
      Reported-by: NNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      0dfe4522
  7. 03 8月, 2018 1 次提交
    • J
      RDMA/netdev: Use priv_destructor for netdev cleanup · 9f49a5b5
      Jason Gunthorpe 提交于
      Now that the unregister_netdev flow for IPoIB no longer relies on external
      code we can now introduce the use of priv_destructor and
      needs_free_netdev.
      
      The rdma_netdev flow is switched to use the netdev common priv_destructor
      instead of the special free_rdma_netdev and the IPOIB ULP adjusted:
       - priv_destructor needs to switch to point to the ULP's destructor
         which will then call the rdma_ndev's in the right order
       - We need to be careful around the error unwind of register_netdev
         as it sometimes calls priv_destructor on failure
       - ULPs need to use ndo_init/uninit to ensure proper ordering
         of failures around register_netdev
      
      Switching to priv_destructor is a necessary pre-requisite to using
      the rtnl new_link mechanism.
      
      The VNIC user for rdma_netdev should also be revised, but that is left for
      another patch.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NDenis Drozdov <denisd@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      9f49a5b5
  8. 02 8月, 2018 1 次提交
    • J
      IB/uverbs: Do not pass struct ib_device to the ioctl methods · e83f0ecd
      Jason Gunthorpe 提交于
      This does the same as the patch before, except for ioctl. The rules are
      the same, but for the ioctl methods the core code handles setting up the
      uobject.
      
      - Retrieve the ib_dev from the uobject->context->device. This is
        safe under ioctl as the core has already done rdma_alloc_begin_uobject
        and so CREATE calls are entirely protected by the rwsem.
      - Retrieve the ib_dev from uobject->object
      - Call ib_uverbs_get_ucontext()
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      e83f0ecd
  9. 01 8月, 2018 1 次提交
  10. 31 7月, 2018 4 次提交
  11. 27 7月, 2018 1 次提交
  12. 26 7月, 2018 2 次提交
    • J
      IB/uverbs: Fix locking around struct ib_uverbs_file ucontext · 22fa27fb
      Jason Gunthorpe 提交于
      We have a parallel unlocked reader and writer with ib_uverbs_get_context()
      vs everything else, and nothing guarantees this works properly.
      
      Audit and fix all of the places that access ucontext to use one of the
      following locking schemes:
      - Call ib_uverbs_get_ucontext() under SRCU and check for failure
      - Access the ucontext through an struct ib_uobject context member
        while holding a READ or WRITE lock on the uobject.
        This value cannot be NULL and has no race.
      - Hold the ucontext_lock and check for ufile->ucontext !NULL
      
      This also re-implements ib_uverbs_get_ucontext() in a way that is safe
      against concurrent ib_uverbs_get_context() and disassociation.
      
      As a side effect, every access to ucontext in the commands is via
      ib_uverbs_get_context() with an error check, or via the uobject, so there
      is no longer any need for the core code to check ucontext on every command
      call. These checks are also removed.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      22fa27fb
    • J
      IB/mlx5: Use the ucontext from the uobj, not the file · c36ee46d
      Jason Gunthorpe 提交于
      This approach matches the standard flow of the typical write method that
      relies on the HW object to store the device and the uobject to access the
      ucontext.  Avoids the use of the devx_ufile2uctx in several places will
      make revising the semantics of ib_uverbs_get_ucontext() in the next patch
      simpler.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      c36ee46d
  13. 25 7月, 2018 5 次提交
  14. 24 7月, 2018 1 次提交
  15. 19 7月, 2018 1 次提交
  16. 14 7月, 2018 2 次提交
    • L
      RDMA/mlx5: Check that supplied blue flame index doesn't overflow · 05f58ceb
      Leon Romanovsky 提交于
      User's supplied index is checked again total number of system pages, but
      this number already includes num_static_sys_pages, so addition of that
      value to supplied index causes to below error while trying to access
      sys_pages[].
      
      BUG: KASAN: slab-out-of-bounds in bfregn_to_uar_index+0x34f/0x400
      Read of size 4 at addr ffff880065561904 by task syz-executor446/314
      
      CPU: 0 PID: 314 Comm: syz-executor446 Not tainted 4.18.0-rc1+ #256
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
      Call Trace:
       dump_stack+0xef/0x17e
       print_address_description+0x83/0x3b0
       kasan_report+0x18d/0x4d0
       bfregn_to_uar_index+0x34f/0x400
       create_user_qp+0x272/0x227d
       create_qp_common+0x32eb/0x43e0
       mlx5_ib_create_qp+0x379/0x1ca0
       create_qp.isra.5+0xc94/0x22d0
       ib_uverbs_create_qp+0x21b/0x2a0
       ib_uverbs_write+0xc2c/0x1010
       vfs_write+0x1b0/0x550
       ksys_write+0xc6/0x1a0
       do_syscall_64+0xa7/0x590
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x433679
      Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b 91 fd ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fff2b3d8e48 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00000000004002f8 RCX: 0000000000433679
      RDX: 0000000000000040 RSI: 0000000020000240 RDI: 0000000000000003
      RBP: 00000000006d4018 R08: 00000000004002f8 R09: 00000000004002f8
      R10: 00000000004002f8 R11: 0000000000000217 R12: 0000000000000000
      R13: 000000000040cb00 R14: 000000000040cb90 R15: 0000000000000006
      
      Allocated by task 314:
       kasan_kmalloc+0xa0/0xd0
       __kmalloc+0x1a9/0x510
       mlx5_ib_alloc_ucontext+0x966/0x2620
       ib_uverbs_get_context+0x23f/0xa60
       ib_uverbs_write+0xc2c/0x1010
       __vfs_write+0x10d/0x720
       vfs_write+0x1b0/0x550
       ksys_write+0xc6/0x1a0
       do_syscall_64+0xa7/0x590
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 1:
       __kasan_slab_free+0x12e/0x180
       kfree+0x159/0x630
       kvfree+0x37/0x50
       single_release+0x8e/0xf0
       __fput+0x2d8/0x900
       task_work_run+0x102/0x1f0
       exit_to_usermode_loop+0x159/0x1c0
       do_syscall_64+0x408/0x590
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff880065561100
       which belongs to the cache kmalloc-4096 of size 4096
      The buggy address is located 2052 bytes inside of
       4096-byte region [ffff880065561100, ffff880065562100)
      The buggy address belongs to the page:
      page:ffffea0001955800 count:1 mapcount:0 mapping:ffff88006c402480 index:0x0 compound_mapcount: 0
      flags: 0x4000000000008100(slab|head)
      raw: 4000000000008100 ffffea0001a7c000 0000000200000002 ffff88006c402480
      raw: 0000000000000000 0000000080070007 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff880065561800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff880065561880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      >ffff880065561900: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                         ^
       ffff880065561980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff880065561a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      
      Cc: <stable@vger.kernel.org> # 4.15
      Fixes: 1ee47ab3 ("IB/mlx5: Enable QP creation with a given blue flame index")
      Reported-by: NNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      05f58ceb
    • L
      RDMA/mlx5: Melt consecutive calls to alloc_bfreg() in one call · ffaf58de
      Leon Romanovsky 提交于
      There is no need for three consecutive calls to alloc_bfreg(). It can be
      implemented with one function.
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      ffaf58de
  17. 12 7月, 2018 1 次提交
  18. 11 7月, 2018 3 次提交
  19. 10 7月, 2018 3 次提交
  20. 05 7月, 2018 8 次提交