1. 06 9月, 2018 3 次提交
  2. 05 9月, 2018 1 次提交
  3. 23 8月, 2018 1 次提交
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  4. 21 8月, 2018 1 次提交
    • M
      IB/hfi1: Invalid NUMA node information can cause a divide by zero · c513de49
      Michael J. Ruhl 提交于
      If the system BIOS does not supply NUMA node information to the
      PCI devices, the NUMA node is selected by choosing the current
      node.
      
      This can lead to the following crash:
      
      divide error: 0000 SMP
      CPU: 0 PID: 4 Comm: kworker/0:0 Tainted: G          IOE
      ------------   3.10.0-693.21.1.el7.x86_64 #1
      Hardware name: Intel Corporation S2600KP/S2600KP, BIOS
      SE5C610.86B.01.01.0005.101720141054 10/17/2014
      Workqueue: events work_for_cpu_fn
      task: ffff880174480fd0 ti: ffff880174488000 task.ti: ffff880174488000
      RIP: 0010: [<ffffffffc020ac69>] hfi1_dev_affinity_init+0x129/0x6a0 [hfi1]
      RSP: 0018:ffff88017448bbf8  EFLAGS: 00010246
      RAX: 0000000000000011 RBX: ffff88107ffba6c0 RCX: ffff88085c22e130
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880824ad0000
      RBP: ffff88017448bc48 R08: 0000000000000011 R09: 0000000000000002
      R10: ffff8808582b6ca0 R11: 0000000000003151 R12: ffff8808582b6ca0
      R13: ffff8808582b6518 R14: ffff8808582b6010 R15: 0000000000000012
      FS:  0000000000000000(0000) GS:ffff88085ec00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007efc707404f0 CR3: 0000000001a02000 CR4: 00000000001607f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Call Trace:
       hfi1_init_dd+0x14b3/0x27a0 [hfi1]
       ? pcie_capability_write_word+0x46/0x70
       ? hfi1_pcie_init+0xc0/0x200 [hfi1]
       do_init_one+0x153/0x4c0 [hfi1]
       ? sched_clock_cpu+0x85/0xc0
       init_one+0x1b5/0x260 [hfi1]
       local_pci_probe+0x4a/0xb0
       work_for_cpu_fn+0x1a/0x30
       process_one_work+0x17f/0x440
       worker_thread+0x278/0x3c0
       ? manage_workers.isra.24+0x2a0/0x2a0
       kthread+0xd1/0xe0
       ? insert_kthread_work+0x40/0x40
       ret_from_fork+0x77/0xb0
       ? insert_kthread_work+0x40/0x40
      
      If the BIOS is not supplying NUMA information:
        - set the default table count to 1 for all possible nodes
        - select node 0 (instead of current NUMA) node to get consistent
          performance
        - generate an error indicating that the BIOS should be upgraded
      Reviewed-by: NGary Leshner <gary.s.leshner@intel.com>
      Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      c513de49
  5. 16 8月, 2018 1 次提交
  6. 15 8月, 2018 4 次提交
  7. 13 8月, 2018 1 次提交
  8. 11 8月, 2018 1 次提交
  9. 08 8月, 2018 2 次提交
    • P
      iw_cxgb4: pass window scale in flowc work request · 2e51e45c
      Potnuri Bharat Teja 提交于
      This will allow FW to not send more data to TP (which would then need to
      be buffered). Pass the negotiated TCP window scale to FW in the FLOWC WR.
      
      Also refactor send_flowc() a bit to clean it up.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NPotnuri Bharat Teja <bharat@chelsio.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      2e51e45c
    • L
      RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq · 0dfe4522
      Leon Romanovsky 提交于
      [   61.182439] UBSAN: Undefined behaviour in drivers/infiniband/hw/mlx5/qp.c:5366:34
      [   61.183673] shift exponent 4294967288 is too large for 32-bit type 'unsigned int'
      [   61.185530] CPU: 0 PID: 639 Comm: qp Not tainted 4.18.0-rc1-00037-g4aa1d69a9c60-dirty #96
      [   61.186981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
      [   61.188315] Call Trace:
      [   61.188661]  dump_stack+0xc7/0x13b
      [   61.190427]  ubsan_epilogue+0x9/0x49
      [   61.190899]  __ubsan_handle_shift_out_of_bounds+0x1ea/0x22f
      [   61.197040]  mlx5_ib_create_wq+0x1c99/0x1d50
      [   61.206632]  ib_uverbs_ex_create_wq+0x499/0x820
      [   61.213892]  ib_uverbs_write+0x77e/0xae0
      [   61.248018]  vfs_write+0x121/0x3b0
      [   61.249831]  ksys_write+0xa1/0x120
      [   61.254024]  do_syscall_64+0x7c/0x2a0
      [   61.256178]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   61.259211] RIP: 0033:0x7f54bab70e99
      [   61.262125] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89
      [   61.268678] RSP: 002b:00007ffe1541c318 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [   61.271076] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54bab70e99
      [   61.273795] RDX: 0000000000000070 RSI: 0000000020000240 RDI: 0000000000000003
      [   61.276982] RBP: 00007ffe1541c330 R08: 00000000200078e0 R09: 0000000000000002
      [   61.280035] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004005c0
      [   61.283279] R13: 00007ffe1541c420 R14: 0000000000000000 R15: 0000000000000000
      
      Cc: <stable@vger.kernel.org> # 4.7
      Fixes: 79b20a6c ("IB/mlx5: Add receive Work Queue verbs")
      Cc: syzkaller <syzkaller@googlegroups.com>
      Reported-by: NNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      0dfe4522
  10. 03 8月, 2018 5 次提交
  11. 02 8月, 2018 1 次提交
    • J
      IB/uverbs: Do not pass struct ib_device to the ioctl methods · e83f0ecd
      Jason Gunthorpe 提交于
      This does the same as the patch before, except for ioctl. The rules are
      the same, but for the ioctl methods the core code handles setting up the
      uobject.
      
      - Retrieve the ib_dev from the uobject->context->device. This is
        safe under ioctl as the core has already done rdma_alloc_begin_uobject
        and so CREATE calls are entirely protected by the rwsem.
      - Retrieve the ib_dev from uobject->object
      - Call ib_uverbs_get_ucontext()
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      e83f0ecd
  12. 01 8月, 2018 4 次提交
  13. 31 7月, 2018 12 次提交
  14. 27 7月, 2018 3 次提交