1. 07 9月, 2018 1 次提交
  2. 06 9月, 2018 5 次提交
  3. 23 8月, 2018 1 次提交
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  4. 18 8月, 2018 1 次提交
  5. 17 8月, 2018 1 次提交
  6. 13 8月, 2018 3 次提交
    • J
      IB/uverbs: Remove struct uverbs_root_spec and all supporting code · 51d0a2b4
      Jason Gunthorpe 提交于
      Everything now uses the uverbs_uapi data structure.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      51d0a2b4
    • J
      IB/uverbs: Use uverbs_api to unmarshal ioctl commands · 3a863577
      Jason Gunthorpe 提交于
      Convert the ioctl method syscall path to use the uverbs_api data
      structures. The new uapi structure includes all the same information, just
      in a different and more optimal way.
      
       - Use attr_bkey instead of 2 level radix trees for everything related to
         attributes. This includes the attribute storage, presence, and
         detection of missing mandatory attributes.
       - Avoid iterating over all attribute storage at finish, instead use
         find_first_bit with the attr_bkey to locate only those attrs that need
         cleanup.
       - Organize things to always run, and always rely on, cleanup. This
         avoids a bunch of tricky error unwind cases.
       - Locate the method using the radix tree, and locate the attributes
         using a very efficient incremental radix tree lookup
       - Use the precomputed destroy_bkey to handle uobject destruction
       - Use the precomputed allocation sizes and precomputed 'need_stack'
         to avoid maths in the fast path. This is optimal if userspace
         does not pass (many) unsupported attributes.
      
      Overall this results in much better codegen for the attribute accessors,
      everything is now stored in bitmaps or linear arrays indexed by attr_bkey.
      The compiler can compute attr_bkey values at compile time for all method
      attributes, meaning things like uverbs_attr_is_valid() now compile into
      single instruction bit tests.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      3a863577
    • J
      IB/uverbs: Add a simple allocator to uverbs_attr_bundle · 461bb2ee
      Jason Gunthorpe 提交于
      This is similar in spirit to devm, it keeps track of any allocations
      linked to this method call and ensures they are all freed when the method
      exits. Further, if there is space in the internal/onstack buffer then the
      allocator will hand out that memory and avoid an expensive call to
      kalloc/kfree in the syscall path.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      461bb2ee
  7. 11 8月, 2018 5 次提交
    • J
      IB/uverbs: Remove the ib_uverbs_attr pointer from each attr · 6a1f444f
      Jason Gunthorpe 提交于
      Memory in the bundle is valuable, do not waste it holding an 8 byte
      pointer for the rare case of writing to a PTR_OUT. We can compute the
      pointer by storing a small 1 byte array offset and the base address of the
      uattr memory in the bundle private memory.
      
      This also means we can access the kernel's copy of the ib_uverbs_attr, so
      drop the copy of flags as well.
      
      Since the uattr base should be private bundle information this also
      de-inlines the already too big uverbs_copy_to inline and moves
      create_udata into uverbs_ioctl.c so they can see the private struct
      definition.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      6a1f444f
    • J
      IB/uverbs: Provide implementation private memory for the uverbs_attr_bundle · 4b3dd2bb
      Jason Gunthorpe 提交于
      This already existed as the anonymous 'ctx' structure, but this was not
      really a useful form. Hoist this struct into bundle_priv and rework the
      internal things to use it instead.
      
      Move a bunch of the processing internal state into the priv and reduce the
      excessive use of function arguments.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      4b3dd2bb
    • J
      IB/uverbs: Use uverbs_api to manage the object type inside the uobject · 6b0d08f4
      Jason Gunthorpe 提交于
      Currently the struct uverbs_obj_type stored in the ib_uobject is part of
      the .rodata segment of the module that defines the object. This is a
      problem if drivers define new uapi objects as we will be left with a
      dangling pointer after device disassociation.
      
      Switch the uverbs_obj_type for struct uverbs_api_object, which is
      allocated memory that is part of the uverbs_api and is guaranteed to
      always exist. Further this moves the 'type_class' into this memory which
      means access to the IDR/FD function pointers is also guaranteed. Drivers
      cannot define new types.
      
      This makes it safe to continue to use all uobjects, including driver
      defined ones, after disassociation.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      6b0d08f4
    • J
      IB/uverbs: Build the specs into a radix tree at runtime · 9ed3e5f4
      Jason Gunthorpe 提交于
      This radix tree datastructure is intended to replace the 'hash' structure
      used today for parsing ioctl methods during system calls. This first
      commit introduces the structure and builds it from the existing .rodata
      descriptions.
      
      The so-called hash arrangement is actually a 5 level open coded radix tree.
      This new version uses a 3 level radix tree built using the radix tree
      library.
      
      Overall this is much less code and much easier to build as the radix tree
      API allows for dynamic modification during the building. There is a small
      memory penalty to pay for this, but since the radix tree is allocated on
      a per device basis, a few kb of RAM seems immaterial considering the
      gained simplicity.
      
      The radix tree is similar to the existing tree, but also has a 'attr_bkey'
      concept, which is a small value'd index for each method attribute. This is
      used to simplify and improve performance of everything in the next
      patches.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
      9ed3e5f4
    • J
      IB/uverbs: Have the core code create the uverbs_root_spec · 7d96c9b1
      Jason Gunthorpe 提交于
      There is no reason for drivers to do this, the core code should take of
      everything. The drivers will provide their information from rodata to
      describe their modifications to the core's base uapi specification.
      
      The core uses this to build up the runtime uapi for each device.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      7d96c9b1
  8. 03 8月, 2018 1 次提交
    • J
      RDMA/netdev: Use priv_destructor for netdev cleanup · 9f49a5b5
      Jason Gunthorpe 提交于
      Now that the unregister_netdev flow for IPoIB no longer relies on external
      code we can now introduce the use of priv_destructor and
      needs_free_netdev.
      
      The rdma_netdev flow is switched to use the netdev common priv_destructor
      instead of the special free_rdma_netdev and the IPOIB ULP adjusted:
       - priv_destructor needs to switch to point to the ULP's destructor
         which will then call the rdma_ndev's in the right order
       - We need to be careful around the error unwind of register_netdev
         as it sometimes calls priv_destructor on failure
       - ULPs need to use ndo_init/uninit to ensure proper ordering
         of failures around register_netdev
      
      Switching to priv_destructor is a necessary pre-requisite to using
      the rtnl new_link mechanism.
      
      The VNIC user for rdma_netdev should also be revised, but that is left for
      another patch.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NDenis Drozdov <denisd@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      9f49a5b5
  9. 02 8月, 2018 7 次提交
    • J
      IB/uverbs: Allow all DESTROY commands to succeed after disassociate · 0f50d88a
      Jason Gunthorpe 提交于
      The disassociate function was broken by design because it failed all
      commands. This prevents userspace from calling destroy on a uobject after
      it has detected a device fatal error and thus reclaiming the resources in
      userspace is prevented.
      
      This fix is now straightforward, when anything destroys a uobject that is
      not the user the object remains on the IDR with a NULL context and object
      pointer. All lookup locking modes other than DESTROY will fail. When the
      user ultimately calls the destroy function it is simply dropped from the
      IDR while any related information is returned.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      0f50d88a
    • J
      IB/uverbs: Do not pass struct ib_device to the ioctl methods · e83f0ecd
      Jason Gunthorpe 提交于
      This does the same as the patch before, except for ioctl. The rules are
      the same, but for the ioctl methods the core code handles setting up the
      uobject.
      
      - Retrieve the ib_dev from the uobject->context->device. This is
        safe under ioctl as the core has already done rdma_alloc_begin_uobject
        and so CREATE calls are entirely protected by the rwsem.
      - Retrieve the ib_dev from uobject->object
      - Call ib_uverbs_get_ucontext()
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      e83f0ecd
    • J
      IB/uverbs: Do not pass struct ib_device to the write based methods · bbd51e88
      Jason Gunthorpe 提交于
      This is a step to get rid of the global check for disassociation. In this
      model, the ib_dev is not proven to be valid by the core code and cannot be
      provided to the method. Instead, every method decides if it is able to
      run after disassociation and obtains the ib_dev using one of three
      different approaches:
      
      - Call srcu_dereference on the udevice's ib_dev. As before, this means
        the method cannot be called after disassociation begins.
        (eg alloc ucontext)
      - Retrieve the ib_dev from the ucontext, via ib_uverbs_get_ucontext()
      - Retrieve the ib_dev from the uobject->object after checking
        under SRCU if disassociation has started (eg uobj_get)
      
      Largely, the code is all ready for this, the main work is to provide a
      ib_dev after calling uobj_alloc(). The few other places simply use
      ib_uverbs_get_ucontext() to get the ib_dev.
      
      This flexibility will let the next patches allow destroy to operate
      after disassociation.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      bbd51e88
    • J
      IB/uverbs: Allow RDMA_REMOVE_DESTROY to work concurrently with disassociate · 7452a3c7
      Jason Gunthorpe 提交于
      After all the recent structural changes this is now straightfoward, hoist
      the hw_destroy_rwsem up out of rdma_destroy_explicit and wrap it around
      the uobject write lock as well as the destroy.
      
      This is necessary as obtaining a write lock concurrently with
      uverbs_destroy_ufile_hw() will cause malfunction.
      
      After this change none of the destroy callbacks require the
      disassociate_srcu lock to be correct.
      
      This requires introducing a new lookup mode, UVERBS_LOOKUP_DESTROY as the
      IOCTL interface needs to hold an unlocked kref until all command
      verification is completed.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      7452a3c7
    • J
      IB/uverbs: Convert 'bool exclusive' into an enum · 9867f5c6
      Jason Gunthorpe 提交于
      This is more readable, and future patches will need a 3rd lookup type.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      9867f5c6
    • J
      IB/uverbs: Consolidate uobject destruction · 87ad80ab
      Jason Gunthorpe 提交于
      There are several flows that can destroy a uobject and each one is
      minimized and sprinkled throughout the code base, making it difficult to
      understand and very hard to modify the destroy path.
      
      Consolidate all of these into uverbs_destroy_uobject() and call it in all
      cases where a uobject has to be destroyed.
      
      This makes one change to the lifecycle, during any abort (eg when
      alloc_commit is not called) we always call out to alloc_abort, even if
      remove_commit needs to be called to delete a HW object.
      
      This also renames RDMA_REMOVE_DURING_CLEANUP to RDMA_REMOVE_ABORT to
      clarify its actual usage and revises some of the comments to reflect what
      the life cycle is for the type implementation.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      87ad80ab
    • J
      IB/uverbs: Make the write path destroy methods use the same flow as ioctl · 32ed5c00
      Jason Gunthorpe 提交于
      The ridiculous dance with uobj_remove_commit() is not needed, the write
      path can follow the same flow as ioctl - lock and destroy the HW object
      then use the data left over in the uobject to form the response to
      userspace.
      
      Two helpers are introduced to make this flow straightforward for the
      caller.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      32ed5c00
  10. 31 7月, 2018 6 次提交
  11. 28 7月, 2018 1 次提交
  12. 26 7月, 2018 6 次提交
    • P
      IB/core: Introduce and use sgid_attr in CM requests · cee10433
      Parav Pandit 提交于
      For RoCE, when CM requests are received for RC and UD connections,
      netdevice of the incoming request is unavailable. Because of that CM
      requests are always forwarded to init_net namespace.
      
      Now that we have the GID attribute available, introduce SGID attribute in
      incoming CM requests and refer to the netdevice of it.  This is similar to
      existing SGID attribute field in outgoing CM requests for RC and UD
      transports.
      Signed-off-by: NParav Pandit <parav@mellanox.com>
      Reviewed-by: NDaniel Jurgens <danielj@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      cee10433
    • J
      IB/uverbs: Fix locking around struct ib_uverbs_file ucontext · 22fa27fb
      Jason Gunthorpe 提交于
      We have a parallel unlocked reader and writer with ib_uverbs_get_context()
      vs everything else, and nothing guarantees this works properly.
      
      Audit and fix all of the places that access ucontext to use one of the
      following locking schemes:
      - Call ib_uverbs_get_ucontext() under SRCU and check for failure
      - Access the ucontext through an struct ib_uobject context member
        while holding a READ or WRITE lock on the uobject.
        This value cannot be NULL and has no race.
      - Hold the ucontext_lock and check for ufile->ucontext !NULL
      
      This also re-implements ib_uverbs_get_ucontext() in a way that is safe
      against concurrent ib_uverbs_get_context() and disassociation.
      
      As a side effect, every access to ucontext in the commands is via
      ib_uverbs_get_context() with an error check, or via the uobject, so there
      is no longer any need for the core code to check ucontext on every command
      call. These checks are also removed.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      22fa27fb
    • J
      IB/uverbs: Move the FD uobj type struct file allocation to alloc_commit · aba94548
      Jason Gunthorpe 提交于
      Allocating the struct file during alloc_begin creates this strange
      asymmetry with IDR, where the FD has two krefs pointing at it during the
      pre-commit phase. In particular this makes the abort process for FD very
      strange and confusing.
      
      For instance abort currently calls the type's destroy_object twice, and
      the fops release once if abort is done. This is very counter intuitive. No
      fops should be called until alloc_commit succeeds, and destroy_object
      should only ever be called once.
      
      Moving the struct file allocation to the alloc_commit is now simple, as we
      already support failure of rdma_alloc_commit_uobject, with all the
      required rollback pieces.
      
      This creates an understandable symmetry with IDR and simplifies/fixes the
      abort handling for FD types.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      aba94548
    • J
      IB/uverbs: Always propagate errors from rdma_alloc_commit_uobject() · 2c96eb7d
      Jason Gunthorpe 提交于
      The ioctl framework already does this correctly, but the write path did
      not. This is trivially fixed by simply using a standard pattern to return
      uobj_alloc_commit() as the last statement in every function.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      2c96eb7d
    • J
      IB/uverbs: Rework the locking for cleaning up the ucontext · e951747a
      Jason Gunthorpe 提交于
      The locking here has always been a bit crazy and spread out, upon some
      careful analysis we can simplify things.
      
      Create a single function uverbs_destroy_ufile_hw() that internally handles
      all locking. This pulls together pieces of this process that were
      sprinkled all over the places into one place, and covers them with one
      lock.
      
      This eliminates several duplicate/confusing locks and makes the control
      flow in ib_uverbs_close() and ib_uverbs_free_hw_resources() extremely
      simple.
      
      Unfortunately we have to keep an extra mutex, ucontext_lock.  This lock is
      logically part of the rwsem and provides the 'down write, fail if write
      locked, wait if read locked' semantic we require.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      e951747a
    • J
      IB/uverbs: Handle IDR and FD types without truncation · 1250c304
      Jason Gunthorpe 提交于
      Our ABI for write() uses a s32 for FDs and a u32 for IDRs, but internally
      we ended up implicitly casting these ABI values into an 'int'. For ioctl()
      we use a s64 for FDs and a u64 for IDRs, again casting to an int.
      
      The various casts to int are all missing range checks which can cause
      userspace values that should be considered invalid to be accepted.
      
      Fix this by making the generic lookup routine accept a s64, which does not
      truncate the write API's u32/s32 or the ioctl API's s64. Then push the
      detailed range checking down to the actual type implementations to be
      shared by both interfaces.
      
      Finally, change the copy of the uobj->id to sign extend into a s64, so eg,
      if we ever wish to return a negative value for a FD it is carried
      properly.
      
      This ensures that userspace values are never weirdly interpreted due to
      the various trunctations and everything that is really out of range gets
      an EINVAL.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      1250c304
  13. 25 7月, 2018 2 次提交