1. 17 4月, 2021 1 次提交
    • R
      lib: remove "expecting prototype" kernel-doc warnings · c95c2d32
      Randy Dunlap 提交于
      Fix various kernel-doc warnings in lib/ due to missing or erroneous
      function names.
      
      Add kernel-doc for some function parameters that was missing.  Use
      kernel-doc "Return:" notation in earlycpio.c.
      
      Quietens the following warnings:
      
        lib/earlycpio.c:61: warning: expecting prototype for cpio_data find_cpio_data(). Prototype was for find_cpio_data() instead
      
        lib/lru_cache.c:640: warning: expecting prototype for lc_dump(). Prototype was for lc_seq_dump_details() instead
        lru_cache.c:90: warning: Function parameter or member 'cache' not described in 'lc_create'
      
        lib/parman.c:368: warning: expecting prototype for parman_item_del(). Prototype was for parman_item_remove() instead
        parman.c:309: warning: Excess function parameter 'prority' description in 'parman_prio_init'
      
        lib/radix-tree.c:703: warning: expecting prototype for __radix_tree_insert(). Prototype was for radix_tree_insert() instead
        radix-tree.c:180: warning: Excess function parameter 'addr' description in 'radix_tree_find_next_bit'
        radix-tree.c:180: warning: Excess function parameter 'size' description in 'radix_tree_find_next_bit'
        radix-tree.c:931: warning: Function parameter or member 'iter' not described in 'radix_tree_iter_replace'
      
      Link: https://lkml.kernel.org/r/20210411221756.15461-1-rdunlap@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
      Cc: Jiri Pirko <jiri@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c95c2d32
  2. 17 10月, 2020 1 次提交
  3. 07 10月, 2020 1 次提交
  4. 17 7月, 2020 1 次提交
  5. 28 5月, 2020 1 次提交
  6. 01 2月, 2020 1 次提交
  7. 03 11月, 2019 1 次提交
  8. 31 5月, 2019 1 次提交
  9. 06 12月, 2018 1 次提交
    • M
      radix tree: Don't return retry entries from lookup · eff3860b
      Matthew Wilcox 提交于
      Commit 66ee620f ("idr: Permit any valid kernel pointer to be stored")
      changed the radix tree lookup so that it stops when reaching the bottom
      of the tree.  However, the condition was added in the wrong place,
      making it possible to return retry entries to the caller.  Reorder the
      tests to check for the retry entry before checking whether we're at the
      bottom of the tree.  The retry entry should never be found in the tree
      root, so it's safe to defer the check until the end of the loop.
      
      Add a regression test to the test-suite to be sure this doesn't come
      back.
      
      Fixes: 66ee620f ("idr: Permit any valid kernel pointer to be stored")
      Reported-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      eff3860b
  10. 21 10月, 2018 13 次提交
  11. 30 9月, 2018 3 次提交
    • M
      xarray: Change definition of sibling entries · 02c02bf1
      Matthew Wilcox 提交于
      Instead of storing a pointer to the slot containing the canonical entry,
      store the offset of the slot.  Produces slightly more efficient code
      (~300 bytes) and simplifies the implementation.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      02c02bf1
    • M
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox 提交于
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      3159f943
    • M
      idr: Permit any valid kernel pointer to be stored · 66ee620f
      Matthew Wilcox 提交于
      An upcoming change to the encoding of internal entries will set the bottom
      two bits to 0b10.  Unfortunately, m68k only aligns some data structures
      to 2 bytes, so the IDR will interpret them as internal entries and things
      will go badly wrong.
      
      Change the radix tree so that it stops either when the node indicates
      that it's the bottom of the tree (shift == 0) or when the entry is not an
      internal entry.  This means we cannot insert an arbitrary kernel pointer
      as a multiorder entry, but the IDR does not permit multiorder entries.
      
      Annoyingly, this means the IDR can no longer take advantage of the radix
      tree's ability to store a single entry at offset 0 without allocating
      memory.  A pointer which is 2-byte aligned cannot be stored directly in
      the root as it would be indistinguishable from a node, so we must allocate
      a node in order to store a 2-byte pointer at index 0.  The idr_replace()
      function does not take a GFP flags argument, so cannot allocate memory.
      If a user inserts a 4-byte aligned pointer at index 0 and then replaces
      it with a 2-byte aligned pointer, we must be able to store it.
      
      Arbitrary pointer values are still not permitted; pointers of the
      form 2 + (i * 4) for values of i between 0 and 1023 are reserved for
      the implementation.  These are not valid kernel pointers as they would
      point into the zero page.
      
      This change does cause a runtime memory consumption regression for
      the IDA.  I will recover that later.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      66ee620f
  12. 22 8月, 2018 2 次提交
    • M
      ida: Remove old API · b03f8e43
      Matthew Wilcox 提交于
      Delete ida_pre_get(), ida_get_new(), ida_get_new_above() and ida_remove()
      from the public API.  Some of these functions still exist as internal
      helpers, but they should not be called by consumers.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      b03f8e43
    • M
      radix-tree: Fix UBSAN warning · 76f070b4
      Matthew Wilcox 提交于
      get_slot_offset() can be called with a NULL 'parent' argument.
      In this case, the calculated value will not be used, but calculating
      it is undefined.  Rather than fixing the caller (__radix_tree_delete)
      to not call get_slot_offset(), make get_slot_offset() robust against
      being called with a NULL parent.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      76f070b4
  13. 26 5月, 2018 1 次提交
    • M
      idr: fix invalid ptr dereference on item delete · 7a4deea1
      Matthew Wilcox 提交于
      If the radix tree underlying the IDR happens to be full and we attempt
      to remove an id which is larger than any id in the IDR, we will call
      __radix_tree_delete() with an uninitialised 'slot' pointer, at which
      point anything could happen.  This was easiest to hit with a single
      entry at id 0 and attempting to remove a non-0 id, but it could have
      happened with 64 entries and attempting to remove an id >= 64.
      
      Roman said:
      
        The syzcaller test boils down to opening /dev/kvm, creating an
        eventfd, and calling a couple of KVM ioctls. None of this requires
        superuser. And the result is dereferencing an uninitialized pointer
        which is likely a crash. The specific path caught by syzbot is via
        KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are
        other user-triggerable paths, so cc:stable is probably justified.
      
      Matthew added:
      
        We have around 250 calls to idr_remove() in the kernel today. Many of
        them pass an ID which is embedded in the object they're removing, so
        they're safe. Picking a few likely candidates:
      
        drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl.
        drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar
        drivers/atm/nicstar.c could be taken down by a handcrafted packet
      
      Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org
      Fixes: 0a835c4f ("Reimplement IDR and IDA using the radix tree")
      Reported-by: <syzbot+35666cba7f0a337e2e79@syzkaller.appspotmail.com>
      Debugged-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a4deea1
  14. 19 5月, 2018 1 次提交
    • R
      radix tree: fix multi-order iteration race · 9f418224
      Ross Zwisler 提交于
      Fix a race in the multi-order iteration code which causes the kernel to
      hit a GP fault.  This was first seen with a production v4.15 based
      kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used
      order 9 PMD DAX entries.
      
      The race has to do with how we tear down multi-order sibling entries
      when we are removing an item from the tree.  Remember for example that
      an order 2 entry looks like this:
      
        struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
      
      where 'entry' is in some slot in the struct radix_tree_node, and the
      three slots following 'entry' contain sibling pointers which point back
      to 'entry.'
      
      When we delete 'entry' from the tree, we call :
      
        radix_tree_delete()
          radix_tree_delete_item()
            __radix_tree_delete()
              replace_slot()
      
      replace_slot() first removes the siblings in order from the first to the
      last, then at then replaces 'entry' with NULL.  This means that for a
      brief period of time we end up with one or more of the siblings removed,
      so:
      
        struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
      
      This causes an issue if you have a reader iterating over the slots in
      the tree via radix_tree_for_each_slot() while only under
      rcu_read_lock()/rcu_read_unlock() protection.  This is a common case in
      mm/filemap.c.
      
      The issue is that when __radix_tree_next_slot() => skip_siblings() tries
      to skip over the sibling entries in the slots, it currently does so with
      an exact match on the slot directly preceding our current slot.
      Normally this works:
      
                                            V preceding slot
        struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
                                                    ^ current slot
      
      This lets you find the first sibling, and you skip them all in order.
      
      But in the case where one of the siblings is NULL, that slot is skipped
      and then our sibling detection is interrupted:
      
                                                   V preceding slot
        struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
                                                          ^ current slot
      
      This means that the sibling pointers aren't recognized since they point
      all the way back to 'entry', so we think that they are normal internal
      radix tree pointers.  This causes us to think we need to walk down to a
      struct radix_tree_node starting at the address of 'entry'.
      
      In a real running kernel this will crash the thread with a GP fault when
      you try and dereference the slots in your broken node starting at
      'entry'.
      
      We fix this race by fixing the way that skip_siblings() detects sibling
      nodes.  Instead of testing against the preceding slot we instead look
      for siblings via is_sibling_entry() which compares against the position
      of the struct radix_tree_node.slots[] array.  This ensures that sibling
      entries are properly identified, even if they are no longer contiguous
      with the 'entry' they point to.
      
      Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com
      Fixes: 148deab2 ("radix-tree: improve multiorder iterators")
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NCR, Sapthagirish <sapthagirish.cr@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f418224
  15. 12 4月, 2018 1 次提交
    • M
      radix tree: use GFP_ZONEMASK bits of gfp_t for flags · fa290cda
      Matthew Wilcox 提交于
      Patch series "XArray", v9.  (First part thereof).
      
      This patchset is, I believe, appropriate for merging for 4.17.  It
      contains the XArray implementation, to eventually replace the radix
      tree, and converts the page cache to use it.
      
      This conversion keeps the radix tree and XArray data structures in sync
      at all times.  That allows us to convert the page cache one function at
      a time and should allow for easier bisection.  Other than renaming some
      elements of the structures, the data structures are fundamentally
      unchanged; a radix tree walk and an XArray walk will touch the same
      number of cachelines.  I have changes planned to the XArray data
      structure, but those will happen in future patches.
      
      Improvements the XArray has over the radix tree:
      
       - The radix tree provides operations like other trees do; 'insert' and
         'delete'. But what most users really want is an automatically
         resizing array, and so it makes more sense to give users an API that
         is like an array -- 'load' and 'store'. We still have an 'insert'
         operation for users that really want that semantic.
      
       - The XArray considers locking as part of its API. This simplifies a
         lot of users who formerly had to manage their own locking just for
         the radix tree. It also improves code generation as we can now tell
         RCU that we're holding a lock and it doesn't need to generate as much
         fencing code. The other advantage is that tree nodes can be moved
         (not yet implemented).
      
       - GFP flags are now parameters to calls which may need to allocate
         memory. The radix tree forced users to decide what the allocation
         flags would be at creation time. It's much clearer to specify them at
         allocation time.
      
       - Memory is not preloaded; we don't tie up dozens of pages on the off
         chance that the slab allocator fails. Instead, we drop the lock,
         allocate a new node and retry the operation. We have to convert all
         the radix tree, IDA and IDR preload users before we can realise this
         benefit, but I have not yet found a user which cannot be converted.
      
       - The XArray provides a cmpxchg operation. The radix tree forces users
         to roll their own (and at least four have).
      
       - Iterators take a 'max' parameter. That simplifies many users and will
         reduce the amount of iteration done.
      
       - Iteration can proceed backwards. We only have one user for this, but
         since it's called as part of the pagefault readahead algorithm, that
         seemed worth mentioning.
      
       - RCU-protected pointers are not exposed as part of the API. There are
         some fun bugs where the page cache forgets to use rcu_dereference()
         in the current codebase.
      
       - Value entries gain an extra bit compared to radix tree exceptional
         entries. That gives us the extra bit we need to put huge page swap
         entries in the page cache.
      
       - Some iterators now take a 'filter' argument instead of having
         separate iterators for tagged/untagged iterations.
      
      The page cache is improved by this:
      
       - Shorter, easier to read code
      
       - More efficient iterations
      
       - Reduction in size of struct address_space
      
       - Fewer walks from the top of the data structure; the XArray API
         encourages staying at the leaf node and conducting operations there.
      
      This patch (of 8):
      
      None of these bits may be used for slab allocations, so we can use them
      as radix tree flags as long as we mask them off before passing them to
      the slab allocator. Move the IDR flag from the high bits to the
      GFP_ZONEMASK bits.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa290cda
  16. 22 2月, 2018 1 次提交
  17. 07 2月, 2018 1 次提交
    • M
      idr: Remove idr_alloc_ext · 460488c5
      Matthew Wilcox 提交于
      It has no more users, so remove it.  Move idr_alloc() back into idr.c,
      move the guts of idr_alloc_cmn() into idr_alloc_u32(), remove the
      wrappers around idr_get_free_cmn() and rename it to idr_get_free().
      While there is now no interface to allocate IDs larger than a u32,
      the IDR internals remain ready to handle a larger ID should a need arise.
      
      These changes make it possible to provide the guarantee that, if the
      nextid pointer points into the object, the object's ID will be initialised
      before a concurrent lookup can find the object.
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      460488c5
  18. 16 11月, 2017 1 次提交
    • M
      mm, truncate: do not check mapping for every page being truncated · c7df8ad2
      Mel Gorman 提交于
      During truncation, the mapping has already been checked for shmem and
      dax so it's known that workingset_update_node is required.
      
      This patch avoids the checks on mapping for each page being truncated.
      In all other cases, a lookup helper is used to determine if
      workingset_update_node() needs to be called.  The one danger is that the
      API is slightly harder to use as calling workingset_update_node directly
      without checking for dax or shmem mappings could lead to surprises.
      However, the API rarely needs to be used and hopefully the comment is
      enough to give people the hint.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                   oneirq-v1r1        pickhelper-v1r1
      Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
      1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
      Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
      Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
      Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
      Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
      Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
      Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
      Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
      Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
      Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
      Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
      Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)
      
      As you'd expect, the gain is marginal but it can be detected.  The
      differences in bonnie are all within the noise which is not surprising
      given the impact on the microbenchmark.
      
      radix_tree_update_node_t is a callback for some radix operations that
      optionally passes in a private field.  The only user of the callback is
      workingset_update_node and as it no longer requires a mapping, the
      private field is removed.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7df8ad2
  19. 09 9月, 2017 1 次提交
  20. 31 8月, 2017 1 次提交
  21. 18 8月, 2017 1 次提交
    • C
      drm/i915: Replace execbuf vma ht with an idr · d1b48c1e
      Chris Wilson 提交于
      This was the competing idea long ago, but it was only with the rewrite
      of the idr as an radixtree and using the radixtree directly ourselves,
      along with the realisation that we can store the vma directly in the
      radixtree and only need a list for the reverse mapping, that made the
      patch performant enough to displace using a hashtable. Though the vma ht
      is fast and doesn't require any extra allocation (as we can embed the node
      inside the vma), it does require a thread for resizing and serialization
      and will have the occasional slow lookup. That is hairy enough to
      investigate alternatives and favour them if equivalent in peak performance.
      One advantage of allocating an indirection entry is that we can support a
      single shared bo between many clients, something that was done on a
      first-come first-serve basis for shared GGTT vma previously. To offset
      the extra allocations, we create yet another kmem_cache for them.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk
      d1b48c1e
  22. 04 5月, 2017 1 次提交
    • M
      lockdep: allow to disable reclaim lockup detection · 7e784422
      Michal Hocko 提交于
      The current implementation of the reclaim lockup detection can lead to
      false positives and those even happen and usually lead to tweak the code
      to silence the lockdep by using GFP_NOFS even though the context can use
      __GFP_FS just fine.
      
      See
      
        http://lkml.kernel.org/r/20160512080321.GA18496@dastard
      
      as an example.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.5.0-rc2+ #4 Tainted: G           O
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
      
        (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]
      
        {RECLAIM_FS-ON-R} state was registered at:
          mark_held_locks+0x79/0xa0
          lockdep_trace_alloc+0xb3/0x100
          kmem_cache_alloc+0x33/0x230
          kmem_zone_alloc+0x81/0x120 [xfs]
          xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
          __xfs_refcount_find_shared+0x75/0x580 [xfs]
          xfs_refcount_find_shared+0x84/0xb0 [xfs]
          xfs_getbmap+0x608/0x8c0 [xfs]
          xfs_vn_fiemap+0xab/0xc0 [xfs]
          do_vfs_ioctl+0x498/0x670
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x12/0x6f
      
               CPU0
               ----
          lock(&xfs_nondir_ilock_class);
          <Interrupt>
            lock(&xfs_nondir_ilock_class);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/543:
      
        stack backtrace:
        CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
        Call Trace:
         lock_acquire+0xd8/0x1e0
         down_write_nested+0x5e/0xc0
         xfs_ilock+0x177/0x200 [xfs]
         xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
         xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
         evict+0xc5/0x190
         dispose_list+0x39/0x60
         prune_icache_sb+0x4b/0x60
         super_cache_scan+0x14f/0x1a0
         shrink_slab.part.63.constprop.79+0x1e9/0x4e0
         shrink_zone+0x15e/0x170
         kswapd+0x4f1/0xa80
         kthread+0xf2/0x110
         ret_from_fork+0x3f/0x70
      
      To quote Dave:
       "Ignoring whether reflink should be doing anything or not, that's a
        "xfs_refcountbt_init_cursor() gets called both outside and inside
        transactions" lockdep false positive case. The problem here is lockdep
        has seen this allocation from within a transaction, hence a GFP_NOFS
        allocation, and now it's seeing it in a GFP_KERNEL context. Also note
        that we have an active reference to this inode.
      
        So, because the reclaim annotations overload the interrupt level
        detections and it's seen the inode ilock been taken in reclaim
        ("interrupt") context, this triggers a reclaim context warning where
        it thinks it is unsafe to do this allocation in GFP_KERNEL context
        holding the inode ilock..."
      
      This sounds like a fundamental problem of the reclaim lock detection.
      It is really impossible to annotate such a special usecase IMHO unless
      the reclaim lockup detection is reworked completely.  Until then it is
      much better to provide a way to add "I know what I am doing flag" and
      mark problematic places.  This would prevent from abusing GFP_NOFS flag
      which has a runtime effect even on configurations which have lockdep
      disabled.
      
      Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
      skip the current allocation request.
      
      While we are at it also make sure that the radix tree doesn't
      accidentaly override tags stored in the upper part of the gfp_mask.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e784422
  23. 08 3月, 2017 1 次提交
    • M
      ida: Free correct IDA bitmap · 4ecd9542
      Matthew Wilcox 提交于
      There's a relatively rare race where we look at the per-cpu preallocated
      IDA bitmap, see it's NULL, allocate a new one, and atomically update it.
      If the kmalloc() happened to sleep and we were rescheduled to a different
      CPU, or an interrupt came in at the exact right time, another task
      might have successfully allocated a bitmap and already deposited it.
      I forgot what the semantics of cmpxchg() were and ended up freeing the
      wrong bitmap leading to KASAN reporting a use-after-free.
      
      Dmitry found the bug with syzkaller & wrote the patch.  I wrote the test
      case that will reproduce the bug without his patch being applied.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      4ecd9542
  24. 14 2月, 2017 2 次提交