1. 12 10月, 2016 1 次提交
    • R
      radix-tree: 'slot' can be NULL in radix_tree_next_slot() · 915045fe
      Ross Zwisler 提交于
      There are four cases I can see where we could end up with a NULL 'slot' in
      radix_tree_next_slot().  Yet radix_tree_next_slot() never actually checks
      whether 'slot' is NULL.  It just happens that for the cases where 'slot'
      is NULL, some other combination of factors prevents us from dereferencing
      it.
      
      It would be very easy for someone to unwittingly change one of these
      factors without realizing that we are implicitly depending on it to save
      us from a NULL pointer dereference.
      
      Add a comment documenting the things that allow 'slot' to be safely passed
      as NULL to radix_tree_next_slot().
      
      Here are details on the four cases:
      
      1) radix_tree_iter_retry() via a non-tagged iteration like
      radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
      because radix_tree_iter_retry() sets
      
      	iter->next_index = iter->index;
      
      which means that in in the else case in radix_tree_next_slot(), 'count' is
      zero, so we skip over the while() loop and effectively just return NULL
      without ever dereferencing 'slot'.
      
      2) radix_tree_iter_retry() via tagged iteration like
      radix_tree_for_each_tagged().  This case was giving us NULL pointer
      dereferences in testing, and was fixed with this commit:
      
      commit 3cb9185c ("radix-tree: fix radix_tree_iter_retry() for tagged
      iterators.")
      
      This fix doesn't explicitly check for 'slot' being NULL, though, it works
      around the NULL pointer dereference by instead zeroing iter->tags in
      radix_tree_iter_retry(), which makes us bail out of the if() case in
      radix_tree_next_slot() before we dereference 'slot'.
      
      3) radix_tree_iter_next() via via a non-tagged iteration like
      radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
      and shmem_partial_swap_usage().
      
      As with non-tagged iteration, 'count' in the else case of
      radix_tree_next_slot() is zero, so we skip over the while() loop and
      effectively just return NULL without ever dereferencing 'slot'.
      
      4) radix_tree_iter_next() via tagged iteration like
      radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().
      
      radix_tree_iter_next() zeros out iter->tags, so we end up exiting
      radix_tree_next_slot() here:
      
      	if (flags & RADIX_TREE_ITER_TAGGED) {
      		void *canon = slot;
      
      		iter->tags >>= 1;
      		if (unlikely(!iter->tags))
      			return NULL;
      
      Link: http://lkml.kernel.org/r/20160815194237.25967-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      915045fe
  2. 06 10月, 2016 1 次提交
    • J
      mm: filemap: don't plant shadow entries without radix tree node · d3798ae8
      Johannes Weiner 提交于
      When the underflow checks were added to workingset_node_shadow_dec(),
      they triggered immediately:
      
        kernel BUG at ./include/linux/swap.h:276!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
         soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
        CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60b #1
        Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
        task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
        RIP: page_cache_tree_insert+0xf1/0x100
        Call Trace:
          __add_to_page_cache_locked+0x12e/0x270
          add_to_page_cache_lru+0x4e/0xe0
          mpage_readpages+0x112/0x1d0
          blkdev_readpages+0x1d/0x20
          __do_page_cache_readahead+0x1ad/0x290
          force_page_cache_readahead+0xaa/0x100
          page_cache_sync_readahead+0x3f/0x50
          generic_file_read_iter+0x5af/0x740
          blkdev_read_iter+0x35/0x40
          __vfs_read+0xe1/0x130
          vfs_read+0x96/0x130
          SyS_read+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x13/0x8f
        Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
        RIP  page_cache_tree_insert+0xf1/0x100
      
      This is a long-standing bug in the way shadow entries are accounted in
      the radix tree nodes. The shrinker needs to know when radix tree nodes
      contain only shadow entries, no pages, so node->count is split in half
      to count shadows in the upper bits and pages in the lower bits.
      
      Unfortunately, the radix tree implementation doesn't know of this and
      assumes all entries are in node->count. When there is a shadow entry
      directly in root->rnode and the tree is later extended, the radix tree
      implementation will copy that entry into the new node and and bump its
      node->count, i.e. increases the page count bits. Once the shadow gets
      removed and we subtract from the upper counter, node->count underflows
      and triggers the warning. Afterwards, without node->count reaching 0
      again, the radix tree node is leaked.
      
      Limit shadow entries to when we have actual radix tree nodes and can
      count them properly. That means we lose the ability to detect refaults
      from files that had only the first page faulted in at eviction time.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-and-tested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3798ae8
  3. 03 8月, 2016 1 次提交
  4. 27 7月, 2016 1 次提交
  5. 23 7月, 2016 1 次提交
  6. 21 5月, 2016 13 次提交
  7. 17 5月, 2016 1 次提交
  8. 18 3月, 2016 3 次提交
  9. 06 2月, 2016 1 次提交
  10. 04 2月, 2016 1 次提交
    • M
      radix-tree: fix race in gang lookup · 46437f9a
      Matthew Wilcox 提交于
      If the indirect_ptr bit is set on a slot, that indicates we need to redo
      the lookup.  Introduce a new function radix_tree_iter_retry() which
      forces the loop to retry the lookup by setting 'slot' to NULL and
      turning the iterator back to point at the problematic entry.
      
      This is a pretty rare problem to hit at the moment; the lookup has to
      race with a grow of the radix tree from a height of 0.  The consequences
      of hitting this race are that gang lookup could return a pointer to a
      radix_tree_node instead of a pointer to whatever the user had inserted
      in the tree.
      
      Fixes: cebbd29e ("radix-tree: rewrite gang lookup using iterator")
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ohad Ben-Cohen <ohad@wizery.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46437f9a
  11. 23 1月, 2016 1 次提交
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
  12. 21 1月, 2016 1 次提交
  13. 04 4月, 2014 4 次提交
    • J
      mm: keep page cache radix tree nodes in check · 449dd698
      Johannes Weiner 提交于
      Previously, page cache radix tree nodes were freed after reclaim emptied
      out their page pointers.  But now reclaim stores shadow entries in their
      place, which are only reclaimed when the inodes themselves are
      reclaimed.  This is problematic for bigger files that are still in use
      after they have a significant amount of their cache reclaimed, without
      any of those pages actually refaulting.  The shadow entries will just
      sit there and waste memory.  In the worst case, the shadow entries will
      accumulate until the machine runs out of memory.
      
      To get this under control, the VM will track radix tree nodes
      exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
      rather than global because we expect the radix tree nodes themselves to
      be allocated node-locally and we want to reduce cross-node references of
      otherwise independent cache workloads.  A simple shrinker will then
      reclaim these nodes on memory pressure.
      
      A few things need to be stored in the radix tree node to implement the
      shadow node LRU and allow tree deletions coming from the list:
      
      1. There is no index available that would describe the reverse path
         from the node up to the tree root, which is needed to perform a
         deletion.  To solve this, encode in each node its offset inside the
         parent.  This can be stored in the unused upper bits of the same
         member that stores the node's height at no extra space cost.
      
      2. The number of shadow entries needs to be counted in addition to the
         regular entries, to quickly detect when the node is ready to go to
         the shadow node LRU list.  The current entry count is an unsigned
         int but the maximum number of entries is 64, so a shadow counter
         can easily be stored in the unused upper bits.
      
      3. Tree modification needs tree lock and tree root, which are located
         in the address space, so store an address_space backpointer in the
         node.  The parent pointer of the node is in a union with the 2-word
         rcu_head, so the backpointer comes at no extra cost as well.
      
      4. The node needs to be linked to an LRU list, which requires a list
         head inside the node.  This does increase the size of the node, but
         it does not change the number of objects that fit into a slab page.
      
      [akpm@linux-foundation.org: export the right function]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449dd698
    • J
      lib: radix_tree: tree node interface · 139e5616
      Johannes Weiner 提交于
      Make struct radix_tree_node part of the public interface and provide API
      functions to create, look up, and delete whole nodes.  Refactor the
      existing insert, look up, delete functions on top of these new node
      primitives.
      
      This will allow the VM to track and garbage collect page cache radix
      tree nodes.
      
      [sasha.levin@oracle.com: return correct error code on insertion failure]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      139e5616
    • J
      mm: filemap: move radix tree hole searching here · e7b563bb
      Johannes Weiner 提交于
      The radix tree hole searching code is only used for page cache, for
      example the readahead code trying to get a a picture of the area
      surrounding a fault.
      
      It sufficed to rely on the radix tree definition of holes, which is
      "empty tree slot".  But this is about to change, though, as shadow page
      descriptors will be stored in the page cache after the actual pages get
      evicted from memory.
      
      Move the functions over to mm/filemap.c and make them native page cache
      operations, where they can later be adapted to handle the new definition
      of "page cache hole".
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7b563bb
    • J
      lib: radix-tree: add radix_tree_delete_item() · 53c59f26
      Johannes Weiner 提交于
      Provide a function that does not just delete an entry at a given index,
      but also allows passing in an expected item.  Delete only if that item
      is still located at the specified index.
      
      This is handy when lockless tree traversals want to delete entries as
      well because they don't have to do an second, locked lookup to verify
      the slot has not changed under them before deleting the entry.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53c59f26
  14. 12 9月, 2013 1 次提交
    • J
      lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt · 5e4c0d97
      Jan Kara 提交于
      With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
      one such possible user), the following race can happen:
      
      radix_tree_preload()
      ...
      radix_tree_insert()
        radix_tree_node_alloc()
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      <interrupt>
      ...
      radix_tree_preload()
      ...
      radix_tree_insert()
        radix_tree_node_alloc()
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      
      And we give out one radix tree node twice.  That clearly results in radix
      tree corruption with different results (usually OOPS) depending on which
      two users of radix tree race.
      
      We fix the problem by making radix_tree_node_alloc() always allocate fresh
      radix tree nodes when in interrupt.  Using preloading when in interrupt
      doesn't make sense since all the allocations have to be atomic anyway and
      we cannot steal nodes from process-context users because some users rely
      on radix_tree_insert() succeeding after radix_tree_preload().
      in_interrupt() check is somewhat ugly but we cannot simply key off passed
      gfp_mask as that is acquired from root_gfp_mask() and thus the same for
      all preload users.
      
      Another part of the fix is to avoid node preallocation in
      radix_tree_preload() when passed gfp_mask doesn't allow waiting.  Again,
      preallocation in such case doesn't make sense and when preallocation would
      happen in interrupt we could possibly leak some allocated nodes.  However,
      some users of radix_tree_preload() require following radix_tree_insert()
      to succeed.  To avoid unexpected effects for these users,
      radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
      and we provide a new function radix_tree_maybe_preload() for those users
      which get different gfp mask from different call sites and which are
      prepared to handle radix_tree_insert() failure.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e4c0d97
  15. 06 6月, 2012 1 次提交
  16. 29 3月, 2012 1 次提交
    • K
      radix-tree: introduce bit-optimized iterator · 78c1d784
      Konstantin Khlebnikov 提交于
      A series of radix tree cleanups, and usage of them in the core pagecache
      code.
      
      Micro-benchmark:
      
      lookup 14 slots (typical page-vector size)
      in radix-tree there earch <step> slot filled and tagged
      before/after - nsec per full scan through tree
      
      * Intel Sandy Bridge i7-2620M 4Mb L3
      New code always faster
      
      * AMD Athlon 6000+ 2x1Mb L2, without L3
      New code generally faster,
      Minor degradation (marked with "*") for huge sparse trees
      
      * i386 on Sandy Bridge
      New code faster for common cases: tagged and dense trees.
      Some degradations for non-tagged lookup on sparse trees.
      
      Ideally, there might help __ffs() analog for searching first non-zero
      long element in array, gcc sometimes cannot optimize this loop corretly.
      
      Numbers:
      
      CPU: Intel Sandy Bridge i7-2620M 4Mb L3
      
      radix-tree with 1024 slots:
      
      tagged lookup
      
      step  1      before  7156        after  3613
      step  2      before  5399        after  2696
      step  3      before  4779        after  1928
      step  4      before  4456        after  1429
      step  5      before  4292        after  1213
      step  6      before  4183        after  1052
      step  7      before  4157        after  951
      step  8      before  4016        after  812
      step  9      before  3952        after  851
      step  10     before  3937        after  732
      step  11     before  4023        after  709
      step  12     before  3872        after  657
      step  13     before  3892        after  633
      step  14     before  3720        after  591
      step  15     before  3879        after  578
      step  16     before  3561        after  513
      
      normal lookup
      
      step  1      before  4266       after  3301
      step  2      before  2695       after  2129
      step  3      before  2083       after  1712
      step  4      before  1801       after  1534
      step  5      before  1628       after  1313
      step  6      before  1551       after  1263
      step  7      before  1475       after  1185
      step  8      before  1432       after  1167
      step  9      before  1373       after  1092
      step  10     before  1339       after  1134
      step  11     before  1292       after  1056
      step  12     before  1319       after  1030
      step  13     before  1276       after  1004
      step  14     before  1256       after  987
      step  15     before  1228       after  992
      step  16     before  1247       after  999
      
      radix-tree with 1024*1024*128 slots:
      
      tagged lookup
      
      step  1      before  1086102841  after  674196409
      step  2      before  816839155   after  498138306
      step  7      before  599728907   after  240676762
      step  15     before  555729253   after  185219677
      step  63     before  606637748   after  128585664
      step  64     before  608384432   after  102945089
      step  65     before  596987114   after  123996019
      step  128    before  304459225   after  56783056
      step  256    before  158846855   after  31232481
      step  512    before  86085652    after  18950595
      step  12345  before  6517189     after  1674057
      
      normal lookup
      
      step  1      before  626064869  after  544418266
      step  2      before  418809975  after  336321473
      step  7      before  242303598  after  207755560
      step  15     before  208380563  after  176496355
      step  63     before  186854206  after  167283638
      step  64     before  176188060  after  170143976
      step  65     before  185139608  after  167487116
      step  128    before  88181865   after  86913490
      step  256    before  45733628   after  45143534
      step  512    before  24506038   after  23859036
      step  12345  before  2177425    after  2018662
      
      * AMD Athlon 6000+ 2x1Mb L2, without L3
      
      radix-tree with 1024 slots:
      
      tag-lookup
      
      step  1      before  8164        after  5379
      step  2      before  5818        after  5581
      step  3      before  4959        after  4213
      step  4      before  4371        after  3386
      step  5      before  4204        after  2997
      step  6      before  4950        after  2744
      step  7      before  4598        after  2480
      step  8      before  4251        after  2288
      step  9      before  4262        after  2243
      step  10     before  4175        after  2131
      step  11     before  3999        after  2024
      step  12     before  3979        after  1994
      step  13     before  3842        after  1929
      step  14     before  3750        after  1810
      step  15     before  3735        after  1810
      step  16     before  3532        after  1660
      
      normal-lookup
      
      step  1      before  7875        after  5847
      step  2      before  4808        after  4071
      step  3      before  4073        after  3462
      step  4      before  3677        after  3074
      step  5      before  4308        after  2978
      step  6      before  3911        after  3807
      step  7      before  3635        after  3522
      step  8      before  3313        after  3202
      step  9      before  3280        after  3257
      step  10     before  3166        after  3083
      step  11     before  3066        after  3026
      step  12     before  2985        after  2982
      step  13     before  2925        after  2924
      step  14     before  2834        after  2808
      step  15     before  2805        after  2803
      step  16     before  2647        after  2622
      
      radix-tree with 1024*1024*128 slots:
      
      tag-lookup
      
      step  1      before  1288059720  after  951736580
      step  2      before  961292300   after  884212140
      step  7      before  768905140   after  547267580
      step  15     before  771319480   after  456550640
      step  63     before  504847640   after  242704304
      step  64     before  392484800   after  177920786
      step  65     before  491162160   after  246895264
      step  128    before  208084064   after  97348392
      step  256    before  112401035   after  51408126
      step  512    before  75825834    after  29145070
      step  12345  before  5603166     after  2847330
      
      normal-lookup
      
      step  1      before  1025677120  after  861375100
      step  2      before  647220080   after  572258540
      step  7      before  505518960   after  484041813
      step  15     before  430483053   after  444815320	*
      step  63     before  388113453   after  404250546	*
      step  64     before  374154666   after  396027440	*
      step  65     before  381423973   after  396704853	*
      step  128    before  190078700   after  202619384	*
      step  256    before  100886756   after  102829108	*
      step  512    before  64074505    after  56158720
      step  12345  before  4237289     after  4422299		*
      
      * i686 on Sandy bridge
      
      radix-tree with 1024 slots:
      
      tagged lookup
      
      step  1      before  7990        after  4019
      step  2      before  5698        after  2897
      step  3      before  5013        after  2475
      step  4      before  4630        after  1721
      step  5      before  4346        after  1759
      step  6      before  4299        after  1556
      step  7      before  4098        after  1513
      step  8      before  4115        after  1222
      step  9      before  3983        after  1390
      step  10     before  4077        after  1207
      step  11     before  3921        after  1231
      step  12     before  3894        after  1116
      step  13     before  3840        after  1147
      step  14     before  3799        after  1090
      step  15     before  3797        after  1059
      step  16     before  3783        after  745
      
      normal lookup
      
      step  1      before  5103       after  3499
      step  2      before  3299       after  2550
      step  3      before  2489       after  2370
      step  4      before  2034       after  2302		*
      step  5      before  1846       after  2268		*
      step  6      before  1752       after  2249		*
      step  7      before  1679       after  2164		*
      step  8      before  1627       after  2153		*
      step  9      before  1542       after  2095		*
      step  10     before  1479       after  2109		*
      step  11     before  1469       after  2009		*
      step  12     before  1445       after  2039		*
      step  13     before  1411       after  2013		*
      step  14     before  1374       after  2046		*
      step  15     before  1340       after  1975		*
      step  16     before  1331       after  2000		*
      
      radix-tree with 1024*1024*128 slots:
      
      tagged lookup
      
      step  1      before  1225865377  after  667153553
      step  2      before  842427423   after  471533007
      step  7      before  609296153   after  276260116
      step  15     before  544232060   after  226859105
      step  63     before  519209199   after  141343043
      step  64     before  588980279   after  141951339
      step  65     before  521099710   after  138282060
      step  128    before  298476778   after  83390628
      step  256    before  149358342   after  43602609
      step  512    before  76994713    after  22911077
      step  12345  before  53286669     after  1472111
      
      normal lookup
      
      step  1      before  819284564  after  533635310
      step  2      before  512421605  after  364956155
      step  7      before  271443305  after  305721345	*
      step  15     before  223591630  after  273960216	*
      step  63     before  190320247  after  217770207	*
      step  64     before  178538168  after  267411372	*
      step  65     before  186400423  after  215347937	*
      step  128    before  88106045   after  140540612	*
      step  256    before  44812420   after  70660377		*
      step  512    before  24435438   after  36328275		*
      step  12345  before  2123924    after  2148062		*
      
      bloat-o-meter delta for this patchset + patchset with related shmem cleanups
      
      bloat-o-meter: x86_64
      
      add/remove: 4/3 grow/shrink: 5/6 up/down: 928/-939 (-11)
      function                                     old     new   delta
      radix_tree_next_chunk                          -     499    +499
      shmem_unuse                                  428     554    +126
      shmem_radix_tree_replace                     131     227     +96
      find_get_pages_tag                           354     419     +65
      find_get_pages_contig                        345     407     +62
      find_get_pages                               362     396     +34
      __kstrtab_radix_tree_next_chunk                -      22     +22
      __ksymtab_radix_tree_next_chunk                -      16     +16
      __kcrctab_radix_tree_next_chunk                -       8      +8
      radix_tree_gang_lookup_slot                  204     203      -1
      static.shmem_xattr_set                       384     381      -3
      radix_tree_gang_lookup_tag_slot              208     191     -17
      radix_tree_gang_lookup                       231     187     -44
      radix_tree_gang_lookup_tag                   247     199     -48
      shmem_unlock_mapping                         278     190     -88
      __lookup                                     217       -    -217
      __lookup_tag                                 242       -    -242
      radix_tree_locate_item                       279       -    -279
      
      bloat-o-meter: i386
      
      add/remove: 3/3 grow/shrink: 8/9 up/down: 1075/-1275 (-200)
      function                                     old     new   delta
      radix_tree_next_chunk                          -     757    +757
      shmem_unuse                                  352     449     +97
      find_get_pages_contig                        269     322     +53
      shmem_radix_tree_replace                     113     154     +41
      find_get_pages_tag                           277     318     +41
      dcache_dir_lseek                             426     458     +32
      __kstrtab_radix_tree_next_chunk                -      22     +22
      vc_do_resize                                 968     977      +9
      snd_pcm_lib_read1                            725     733      +8
      __ksymtab_radix_tree_next_chunk                -       8      +8
      netlbl_cipsov4_list                         1120    1127      +7
      find_get_pages                               293     291      -2
      new_slab                                     467     459      -8
      bitfill_unaligned_rev                        425     417      -8
      radix_tree_gang_lookup_tag_slot              177     146     -31
      blk_dump_cmd                                 267     229     -38
      radix_tree_gang_lookup_slot                  212     134     -78
      shmem_unlock_mapping                         221     128     -93
      radix_tree_gang_lookup_tag                   275     162    -113
      radix_tree_gang_lookup                       255     126    -129
      __lookup                                     227       -    -227
      __lookup_tag                                 271       -    -271
      radix_tree_locate_item                       277       -    -277
      
      This patch:
      
      Implement a clean, simple and effective radix-tree iteration routine.
      
      Iterating divided into two phases:
      * lookup next chunk in radix-tree leaf node
      * iterating through slots in this chunk
      
      Main iterator function radix_tree_next_chunk() returns pointer to first
      slot, and stores in the struct radix_tree_iter index of next-to-last slot.
       For tagged-iterating it also constuct bitmask of tags for retunted chunk.
       All additional logic implemented as static-inline functions and macroses.
      
      Also adds radix_tree_find_next_bit() static-inline variant of
      find_next_bit() optimized for small constant size arrays, because
      find_next_bit() too heavy for searching in an array with one/two long
      elements.
      
      [akpm@linux-foundation.org: rework comments a bit]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78c1d784
  17. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  18. 13 1月, 2012 1 次提交
  19. 04 8月, 2011 2 次提交
    • H
      tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd
      Hugh Dickins 提交于
      We have already acknowledged that swapoff of a tmpfs file is slower than
      it was before conversion to the generic radix_tree: a little slower
      there will be acceptable, if the hotter paths are faster.
      
      But it was a shock to find swapoff of a 500MB file 20 times slower on my
      laptop, taking 10 minutes; and at that rate it significantly slows down
      my testing.
      
      Now, most of that turned out to be overhead from PROVE_LOCKING and
      PROVE_RCU: without those it was only 4 times slower than before; and
      more realistic tests on other machines don't fare as badly.
      
      I've tried a number of things to improve it, including tagging the swap
      entries, then doing lookup by tag: I'd expected that to halve the time,
      but in practice it's erratic, and often counter-productive.
      
      The only change I've so far found to make a consistent improvement, is
      to short-circuit the way we go back and forth, gang lookup packing
      entries into the array supplied, then shmem scanning that array for the
      target entry.  Scanning in place doubles the speed, so it's now only
      twice as slow as before (or three times slower when the PROVEs are on).
      
      So, add radix_tree_locate_item() as an expedient, once-off,
      single-caller hack to do the lookup directly in place.  #ifdef it on
      CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
      applicability as save space in other configurations.  And, sadly,
      #include sched.h for cond_resched().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e504f3fd
    • H
      radix_tree: exceptional entries and indices · 6328650b
      Hugh Dickins 提交于
      A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
      peculiar swap vector, instead keeping a file's swap entries in the same
      radix tree as its struct page pointers: thus saving memory, and
      simplifying its code and locking.
      
      This patch:
      
      The radix_tree is used by several subsystems for different purposes.  A
      major use is to store the struct page pointers of a file's pagecache for
      memory management.  But what if mm wanted to store something other than
      page pointers there too?
      
      The low bit of a radix_tree entry is already used to denote an indirect
      pointer, for internal use, and the unlikely radix_tree_deref_retry()
      case.
      
      Define the next bit as denoting an exceptional entry, and supply inline
      functions radix_tree_exception() to return non-0 in either unlikely
      case, and radix_tree_exceptional_entry() to return non-0 in the second
      case.
      
      If a subsystem already uses radix_tree with that bit set, no problem: it
      does not affect internal workings at all, but is defined for the
      convenience of those storing well-aligned pointers in the radix_tree.
      
      The radix_tree_gang_lookups have an implicit assumption that the caller
      can deduce the offset of each entry returned e.g.  by the page->index of
      a struct page.  But that may not be feasible for some kinds of item to
      be stored there.
      
      radix_tree_gang_lookup_slot() allow for an optional indices argument,
      output array in which to return those offsets.  The same could be added
      to other radix_tree_gang_lookups, but for now keep it to the only one
      for which we need it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6328650b
  20. 14 1月, 2011 1 次提交
    • M
      mm: migration: use rcu_dereference_protected when dereferencing the radix tree... · 29c1f677
      Mel Gorman 提交于
      mm: migration: use rcu_dereference_protected when dereferencing the radix tree slot during file page migration
      
      migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
      anonymous pages, as introduced by git commit
      989f89c5 ("fix rcu_read_lock() in page
      migraton").  The point of the RCU protection there is part of getting a
      stable reference to anon_vma and is only held for anon pages as file pages
      are locked which is sufficient protection against freeing.
      
      However, while a file page's mapping is being migrated, the radix tree is
      double checked to ensure it is the expected page.  This uses
      radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
      triggering the following warning.
      
      [  173.674290] ===================================================
      [  173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
      [  173.676016] ---------------------------------------------------
      [  173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
      [  173.676016]
      [  173.676016] other info that might help us debug this:
      [  173.676016]
      [  173.676016]
      [  173.676016] rcu_scheduler_active = 1, debug_locks = 0
      [  173.676016] 1 lock held by hugeadm/2899:
      [  173.676016]  #0:  (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
      [  173.676016]
      [  173.676016] stack backtrace:
      [  173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
      [  173.676016] Call Trace:
      [  173.676016]  [<c128cc01>] ? printk+0x14/0x1b
      [  173.676016]  [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
      [  173.676016]  [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
      [  173.676016]  [<c10e41ad>] migrate_page+0x23/0x39
      [  173.676016]  [<c10e491b>] buffer_migrate_page+0x22/0x107
      [  173.676016]  [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
      [  173.676016]  [<c10e425d>] move_to_new_page+0x9a/0x1ae
      [  173.676016]  [<c10e47e6>] migrate_pages+0x1e7/0x2fa
      
      This patch introduces radix_tree_deref_slot_protected() which calls
      rcu_dereference_protected().  Users of it must pass in the
      mapping->tree_lock that is protecting this dereference.  Holding the tree
      lock protects against parallel updaters of the radix tree meaning that
      rcu_dereference_protected is allowable.
      
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Milton Miller <miltonm@bga.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>		[2.6.37.early]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c1f677
  21. 12 11月, 2010 1 次提交
    • N
      radix-tree: fix RCU bug · 27d20fdd
      Nick Piggin 提交于
      Salman Qazi describes the following radix-tree bug:
      
      In the following case, we get can get a deadlock:
      
      0.  The radix tree contains two items, one has the index 0.
      1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
      2.  The reader acquires slot(s) for item(s) including the index 0 item.
      3.  The non-zero index item is deleted, and as a consequence the other item is
          moved to the root of the tree. The place where it used to be is queued for
          deletion after the readers finish.
      3b. The zero item is deleted, removing it from the direct slot, it remains in
          the rcu-delayed indirect node.
      4.  The reader looks at the index 0 slot, and finds that the page has 0 ref
          count
      5.  The reader looks at it again, hoping that the item will either be freed or
          the ref count will increase. This never happens, as the slot it is looking
          at will never be updated. Also, this slot can never be reclaimed because
          the reader is holding rcu_read_lock and is in an infinite loop.
      
      The fix is to re-use the same "indirect" pointer case that requires a slot
      lookup retry into a general "retry the lookup" bit.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Reported-by: NSalman Qazi <sqazi@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d20fdd
  22. 20 8月, 2010 1 次提交