1. 29 3月, 2012 1 次提交
    • K
      radix-tree: introduce bit-optimized iterator · 78c1d784
      Konstantin Khlebnikov 提交于
      A series of radix tree cleanups, and usage of them in the core pagecache
      code.
      
      Micro-benchmark:
      
      lookup 14 slots (typical page-vector size)
      in radix-tree there earch <step> slot filled and tagged
      before/after - nsec per full scan through tree
      
      * Intel Sandy Bridge i7-2620M 4Mb L3
      New code always faster
      
      * AMD Athlon 6000+ 2x1Mb L2, without L3
      New code generally faster,
      Minor degradation (marked with "*") for huge sparse trees
      
      * i386 on Sandy Bridge
      New code faster for common cases: tagged and dense trees.
      Some degradations for non-tagged lookup on sparse trees.
      
      Ideally, there might help __ffs() analog for searching first non-zero
      long element in array, gcc sometimes cannot optimize this loop corretly.
      
      Numbers:
      
      CPU: Intel Sandy Bridge i7-2620M 4Mb L3
      
      radix-tree with 1024 slots:
      
      tagged lookup
      
      step  1      before  7156        after  3613
      step  2      before  5399        after  2696
      step  3      before  4779        after  1928
      step  4      before  4456        after  1429
      step  5      before  4292        after  1213
      step  6      before  4183        after  1052
      step  7      before  4157        after  951
      step  8      before  4016        after  812
      step  9      before  3952        after  851
      step  10     before  3937        after  732
      step  11     before  4023        after  709
      step  12     before  3872        after  657
      step  13     before  3892        after  633
      step  14     before  3720        after  591
      step  15     before  3879        after  578
      step  16     before  3561        after  513
      
      normal lookup
      
      step  1      before  4266       after  3301
      step  2      before  2695       after  2129
      step  3      before  2083       after  1712
      step  4      before  1801       after  1534
      step  5      before  1628       after  1313
      step  6      before  1551       after  1263
      step  7      before  1475       after  1185
      step  8      before  1432       after  1167
      step  9      before  1373       after  1092
      step  10     before  1339       after  1134
      step  11     before  1292       after  1056
      step  12     before  1319       after  1030
      step  13     before  1276       after  1004
      step  14     before  1256       after  987
      step  15     before  1228       after  992
      step  16     before  1247       after  999
      
      radix-tree with 1024*1024*128 slots:
      
      tagged lookup
      
      step  1      before  1086102841  after  674196409
      step  2      before  816839155   after  498138306
      step  7      before  599728907   after  240676762
      step  15     before  555729253   after  185219677
      step  63     before  606637748   after  128585664
      step  64     before  608384432   after  102945089
      step  65     before  596987114   after  123996019
      step  128    before  304459225   after  56783056
      step  256    before  158846855   after  31232481
      step  512    before  86085652    after  18950595
      step  12345  before  6517189     after  1674057
      
      normal lookup
      
      step  1      before  626064869  after  544418266
      step  2      before  418809975  after  336321473
      step  7      before  242303598  after  207755560
      step  15     before  208380563  after  176496355
      step  63     before  186854206  after  167283638
      step  64     before  176188060  after  170143976
      step  65     before  185139608  after  167487116
      step  128    before  88181865   after  86913490
      step  256    before  45733628   after  45143534
      step  512    before  24506038   after  23859036
      step  12345  before  2177425    after  2018662
      
      * AMD Athlon 6000+ 2x1Mb L2, without L3
      
      radix-tree with 1024 slots:
      
      tag-lookup
      
      step  1      before  8164        after  5379
      step  2      before  5818        after  5581
      step  3      before  4959        after  4213
      step  4      before  4371        after  3386
      step  5      before  4204        after  2997
      step  6      before  4950        after  2744
      step  7      before  4598        after  2480
      step  8      before  4251        after  2288
      step  9      before  4262        after  2243
      step  10     before  4175        after  2131
      step  11     before  3999        after  2024
      step  12     before  3979        after  1994
      step  13     before  3842        after  1929
      step  14     before  3750        after  1810
      step  15     before  3735        after  1810
      step  16     before  3532        after  1660
      
      normal-lookup
      
      step  1      before  7875        after  5847
      step  2      before  4808        after  4071
      step  3      before  4073        after  3462
      step  4      before  3677        after  3074
      step  5      before  4308        after  2978
      step  6      before  3911        after  3807
      step  7      before  3635        after  3522
      step  8      before  3313        after  3202
      step  9      before  3280        after  3257
      step  10     before  3166        after  3083
      step  11     before  3066        after  3026
      step  12     before  2985        after  2982
      step  13     before  2925        after  2924
      step  14     before  2834        after  2808
      step  15     before  2805        after  2803
      step  16     before  2647        after  2622
      
      radix-tree with 1024*1024*128 slots:
      
      tag-lookup
      
      step  1      before  1288059720  after  951736580
      step  2      before  961292300   after  884212140
      step  7      before  768905140   after  547267580
      step  15     before  771319480   after  456550640
      step  63     before  504847640   after  242704304
      step  64     before  392484800   after  177920786
      step  65     before  491162160   after  246895264
      step  128    before  208084064   after  97348392
      step  256    before  112401035   after  51408126
      step  512    before  75825834    after  29145070
      step  12345  before  5603166     after  2847330
      
      normal-lookup
      
      step  1      before  1025677120  after  861375100
      step  2      before  647220080   after  572258540
      step  7      before  505518960   after  484041813
      step  15     before  430483053   after  444815320	*
      step  63     before  388113453   after  404250546	*
      step  64     before  374154666   after  396027440	*
      step  65     before  381423973   after  396704853	*
      step  128    before  190078700   after  202619384	*
      step  256    before  100886756   after  102829108	*
      step  512    before  64074505    after  56158720
      step  12345  before  4237289     after  4422299		*
      
      * i686 on Sandy bridge
      
      radix-tree with 1024 slots:
      
      tagged lookup
      
      step  1      before  7990        after  4019
      step  2      before  5698        after  2897
      step  3      before  5013        after  2475
      step  4      before  4630        after  1721
      step  5      before  4346        after  1759
      step  6      before  4299        after  1556
      step  7      before  4098        after  1513
      step  8      before  4115        after  1222
      step  9      before  3983        after  1390
      step  10     before  4077        after  1207
      step  11     before  3921        after  1231
      step  12     before  3894        after  1116
      step  13     before  3840        after  1147
      step  14     before  3799        after  1090
      step  15     before  3797        after  1059
      step  16     before  3783        after  745
      
      normal lookup
      
      step  1      before  5103       after  3499
      step  2      before  3299       after  2550
      step  3      before  2489       after  2370
      step  4      before  2034       after  2302		*
      step  5      before  1846       after  2268		*
      step  6      before  1752       after  2249		*
      step  7      before  1679       after  2164		*
      step  8      before  1627       after  2153		*
      step  9      before  1542       after  2095		*
      step  10     before  1479       after  2109		*
      step  11     before  1469       after  2009		*
      step  12     before  1445       after  2039		*
      step  13     before  1411       after  2013		*
      step  14     before  1374       after  2046		*
      step  15     before  1340       after  1975		*
      step  16     before  1331       after  2000		*
      
      radix-tree with 1024*1024*128 slots:
      
      tagged lookup
      
      step  1      before  1225865377  after  667153553
      step  2      before  842427423   after  471533007
      step  7      before  609296153   after  276260116
      step  15     before  544232060   after  226859105
      step  63     before  519209199   after  141343043
      step  64     before  588980279   after  141951339
      step  65     before  521099710   after  138282060
      step  128    before  298476778   after  83390628
      step  256    before  149358342   after  43602609
      step  512    before  76994713    after  22911077
      step  12345  before  53286669     after  1472111
      
      normal lookup
      
      step  1      before  819284564  after  533635310
      step  2      before  512421605  after  364956155
      step  7      before  271443305  after  305721345	*
      step  15     before  223591630  after  273960216	*
      step  63     before  190320247  after  217770207	*
      step  64     before  178538168  after  267411372	*
      step  65     before  186400423  after  215347937	*
      step  128    before  88106045   after  140540612	*
      step  256    before  44812420   after  70660377		*
      step  512    before  24435438   after  36328275		*
      step  12345  before  2123924    after  2148062		*
      
      bloat-o-meter delta for this patchset + patchset with related shmem cleanups
      
      bloat-o-meter: x86_64
      
      add/remove: 4/3 grow/shrink: 5/6 up/down: 928/-939 (-11)
      function                                     old     new   delta
      radix_tree_next_chunk                          -     499    +499
      shmem_unuse                                  428     554    +126
      shmem_radix_tree_replace                     131     227     +96
      find_get_pages_tag                           354     419     +65
      find_get_pages_contig                        345     407     +62
      find_get_pages                               362     396     +34
      __kstrtab_radix_tree_next_chunk                -      22     +22
      __ksymtab_radix_tree_next_chunk                -      16     +16
      __kcrctab_radix_tree_next_chunk                -       8      +8
      radix_tree_gang_lookup_slot                  204     203      -1
      static.shmem_xattr_set                       384     381      -3
      radix_tree_gang_lookup_tag_slot              208     191     -17
      radix_tree_gang_lookup                       231     187     -44
      radix_tree_gang_lookup_tag                   247     199     -48
      shmem_unlock_mapping                         278     190     -88
      __lookup                                     217       -    -217
      __lookup_tag                                 242       -    -242
      radix_tree_locate_item                       279       -    -279
      
      bloat-o-meter: i386
      
      add/remove: 3/3 grow/shrink: 8/9 up/down: 1075/-1275 (-200)
      function                                     old     new   delta
      radix_tree_next_chunk                          -     757    +757
      shmem_unuse                                  352     449     +97
      find_get_pages_contig                        269     322     +53
      shmem_radix_tree_replace                     113     154     +41
      find_get_pages_tag                           277     318     +41
      dcache_dir_lseek                             426     458     +32
      __kstrtab_radix_tree_next_chunk                -      22     +22
      vc_do_resize                                 968     977      +9
      snd_pcm_lib_read1                            725     733      +8
      __ksymtab_radix_tree_next_chunk                -       8      +8
      netlbl_cipsov4_list                         1120    1127      +7
      find_get_pages                               293     291      -2
      new_slab                                     467     459      -8
      bitfill_unaligned_rev                        425     417      -8
      radix_tree_gang_lookup_tag_slot              177     146     -31
      blk_dump_cmd                                 267     229     -38
      radix_tree_gang_lookup_slot                  212     134     -78
      shmem_unlock_mapping                         221     128     -93
      radix_tree_gang_lookup_tag                   275     162    -113
      radix_tree_gang_lookup                       255     126    -129
      __lookup                                     227       -    -227
      __lookup_tag                                 271       -    -271
      radix_tree_locate_item                       277       -    -277
      
      This patch:
      
      Implement a clean, simple and effective radix-tree iteration routine.
      
      Iterating divided into two phases:
      * lookup next chunk in radix-tree leaf node
      * iterating through slots in this chunk
      
      Main iterator function radix_tree_next_chunk() returns pointer to first
      slot, and stores in the struct radix_tree_iter index of next-to-last slot.
       For tagged-iterating it also constuct bitmask of tags for retunted chunk.
       All additional logic implemented as static-inline functions and macroses.
      
      Also adds radix_tree_find_next_bit() static-inline variant of
      find_next_bit() optimized for small constant size arrays, because
      find_next_bit() too heavy for searching in an array with one/two long
      elements.
      
      [akpm@linux-foundation.org: rework comments a bit]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78c1d784
  2. 08 3月, 2012 1 次提交
  3. 13 1月, 2012 1 次提交
    • H
      radix_tree: take radix_tree_path off stack · e2bdb933
      Hugh Dickins 提交于
      Down, down in the deepest depths of GFP_NOIO page reclaim, we have
      shrink_page_list() calling __remove_mapping() calling __delete_from_
      swap_cache() or __delete_from_page_cache().
      
      You would not expect those to need much stack, but in fact they call
      radix_tree_delete(): which declares a 192-byte radix_tree_path array on
      its stack (to record the node,offsets it visits when descending, in case
      it needs to ascend to update them).  And if any tag is still set [1],
      that calls radix_tree_tag_clear(), which declares a further such
      192-byte radix_tree_path array on the stack.  (At least we have
      interrupts disabled here, so won't then be pushing registers too.)
      
      That was probably a good choice when most users were 32-bit (array of
      half the size), and adding fields to radix_tree_node would have bloated
      it unnecessarily.  But nowadays many are 64-bit, and each
      radix_tree_node contains a struct rcu_head, which is only used when
      freeing; whereas the radix_tree_path info is only used for updating the
      tree (deleting, clearing tags or setting tags if tagged) when a lock
      must be held, of no interest when accessing the tree locklessly.
      
      So add a parent pointer to the radix_tree_node, in union with the
      rcu_head, and remove all uses of the radix_tree_path.  There would be
      space in that union to save the offset when descending as before (we can
      argue that a lock must already be held to exclude other users), but
      recalculating it when ascending is both easy (a constant shift and a
      constant mask) and uncommon, so it seems better just to do that.
      
      Two little optimizations: no need to decrement height when descending,
      adjusting shift is enough; and once radix_tree_tag_if_tagged() has set
      tag on a node and its ancestors, it need not ascend from that node
      again.
      
      perf on the radix tree test harness reports radix_tree_insert() as 2%
      slower (now having to set parent), but radix_tree_delete() 24% faster.
      Surely that's an exaggeration from rtth's artificially low map shift 3,
      but forcing it back to 6 still rates radix_tree_delete() 8% faster.
      
      [1] Can a pagecache tag (dirty, writeback or towrite) actually still be
      set at the time of radix_tree_delete()? Perhaps not if the filesystem is
      well-behaved.  But although I've not tracked any stack overflow down to
      this cause, I have observed a curious case in which a dirty tag is set
      and left set on tmpfs: page migration's migrate_page_copy() happens to
      use __set_page_dirty_nobuffers() to set PageDirty on the newpage, and
      that sets PAGECACHE_TAG_DIRTY as a side-effect - harmless to a
      filesystem which doesn't use tags, except for this stack depth issue.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nai Xia <nai.xia@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2bdb933
  4. 01 11月, 2011 1 次提交
  5. 04 8月, 2011 2 次提交
    • H
      tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd
      Hugh Dickins 提交于
      We have already acknowledged that swapoff of a tmpfs file is slower than
      it was before conversion to the generic radix_tree: a little slower
      there will be acceptable, if the hotter paths are faster.
      
      But it was a shock to find swapoff of a 500MB file 20 times slower on my
      laptop, taking 10 minutes; and at that rate it significantly slows down
      my testing.
      
      Now, most of that turned out to be overhead from PROVE_LOCKING and
      PROVE_RCU: without those it was only 4 times slower than before; and
      more realistic tests on other machines don't fare as badly.
      
      I've tried a number of things to improve it, including tagging the swap
      entries, then doing lookup by tag: I'd expected that to halve the time,
      but in practice it's erratic, and often counter-productive.
      
      The only change I've so far found to make a consistent improvement, is
      to short-circuit the way we go back and forth, gang lookup packing
      entries into the array supplied, then shmem scanning that array for the
      target entry.  Scanning in place doubles the speed, so it's now only
      twice as slow as before (or three times slower when the PROVEs are on).
      
      So, add radix_tree_locate_item() as an expedient, once-off,
      single-caller hack to do the lookup directly in place.  #ifdef it on
      CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
      applicability as save space in other configurations.  And, sadly,
      #include sched.h for cond_resched().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e504f3fd
    • H
      radix_tree: exceptional entries and indices · 6328650b
      Hugh Dickins 提交于
      A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
      peculiar swap vector, instead keeping a file's swap entries in the same
      radix tree as its struct page pointers: thus saving memory, and
      simplifying its code and locking.
      
      This patch:
      
      The radix_tree is used by several subsystems for different purposes.  A
      major use is to store the struct page pointers of a file's pagecache for
      memory management.  But what if mm wanted to store something other than
      page pointers there too?
      
      The low bit of a radix_tree entry is already used to denote an indirect
      pointer, for internal use, and the unlikely radix_tree_deref_retry()
      case.
      
      Define the next bit as denoting an exceptional entry, and supply inline
      functions radix_tree_exception() to return non-0 in either unlikely
      case, and radix_tree_exceptional_entry() to return non-0 in the second
      case.
      
      If a subsystem already uses radix_tree with that bit set, no problem: it
      does not affect internal workings at all, but is defined for the
      convenience of those storing well-aligned pointers in the radix_tree.
      
      The radix_tree_gang_lookups have an implicit assumption that the caller
      can deduce the offset of each entry returned e.g.  by the page->index of
      a struct page.  But that may not be feasible for some kinds of item to
      be stored there.
      
      radix_tree_gang_lookup_slot() allow for an optional indices argument,
      output array in which to return those offsets.  The same could be added
      to other radix_tree_gang_lookups, but for now keep it to the only one
      for which we need it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6328650b
  6. 26 1月, 2011 1 次提交
    • T
      radix_tree: radix_tree_gang_lookup_tag_slot() may never return · ac15ee69
      Toshiyuki Okajima 提交于
      Executed command: fsstress -d /mnt -n 600 -p 850
      
        crash> bt
        PID: 7947   TASK: ffff880160546a70  CPU: 0   COMMAND: "fsstress"
         #0 [ffff8800dfc07d00] machine_kexec at ffffffff81030db9
         #1 [ffff8800dfc07d70] crash_kexec at ffffffff810a7952
         #2 [ffff8800dfc07e40] oops_end at ffffffff814aa7c8
         #3 [ffff8800dfc07e70] die_nmi at ffffffff814aa969
         #4 [ffff8800dfc07ea0] do_nmi_callback at ffffffff8102b07b
         #5 [ffff8800dfc07f10] do_nmi at ffffffff814aa514
         #6 [ffff8800dfc07f50] nmi at ffffffff814a9d60
            [exception RIP: __lookup_tag+100]
            RIP: ffffffff812274b4  RSP: ffff88016056b998  RFLAGS: 00000287
            RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000006
            RDX: 000000000000001d  RSI: ffff88016056bb18  RDI: ffff8800c85366e0
            RBP: ffff88016056b9c8   R8: ffff88016056b9e8   R9: 0000000000000000
            R10: 000000000000000e  R11: ffff8800c8536908  R12: 0000000000000010
            R13: 0000000000000040  R14: ffffffffffffffc0  R15: ffff8800c85366e0
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        <NMI exception stack>
         #7 [ffff88016056b998] __lookup_tag at ffffffff812274b4
         #8 [ffff88016056b9d0] radix_tree_gang_lookup_tag_slot at ffffffff81227605
         #9 [ffff88016056ba20] find_get_pages_tag at ffffffff810fc110
        #10 [ffff88016056ba80] pagevec_lookup_tag at ffffffff81105e85
        #11 [ffff88016056baa0] write_cache_pages at ffffffff81104c47
        #12 [ffff88016056bbd0] generic_writepages at ffffffff81105014
        #13 [ffff88016056bbe0] do_writepages at ffffffff81105055
        #14 [ffff88016056bbf0] __filemap_fdatawrite_range at ffffffff810fb2cb
        #15 [ffff88016056bc40] filemap_write_and_wait_range at ffffffff810fb32a
        #16 [ffff88016056bc70] generic_file_direct_write at ffffffff810fb3dc
        #17 [ffff88016056bce0] __generic_file_aio_write at ffffffff810fcee5
        #18 [ffff88016056bda0] generic_file_aio_write at ffffffff810fd085
        #19 [ffff88016056bdf0] do_sync_write at ffffffff8114f9ea
        #20 [ffff88016056bf00] vfs_write at ffffffff8114fcf8
        #21 [ffff88016056bf30] sys_write at ffffffff81150691
        #22 [ffff88016056bf80] system_call_fastpath at ffffffff8100c0b2
      
      I think this root cause is the following:
      
       radix_tree_range_tag_if_tagged() always tags the root tag with settag
       if the root tag is set with iftag even if there are no iftag tags
       in the specified range (Of course, there are some iftag tags
       outside the specified range).
      
      ===============================================================================
      [[[Detailed description]]]
      
      (1) Why cannot radix_tree_gang_lookup_tag_slot() return forever?
      
      __lookup_tag():
       - Return with 0.
       - Return with the index which is not bigger than the old one as the
         input parameter.
      
      Therefore the following "while" repeats forever because the above
      conditions cause "ret" not to be updated and the cur_index cannot be
      changed into the bigger one.
      
      (So, radix_tree_gang_lookup_tag_slot() cannot return forever.)
      
      radix_tree_gang_lookup_tag_slot():
      1178         while (ret < max_items) {
      1179                 unsigned int slots_found;
      1180                 unsigned long next_index;       /* Index of next search */
      1181
      1182                 if (cur_index > max_index)
      1183                         break;
      1184                 slots_found = __lookup_tag(node, results + ret,
      1185                                 cur_index, max_items - ret, &next_index,
      tag);
      1186                 ret += slots_found;
      			// cannot update ret because slots_found == 0.
      			// so, this while loops forever.
      1187                 if (next_index == 0)
      1188                         break;
      1189                 cur_index = next_index;
      1190         }
      
      (2) Why does __lookup_tag() return with 0 and doesn't update the index?
      
      Assuming the following:
        - the one of the slot in radix_tree_node is NULL.
        - the one of the tag which corresponds to the slot sets with
          PAGECACHE_TAG_TOWRITE or other.
        - In a certain height(!=0), the corresponding index is 0.
      
      a) __lookup_tag() notices that the tag is set.
      
      1005 static unsigned int
      1006 __lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
      1007         unsigned int max_items, unsigned long *next_index, unsigned int tag)
      1008 {
      1009         unsigned int nr_found = 0;
      1010         unsigned int shift, height;
      1011
      1012         height = slot->height;
      1013         if (height == 0)
      1014                 goto out;
      1015         shift = (height-1) * RADIX_TREE_MAP_SHIFT;
      1016
      1017         while (height > 0) {
      1018                 unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
      1019
      1020                 for (;;) {
      1021                         if (tag_get(slot, tag, i))
      1022                                 break;
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      * the index is not updated yet.
      
      b) __lookup_tag() notices that the slot is NULL.
      
      1023                         index &= ~((1UL << shift) - 1);
      1024                         index += 1UL << shift;
      1025                         if (index == 0)
      1026                                 goto out;       /* 32-bit wraparound */
      1027                         i++;
      1028                         if (i == RADIX_TREE_MAP_SIZE)
      1029                                 goto out;
      1030                 }
      1031                 height--;
      1032                 if (height == 0) {      /* Bottom level: grab some items */
      ...
      1055                 }
      1056                 shift -= RADIX_TREE_MAP_SHIFT;
      1057                 slot = rcu_dereference_raw(slot->slots[i]);
      1058                 if (slot == NULL)
      1059                         break;
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
      c) __lookup_tag() doesn't update the index and return with 0.
      
      1060         }
      1061 out:
      1062         *next_index = index;
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      1063         return nr_found;
      1064 }
      
      (3) Why is the slot NULL even if the tag is set?
      
      Because radix_tree_range_tag_if_tagged() always sets the root tag with
      PAGECACHE_TAG_TOWRITE if the root tag is set with PAGECACHE_TAG_DIRTY,
      even if there is no tag which can be set with PAGECACHE_TAG_TOWRITE
      in the specified range (from *first_indexp to last_index). Of course,
      some PAGECACHE_TAG_DIRTY nodes must exist outside the specified range.
      (radix_tree_range_tag_if_tagged() is called only from tag_pages_for_writeback())
      
       640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
      *root,
       641                 unsigned long *first_indexp, unsigned long last_index,
       642                 unsigned long nr_to_tag,
       643                 unsigned int iftag, unsigned int settag)
       644 {
       645         unsigned int height = root->height;
       646         struct radix_tree_path path[height];
       647         struct radix_tree_path *pathp = path;
       648         struct radix_tree_node *slot;
       649         unsigned int shift;
       650         unsigned long tagged = 0;
       651         unsigned long index = *first_indexp;
       652
       653         last_index = min(last_index, radix_tree_maxindex(height));
       654         if (index > last_index)
       655                 return 0;
       656         if (!nr_to_tag)
       657                 return 0;
       658         if (!root_tag_get(root, iftag)) {
       659                 *first_indexp = last_index + 1;
       660                 return 0;
       661         }
       662         if (height == 0) {
       663                 *first_indexp = last_index + 1;
       664                 root_tag_set(root, settag);
       665                 return 1;
       666         }
      ...
       733         root_tag_set(root, settag);
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       734         *first_indexp = index;
       735
       736         return tagged;
       737 }
      
      As the result, there is no radix_tree_node which is set with
      PAGECACHE_TAG_TOWRITE but the root tag(radix_tree_root) is set with
      PAGECACHE_TAG_TOWRITE.
      
      [figure: inside radix_tree]
      (Please see the figure with typewriter font)
      ===========================================
                [roottag = DIRTY]
                       |             tag=0:NOTHING
               tag[0 0 0 1]              1:DIRTY
                  [x x x +]              2:WRITEBACK
                         |               3:DIRTY,WRITEBACK
                         p               4:TOWRITE
                   <--->                 5:DIRTY,TOWRITE ...
           specified range (index: 0 to 2)
      
      * There is no DIRTY tag within the specified range.
       (But there is a DIRTY tag outside that range.)
      
                  | | | | | | | | |
          after calling tag_pages_for_writeback()
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |                 p is "page".
               tag[0 0 0 1]              x is NULL.
                  [x x x +]              +- is a pointer to "page".
                         |
                         p
      
      * But TOWRITE tag is set on the root tag.
      ============================================
      
      After that, radix_tree_extend() via radix_tree_insert() is called
      when the page is added.
      This function sets the new radix_tree_node with PAGECACHE_TAG_TOWRITE
      to succeed the status of the root tag.
      
       246 static int radix_tree_extend(struct radix_tree_root *root, unsigned long
      index)
       247 {
       248         struct radix_tree_node *node;
       249         unsigned int height;
       250         int tag;
       251
       252         /* Figure out what the height should be.  */
       253         height = root->height + 1;
       254         while (index > radix_tree_maxindex(height))
       255                 height++;
       256
       257         if (root->rnode == NULL) {
       258                 root->height = height;
       259                 goto out;
       260         }
       261
       262         do {
       263                 unsigned int newheight;
       264                 if (!(node = radix_tree_node_alloc(root)))
       265                         return -ENOMEM;
       266
       267                 /* Increase the height.  */
       268                 node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
       269
       270                 /* Propagate the aggregated tag info into the new root */
       271                 for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
       272                         if (root_tag_get(root, tag))
       273                                 tag_set(node, tag, 0);
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       274                 }
      
      ===========================================
                [roottag = DIRTY,TOWRITE]
                       |     :
               tag[0 0 0 1] [0 0 0 0]
                  [x x x +] [+ x x x]
                         |   |
                         p   p (new page)
      
                  | | | | | | | | |
          after calling radix_tree_insert
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |
               tag [5 0 0 0]    *  DIRTY and TOWRITE tags are
                   [+ + x x]       succeeded to the new node.
                    | |
        tag [0 0 0 1] [0 0 0 0]
            [x x x +] [+ x x x]
                   |   |
                   p   p
      ============================================
      
      After that, the index 3 page is released by remove_from_page_cache().
      Then we can make the situation that the tag is set with PAGECACHE_TAG_TOWRITE
      and that the slot which corresponds to the tag is NULL.
      ===========================================
                [roottag = DIRTY,TOWRITE]
                       |
               tag [5 0 0 0]
                   [+ + x x]
                    | |
        tag [0 0 0 1] [0 0 0 0]
            [x x x +] [+ x x x]
                   |   |
                   p   p
               (remove)
      
                  | | | | | | | | |
          after calling remove_page_cache
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |
               tag [4 0 0 0]      * Only DIRTY tag is cleared
                   [x + x x]        because no TOWRITE tag is existed
                      |             in the bottom node.
                      [0 0 0 0]
                      [+ x x x]
                       |
                       p
      ============================================
      
      To solve this problem
      
      Change to that radix_tree_tag_if_tagged() doesn't tag the root tag
      if it doesn't set any tags within the specified range.
      
      Like this.
      ============================================
       640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
      *root,
       641                 unsigned long *first_indexp, unsigned long last_index,
       642                 unsigned long nr_to_tag,
       643                 unsigned int iftag, unsigned int settag)
       644 {
       650         unsigned long tagged = 0;
      ...
       733 	     if (tagged)
      ^^^^^^^^^^^^^^^^^^^^^^^^
       734            root_tag_set(root, settag);
       735         *first_indexp = index;
       736
       737         return tagged;
       738 }
      
      ============================================
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac15ee69
  7. 12 11月, 2010 1 次提交
    • N
      radix-tree: fix RCU bug · 27d20fdd
      Nick Piggin 提交于
      Salman Qazi describes the following radix-tree bug:
      
      In the following case, we get can get a deadlock:
      
      0.  The radix tree contains two items, one has the index 0.
      1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
      2.  The reader acquires slot(s) for item(s) including the index 0 item.
      3.  The non-zero index item is deleted, and as a consequence the other item is
          moved to the root of the tree. The place where it used to be is queued for
          deletion after the readers finish.
      3b. The zero item is deleted, removing it from the direct slot, it remains in
          the rcu-delayed indirect node.
      4.  The reader looks at the index 0 slot, and finds that the page has 0 ref
          count
      5.  The reader looks at it again, hoping that the item will either be freed or
          the ref count will increase. This never happens, as the slot it is looking
          at will never be updated. Also, this slot can never be reclaimed because
          the reader is holding rcu_read_lock and is in an infinite loop.
      
      The fix is to re-use the same "indirect" pointer case that requires a slot
      lookup retry into a general "retry the lookup" bit.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Reported-by: NSalman Qazi <sqazi@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d20fdd
  8. 23 8月, 2010 2 次提交
    • D
      radix-tree: radix_tree_range_tag_if_tagged() can set incorrect tags · 144dcfc0
      Dave Chinner 提交于
      Commit ebf8aa44 ("radix-tree:
      omplement function radix_tree_range_tag_if_tagged") does not safely
      set tags on on intermediate tree nodes. The code walks down the tree
      setting tags before it has fully resolved the path to the leaf under
      the assumption there will be a leaf slot with the tag set in the
      range it is searching.
      
      Unfortunately, this is not a valid assumption - we can abort after
      setting a tag on an intermediate node if we overrun the number of
      tags we are allowed to set in a batch, or stop scanning because we
      we have passed the last scan index before we reach a leaf slot with
      the tag we are searching for set.
      
      As a result, we can leave the function with tags set on intemediate
      nodes which can be tripped over later by tag-based lookups. The
      result of these stale tags is that lookup may end prematurely or
      livelock because the lookup cannot make progress.
      
      The fix for the problem involves reocrding the traversal path we
      take to the leaf nodes, and only propagating the tags back up the
      tree once the tag is set in the leaf node slot. We are already
      recording the path for efficient traversal, so there is no
      additional overhead to do the intermediately node tag setting in
      this manner.
      
      This fixes a radix tree lookup livelock triggered by the new
      writeback sync livelock avoidance code introduced in commit
      f446daae ("mm: implement writeback
      livelock avoidance using page tagging").
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NJan Kara <jack@suse.cz>
      144dcfc0
    • D
      radix-tree: clear all tags in radix_tree_node_rcu_free · b6dd0865
      Dave Chinner 提交于
      Commit f446daae ("mm: implement
      writeback livelock avoidance using page tagging") introduced a new
      radix tree tag, increasing the number of tags in each node from 2 to
      3. It did not, however, fix up the code in
      radix_tree_node_rcu_free() that cleans up after radix_tree_shrink()
      and hence could leave stray tags set in the new tag array.
      
      The result is that the livelock avoidance code added in the the
      above commit would hit stale tags when doing tag based lookups,
      resulting in livelocks when trying to traverse the tree.
      
      Fix this problem in radix_tree_node_rcu_free() so it doesn't happen
      again in the future by using a loop to walk all the tags up to
      RADIX_TREE_MAX_TAGS to clear the stray tags radix_tree_shrink()
      leaves behind.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NNick Piggin <npiggin@kernel.dk>
      Acked-by: NJan Kara <jack@suse.cz>
      b6dd0865
  9. 21 8月, 2010 1 次提交
  10. 20 8月, 2010 1 次提交
  11. 10 8月, 2010 1 次提交
  12. 28 5月, 2010 1 次提交
  13. 10 4月, 2010 1 次提交
    • D
      radix_tree_tag_get() is not as safe as the docs make out [ver #2] · ce82653d
      David Howells 提交于
      radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set()
      or radix_tree_tag_clear().  The problem is that the double tag_get() in
      radix_tree_tag_get():
      
      		if (!tag_get(node, tag, offset))
      			saw_unset_tag = 1;
      		if (height == 1) {
      			int ret = tag_get(node, tag, offset);
      
      may see the value change due to the action of set/clear.  RCU is no protection
      against this as no pointers are being changed, no nodes are being replaced
      according to a COW protocol - set/clear alter the node directly.
      
      The documentation in linux/radix-tree.h, however, says that
      radix_tree_tag_get() is an exception to the rule that "any function modifying
      the tree or tags (...) must exclude other modifications, and exclude any
      functions reading the tree".
      
      The problem is that the next statement in radix_tree_tag_get() checks that the
      tag doesn't vary over time:
      
      			BUG_ON(ret && saw_unset_tag);
      
      This has been seen happening in FS-Cache:
      
      	https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html
      
      To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various
      comments that the value of the tag may change whilst the RCU read lock is held,
      and thus that the return value of radix_tree_tag_get() may not be relied upon
      unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from
      running concurrently with it.
      Reported-by: NRomain DEGEZ <romain.degez@smartjog.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce82653d
  14. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  15. 25 2月, 2010 1 次提交
    • P
      radix-tree: Disable RCU lockdep checking in radix tree · 2676a58c
      Paul E. McKenney 提交于
      Because the radix tree is used with many different locking
      designs, we cannot do any effective checking without changing
      the radix-tree APIs. It might make sense to do this later, but
      only if the RCU lockdep checking proves itself sufficiently
      valuable.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-10-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2676a58c
  16. 20 11月, 2009 2 次提交
    • D
      FS-Cache: Don't delete pending pages from the page-store tracking tree · 285e728b
      David Howells 提交于
      Don't delete pending pages from the page-store tracking tree, but rather send
      them for another write as they've presumably been updated.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      285e728b
    • D
      FS-Cache: Use radix tree preload correctly in tracking of pages to be stored · b34df792
      David Howells 提交于
      __fscache_write_page() attempts to load the radix tree preallocation pool for
      the CPU it is on before calling radix_tree_insert(), as the insertion must be
      done inside a pair of spinlocks.
      
      Use of the preallocation pool, however, is contingent on the radix tree being
      initialised without __GFP_WAIT specified.  __fscache_acquire_cookie() was
      passing GFP_NOFS to INIT_RADIX_TREE() - but that includes __GFP_WAIT.
      
      The solution is to AND out __GFP_WAIT.
      
      Additionally, the banner comment to radix_tree_preload() is altered to make
      note of this prerequisite.  Possibly there should be a WARN_ON() too.
      
      Without this fix, I have seen the following recursive deadlock caused by
      radix_tree_insert() attempting to allocate memory inside the spinlocked
      region, which resulted in FS-Cache being called back into to release memory -
      which required the spinlock already held.
      
      =============================================
      [ INFO: possible recursive locking detected ]
      2.6.32-rc6-cachefs #24
      ---------------------------------------------
      nfsiod/7916 is trying to acquire lock:
       (&cookie->lock){+.+.-.}, at: [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache]
      
      but task is already holding lock:
       (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache]
      
      other info that might help us debug this:
      5 locks held by nfsiod/7916:
       #0:  (nfsiod){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2
       #1:  (&task->u.tk_work#2){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2
       #2:  (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache]
       #3:  (&object->lock#2){+.+.-.}, at: [<ffffffffa0076b07>] __fscache_write_page+0x197/0x3f3 [fscache]
       #4:  (&cookie->stores_lock){+.+...}, at: [<ffffffffa0076b0f>] __fscache_write_page+0x19f/0x3f3 [fscache]
      
      stack backtrace:
      Pid: 7916, comm: nfsiod Not tainted 2.6.32-rc6-cachefs #24
      Call Trace:
       [<ffffffff8105ac7f>] __lock_acquire+0x1649/0x16e3
       [<ffffffff81059ded>] ? __lock_acquire+0x7b7/0x16e3
       [<ffffffff8100e27d>] ? dump_trace+0x248/0x257
       [<ffffffff8105ad70>] lock_acquire+0x57/0x6d
       [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache]
       [<ffffffff8135467c>] _spin_lock+0x2c/0x3b
       [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache]
       [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache]
       [<ffffffffa0077eb7>] ? __fscache_check_page_write+0x0/0x71 [fscache]
       [<ffffffffa00b4755>] nfs_fscache_release_page+0x86/0xc4 [nfs]
       [<ffffffffa00907f0>] nfs_release_page+0x3c/0x41 [nfs]
       [<ffffffff81087ffb>] try_to_release_page+0x32/0x3b
       [<ffffffff81092c2b>] shrink_page_list+0x316/0x4ac
       [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70
       [<ffffffff8135451b>] ? _spin_unlock_irq+0x2b/0x31
       [<ffffffff81093153>] shrink_inactive_list+0x392/0x67c
       [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70
       [<ffffffff810934ca>] shrink_list+0x8d/0x8f
       [<ffffffff81093744>] shrink_zone+0x278/0x33c
       [<ffffffff81052c70>] ? ktime_get_ts+0xad/0xba
       [<ffffffff8109453b>] try_to_free_pages+0x22e/0x392
       [<ffffffff8109184c>] ? isolate_pages_global+0x0/0x212
       [<ffffffff8108e16b>] __alloc_pages_nodemask+0x3dc/0x5cf
       [<ffffffff810ae24a>] cache_alloc_refill+0x34d/0x6c1
       [<ffffffff811bcf74>] ? radix_tree_node_alloc+0x52/0x5c
       [<ffffffff810ae929>] kmem_cache_alloc+0xb2/0x118
       [<ffffffff811bcf74>] radix_tree_node_alloc+0x52/0x5c
       [<ffffffff811bcfd5>] radix_tree_insert+0x57/0x19c
       [<ffffffffa0076b53>] __fscache_write_page+0x1e3/0x3f3 [fscache]
       [<ffffffffa00b4248>] __nfs_readpage_to_fscache+0x58/0x11e [nfs]
       [<ffffffffa009bb77>] nfs_readpage_release+0x34/0x9b [nfs]
       [<ffffffffa009c0d9>] nfs_readpage_release_full+0x32/0x4b [nfs]
       [<ffffffffa0006cff>] rpc_release_calldata+0x12/0x14 [sunrpc]
       [<ffffffffa0006e2d>] rpc_free_task+0x59/0x61 [sunrpc]
       [<ffffffffa0006f03>] rpc_async_release+0x10/0x12 [sunrpc]
       [<ffffffff810482e5>] worker_thread+0x1ef/0x2e2
       [<ffffffff81048290>] ? worker_thread+0x19a/0x2e2
       [<ffffffff81352433>] ? thread_return+0x3e/0x101
       [<ffffffffa0006ef3>] ? rpc_async_release+0x0/0x12 [sunrpc]
       [<ffffffff8104bff5>] ? autoremove_wake_function+0x0/0x34
       [<ffffffff81058d25>] ? trace_hardirqs_on+0xd/0xf
       [<ffffffff810480f6>] ? worker_thread+0x0/0x2e2
       [<ffffffff8104bd21>] kthread+0x7a/0x82
       [<ffffffff8100beda>] child_rip+0xa/0x20
       [<ffffffff8100b87c>] ? restore_args+0x0/0x30
       [<ffffffff8104c2b9>] ? add_wait_queue+0x15/0x44
       [<ffffffff8104bca7>] ? kthread+0x0/0x82
       [<ffffffff8100bed0>] ? child_rip+0x0/0x20
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      b34df792
  17. 17 6月, 2009 2 次提交
  18. 07 1月, 2009 1 次提交
  19. 06 1月, 2009 1 次提交
  20. 27 7月, 2008 2 次提交
  21. 05 7月, 2008 1 次提交
  22. 13 6月, 2008 1 次提交
    • N
      radix-tree: fix small lockless radix-tree bug · 643b52b9
      Nick Piggin 提交于
      We shrink a radix tree when its root node has only one child, in the left
      most slot.  The child becomes the new root node.  To perform this
      operation in a manner compatible with concurrent lockless lookups, we
      atomically switch the root pointer from the parent to its child.
      
      However a concurrent lockless lookup may now have loaded a pointer to the
      parent (and is presently deciding what to do next).  For this reason, we
      also have to keep the parent node in a valid state after shrinking the
      tree, until the next RCU grace period -- otherwise this lookup with the
      parent pointer may not do the right thing.  Notably, we need to keep the
      child in the left most slot there in case that is requested by the lookup.
      
      This is all pretty standard RCU stuff.  It is worth repeating because in
      my eagerness to obey the radix tree node constructor scheme, I had broken
      it by zeroing the radix tree node before the grace period.
      
      What could happen is that a lookup can load the parent pointer, then
      decide it wants to follow the left most child slot, only to find the slot
      contained NULL due to the concurrent shrinker having zeroed the parent
      node before waiting for a grace period.  The lookup would return a false
      negative as a result.
      
      Fix it by doing that clearing in the RCU callback.  I would normally want
      to rip out the constructor entirely, but radix tree nodes are one of those
      places where they make sense (only few cachelines will be touched soon
      after allocation).
      
      This was never actually found in any lockless pagecache testing or by the
      test harness, but by seeing the odd problem with my scalable vmap rewrite.
       I have not tickled the test harness into reproducing it yet, but I'll
      keep working at it.
      
      Fortunately, it is not a problem anywhere lockless pagecache is used in
      mainline kernels (pagecache probe is not a guarantee, and brd does not
      have concurrent lookups and deletes).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      643b52b9
  23. 28 4月, 2008 1 次提交
    • C
      Remove set_migrateflags() · 488514d1
      Christoph Lameter 提交于
      Migrate flags must be set on slab creation as agreed upon when the antifrag
      logic was reviewed.  Otherwise some slabs of a slabcache will end up in the
      unmovable and others in the reclaimable section depending on which flag was
      active when a new slab page was allocated.
      
      This likely slid in somehow when antifrag was merged. Remove it.
      
      The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
      SLAB_RECLAIM_ACCOUNT option is set.  The set_migrateflags() never had any
      effect there.
      
      Radix tree allocations are not directly reclaimable but they are allocated
      with __GFP_RECLAIMABLE set on each allocation.  We now set
      SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
      tree slabs are consistently placed in the reclaimable section.  Radix tree
      slabs will also be accounted as such.
      
      There is then no user left of set_migratepages. So remove it.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      488514d1
  24. 06 2月, 2008 1 次提交
    • N
      radix-tree: avoid atomic allocations for preloaded insertions · e2848a0e
      Nick Piggin 提交于
      Most pagecache (and some other) radix tree insertions have the great
      opportunity to preallocate a few nodes with relaxed gfp flags.  But the
      preallocation is squandered when it comes time to allocate a node, we
      default to first attempting a GFP_ATOMIC allocation -- that doesn't
      normally fail, but it can eat into atomic memory reserves that we don't
      need to be using.
      
      Another upshot of this is that it removes the sometimes highly contended
      zone->lock from underneath tree_lock.  Pagecache insertions are always
      performed with a radix tree preload, and after this change, such a
      situation will never fall back to kmem_cache_alloc within
      radix_tree_node_alloc.
      
      David Miller reports seeing this allocation fail on a highly threaded
      sparc64 system:
      
      [527319.459981] dd: page allocation failure. order:0, mode:0x20
      [527319.460403] Call Trace:
      [527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
      [527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
      [527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
      [527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
      [527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
      [527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
      [527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
      [527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
      [527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
      [527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
      [527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
      [527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
      [527319.461361]  [00000000004bb920] sys_read+0x34/0x60
      [527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40
      
      The calltrace is significant: __do_page_cache_readahead allocates a number
      of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
      memory to satisfy GFP_ATOMIC allocations.  However after the list of pages
      goes to mpage_readpages, there can be significant intervals (including disk
      IO) before all the pages are inserted into the radix-tree.  So the reserves
      can easily be depleted at that point.  The patch is confirmed to fix the
      problem.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2848a0e
  25. 17 10月, 2007 6 次提交
    • P
      avoid negative (and full-width) shifts in radix-tree.c · 430d275a
      Peter Lund 提交于
      Negative shifts are not allowed in C (the result is undefined).  Same thing
      with full-width shifts.
      
      It works on most platforms but not on the VAX with gcc 4.0.1 (it results in an
      "operand reserved" fault).
      
      Shifting by more than the width of the value on the left is also not
      allowed.  I think the extra '>> 1' tacked on at the end in the original
      code was an attempt to work around that.  Getting rid of that is an extra
      feature of this patch.
      
      Here's the chapter and verse, taken from the final draft of the C99
      standard ("6.5.7 Bitwise shift operators", paragraph 3):
      
        "The integer promotions are performed on each of the operands. The
        type of the result is that of the promoted left operand. If the
        value of the right operand is negative or is greater than or equal
        to the width of the promoted left operand, the behavior is
        undefined."
      
      Thank you to Jan-Benedict Glaw, Christoph Hellwig, Maciej Rozycki, Pekka
      Enberg, Andreas Schwab, and Christoph Lameter for review.  Special thanks
      to Andreas for spotting that my fix only removed half the undefined
      behaviour.
      Signed-off-by: NPeter Lund <firefly@vax64.dk>
      Christoph Lameter <clameter@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andreas Schwab <schwab@suse.de>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      430d275a
    • C
      Slab API: remove useless ctor parameter and reorder parameters · 4ba9b9d0
      Christoph Lameter 提交于
      Slab constructors currently have a flags parameter that is never used.  And
      the order of the arguments is opposite to other slab functions.  The object
      pointer is placed before the kmem_cache pointer.
      
      Convert
      
              ctor(void *object, struct kmem_cache *s, unsigned long flags)
      
      to
      
              ctor(struct kmem_cache *s, void *object)
      
      throughout the kernel
      
      [akpm@linux-foundation.org: coupla fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ba9b9d0
    • M
      Group short-lived and reclaimable kernel allocations · e12ba74d
      Mel Gorman 提交于
      This patch marks a number of allocations that are either short-lived such as
      network buffers or are reclaimable such as inode allocations.  When something
      like updatedb is called, long-lived and unmovable kernel allocations tend to
      be spread throughout the address space which increases fragmentation.
      
      This patch groups these allocations together as much as possible by adding a
      new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
      reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
      them and re-reading the information from elsewhere.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e12ba74d
    • J
      fix the max path calculation in radix-tree.c · 26fb1589
      Jeff Moyer 提交于
      A while back, Nick Piggin introduced a patch to reduce the node memory
      usage for small files (commit cfd9b7df):
      
      -#define RADIX_TREE_MAP_SHIFT	6
      +#define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
      
      Unfortunately, he didn't take into account the fact that the
      calculation of the maximum path was based on an assumption of having
      to round up:
      
      #define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2)
      
      So, if CONFIG_BASE_SMALL is set, you will end up with a
      RADIX_TREE_MAX_PATH that is one greater than necessary.  The practical
      upshot of this is just a bit of wasted memory (one long in the
      height_to_maxindex array, an extra pre-allocated radix tree node per
      cpu, and extra stack usage in a couple of functions), but it seems
      worth getting right.
      
      It's also worth noting that I never build with CONFIG_BASE_SMALL.
      What I did to test this was duplicate the code in a small user-space
      program and check the results of the calculations for max path and the
      contents of the height_to_maxindex array.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26fb1589
    • N
      radix-tree: use indirect bit · c0bc9875
      Nick Piggin 提交于
      Rather than sign direct radix-tree pointers with a special bit, sign the
      indirect one that hangs off the root.  This means that, given a lookup_slot
      operation, the invalid result will be differentiated from the valid
      (previously, valid results could have the bit either set or clear).
      
      This does not affect slot lookups which occur under lock -- they can never
      return an invalid result.  Is needed in future for lockless pagecache.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0bc9875
    • F
      radixtree: introduce radix_tree_next_hole() · 6df8ba4f
      Fengguang Wu 提交于
      Introduce radix_tree_next_hole(root, index, max_scan) to scan radix tree for
      the first hole.  It will be used in interleaved readahead.
      
      The implementation is dumb and obviously correct.  It can help debug(and
      document) the possible smart one in future.
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6df8ba4f
  26. 20 7月, 2007 1 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
  27. 14 7月, 2007 1 次提交
  28. 10 5月, 2007 1 次提交
    • R
      Add suspend-related notifications for CPU hotplug · 8bb78442
      Rafael J. Wysocki 提交于
      Since nonboot CPUs are now disabled after tasks and devices have been
      frozen and the CPU hotplug infrastructure is used for this purpose, we need
      special CPU hotplug notifications that will help the CPU-hotplug-aware
      subsystems distinguish normal CPU hotplug events from CPU hotplug events
      related to a system-wide suspend or resume operation in progress.  This
      patch introduces such notifications and causes them to be used during
      suspend and resume transitions.  It also changes all of the
      CPU-hotplug-aware subsystems to take these notifications into consideration
      (for now they are handled in the same way as the corresponding "normal"
      ones).
      
      [oleg@tv-sign.ru: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bb78442
  29. 08 12月, 2006 2 次提交
    • I
      [PATCH] hotplug CPU: clean up hotcpu_notifier() use · 02316067
      Ingo Molnar 提交于
      There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
      prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
      generating compiler warnings of unused symbols, hence forcing people to add
      #ifdefs.
      
      the compiler can skip truly unused functions just fine:
      
          text    data     bss     dec     hex filename
       1624412  728710 3674856 6027978  5bfaca vmlinux.before
       1624412  728710 3674856 6027978  5bfaca vmlinux.after
      
      [akpm@osdl.org: topology.c fix]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02316067
    • N
      [PATCH] radix-tree: RCU lockless readside · 7cf9c2c7
      Nick Piggin 提交于
      Make radix tree lookups safe to be performed without locks.  Readers are
      protected against nodes being deleted by using RCU based freeing.  Readers
      are protected against new node insertion by using memory barriers to ensure
      the node itself will be properly written before it is visible in the radix
      tree.
      
      Each radix tree node keeps a record of their height (above leaf nodes).
      This height does not change after insertion -- when the radix tree is
      extended, higher nodes are only inserted in the top.  So a lookup can take
      the pointer to what is *now* the root node, and traverse down it even if
      the tree is concurrently extended and this node becomes a subtree of a new
      root.
      
      "Direct" pointers (tree height of 0, where root->rnode points directly to
      the data item) are handled by using the low bit of the pointer to signal
      whether rnode is a direct pointer or a pointer to a radix tree node.
      
      When a reader wants to traverse the next branch, they will take a copy of
      the pointer.  This pointer will be either NULL (and the branch is empty) or
      non-NULL (and will point to a valid node).
      
      [akpm@osdl.org: cleanups]
      [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
      [clameter@sgi.com: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7cf9c2c7