1. 18 12月, 2012 6 次提交
    • O
      percpu_rw_semaphore: add lockdep annotations · 8ebe3473
      Oleg Nesterov 提交于
      Add lockdep annotations.  Not only this can help to find the potential
      problems, we do not want the false warnings if, say, the task takes two
      different percpu_rw_semaphore's for reading.  IOW, at least ->rw_sem
      should not use a single class.
      
      This patch exposes this internal lock to lockdep so that it represents the
      whole percpu_rw_semaphore.  This way we do not need to add another "fake"
      ->lockdep_map and lock_class_key.  More importantly, this also makes the
      output from lockdep much more understandable if it finds the problem.
      
      In short, with this patch from lockdep pov percpu_down_read() and
      percpu_up_read() acquire/release ->rw_sem for reading, this matches the
      actual semantics.  This abuses __up_read() but I hope this is fine and in
      fact I'd like to have down_read_no_lockdep() as well,
      percpu_down_read_recursive_readers() will need it.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ebe3473
    • O
      percpu_rw_semaphore: kill ->writer_mutex, add ->write_ctr · 9390ef0c
      Oleg Nesterov 提交于
      percpu_rw_semaphore->writer_mutex was only added to simplify the initial
      rewrite, the only thing it protects is clear_fast_ctr() which otherwise
      could be called by multiple writers.  ->rw_sem is enough to serialize the
      writers.
      
      Kill this mutex and add "atomic_t write_ctr" instead.  The writers
      increment/decrement this counter, the readers check it is zero instead of
      mutex_is_locked().
      
      Move atomic_add(clear_fast_ctr(), slow_read_ctr) under down_write() to
      avoid the race with other writers.  This is a bit sub-optimal, only the
      first writer needs this and we do not need to exclude the readers at this
      stage.  But this is simple, we do not want another internal lock until we
      add more features.
      
      And this speeds up the write-contended case.  Before this patch the racing
      writers sleep in synchronize_sched_expedited() sequentially, with this
      patch multiple synchronize_sched_expedited's can "overlap" with each
      other.  Note: we can do more optimizations, this is only the first step.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9390ef0c
    • O
      percpu_rw_semaphore: reimplement to not block the readers unnecessarily · a1fd3e24
      Oleg Nesterov 提交于
      Currently the writer does msleep() plus synchronize_sched() 3 times to
      acquire/release the semaphore, and during this time the readers are
      blocked completely.  Even if the "write" section was not actually started
      or if it was already finished.
      
      With this patch down_write/up_write does synchronize_sched() twice and
      down_read/up_read are still possible during this time, just they use the
      slow path.
      
      percpu_down_write() first forces the readers to use rw_semaphore and
      increment the "slow" counter to take the lock for reading, then it
      takes that rw_semaphore for writing and blocks the readers.
      
      Also.  With this patch the code relies on the documented behaviour of
      synchronize_sched(), it doesn't try to pair synchronize_sched() with
      barrier.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1fd3e24
    • J
      sscanf: don't ignore field widths for numeric conversions · 53809751
      Jan Beulich 提交于
      This is another step towards better standard conformance.  Rather than
      adding a local buffer to store the specified portion of the string (with
      the need to enforce an arbitrary maximum supported width to limit the
      buffer size), do a maximum width conversion and then drop as much of it as
      is necessary to meet the caller's request.
      
      Also fail on negative field widths.
      
      Uses the deprecated simple_strto*() functions because kstrtoXX() fail on
      non-zero terminated strings.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53809751
    • A
      lib: dynamic_debug: use kbasename() · 35367ab2
      Andy Shevchenko 提交于
      Remove the custom implementation of the functionality similar to kbasename().
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35367ab2
    • J
      lib/vsprintf.c: fix handling of %zd when using ssize_t · ef124960
      Jason Gunthorpe 提交于
      Documentation/printk-formats.txt says to use %zd for a ssize_t argument
      and some drivers do.  Unfortunately this prints a positive number for
      negative values eg:
      
        tpm_tis 70030000.tpm_tis: tpm_transmit: tpm_send: error 4294967234
      
      Add a case to va_args a ssize_t type if the interpretation should be
      signed.
      
      Tested on PPC32.
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef124960
  2. 12 12月, 2012 1 次提交
    • J
      bootmem: fix wrong call parameter for free_bootmem() · 81df9bff
      Joonsoo Kim 提交于
      It is strange that alloc_bootmem() returns a virtual address and
      free_bootmem() requires a physical address.  Anyway, free_bootmem()'s
      first parameter should be physical address.
      
      There are some call sites for free_bootmem() with virtual address.  So fix
      them.
      
      [akpm@linux-foundation.org: improve free_bootmem() and free_bootmem_pate() documentation]
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81df9bff
  3. 06 12月, 2012 2 次提交
    • N
      propagate name change to comments in kernel source · 6d49e352
      Nadia Yvette Chambers 提交于
      I've legally changed my name with New York State, the US Social Security
      Administration, et al. This patch propagates the name change and change
      in initials and login to comments in the kernel source as well.
      Signed-off-by: NNadia Yvette Chambers <nyc@holomorphy.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      6d49e352
    • T
      lib/Makefile: Fix oid_registry build dependency · 527897cc
      Tim Gardner 提交于
      It is $(obj)/oid_registry.o that is dependent on $(obj)/oid_registry_data.c.
      The object file cannot be built until $(obj)/oid_registry_data.c has been
      generated.
      
      A periodic and hard to reproduce parallel build failure is due to
      this incorrect lib/Makefile dependency. The compile error is completely
      disingenuous.
      
        GEN     lib/oid_registry_data.c
      Compiling 49 OIDs
        CC      lib/oid_registry.o
      gcc: error: lib/oid_registry.c: No such file or directory
      gcc: fatal error: no input files
      compilation terminated.
      make[3]: *** [lib/oid_registry.o] Error 4
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      527897cc
  4. 05 12月, 2012 1 次提交
  5. 03 12月, 2012 1 次提交
  6. 29 11月, 2012 1 次提交
  7. 24 11月, 2012 1 次提交
  8. 14 11月, 2012 1 次提交
  9. 30 10月, 2012 7 次提交
  10. 26 10月, 2012 1 次提交
    • T
      genalloc: stop crashing the system when destroying a pool · eedce141
      Thadeu Lima de Souza Cascardo 提交于
      The genalloc code uses the bitmap API from include/linux/bitmap.h and
      lib/bitmap.c, which is based on long values.  Both bitmap_set from
      lib/bitmap.c and bitmap_set_ll, which is the lockless version from
      genalloc.c, use BITMAP_LAST_WORD_MASK to set the first bits in a long in
      the bitmap.
      
      That one uses (1 << bits) - 1, 0b111, if you are setting the first three
      bits.  This means that the API counts from the least significant bits
      (LSB from now on) to the MSB.  The LSB in the first long is bit 0, then.
      The same works for the lookup functions.
      
      The genalloc code uses longs for the bitmap, as it should.  In
      include/linux/genalloc.h, struct gen_pool_chunk has unsigned long
      bits[0] as its last member.  When allocating the struct, genalloc should
      reserve enough space for the bitmap.  This should be a proper number of
      longs that can fit the amount of bits in the bitmap.
      
      However, genalloc allocates an integer number of bytes that fit the
      amount of bits, but may not be an integer amount of longs.  9 bytes, for
      example, could be allocated for 70 bits.
      
      This is a problem in itself if the Least Significat Bit in a long is in
      the byte with the largest address, which happens in Big Endian machines.
      This means genalloc is not allocating the byte in which it will try to
      set or check for a bit.
      
      This may end up in memory corruption, where genalloc will try to set the
      bits it has not allocated.  In fact, genalloc may not set these bits
      because it may find them already set, because they were not zeroed since
      they were not allocated.  And that's what causes a BUG when
      gen_pool_destroy is called and check for any set bits.
      
      What really happens is that genalloc uses kmalloc_node with __GFP_ZERO
      on gen_pool_add_virt.  With SLAB and SLUB, this means the whole slab
      will be cleared, not only the requested bytes.  Since struct
      gen_pool_chunk has a size that is a multiple of 8, and slab sizes are
      multiples of 8, we get lucky and allocate and clear the right amount of
      bytes.
      
      Hower, this is not the case with SLOB or with older code that did memset
      after allocating instead of using __GFP_ZERO.
      
      So, a simple module as this (running 3.6.0), will cause a crash when
      rmmod'ed.
      
        [root@phantom-lp2 foo]# cat foo.c
        #include <linux/kernel.h>
        #include <linux/module.h>
        #include <linux/init.h>
        #include <linux/genalloc.h>
      
        MODULE_LICENSE("GPL");
        MODULE_VERSION("0.1");
      
        static struct gen_pool *foo_pool;
      
        static __init int foo_init(void)
        {
                int ret;
                foo_pool = gen_pool_create(10, -1);
                if (!foo_pool)
                        return -ENOMEM;
                ret = gen_pool_add(foo_pool, 0xa0000000, 32 << 10, -1);
                if (ret) {
                        gen_pool_destroy(foo_pool);
                        return ret;
                }
                return 0;
        }
      
        static __exit void foo_exit(void)
        {
                gen_pool_destroy(foo_pool);
        }
      
        module_init(foo_init);
        module_exit(foo_exit);
        [root@phantom-lp2 foo]# zcat /proc/config.gz | grep SLOB
        CONFIG_SLOB=y
        [root@phantom-lp2 foo]# insmod ./foo.ko
        [root@phantom-lp2 foo]# rmmod foo
        ------------[ cut here ]------------
        kernel BUG at lib/genalloc.c:243!
        cpu 0x4: Vector: 700 (Program Check) at [c0000000bb0e7960]
            pc: c0000000003cb50c: .gen_pool_destroy+0xac/0x110
            lr: c0000000003cb4fc: .gen_pool_destroy+0x9c/0x110
            sp: c0000000bb0e7be0
           msr: 8000000000029032
          current = 0xc0000000bb0e0000
          paca    = 0xc000000006d30e00   softe: 0        irq_happened: 0x01
            pid   = 13044, comm = rmmod
        kernel BUG at lib/genalloc.c:243!
        [c0000000bb0e7ca0] d000000004b00020 .foo_exit+0x20/0x38 [foo]
        [c0000000bb0e7d20] c0000000000dff98 .SyS_delete_module+0x1a8/0x290
        [c0000000bb0e7e30] c0000000000097d4 syscall_exit+0x0/0x94
        --- Exception: c00 (System Call) at 000000800753d1a0
        SP (fffd0b0e640) is in userspace
      Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Benjamin Gaignard <benjamin.gaignard@stericsson.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eedce141
  11. 20 10月, 2012 1 次提交
  12. 11 10月, 2012 1 次提交
  13. 10 10月, 2012 1 次提交
  14. 09 10月, 2012 15 次提交
    • M
      mm: add CONFIG_DEBUG_VM_RB build option · ed8ea815
      Michel Lespinasse 提交于
      Add a CONFIG_DEBUG_VM_RB build option for the previously existing
      DEBUG_MM_RB code.  Now that Andi Kleen modified it to avoid using
      recursive algorithms, we can expose it a bit more.
      
      Also extend this code to validate_mm() after stack expansion, and to check
      that the vma's start and last pgoffs have not changed since the nodes were
      inserted on the anon vma interval tree (as it is important that the nodes
      be reindexed after each such update).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed8ea815
    • M
      mm: interval tree updates · 9826a516
      Michel Lespinasse 提交于
      Update the generic interval tree code that was introduced in "mm: replace
      vma prio_tree with an interval tree".
      
      Changes:
      
      - fixed 'endpoing' typo noticed by Andrew Morton
      
      - replaced include/linux/interval_tree_tmpl.h, which was used as a
        template (including it automatically defined the interval tree
        functions) with include/linux/interval_tree_generic.h, which only
        defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
        defines the interval tree functions when invoked. Now that is a very
        long macro which is unfortunate, but it does make the usage sites
        (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.
      
      - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
        instead of duplicating that code in the interval tree template.
      
      - replaced vma_interval_tree_add(), which was actually handling the
        nonlinear and interval tree cases, with vma_interval_tree_insert_after()
        which handles only the interval tree case and has an API that is more
        consistent with the other interval tree handling functions.
        The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9826a516
    • M
      rbtree: move augmented rbtree functionality to rbtree_augmented.h · 9c079add
      Michel Lespinasse 提交于
      Provide rb_insert_augmented() and rb_erase_augmented() through a new
      rbtree_augmented.h include file.  rb_erase_augmented() is defined there as
      an __always_inline function, in order to allow inlining of augmented
      rbtree callbacks into it.  Since this generates a relatively large
      function, each augmented rbtree user should make sure to have a single
      call site.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c079add
    • M
      prio_tree: remove · 147e615f
      Michel Lespinasse 提交于
      After both prio_tree users have been converted to use red-black trees,
      there is no need to keep around the prio tree library anymore.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      147e615f
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • M
      rbtree: add prio tree and interval tree tests · fff3fd8a
      Michel Lespinasse 提交于
      Patch 1 implements support for interval trees, on top of the augmented
      rbtree API. It also adds synthetic tests to compare the performance of
      interval trees vs prio trees. Short answers is that interval trees are
      slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x)
      on search. It is debatable how realistic the synthetic test is, and I have
      not made such measurements yet, but my impression is that interval trees
      would still come out faster.
      
      Patch 2 uses a preprocessor template to make the interval tree generic,
      and uses it as a replacement for the vma prio_tree.
      
      Patch 3 takes the other prio_tree user, kmemleak, and converts it to use
      a basic rbtree. We don't actually need the augmented rbtree support here
      because the intervals are always non-overlapping.
      
      Patch 4 removes the now-unused prio tree library.
      
      Patch 5 proposes an additional optimization to rb_erase_augmented, now
      providing it as an inline function so that the augmented callbacks can be
      inlined in. This provides an additional 5-10% performance improvement
      for the interval tree insert/erase benchmark. There is a maintainance cost
      as it exposes augmented rbtree users to some of the rbtree library internals;
      however I think this cost shouldn't be too high as I expect the augmented
      rbtree will always have much less users than the base rbtree.
      
      I should probably add a quick summary of why I think it makes sense to
      replace prio trees with augmented rbtree based interval trees now.  One of
      the drivers is that we need augmented rbtrees for Rik's vma gap finding
      code, and once you have them, it just makes sense to use them for interval
      trees as well, as this is the simpler and more well known algorithm.  prio
      trees, in comparison, seem *too* clever: they impose an additional 'heap'
      constraint on the tree, which they use to guarantee a faster worst-case
      complexity of O(k+log N) for stabbing queries in a well-balanced prio
      tree, vs O(k*log N) for interval trees (where k=number of matches,
      N=number of intervals).  Now this sounds great, but in practice prio trees
      don't realize this theorical benefit.  First, the additional constraint
      makes them harder to update, so that the kernel implementation has to
      simplify things by balancing them like a radix tree, which is not always
      ideal.  Second, the fact that there are both index and heap properties
      makes both tree manipulation and search more complex, which results in a
      higher multiplicative time constant.  As it turns out, the simple interval
      tree algorithm ends up running faster than the more clever prio tree.
      
      This patch:
      
      Add two test modules:
      
      - prio_tree_test measures the performance of lib/prio_tree.c, both for
        insertion/removal and for stabbing searches
      
      - interval_tree_test measures the performance of a library of equivalent
        functionality, built using the augmented rbtree support.
      
      In order to support the second test module, lib/interval_tree.c is
      introduced. It is kept separate from the interval_tree_test main file
      for two reasons: first we don't want to provide an unfair advantage
      over prio_tree_test by having everything in a single compilation unit,
      and second there is the possibility that the interval tree functionality
      could get some non-test users in kernel over time.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fff3fd8a
    • M
      rbtree: add RB_DECLARE_CALLBACKS() macro · 3908836a
      Michel Lespinasse 提交于
      As proposed by Peter Zijlstra, this makes it easier to define the augmented
      rbtree callbacks.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3908836a
    • M
      rbtree: remove prior augmented rbtree implementation · 9d9e6f97
      Michel Lespinasse 提交于
      convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api
      and remove the old augmented rbtree implementation.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d9e6f97
    • M
      rbtree: faster augmented rbtree manipulation · 14b94af0
      Michel Lespinasse 提交于
      Introduce new augmented rbtree APIs that allow minimal recalculation of
      augmented node information.
      
      A new callback is added to the rbtree insertion and erase rebalancing
      functions, to be called on each tree rotations. Such rotations preserve
      the subtree's root augmented value, but require recalculation of the one
      child that was previously located at the subtree root.
      
      In the insertion case, the handcoded search phase must be updated to
      maintain the augmented information on insertion, and then the rbtree
      coloring/rebalancing algorithms keep it up to date.
      
      In the erase case, things are more complicated since it is library
      code that manipulates the rbtree in order to remove internal nodes.
      This requires a couple additional callbacks to copy a subtree's
      augmented value when a new root is stitched in, and to recompute
      augmented values down the ancestry path when a node is removed from
      the tree.
      
      In order to preserve maximum speed for the non-augmented case,
      we provide two versions of each tree manipulation function.
      rb_insert_augmented() is the augmented equivalent of rb_insert_color(),
      and rb_erase_augmented() is the augmented equivalent of rb_erase().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14b94af0
    • M
      rbtree: augmented rbtree test · dadf9353
      Michel Lespinasse 提交于
      Small test to measure the performance of augmented rbtrees.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dadf9353
    • M
      rbtree: low level optimizations in rb_erase() · 4f035ad6
      Michel Lespinasse 提交于
      Various minor optimizations in rb_erase():
      - Avoid multiple loading of node->__rb_parent_color when computing parent
        and color information (possibly not in close sequence, as there might
        be further branches in the algorithm)
      - In the 1-child subcase of case 1, copy the __rb_parent_color field from
        the erased node to the child instead of recomputing it from the desired
        parent and color
      - When searching for the erased node's successor, differentiate between
        cases 2 and 3 based on whether any left links were followed. This avoids
        a condition later down.
      - In case 3, keep a pointer to the erased node's right child so we don't
        have to refetch it later to adjust its parent.
      - In the no-childs subcase of cases 2 and 3, place the rebalance assigment
        last so that the compiler can remove the following if(rebalance) test.
      
      Also, added some comments to illustrate cases 2 and 3.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f035ad6
    • M
      rbtree: handle 1-child recoloring in rb_erase() instead of rb_erase_color() · 46b6135a
      Michel Lespinasse 提交于
      An interesting observation for rb_erase() is that when a node has
      exactly one child, the node must be black and the child must be red.
      An interesting consequence is that removing such a node can be done by
      simply replacing it with its child and making the child black,
      which we can do efficiently in rb_erase(). __rb_erase_color() then
      only needs to handle the no-childs case and can be modified accordingly.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46b6135a
    • M
      rbtree: place easiest case first in rb_erase() · 60670b80
      Michel Lespinasse 提交于
      In rb_erase, move the easy case (node to erase has no more than
      1 child) first. I feel the code reads easier that way.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60670b80
    • M
      rbtree: add __rb_change_child() helper function · 7abc704a
      Michel Lespinasse 提交于
      Add __rb_change_child() as an inline helper function to replace code that
      would otherwise be duplicated 4 times in the source.
      
      No changes to binary size or speed.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7abc704a
    • M
      rbtree test: fix sparse warning about 64-bit constant · 28d75309
      Michel Lespinasse 提交于
      Just a small fix to make sparse happy.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reported-by: NFengguang Wu <wfg@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28d75309