1. 28 5月, 2015 3 次提交
    • P
      rcu: Conditionally compile RCU's eqs warnings · 1ce46ee5
      Paul E. McKenney 提交于
      This commit applies some warning-omission micro-optimizations to RCU's
      various extended-quiescent-state functions, which are on the kernel/user
      hotpath for CONFIG_NO_HZ_FULL=y.
      Reported-by: NRik van Riel <riel@redhat.com>
      Reported by: Mike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1ce46ee5
    • P
      rcu: Directly drive TASKS_RCU from Kconfig · 82d0f4c0
      Paul E. McKenney 提交于
      Currently, Kconfig will ask the user whether TASKS_RCU should be set.
      This is silly because Kconfig already has all the information that it
      needs to set this parameter.  This commit therefore directly drives
      the value of TASKS_RCU via "select" statements.  Which means that
      as subsystems require TASKS_RCU, those subsystems will need to add
      "select" statements of their own.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NPranith Kumar <bobby.prani@gmail.com>
      82d0f4c0
    • P
      rcu: Provide diagnostic option to slow down grace-period scans · 0f41c0dd
      Paul E. McKenney 提交于
      Grace-period scans of the rcu_node combining tree normally
      proceed quite quickly, so that it is very difficult to reproduce
      races against them.  This commit therefore allows grace-period
      pre-initialization and cleanup to be artificially slowed down,
      increasing race-reproduction probability.  A pair of pairs of new
      Kconfig parameters are provided, RCU_TORTURE_TEST_SLOW_PREINIT to
      enable the slowing down of propagating CPU-hotplug changes up the
      combining tree along with RCU_TORTURE_TEST_SLOW_PREINIT_DELAY to
      specify the delay in jiffies, and RCU_TORTURE_TEST_SLOW_CLEANUP
      to enable the slowing down of the end-of-grace-period cleanup scan
      along with RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY to specify the delay
      in jiffies.  Boot-time parameters named rcutree.gp_preinit_delay and
      rcutree.gp_cleanup_delay allow these delays to be specified at boot time.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0f41c0dd
  2. 06 5月, 2015 2 次提交
  3. 04 5月, 2015 1 次提交
    • D
      lib: make memzero_explicit more robust against dead store elimination · 7829fb09
      Daniel Borkmann 提交于
      In commit 0b053c95 ("lib: memzero_explicit: use barrier instead
      of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in
      case LTO would decide to inline memzero_explicit() and eventually
      find out it could be elimiated as dead store.
      
      While using barrier() works well for the case of gcc, recent efforts
      from LLVMLinux people suggest to use llvm as an alternative to gcc,
      and there, Stephan found in a simple stand-alone user space example
      that llvm could nevertheless optimize and thus elimitate the memset().
      A similar issue has been observed in the referenced llvm bug report,
      which is regarded as not-a-bug.
      
      Based on some experiments, icc is a bit special on its own, while it
      doesn't seem to eliminate the memset(), it could do so with an own
      implementation, and then result in similar findings as with llvm.
      
      The fix in this patch now works for all three compilers (also tested
      with more aggressive optimization levels). Arguably, in the current
      kernel tree it's more of a theoretical issue, but imho, it's better
      to be pedantic about it.
      
      It's clearly visible with gcc/llvm though, with the below code: if we
      would have used barrier() only here, llvm would have omitted clearing,
      not so with barrier_data() variant:
      
        static inline void memzero_explicit(void *s, size_t count)
        {
          memset(s, 0, count);
          barrier_data(s);
        }
      
        int main(void)
        {
          char buff[20];
          memzero_explicit(buff, sizeof(buff));
          return 0;
        }
      
        $ gcc -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x0000000000400400  <+0>: lea   -0x28(%rsp),%rax
         0x0000000000400405  <+5>: movq  $0x0,-0x28(%rsp)
         0x000000000040040e <+14>: movq  $0x0,-0x20(%rsp)
         0x0000000000400417 <+23>: movl  $0x0,-0x18(%rsp)
         0x000000000040041f <+31>: xor   %eax,%eax
         0x0000000000400421 <+33>: retq
        End of assembler dump.
      
        $ clang -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x00000000004004f0  <+0>: xorps  %xmm0,%xmm0
         0x00000000004004f3  <+3>: movaps %xmm0,-0x18(%rsp)
         0x00000000004004f8  <+8>: movl   $0x0,-0x8(%rsp)
         0x0000000000400500 <+16>: lea    -0x18(%rsp),%rax
         0x0000000000400505 <+21>: xor    %eax,%eax
         0x0000000000400507 <+23>: retq
        End of assembler dump.
      
      As gcc, clang, but also icc defines __GNUC__, it's sufficient to define
      this in compiler-gcc.h only to be picked up. For a fallback or otherwise
      unsupported compiler, we define it as a barrier. Similarly, for ecc which
      does not support gcc inline asm.
      
      Reference: https://llvm.org/bugs/show_bug.cgi?id=15495Reported-by: NStephan Mueller <smueller@chronox.de>
      Tested-by: NStephan Mueller <smueller@chronox.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Stephan Mueller <smueller@chronox.de>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: mancha security <mancha1@zoho.com>
      Cc: Mark Charlebois <charlebm@gmail.com>
      Cc: Behan Webster <behanw@converseincode.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7829fb09
  4. 23 4月, 2015 2 次提交
  5. 22 4月, 2015 4 次提交
    • M
      md/raid6 algorithms: xor_syndrome() for SSE2 · a582564b
      Markus Stockhausen 提交于
      The second and (last) optimized XOR syndrome calculation. This version
      supports right and left side optimization. All CPUs with architecture
      older than Haswell will benefit from it.
      
      It should be noted that SSE2 movntdq kills performance for memory areas
      that are read and written simultaneously in chunks smaller than cache
      line size. So use movdqa instead for P/Q writes in sse21 and sse22 XOR
      functions.
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a582564b
    • M
      md/raid6 algorithms: xor_syndrome() for generic int · 9a5ce91d
      Markus Stockhausen 提交于
      Start the algorithms with the very basic one. It is left and right
      optimized. That means we can avoid all calculations for unneeded pages
      above the right stop offset. For pages below the left start offset we
      still need the syndrome multiplication but without reading data pages.
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9a5ce91d
    • M
      md/raid6 algorithms: improve test program · 7e92e1d7
      Markus Stockhausen 提交于
      It is always helpful to have a test tool in place if we implement
      new data critical algorithms. So add some test routines to the raid6
      checker that can prove if the new xor_syndrome() works as expected.
      
      Run through all permutations of start/stop pages per algorithm and
      simulate a xor_syndrome() assisted rmw run. After each rmw check if
      the recovery algorithm still confirms that the stripe is fine.
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7e92e1d7
    • M
      md/raid6 algorithms: delta syndrome functions · fe5cbc6e
      Markus Stockhausen 提交于
      v3: s-o-b comment, explanation of performance and descision for
      the start/stop implementation
      
      Implementing rmw functionality for RAID6 requires optimized syndrome
      calculation. Up to now we can only generate a complete syndrome. The
      target P/Q pages are always overwritten. With this patch we provide
      a framework for inplace P/Q modification. In the first place simply
      fill those functions with NULL values.
      
      xor_syndrome() has two additional parameters: start & stop. These
      will indicate the first and last page that are changing during a
      rmw run. That makes it possible to avoid several unneccessary loops
      and speed up calculation. The caller needs to implement the following
      logic to make the functions work.
      
      1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source
      blocks inside P/Q between (and including) start and end.
      
      2) modify any block with start <= block <= stop
      
      3) xor_syndrome(disks, start, stop, ...): "Reinsert" all data of
      source blocks into P/Q between (and including) start and end.
      
      Pages between start and stop that won't be changed should be filled
      with a pointer to the kernel zero page. The reasons for not taking NULL
      pages are:
      
      1) Algorithms cross the whole source data line by line. Thus avoid
      additional branches.
      
      2) Having a NULL page avoids calculating the XOR P parity but still
      need calulation steps for the Q parity. Depending on the algorithm
      unrolling that might be only a difference of 2 instructions per loop.
      
      The benchmark numbers of the gen_syndrome() functions are displayed in
      the kernel log. Do the same for the xor_syndrome() functions. This
      will help to analyze performance problems and give an rough estimate
      how well the algorithm works. The choice of the fastest algorithm will
      still depend on the gen_syndrome() performance.
      
      With the start/stop page implementation the speed can vary a lot in real
      life. E.g. a change of page 0 & page 15 on a stripe will be harder to
      compute than the case where page 0 & page 1 are XOR candidates. To be not
      to enthusiatic about the expected speeds we will run a worse case test
      that simulates a change on the upper half of the stripe. So we do:
      
      1) calculation of P/Q for the upper pages
      
      2) continuation of Q for the lower (empty) pages
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fe5cbc6e
  6. 21 4月, 2015 2 次提交
  7. 20 4月, 2015 1 次提交
    • L
      hexdump: avoid warning in test function · 17974c05
      Linus Torvalds 提交于
      The test_data_1_le[] array is a const array of const char *.  To avoid
      dropping any const information, we need to use "const char * const *",
      not just "const char **".
      
      I'm not sure why the different test arrays end up having different
      const'ness, but let's make the pointer we use to traverse them as const
      as possible, since we modify neither the array of pointers _or_ the
      pointers we find in the array.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17974c05
  8. 19 4月, 2015 4 次提交
  9. 18 4月, 2015 1 次提交
  10. 17 4月, 2015 10 次提交
    • A
      lib/Kconfig: fix up HAVE_ARCH_BITREVERSE help text · 9e522c0d
      Andrew Morton 提交于
      Cc: Yalin Wang <yalin.wang@sonymobile.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e522c0d
    • S
      cpumask: don't perform while loop in cpumask_next_and() · 534b483a
      Sergey Senozhatsky 提交于
      cpumask_next_and() is looking for cpumask_next() in src1 in a loop and
      tests if found cpu is also present in src2. remove that loop, perform
      cpumask_and() of src1 and src2 first and use that new mask to find
      cpumask_next().
      
      Apart from removing while loop, ./bloat-o-meter on x86_64 shows
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-8 (-8)
      function                                     old     new   delta
      cpumask_next_and                              62      54      -8
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Amir Vadai <amirv@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      534b483a
    • Y
      lib/bitmap.c: bitmap_[empty,full]: remove code duplication · 2afe27c7
      Yury Norov 提交于
      bitmap_empty() has its own implementation.  But it's clearly as simple as:
      
      	find_first_bit(src, nbits) == nbits
      
      The same is true for 'bitmap_full'.
      Signed-off-by: NYury Norov <yury.norov@gmail.com>
      Cc: George Spelvin <linux@horizon.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2afe27c7
    • R
      lib/vsprintf.c: improve put_dec_trunc8 slightly · 675cf53c
      Rasmus Villemoes 提交于
      I hadn't had enough coffee when I wrote this. Currently, the final
      increment of buf depends on the value loaded from the table, and
      causes gcc to emit a cmov immediately before the return. It is smarter
      to let it depend on r, since the increment can then be computed in
      parallel with the final load/store pair. It also shaves 16 bytes of
      .text.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      675cf53c
    • S
      lib/dma-debug: fix bucket_find_contain() · a7a2c02a
      Sebastian Ott 提交于
      bucket_find_contain() will search the bucket list for a dma_debug_entry.
      When the entry isn't found it needs to search other buckets too, since
      only the start address of a dma range is hashed (which might be in a
      different bucket).
      
      A copy of the dma_debug_entry is used to get the previous hash bucket
      but when its list is searched the original dma_debug_entry is to be used
      not its modified copy.
      
      This fixes false "device driver tries to sync DMA memory it has not allocated"
      warnings.
      Signed-off-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Horia Geanta <horia.geanta@freescale.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7a2c02a
    • R
      lib/vsprintf.c: even faster binary to decimal conversion · 7c43d9a3
      Rasmus Villemoes 提交于
      The most expensive part of decimal conversion is the divisions by 10
      (albeit done using reciprocal multiplication with appropriately chosen
      constants).  I decided to see if one could eliminate around half of
      these multiplications by emitting two digits at a time, at the cost of a
      200 byte lookup table, and it does indeed seem like there is something
      to be gained, especially on 64 bits.  Microbenchmarking shows
      improvements ranging from -50% (for numbers uniformly distributed in [0,
      2^64-1]) to -25% (for numbers heavily biased toward the smaller end, a
      more realistic distribution).
      
      On a larger scale, perf shows that top, one of the big consumers of /proc
      data, uses 0.5-1.0% fewer cpu cycles.
      
      I had to jump through some hoops to get the 32 bit code to compile and run
      on my 64 bit machine, so I'm not sure how relevant these numbers are, but
      just for comparison the microbenchmark showed improvements between -30%
      and -10%.
      
      The bloat-o-meter costs are around 150 bytes (the generated code is a
      little smaller, so it's not the full 200 bytes) on both 32 and 64 bit.
      I'm aware that extra cache misses won't show up in a microbenchmark as
      used above, but on the other hand decimal conversions often happen in bulk
      (for example in the case of top).
      
      I have of course tested that the new code generates the same output as the
      old, for both the first and last 1e10 numbers in [0,2^64-1] and 4e9
      'random' numbers in-between.
      
      Test and verification code on github: https://github.com/Villemoes/dec.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Tested-by: NJeff Epler <jepler@unpythonic.net>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c43d9a3
    • Y
      lib: rename lib/find_next_bit.c to lib/find_bit.c · 840620a1
      Yury Norov 提交于
      This file contains implementation for all find_*_bit{,_le}
      So giving it more generic name looks reasonable.
      Signed-off-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NGeorge Spelvin <linux@horizon.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Valentin Rothberg <valentinrothberg@gmail.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      840620a1
    • Y
      lib: move find_last_bit to lib/find_next_bit.c · 8f6f19dd
      Yury Norov 提交于
      Currently all 'find_*_bit' family is located in lib/find_next_bit.c,
      except 'find_last_bit', which is in lib/find_last_bit.c. It seems,
      there's no major benefit to have it separated.
      Signed-off-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NGeorge Spelvin <linux@horizon.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Valentin Rothberg <valentinrothberg@gmail.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f6f19dd
    • Y
      lib: find_*_bit reimplementation · 2c57a0e2
      Yury Norov 提交于
      This patchset does rework to find_bit function family to achieve better
      performance, and decrease size of text.  All rework is done in patch 1.
      Patches 2 and 3 are about code moving and renaming.
      
      It was boot-tested on x86_64 and MIPS (big-endian) machines.
      Performance tests were ran on userspace with code like this:
      
      	/* addr[] is filled from /dev/urandom */
      	start = clock();
      	while (ret < nbits)
      		ret = find_next_bit(addr, nbits, ret + 1);
      
      	end = clock();
      	printf("%ld\t", (unsigned long) end - start);
      
      On Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz measurements are: (for
      find_next_bit, nbits is 8M, for find_first_bit - 80K)
      
      	find_next_bit:		find_first_bit:
      	new	current		new	current
      	26932	43151		14777	14925
      	26947	43182		14521	15423
      	26507	43824		15053	14705
      	27329	43759		14473	14777
      	26895	43367		14847	15023
      	26990	43693		15103	15163
      	26775	43299		15067	15232
      	27282	42752		14544	15121
      	27504	43088		14644	14858
      	26761	43856		14699	15193
      	26692	43075		14781	14681
      	27137	42969		14451	15061
      	...			...
      
      find_next_bit performance gain is 35-40%;
      find_first_bit - no measurable difference.
      
      On ARM machine, there is arch-specific implementation for find_bit.
      
      Thanks a lot to George Spelvin and Rasmus Villemoes for hints and
      helpful discussions.
      
      This patch (of 3):
      
      New implementations takes less space in source file (see diffstat) and in
      object.  For me it's 710 vs 453 bytes of text.  It also shows better
      performance.
      
      find_last_bit description fixed due to obvious typo.
      
      [akpm@linux-foundation.org: include linux/bitmap.h, per Rasmus]
      Signed-off-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NGeorge Spelvin <linux@horizon.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Valentin Rothberg <valentinrothberg@gmail.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c57a0e2
    • S
      sparc: Break up monolithic iommu table/lock into finer graularity pools and lock · 10b88a4b
      Sowmini Varadhan 提交于
      Investigation of multithreaded iperf experiments on an ethernet
      interface show the iommu->lock as the hottest lock identified by
      lockstat, with something of the order of  21M contentions out of
      27M acquisitions, and an average wait time of 26 us for the lock.
      This is not efficient. A more scalable design is to follow the ppc
      model, where the iommu_table has multiple pools, each stretching
      over a segment of the map, and with a separate lock for each pool.
      This model allows for better parallelization of the iommu map search.
      
      This patch adds the iommu range alloc/free function infrastructure.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10b88a4b
  11. 16 4月, 2015 10 次提交