1. 13 5月, 2011 1 次提交
    • I
      vsprintf: Turn kptr_restrict off by default · 411f05f1
      Ingo Molnar 提交于
      kptr_restrict has been triggering bugs in apps such as perf, and it also makes
      the system less useful by default, so turn it off by default.
      
      This is how we generally handle security features that remove functionality,
      such as firewall code or SELinux - they have to be configured and activated
      from user-space.
      
      Distributions can turn kptr_restrict on again via this line in
      /etc/sysctrl.conf:
      
      kernel.kptr_restrict = 1
      
      ( Also mark the variable __read_mostly while at it, as it's typically modified
        only once per bootup, or not at all. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      411f05f1
  2. 02 5月, 2011 1 次提交
  3. 29 4月, 2011 2 次提交
  4. 15 4月, 2011 2 次提交
  5. 05 4月, 2011 1 次提交
  6. 31 3月, 2011 1 次提交
  7. 25 3月, 2011 1 次提交
  8. 24 3月, 2011 4 次提交
    • N
      vsprintf: Introduce %pB format specifier · 0f77a8d3
      Namhyung Kim 提交于
      The %pB format specifier is for stack backtrace. Its handler
      sprint_backtrace() does symbol lookup using (address-1) to
      ensure the address will not point outside of the function.
      
      If there is a tail-call to the function marked "noreturn",
      gcc optimized out the code after the call then causes saved
      return address points outside of the function (i.e. the start
      of the next function), so pollutes call trace somewhat.
      
      This patch adds the %pB printk mechanism that allows architecture
      call-trace printout functions to improve backtrace printouts.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-arch@vger.kernel.org
      LKML-Reference: <1300934550-21394-1-git-send-email-namhyung@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0f77a8d3
    • A
      bitops: introduce CONFIG_GENERIC_FIND_BIT_LE · 0664996b
      Akinobu Mita 提交于
      This introduces CONFIG_GENERIC_FIND_BIT_LE to tell whether to use generic
      implementation of find_*_bit_le() in lib/find_next_bit.c or not.
      
      For now we select CONFIG_GENERIC_FIND_BIT_LE for all architectures which
      enable CONFIG_GENERIC_FIND_NEXT_BIT.
      
      But m68knommu wants to define own faster find_next_zero_bit_le() and
      continues using generic find_next_{,zero_}bit().
      (CONFIG_GENERIC_FIND_NEXT_BIT and !CONFIG_GENERIC_FIND_BIT_LE)
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0664996b
    • A
      asm-generic: change little-endian bitops to take any pointer types · a56560b3
      Akinobu Mita 提交于
      This makes the little-endian bitops take any pointer types by changing the
      prototypes and adding casts in the preprocessor macros.
      
      That would seem to at least make all the filesystem code happier, and they
      can continue to do just something like
      
        #define ext2_set_bit __test_and_set_bit_le
      
      (or whatever the exact sequence ends up being).
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a56560b3
    • A
      asm-generic: rename generic little-endian bitops functions · c4945b9e
      Akinobu Mita 提交于
      As a preparation for providing little-endian bitops for all architectures,
      This renames generic implementation of little-endian bitops.  (remove
      "generic_" prefix and postfix "_le")
      
      s/generic_find_next_le_bit/find_next_bit_le/
      s/generic_find_next_zero_le_bit/find_next_zero_bit_le/
      s/generic_find_first_zero_le_bit/find_first_zero_bit_le/
      s/generic___test_and_set_le_bit/__test_and_set_bit_le/
      s/generic___test_and_clear_le_bit/__test_and_clear_bit_le/
      s/generic_test_le_bit/test_bit_le/
      s/generic___set_le_bit/__set_bit_le/
      s/generic___clear_le_bit/__clear_bit_le/
      s/generic_test_and_set_le_bit/test_and_set_bit_le/
      s/generic_test_and_clear_le_bit/test_and_clear_bit_le/
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NHans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4945b9e
  9. 23 3月, 2011 7 次提交
    • J
      zlib: slim down zlib_deflate() workspace when possible · 565d76cb
      Jim Keniston 提交于
      Instead of always creating a huge (268K) deflate_workspace with the
      maximum compression parameters (windowBits=15, memLevel=8), allow the
      caller to obtain a smaller workspace by specifying smaller parameter
      values.
      
      For example, when capturing oops and panic reports to a medium with
      limited capacity, such as NVRAM, compression may be the only way to
      capture the whole report.  In this case, a small workspace (24K works
      fine) is a win, whether you allocate the workspace when you need it (i.e.,
      during an oops or panic) or at boot time.
      
      I've verified that this patch works with all accepted values of windowBits
      (positive and negative), memLevel, and compression level.
      Signed-off-by: NJim Keniston <jkenisto@us.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: David Miller <davem@davemloft.net>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      565d76cb
    • A
      kstrto*: converting strings to integers done (hopefully) right · 33ee3b2e
      Alexey Dobriyan 提交于
      1. simple_strto*() do not contain overflow checks and crufty,
         libc way to indicate failure.
      2. strict_strto*() also do not have overflow checks but the name and
         comments pretend they do.
      3. Both families have only "long long" and "long" variants,
         but users want strtou8()
      4. Both "simple" and "strict" prefixes are wrong:
         Simple doesn't exactly say what's so simple, strict should not exist
         because conversion should be strict by default.
      
      The solution is to use "k" prefix and add convertors for more types.
      Enter
      	kstrtoull()
      	kstrtoll()
      	kstrtoul()
      	kstrtol()
      	kstrtouint()
      	kstrtoint()
      
      	kstrtou64()
      	kstrtos64()
      	kstrtou32()
      	kstrtos32()
      	kstrtou16()
      	kstrtos16()
      	kstrtou8()
      	kstrtos8()
      
      Include runtime testsuite (somewhat incomplete) as well.
      
      strict_strto*() become deprecated, stubbed to kstrto*() and
      eventually will be removed altogether.
      
      Use kstrto*() in code today!
      
      Note: on some archs _kstrtoul() and _kstrtol() are left in tree, even if
            they'll be unused at runtime. This is temporarily solution,
            because I don't want to hardcode list of archs where these
            functions aren't needed. Current solution with sizeof() and
            __alignof__ at least always works.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33ee3b2e
    • M
      printk: allow setting DEFAULT_MESSAGE_LEVEL via Kconfig · 5af5bcb8
      Mandeep Singh Baines 提交于
      We've been burned by regressions/bugs which we later realized could have
      been triaged quicker if only we'd paid closer attention to dmesg.  To make
      it easier to audit dmesg, we'd like to make DEFAULT_MESSAGE_LEVEL
      Kconfig-settable.  That way we can set it to KERN_NOTICE and audit any
      messages <= KERN_WARNING.
      Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joe Perches <joe@perches.com>
      Cc: Olof Johansson <olofj@chromium.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5af5bcb8
    • K
      printk: use %pK for /proc/kallsyms and /proc/modules · 9f36e2c4
      Kees Cook 提交于
      In an effort to reduce kernel address leaks that might be used to help
      target kernel privilege escalation exploits, this patch uses %pK when
      displaying addresses in /proc/kallsyms, /proc/modules, and
      /sys/module/*/sections/*.
      
      Note that this changes %x to %p, so some legitimately 0 values in
      /proc/kallsyms would have changed from 00000000 to "(null)".  To avoid
      this, "(null)" is not used when using the "K" format.  Anything that was
      already successfully parsing "(null)" in addition to full hex digits
      should have no problem with this change.  (Thanks to Joe Perches for the
      suggestion.) Due to the %x to %p, "void *" casts are needed since these
      addresses are already "unsigned long" everywhere internally, due to their
      starting life as ELF section offsets.
      Signed-off-by: NKees Cook <kees.cook@canonical.com>
      Cc: Eugene Teo <eugene@redhat.com>
      Cc: Dan Rosenberg <drosenberg@vsecurity.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f36e2c4
    • J
      vsprintf: neaten %pK kptr_restrict, save a bit of code space · 26297607
      Joe Perches 提交于
      If kptr restrictions are on, just set the passed pointer to NULL.
      
      $ size lib/vsprintf.o.*
         text	   data	    bss	    dec	    hex	filename
         8247	      4	      2	   8253	   203d	lib/vsprintf.o.new
         8282	      4	      2	   8288	   2060	lib/vsprintf.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Dan Rosenberg <drosenberg@vsecurity.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26297607
    • D
      kernel/watchdog.c: allow hardlockup to panic by default · fef2c9bc
      Don Zickus 提交于
      When a cpu is considered stuck, instead of limping along and just printing
      a warning, it is sometimes preferred to just panic, let kdump capture the
      vmcore and reboot.  This gets the machine back into a stable state quickly
      while saving the info that got it into a stuck state to begin with.
      
      Add a Kconfig option to allow users to set the hardlockup to panic
      by default.  Also add in a 'nmi_watchdog=nopanic' to override this.
      
      [akpm@linux-foundation.org: fix strncmp length]
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fef2c9bc
    • D
      oom: suppress nodes that are not allowed from meminfo on oom kill · ddd588b5
      David Rientjes 提交于
      The oom killer is extremely verbose for machines with a large number of
      cpus and/or nodes.  This verbosity can often be harmful if it causes other
      important messages to be scrolled from the kernel log and incurs a
      signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
      8.
      
      This patch causes only memory information to be displayed for nodes that
      are allowed by current's cpuset when dumping the VM state.  Information
      for all other nodes is irrelevant to the oom condition; we don't care if
      there's an abundance of memory elsewhere if we can't access it.
      
      This only affects the behavior of dumping memory information when an oom
      is triggered.  Other dumps, such as for sysrq+m, still display the
      unfiltered form when using the existing show_mem() interface.
      
      Additionally, the per-cpu pageset statistics are extremely verbose in oom
      killer output, so it is now suppressed.  This removes
      
      	nodes_weight(current->mems_allowed) * (1 + nr_cpus)
      
      lines from the oom killer output.
      
      Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
      nodes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddd588b5
  10. 21 3月, 2011 1 次提交
  11. 12 3月, 2011 2 次提交
  12. 11 3月, 2011 1 次提交
    • I
      lib: add shared BCH ECC library · 437aa565
      Ivan Djelic 提交于
      This is a new software BCH encoding/decoding library, similar to the shared
      Reed-Solomon library.
      
      Binary BCH (Bose-Chaudhuri-Hocquenghem) codes are widely used to correct
      errors in NAND flash devices requiring more than 1-bit ecc correction; they
      are generally better suited for NAND flash than RS codes because NAND bit
      errors do not occur in bursts. Latest SLC NAND devices typically require at
      least 4-bit ecc protection per 512 bytes block.
      
      This library provides software encoding/decoding, but may also be used with
      ASIC/SoC hardware BCH engines to perform error correction. It is being
      currently used for this purpose on an OMAP3630 board (4bit/8bit HW BCH). It
      has also been used to decode raw dumps of NAND devices with on-die BCH ecc
      engines (e.g. Micron 4bit ecc SLC devices).
      
      Latest NAND devices (including SLC) can exhibit high error rates (typically
      a dozen or more bitflips per hour during stress tests); in order to
      minimize the performance impact of error correction, this library
      implements recently developed algorithms for fast polynomial root finding
      (see bch.c header for details) instead of the traditional exhaustive Chien
      root search; a few performance figures are provided below:
      
      Platform: arm926ejs @ 468 MHz, 32 KiB icache, 16 KiB dcache
      BCH ecc : 4-bit per 512 bytes
      
      Encoding average throughput: 250 Mbits/s
      
      Error correction time (compared with Chien search):
      
              average   worst      average (Chien)  worst (Chien)
      ----------------------------------------------------------
      1 bit    8.5 µs   11 µs         200 µs           383 µs
      2 bit    9.7 µs   12.5 µs       477 µs           728 µs
      3 bit   18.1 µs   20.6 µs       758 µs          1010 µs
      4 bit   19.5 µs   23 µs        1028 µs          1280 µs
      
      In the above figures, "worst" is meant in terms of error pattern, not in
      terms of cache miss / page faults effects (not taken into account here).
      
      The library has been extensively tested on the following platforms: x86,
      x86_64, arm926ejs, omap3630, qemu-ppc64, qemu-mips.
      Signed-off-by: NIvan Djelic <ivan.djelic@parrot.com>
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      437aa565
  13. 08 3月, 2011 1 次提交
    • S
      debugobjects: Add hint for better object identification · 99777288
      Stanislaw Gruszka 提交于
      In complex subsystems like mac80211 structures can contain several
      timers and work structs, so identifying a specific instance from the
      call trace and object type output of debugobjects can be hard.
      
      Allow the subsystems which support debugobjects to provide a hint
      function. This function returns a pointer to a kernel address
      (preferrably the objects callback function) which is printed along
      with the debugobjects type.
      
      Add hint methods for timer_list, work_struct and hrtimer.
      
      [ tglx: Massaged changelog, made it compile ]
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <20110307085809.GA9334@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      99777288
  14. 05 3月, 2011 2 次提交
    • A
      BKL: That's all, folks · 4ba8216c
      Arnd Bergmann 提交于
      This removes the implementation of the big kernel lock,
      at last. A lot of people have worked on this in the
      past, I so the credit for this patch should be with
      everyone who participated in the hunt.
      
      The names on the Cc list are the people that were the
      most active in this, according to the recorded git
      history, in alphabetical order.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NAlan Cox <alan@linux.intel.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Hendry <andrew.hendry@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Jan Blunck <jblunck@infradead.org>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Oliver Neukum <oliver@neukum.org>
      Cc: Paul Menage <menage@google.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      4ba8216c
    • M
      lib-average: Make config option selectable · a7a9a24d
      Michael Buesch 提交于
      Make CONFIG_AVERAGE selectable for out-of-tree users
      such as compat-wireless.
      Signed-off-by: NMichael Buesch <mb@bu3sch.de>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      a7a9a24d
  15. 02 3月, 2011 1 次提交
  16. 01 3月, 2011 1 次提交
  17. 26 2月, 2011 1 次提交
  18. 24 2月, 2011 1 次提交
  19. 19 2月, 2011 1 次提交
    • L
      Expand CONFIG_DEBUG_LIST to several other list operations · 3c18d4de
      Linus Torvalds 提交于
      When list debugging is enabled, we aim to readably show list corruption
      errors, and the basic list_add/list_del operations end up having extra
      debugging code in them to do some basic validation of the list entries.
      
      However, "list_del_init()" and "list_move[_tail]()" ended up avoiding
      the debug code due to how they were written. This fixes that.
      
      So the _next_ time we have list_move() problems with stale list entries,
      we'll hopefully have an easier time finding them..
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c18d4de
  20. 08 2月, 2011 1 次提交
  21. 04 2月, 2011 2 次提交
  22. 28 1月, 2011 1 次提交
  23. 27 1月, 2011 1 次提交
    • T
      rwsem: Remove redundant asmregparm annotation · d1233754
      Thomas Gleixner 提交于
      Peter Zijlstra pointed out, that the only user of asmregparm (x86) is
      compiling the kernel already with -mregparm=3. So the annotation of
      the rwsem functions is redundant. Remove it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Chris Zankel <chris@zankel.net>
      LKML-Reference: <alpine.LFD.2.00.1101262130450.31804@localhost6.localdomain6>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d1233754
  24. 26 1月, 2011 1 次提交
    • T
      radix_tree: radix_tree_gang_lookup_tag_slot() may never return · ac15ee69
      Toshiyuki Okajima 提交于
      Executed command: fsstress -d /mnt -n 600 -p 850
      
        crash> bt
        PID: 7947   TASK: ffff880160546a70  CPU: 0   COMMAND: "fsstress"
         #0 [ffff8800dfc07d00] machine_kexec at ffffffff81030db9
         #1 [ffff8800dfc07d70] crash_kexec at ffffffff810a7952
         #2 [ffff8800dfc07e40] oops_end at ffffffff814aa7c8
         #3 [ffff8800dfc07e70] die_nmi at ffffffff814aa969
         #4 [ffff8800dfc07ea0] do_nmi_callback at ffffffff8102b07b
         #5 [ffff8800dfc07f10] do_nmi at ffffffff814aa514
         #6 [ffff8800dfc07f50] nmi at ffffffff814a9d60
            [exception RIP: __lookup_tag+100]
            RIP: ffffffff812274b4  RSP: ffff88016056b998  RFLAGS: 00000287
            RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000006
            RDX: 000000000000001d  RSI: ffff88016056bb18  RDI: ffff8800c85366e0
            RBP: ffff88016056b9c8   R8: ffff88016056b9e8   R9: 0000000000000000
            R10: 000000000000000e  R11: ffff8800c8536908  R12: 0000000000000010
            R13: 0000000000000040  R14: ffffffffffffffc0  R15: ffff8800c85366e0
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        <NMI exception stack>
         #7 [ffff88016056b998] __lookup_tag at ffffffff812274b4
         #8 [ffff88016056b9d0] radix_tree_gang_lookup_tag_slot at ffffffff81227605
         #9 [ffff88016056ba20] find_get_pages_tag at ffffffff810fc110
        #10 [ffff88016056ba80] pagevec_lookup_tag at ffffffff81105e85
        #11 [ffff88016056baa0] write_cache_pages at ffffffff81104c47
        #12 [ffff88016056bbd0] generic_writepages at ffffffff81105014
        #13 [ffff88016056bbe0] do_writepages at ffffffff81105055
        #14 [ffff88016056bbf0] __filemap_fdatawrite_range at ffffffff810fb2cb
        #15 [ffff88016056bc40] filemap_write_and_wait_range at ffffffff810fb32a
        #16 [ffff88016056bc70] generic_file_direct_write at ffffffff810fb3dc
        #17 [ffff88016056bce0] __generic_file_aio_write at ffffffff810fcee5
        #18 [ffff88016056bda0] generic_file_aio_write at ffffffff810fd085
        #19 [ffff88016056bdf0] do_sync_write at ffffffff8114f9ea
        #20 [ffff88016056bf00] vfs_write at ffffffff8114fcf8
        #21 [ffff88016056bf30] sys_write at ffffffff81150691
        #22 [ffff88016056bf80] system_call_fastpath at ffffffff8100c0b2
      
      I think this root cause is the following:
      
       radix_tree_range_tag_if_tagged() always tags the root tag with settag
       if the root tag is set with iftag even if there are no iftag tags
       in the specified range (Of course, there are some iftag tags
       outside the specified range).
      
      ===============================================================================
      [[[Detailed description]]]
      
      (1) Why cannot radix_tree_gang_lookup_tag_slot() return forever?
      
      __lookup_tag():
       - Return with 0.
       - Return with the index which is not bigger than the old one as the
         input parameter.
      
      Therefore the following "while" repeats forever because the above
      conditions cause "ret" not to be updated and the cur_index cannot be
      changed into the bigger one.
      
      (So, radix_tree_gang_lookup_tag_slot() cannot return forever.)
      
      radix_tree_gang_lookup_tag_slot():
      1178         while (ret < max_items) {
      1179                 unsigned int slots_found;
      1180                 unsigned long next_index;       /* Index of next search */
      1181
      1182                 if (cur_index > max_index)
      1183                         break;
      1184                 slots_found = __lookup_tag(node, results + ret,
      1185                                 cur_index, max_items - ret, &next_index,
      tag);
      1186                 ret += slots_found;
      			// cannot update ret because slots_found == 0.
      			// so, this while loops forever.
      1187                 if (next_index == 0)
      1188                         break;
      1189                 cur_index = next_index;
      1190         }
      
      (2) Why does __lookup_tag() return with 0 and doesn't update the index?
      
      Assuming the following:
        - the one of the slot in radix_tree_node is NULL.
        - the one of the tag which corresponds to the slot sets with
          PAGECACHE_TAG_TOWRITE or other.
        - In a certain height(!=0), the corresponding index is 0.
      
      a) __lookup_tag() notices that the tag is set.
      
      1005 static unsigned int
      1006 __lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
      1007         unsigned int max_items, unsigned long *next_index, unsigned int tag)
      1008 {
      1009         unsigned int nr_found = 0;
      1010         unsigned int shift, height;
      1011
      1012         height = slot->height;
      1013         if (height == 0)
      1014                 goto out;
      1015         shift = (height-1) * RADIX_TREE_MAP_SHIFT;
      1016
      1017         while (height > 0) {
      1018                 unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
      1019
      1020                 for (;;) {
      1021                         if (tag_get(slot, tag, i))
      1022                                 break;
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      * the index is not updated yet.
      
      b) __lookup_tag() notices that the slot is NULL.
      
      1023                         index &= ~((1UL << shift) - 1);
      1024                         index += 1UL << shift;
      1025                         if (index == 0)
      1026                                 goto out;       /* 32-bit wraparound */
      1027                         i++;
      1028                         if (i == RADIX_TREE_MAP_SIZE)
      1029                                 goto out;
      1030                 }
      1031                 height--;
      1032                 if (height == 0) {      /* Bottom level: grab some items */
      ...
      1055                 }
      1056                 shift -= RADIX_TREE_MAP_SHIFT;
      1057                 slot = rcu_dereference_raw(slot->slots[i]);
      1058                 if (slot == NULL)
      1059                         break;
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
      c) __lookup_tag() doesn't update the index and return with 0.
      
      1060         }
      1061 out:
      1062         *next_index = index;
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      1063         return nr_found;
      1064 }
      
      (3) Why is the slot NULL even if the tag is set?
      
      Because radix_tree_range_tag_if_tagged() always sets the root tag with
      PAGECACHE_TAG_TOWRITE if the root tag is set with PAGECACHE_TAG_DIRTY,
      even if there is no tag which can be set with PAGECACHE_TAG_TOWRITE
      in the specified range (from *first_indexp to last_index). Of course,
      some PAGECACHE_TAG_DIRTY nodes must exist outside the specified range.
      (radix_tree_range_tag_if_tagged() is called only from tag_pages_for_writeback())
      
       640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
      *root,
       641                 unsigned long *first_indexp, unsigned long last_index,
       642                 unsigned long nr_to_tag,
       643                 unsigned int iftag, unsigned int settag)
       644 {
       645         unsigned int height = root->height;
       646         struct radix_tree_path path[height];
       647         struct radix_tree_path *pathp = path;
       648         struct radix_tree_node *slot;
       649         unsigned int shift;
       650         unsigned long tagged = 0;
       651         unsigned long index = *first_indexp;
       652
       653         last_index = min(last_index, radix_tree_maxindex(height));
       654         if (index > last_index)
       655                 return 0;
       656         if (!nr_to_tag)
       657                 return 0;
       658         if (!root_tag_get(root, iftag)) {
       659                 *first_indexp = last_index + 1;
       660                 return 0;
       661         }
       662         if (height == 0) {
       663                 *first_indexp = last_index + 1;
       664                 root_tag_set(root, settag);
       665                 return 1;
       666         }
      ...
       733         root_tag_set(root, settag);
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       734         *first_indexp = index;
       735
       736         return tagged;
       737 }
      
      As the result, there is no radix_tree_node which is set with
      PAGECACHE_TAG_TOWRITE but the root tag(radix_tree_root) is set with
      PAGECACHE_TAG_TOWRITE.
      
      [figure: inside radix_tree]
      (Please see the figure with typewriter font)
      ===========================================
                [roottag = DIRTY]
                       |             tag=0:NOTHING
               tag[0 0 0 1]              1:DIRTY
                  [x x x +]              2:WRITEBACK
                         |               3:DIRTY,WRITEBACK
                         p               4:TOWRITE
                   <--->                 5:DIRTY,TOWRITE ...
           specified range (index: 0 to 2)
      
      * There is no DIRTY tag within the specified range.
       (But there is a DIRTY tag outside that range.)
      
                  | | | | | | | | |
          after calling tag_pages_for_writeback()
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |                 p is "page".
               tag[0 0 0 1]              x is NULL.
                  [x x x +]              +- is a pointer to "page".
                         |
                         p
      
      * But TOWRITE tag is set on the root tag.
      ============================================
      
      After that, radix_tree_extend() via radix_tree_insert() is called
      when the page is added.
      This function sets the new radix_tree_node with PAGECACHE_TAG_TOWRITE
      to succeed the status of the root tag.
      
       246 static int radix_tree_extend(struct radix_tree_root *root, unsigned long
      index)
       247 {
       248         struct radix_tree_node *node;
       249         unsigned int height;
       250         int tag;
       251
       252         /* Figure out what the height should be.  */
       253         height = root->height + 1;
       254         while (index > radix_tree_maxindex(height))
       255                 height++;
       256
       257         if (root->rnode == NULL) {
       258                 root->height = height;
       259                 goto out;
       260         }
       261
       262         do {
       263                 unsigned int newheight;
       264                 if (!(node = radix_tree_node_alloc(root)))
       265                         return -ENOMEM;
       266
       267                 /* Increase the height.  */
       268                 node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
       269
       270                 /* Propagate the aggregated tag info into the new root */
       271                 for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
       272                         if (root_tag_get(root, tag))
       273                                 tag_set(node, tag, 0);
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       274                 }
      
      ===========================================
                [roottag = DIRTY,TOWRITE]
                       |     :
               tag[0 0 0 1] [0 0 0 0]
                  [x x x +] [+ x x x]
                         |   |
                         p   p (new page)
      
                  | | | | | | | | |
          after calling radix_tree_insert
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |
               tag [5 0 0 0]    *  DIRTY and TOWRITE tags are
                   [+ + x x]       succeeded to the new node.
                    | |
        tag [0 0 0 1] [0 0 0 0]
            [x x x +] [+ x x x]
                   |   |
                   p   p
      ============================================
      
      After that, the index 3 page is released by remove_from_page_cache().
      Then we can make the situation that the tag is set with PAGECACHE_TAG_TOWRITE
      and that the slot which corresponds to the tag is NULL.
      ===========================================
                [roottag = DIRTY,TOWRITE]
                       |
               tag [5 0 0 0]
                   [+ + x x]
                    | |
        tag [0 0 0 1] [0 0 0 0]
            [x x x +] [+ x x x]
                   |   |
                   p   p
               (remove)
      
                  | | | | | | | | |
          after calling remove_page_cache
                  | | | | | | | | |
                  v v v v v v v v v
      
                [roottag = DIRTY,TOWRITE]
                       |
               tag [4 0 0 0]      * Only DIRTY tag is cleared
                   [x + x x]        because no TOWRITE tag is existed
                      |             in the bottom node.
                      [0 0 0 0]
                      [+ x x x]
                       |
                       p
      ============================================
      
      To solve this problem
      
      Change to that radix_tree_tag_if_tagged() doesn't tag the root tag
      if it doesn't set any tags within the specified range.
      
      Like this.
      ============================================
       640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
      *root,
       641                 unsigned long *first_indexp, unsigned long last_index,
       642                 unsigned long nr_to_tag,
       643                 unsigned int iftag, unsigned int settag)
       644 {
       650         unsigned long tagged = 0;
      ...
       733 	     if (tagged)
      ^^^^^^^^^^^^^^^^^^^^^^^^
       734            root_tag_set(root, settag);
       735         *first_indexp = index;
       736
       737         return tagged;
       738 }
      
      ============================================
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac15ee69
  25. 25 1月, 2011 2 次提交