1. 30 5月, 2012 1 次提交
    • A
      mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition · 26c19178
      Andrea Arcangeli 提交于
      When holding the mmap_sem for reading, pmd_offset_map_lock should only
      run on a pmd_t that has been read atomically from the pmdp pointer,
      otherwise we may read only half of it leading to this crash.
      
      PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
       #0 [f06a9dd8] crash_kexec at c049b5ec
       #1 [f06a9e2c] oops_end at c083d1c2
       #2 [f06a9e40] no_context at c0433ded
       #3 [f06a9e64] bad_area_nosemaphore at c043401a
       #4 [f06a9e6c] __do_page_fault at c0434493
       #5 [f06a9eec] do_page_fault at c083eb45
       #6 [f06a9f04] error_code (via page_fault) at c083c5d5
          EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
          00000000
          DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
          CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
       #7 [f06a9f38] _spin_lock at c083bc14
       #8 [f06a9f44] sys_mincore at c0507b7d
       #9 [f06a9fb0] system_call at c083becd
                               start           len
          EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
          DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
          SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
          CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286
      
      This should be a longstanding bug affecting x86 32bit PAE without THP.
      Only archs with 64bit large pmd_t and 32bit unsigned long should be
      affected.
      
      With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
      would partly hide the bug when the pmd transition from none to stable,
      by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
      enabled a new set of problem arises by the fact could then transition
      freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
      So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
      unconditional isn't good idea and it would be a flakey solution.
      
      This should be fully fixed by introducing a pmd_read_atomic that reads
      the pmd in order with THP disabled, or by reading the pmd atomically
      with cmpxchg8b with THP enabled.
      
      Luckily this new race condition only triggers in the places that must
      already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
      is localized there but this bug is not related to THP.
      
      NOTE: this can trigger on x86 32bit systems with PAE enabled with more
      than 4G of ram, otherwise the high part of the pmd will never risk to be
      truncated because it would be zero at all times, in turn so hiding the
      SMP race.
      
      This bug was discovered and fully debugged by Ulrich, quote:
      
      ----
      [..]
      pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
      eax.
      
          496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
          *pmd)
          497 {
          498         /* depend on compiler for an atomic pmd read */
          499         pmd_t pmdval = *pmd;
      
                                      // edi = pmd pointer
      0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
      ...
                                      // edx = PTE page table high address
      0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
      ...
                                      // eax = PTE page table low address
      0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax
      
      [..]
      
      Please note that the PMD is not read atomically. These are two "mov"
      instructions where the high order bits of the PMD entry are fetched
      first. Hence, the above machine code is prone to the following race.
      
      -  The PMD entry {high|low} is 0x0000000000000000.
         The "mov" at 0xc0507a84 loads 0x00000000 into edx.
      
      -  A page fault (on another CPU) sneaks in between the two "mov"
         instructions and instantiates the PMD.
      
      -  The PMD entry {high|low} is now 0x00000003fda38067.
         The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
      ----
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26c19178
  2. 29 5月, 2012 1 次提交
  3. 27 5月, 2012 1 次提交
    • L
      word-at-a-time: make the interfaces truly generic · 36126f8f
      Linus Torvalds 提交于
      This changes the interfaces in <asm/word-at-a-time.h> to be a bit more
      complicated, but a lot more generic.
      
      In particular, it allows us to really do the operations efficiently on
      both little-endian and big-endian machines, pretty much regardless of
      machine details.  For example, if you can rely on a fast population
      count instruction on your architecture, this will allow you to make your
      optimized <asm/word-at-a-time.h> file with that.
      
      NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is
      not truly generic, it actually only works on big-endian.  Why? Because
      on little-endian the generic algorithms are wasteful, since you can
      inevitably do better. The x86 implementation is an example of that.
      
      (The only truly non-generic part of the asm-generic implementation is
      the "find_zero()" function, and you could make a little-endian version
      of it.  And if the Kbuild infrastructure allowed us to pick a particular
      header file, that would be lovely)
      
      The <asm/word-at-a-time.h> functions are as follows:
      
       - WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm
         uses.
      
       - has_zero(): take a word, and determine if it has a zero byte in it.
         It gets the word, the pointer to the constant pool, and a pointer to
         an intermediate "data" field it can set.
      
         This is the "quick-and-dirty" zero tester: it's what is run inside
         the hot loops.
      
       - "prep_zero_mask()": take the word, the data that has_zero() produced,
         and the constant pool, and generate an *exact* mask of which byte had
         the first zero.  This is run directly *outside* the loop, and allows
         the "has_zero()" function to answer the "is there a zero byte"
         question without necessarily getting exactly *which* byte is the
         first one to contain a zero.
      
         If you do multiple byte lookups concurrently (eg "hash_name()", which
         looks for both NUL and '/' bytes), after you've done the prep_zero_mask()
         phase, the result of those can be or'ed together to get the "either
         or" case.
      
       - The result from "prep_zero_mask()" can then be fed into "find_zero()"
         (to find the byte offset of the first byte that was zero) or into
         "zero_bytemask()" (to find the bytemask of the bytes preceding the
         zero byte).
      
         The existence of zero_bytemask() is optional, and is not necessary
         for the normal string routines.  But dentry name hashing needs it, so
         if you enable DENTRY_WORD_AT_A_TIME you need to expose it.
      
      This changes the generic strncpy_from_user() function and the dentry
      hashing functions to use these modified word-at-a-time interfaces.  This
      gets us back to the optimized state of the x86 strncpy that we lost in
      the previous commit when moving over to the generic version.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36126f8f
  4. 26 5月, 2012 1 次提交
    • C
      arch/tile: allow building Linux with transparent huge pages enabled · 73636b1a
      Chris Metcalf 提交于
      The change adds some infrastructure for managing tile pmd's more generally,
      using pte_pmd() and pmd_pte() methods to translate pmd values to and
      from ptes, since on TILEPro a pmd is really just a nested structure
      holding a pgd (aka pte).  Several existing pmd methods are moved into
      this framework, and a whole raft of additional pmd accessors are defined
      that are used by the transparent hugepage framework.
      
      The tile PTE now has a "client2" bit.  The bit is used to indicate a
      transparent huge page is in the process of being split into subpages.
      
      This change also fixes a generic bug where the return value of the
      generic pmdp_splitting_flush() was incorrect.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      73636b1a
  5. 21 5月, 2012 3 次提交
  6. 19 5月, 2012 3 次提交
  7. 17 5月, 2012 1 次提交
  8. 01 5月, 2012 2 次提交
    • B
      PCI: work around Stratus ftServer broken PCIe hierarchy · 284f5f9d
      Bjorn Helgaas 提交于
      A PCIe downstream port is a P2P bridge.  Its secondary interface is
      a link that should lead only to device 0 (unless ARI is enabled)[1], so
      we don't probe for non-zero device numbers.
      
      Some Stratus ftServer systems have a PCIe downstream port (02:00.0) that
      leads to both an upstream port (03:00.0) and a downstream port (03:01.0),
      and 03:01.0 has important devices below it:
      
        [0000:02]-+-00.0-[03-3c]--+-00.0-[04-09]--...
                                  \-01.0-[0a-0d]--+-[USB]
                                                  +-[NIC]
                                                  +-...
      
      Previously, we didn't enumerate device 03:01.0, so USB and the network
      didn't work.  This patch adds a DMI quirk to scan all device numbers,
      not just 0, below a downstream port.
      
      Based on a patch by Prarit Bhargava.
      
      [1] PCIe spec r3.0, sec 7.3.1
      
      CC: Myron Stowe <mstowe@redhat.com>
      CC: Don Dutile <ddutile@redhat.com>
      CC: James Paradis <james.paradis@stratus.com>
      CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
      CC: Jesse Barnes <jbarnes@virtuousgeek.org>
      CC: Prarit Bhargava <prarit@redhat.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      284f5f9d
    • H
      asm-generic: Use __BITS_PER_LONG in statfs.h · f5c2347e
      H. Peter Anvin 提交于
      <asm-generic/statfs.h> is exported to userspace, so using
      BITS_PER_LONG is invalid.  We need to use __BITS_PER_LONG instead.
      
      This is kernel bugzilla 43165.
      Reported-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1335465916-16965-1-git-send-email-hpa@linux.intel.comAcked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: <stable@vger.kernel.org>
      f5c2347e
  9. 24 4月, 2012 1 次提交
  10. 21 4月, 2012 1 次提交
  11. 14 4月, 2012 3 次提交
    • W
      seccomp: Add SECCOMP_RET_TRAP · bb6ea430
      Will Drewry 提交于
      Adds a new return value to seccomp filters that triggers a SIGSYS to be
      delivered with the new SYS_SECCOMP si_code.
      
      This allows in-process system call emulation, including just specifying
      an errno or cleanly dumping core, rather than just dying.
      Suggested-by: NMarkus Gutschke <markus@chromium.org>
      Suggested-by: NJulien Tinnes <jln@chromium.org>
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Acked-by: NEric Paris <eparis@redhat.com>
      
      v18: - acked-by, rebase
           - don't mention secure_computing_int() anymore
      v15: - use audit_seccomp/skip
           - pad out error spacing; clean up switch (indan@nul.nu)
      v14: - n/a
      v13: - rebase on to 88ebdda6
      v12: - rebase on to linux-next
      v11: - clarify the comment (indan@nul.nu)
           - s/sigtrap/sigsys
      v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig
             note suggested-by (though original suggestion had other behaviors)
      v9:  - changes to SIGILL
      v8:  - clean up based on changes to dependent patches
      v7:  - introduction
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      bb6ea430
    • W
      signal, x86: add SIGSYS info and make it synchronous. · a0727e8c
      Will Drewry 提交于
      This change enables SIGSYS, defines _sigfields._sigsys, and adds
      x86 (compat) arch support.  _sigsys defines fields which allow
      a signal handler to receive the triggering system call number,
      the relevant AUDIT_ARCH_* value for that number, and the address
      of the callsite.
      
      SIGSYS is added to the SYNCHRONOUS_MASK because it is desirable for it
      to have setup_frame() called for it. The goal is to ensure that
      ucontext_t reflects the machine state from the time-of-syscall and not
      from another signal handler.
      
      The first consumer of SIGSYS would be seccomp filter.  In particular,
      a filter program could specify a new return value, SECCOMP_RET_TRAP,
      which would result in the system call being denied and the calling
      thread signaled.  This also means that implementing arch-specific
      support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
      Suggested-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reviewed-by: NH. Peter Anvin <hpa@zytor.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      
      v18: - added acked by, rebase
      v17: - rebase and reviewed-by addition
      v14: - rebase/nochanges
      v13: - rebase on to 88ebdda6
      v12: - reworded changelog (oleg@redhat.com)
      v11: - fix dropped words in the change description
           - added fallback copy_siginfo support.
           - added __ARCH_SIGSYS define to allow stepped arch support.
      v10: - first version based on suggestion
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      a0727e8c
    • W
      asm/syscall.h: add syscall_get_arch · 07bd18d0
      Will Drewry 提交于
      Adds a stub for a function that will return the AUDIT_ARCH_* value
      appropriate to the supplied task based on the system call convention.
      
      For audit's use, the value can generally be hard-coded at the
      audit-site.  However, for other functionality not inlined into syscall
      entry/exit, this makes that information available.  seccomp_filter is
      the first planned consumer and, as such, the comment indicates a tie to
      CONFIG_HAVE_ARCH_SECCOMP_FILTER.
      Suggested-by: NRoland McGrath <mcgrathr@chromium.org>
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      
      v18: comment and change reword and rebase.
      v14: rebase/nochanges
      v13: rebase on to 88ebdda6
      v12: rebase on to linux-next
      v11: fixed improper return type
      v10: introduced
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      07bd18d0
  12. 08 4月, 2012 1 次提交
  13. 03 4月, 2012 1 次提交
    • P
      asm-generic: add linux/types.h to cmpxchg.h · 80da6a4f
      Paul Gortmaker 提交于
      Builds of the openrisc or1ksim_defconfig show the following:
      
        In file included from arch/openrisc/include/generated/asm/cmpxchg.h:1:0,
                         from include/asm-generic/atomic.h:18,
                         from arch/openrisc/include/generated/asm/atomic.h:1,
                         from include/linux/atomic.h:4,
                         from include/linux/dcache.h:4,
                         from fs/notify/fsnotify.c:19:
        include/asm-generic/cmpxchg.h: In function '__xchg':
        include/asm-generic/cmpxchg.h:34:20: error: expected ')' before 'u8'
        include/asm-generic/cmpxchg.h:34:20: warning: type defaults to 'int' in type name
      
      and many more lines of similar errors.  It seems specific to the or32
      because most other platforms have an arch specific component that would
      have already included types.h ahead of time, but the o32 does not.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Jonas Bonn <jonas@southpole.se>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: David Howells <dhowells@redhat.com
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80da6a4f
  14. 29 3月, 2012 8 次提交
  15. 28 3月, 2012 1 次提交
    • C
      compat: use sys_sendfile64() implementation for sendfile syscall · 1631fcea
      Chris Metcalf 提交于
      <asm-generic/unistd.h> was set up to use sys_sendfile() for the 32-bit
      compat API instead of sys_sendfile64(), but in fact the right thing to
      do is to use sys_sendfile64() in all cases.  The 32-bit sendfile64() API
      in glibc uses the sendfile64 syscall, so it has to be capable of doing
      full 64-bit operations.  But the sys_sendfile() kernel implementation
      has a MAX_NON_LFS test in it which explicitly limits the offset to 2^32.
      So, we need to use the sys_sendfile64() implementation in the kernel
      for this case.
      
      Cc: <stable@kernel.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      1631fcea
  16. 26 3月, 2012 1 次提交
    • P
      params: <level>_initcall-like kernel parameters · 026cee00
      Pawel Moll 提交于
      This patch adds a set of macros that can be used to declare
      kernel parameters to be parsed _before_ initcalls at a chosen
      level are executed.  We rename the now-unused "flags" field of
      struct kernel_param as the level.  It's signed, for when we
      use this for early params as well, in future.
      
      Linker macro collating init calls had to be modified in order
      to add additional symbols between levels that are later used
      by the init code to split the calls into blocks.
      Signed-off-by: NPawel Moll <pawel.moll@arm.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      026cee00
  17. 24 3月, 2012 2 次提交
    • J
      coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP · accb61fe
      Jason Baron 提交于
      Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
      for 'VM_NODUMP' flag.  The idea is is to add a new madvise() flag:
      MADV_DONTDUMP, which can be set by applications to specifically request
      memory regions which should not dump core.
      
      The specific application I have in mind is qemu: we can add a flag there
      that wouldn't dump all of guest memory when qemu dumps core.  This flag
      might also be useful for security sensitive apps that want to absolutely
      make sure that parts of memory are not dumped.  To clear the flag use:
      MADV_DODUMP.
      
      [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
      [akpm@linux-foundation.org: fix up the architectures which broke]
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accb61fe
    • J
      consolidate WARN_...ONCE() static variables · 7ccaba53
      Jan Beulich 提交于
      Due to the alignment of following variables, these typically consume
      more than just the single byte that 'bool' requires, and as there are a
      few hundred instances, the cache pollution (not so much the waste of
      memory) sums up.  Put these variables into their own section, outside of
      any half way frequently used memory range.
      
      Do the same also to the __warned variable of rcu_lockdep_assert().
      (Don't, however, include the ones used by printk_once() and alike, as
      they can potentially be hot.)
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ccaba53
  18. 22 3月, 2012 1 次提交
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  19. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  20. 03 3月, 2012 1 次提交
  21. 27 2月, 2012 1 次提交
    • J
      [PARISC] fix compile break caused by iomap: make IOPORT/PCI mapping functions conditional · 97a29d59
      James Bottomley 提交于
      The problem in
      
      commit fea80311
      Author: Randy Dunlap <rdunlap@xenotime.net>
      Date:   Sun Jul 24 11:39:14 2011 -0700
      
          iomap: make IOPORT/PCI mapping functions conditional
      
      is that if your architecture supplies pci_iomap/pci_iounmap, it expects
      always to supply them.  Adding empty body defitions in the !CONFIG_PCI
      case, which is what this patch does, breaks the parisc compile because
      the functions become doubly defined.  It took us a while to spot this,
      because we don't actually build !CONFIG_PCI very often (only if someone
      is brave enough to test the snake/asp machines).
      
      Since the note in the commit log says this is to fix a
      CONFIG_GENERIC_IOMAP issue (which it does because CONFIG_GENERIC_IOMAP
      supplies pci_iounmap only if CONFIG_PCI is set), there should actually
      have been a condition upon this.  This should make sure no other
      architecture's !CONFIG_PCI compile breaks in the same way as parisc.
      
      The fix had to be updated to take account of the GENERIC_PCI_IOMAP
      separation.
      Reported-by: NRolf Eike Beer <eike@sf-mail.de>
      Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>
      97a29d59
  22. 25 2月, 2012 2 次提交
    • O
      epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() · d80e731e
      Oleg Nesterov 提交于
      This patch is intentionally incomplete to simplify the review.
      It ignores ep_unregister_pollwait() which plays with the same wqh.
      See the next change.
      
      epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
      f_op->poll() needs. In particular it assumes that the wait queue
      can't go away until eventpoll_release(). This is not true in case
      of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
      which is not connected to the file.
      
      This patch adds the special event, POLLFREE, currently only for
      epoll. It expects that init_poll_funcptr()'ed hook should do the
      necessary cleanup. Perhaps it should be defined as EPOLLFREE in
      eventpoll.
      
      __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
      ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
      helper.
      
      ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
      This make this poll entry inconsistent, but we don't care. If you
      share epoll fd which contains our sigfd with another process you
      should blame yourself. signalfd is "really special". I simply do
      not know how we can define the "right" semantics if it used with
      epoll.
      
      The main problem is, epoll calls signalfd_poll() once to establish
      the connection with the wait queue, after that signalfd_poll(NULL)
      returns the different/inconsistent results depending on who does
      EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
      has nothing to do with the file, it works with the current thread.
      
      In short: this patch is the hack which tries to fix the symptoms.
      It also assumes that nobody can take tasklist_lock under epoll
      locks, this seems to be true.
      
      Note:
      
      	- we do not have wake_up_all_poll() but wake_up_poll()
      	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.
      
      	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
      	  we need a couple of simple changes in eventpoll.c to
      	  make sure it can't be "lost".
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d80e731e
    • J
      bitops: Add missing parentheses to new get_order macro · b893485d
      Joerg Roedel 提交于
      The new get_order macro introcuded in commit
      
      	d66acc39
      
      does not use parentheses around all uses of the parameter n.
      This causes new compile warnings, for example in the
      amd_iommu_init.c function:
      
      drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]
      drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]
      
      Fix those warnings by adding the missing parentheses.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      Link: http://lkml.kernel.org/r/1330088295-28732-1-git-send-email-joerg.roedel@amd.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      b893485d
  23. 24 2月, 2012 2 次提交