提交 · bd0e162d0312aa95e8b85ba883efddebf27be121 · OpenHarmony / kernel_linux

30 5月, 2012 1 次提交

mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition · 26c19178

由 Andrea Arcangeli 提交于 5月 29, 2012

When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.

PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
 #0 [f06a9dd8] crash_kexec at c049b5ec
 #1 [f06a9e2c] oops_end at c083d1c2
 #2 [f06a9e40] no_context at c0433ded
 #3 [f06a9e64] bad_area_nosemaphore at c043401a
 #4 [f06a9e6c] __do_page_fault at c0434493
 #5 [f06a9eec] do_page_fault at c083eb45
 #6 [f06a9f04] error_code (via page_fault) at c083c5d5
    EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
    00000000
    DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
    CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
 #7 [f06a9f38] _spin_lock at c083bc14
 #8 [f06a9f44] sys_mincore at c0507b7d
 #9 [f06a9fb0] system_call at c083becd
                         start           len
    EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
    SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
    CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286

This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.

With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.

This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.

Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.

NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.

This bug was discovered and fully debugged by Ulrich, quote:

----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.

    496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
    *pmd)
    497 {
    498         /* depend on compiler for an atomic pmd read */
    499         pmd_t pmdval = *pmd;

                                // edi = pmd pointer
0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
...
                                // edx = PTE page table high address
0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
...
                                // eax = PTE page table low address
0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax

[..]

Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.

-  The PMD entry {high|low} is 0x0000000000000000.
   The "mov" at 0xc0507a84 loads 0x00000000 into edx.

-  A page fault (on another CPU) sneaks in between the two "mov"
   instructions and instantiates the PMD.

-  The PMD entry {high|low} is now 0x00000003fda38067.
   The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

26c19178

29 5月, 2012 1 次提交

KVM: Export asm-generic/kvm_para.h · 56457f38

由 Avi Kivity 提交于 5月 28, 2012

Prevents build failures on non-KVM archs.
Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NAvi Kivity <avi@redhat.com>

56457f38

27 5月, 2012 1 次提交

word-at-a-time: make the interfaces truly generic · 36126f8f

由 Linus Torvalds 提交于 5月 26, 2012

This changes the interfaces in <asm/word-at-a-time.h> to be a bit more
complicated, but a lot more generic.

In particular, it allows us to really do the operations efficiently on
both little-endian and big-endian machines, pretty much regardless of
machine details.  For example, if you can rely on a fast population
count instruction on your architecture, this will allow you to make your
optimized <asm/word-at-a-time.h> file with that.

NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is
not truly generic, it actually only works on big-endian.  Why? Because
on little-endian the generic algorithms are wasteful, since you can
inevitably do better. The x86 implementation is an example of that.

(The only truly non-generic part of the asm-generic implementation is
the "find_zero()" function, and you could make a little-endian version
of it.  And if the Kbuild infrastructure allowed us to pick a particular
header file, that would be lovely)

The <asm/word-at-a-time.h> functions are as follows:

 - WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm
   uses.

 - has_zero(): take a word, and determine if it has a zero byte in it.
   It gets the word, the pointer to the constant pool, and a pointer to
   an intermediate "data" field it can set.

   This is the "quick-and-dirty" zero tester: it's what is run inside
   the hot loops.

 - "prep_zero_mask()": take the word, the data that has_zero() produced,
   and the constant pool, and generate an *exact* mask of which byte had
   the first zero.  This is run directly *outside* the loop, and allows
   the "has_zero()" function to answer the "is there a zero byte"
   question without necessarily getting exactly *which* byte is the
   first one to contain a zero.

   If you do multiple byte lookups concurrently (eg "hash_name()", which
   looks for both NUL and '/' bytes), after you've done the prep_zero_mask()
   phase, the result of those can be or'ed together to get the "either
   or" case.

 - The result from "prep_zero_mask()" can then be fed into "find_zero()"
   (to find the byte offset of the first byte that was zero) or into
   "zero_bytemask()" (to find the bytemask of the bytes preceding the
   zero byte).

   The existence of zero_bytemask() is optional, and is not necessary
   for the normal string routines.  But dentry name hashing needs it, so
   if you enable DENTRY_WORD_AT_A_TIME you need to expose it.

This changes the generic strncpy_from_user() function and the dentry
hashing functions to use these modified word-at-a-time interfaces.  This
gets us back to the optimized state of the x86 strncpy that we lost in
the previous commit when moving over to the generic version.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

36126f8f

26 5月, 2012 1 次提交

arch/tile: allow building Linux with transparent huge pages enabled · 73636b1a

由 Chris Metcalf 提交于 3月 28, 2012

The change adds some infrastructure for managing tile pmd's more generally,
using pte_pmd() and pmd_pte() methods to translate pmd values to and
from ptes, since on TILEPro a pmd is really just a nested structure
holding a pgd (aka pte). Several existing pmd methods are moved into
this framework, and a whole raft of additional pmd accessors are defined
that are used by the transparent hugepage framework.

The tile PTE now has a "client2" bit. The bit is used to indicate a
transparent huge page is in the process of being split into subpages.

This change also fixes a generic bug where the return value of the
generic pmdp_splitting_flush() was incorrect.
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>

73636b1a

21 5月, 2012 3 次提交

KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block · 322728e5

由 Paul Gortmaker 提交于 5月 18, 2012

There are two functions in this asm-generic file.  Looking at
other arch which do not use the generic version, these two fcns
are within an #ifdef __KERNEL__ block, so make the generic one
consistent with those.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

322728e5

drivers: add Contiguous Memory Allocator · c64be2bb

由 Marek Szyprowski 提交于 12月 29, 2011

The Contiguous Memory Allocator is a set of helper functions for DMA
mapping framework that improves allocations of contiguous memory chunks.

CMA grabs memory on system boot, marks it with MIGRATE_CMA migrate type
and gives back to the system. Kernel is allowed to allocate only movable
pages within CMA's managed memory so that it can be used for example for
page cache when DMA mapping do not use it. On
dma_alloc_from_contiguous() request such pages are migrated out of CMA
area to free required contiguous block and fulfill the request. This
allows to allocate large contiguous chunks of memory at any time
assuming that there is enough free memory available in the system.

This code is heavily based on earlier works by Michal Nazarewicz.
Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Tested-by: NRob Clark <rob.clark@linaro.org>
Tested-by: NOhad Ben-Cohen <ohad@wizery.com>
Tested-by: NBenjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: NRobert Nelson <robertcnelson@gmail.com>
Tested-by: NBarry Song <Baohua.Song@csr.com>

c64be2bb

common: add dma_mmap_from_coherent() function · bca0fa5f

由 Marek Szyprowski 提交于 3月 23, 2012

Add a common helper for dma-mapping core for mapping a coherent buffer
to userspace.
Reported-by: NSubash Patel <subashrp@gmail.com>
Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Acked-by: NKyungmin Park <kyungmin.park@samsung.com>
Tested-By: NSubash Patel <subash.ramaswamy@linaro.org>

bca0fa5f

19 5月, 2012 3 次提交

gpiolib: Remove 'const' from data argument of gpiochip_find() · 07ce8ec7

由 Grant Likely 提交于 5月 18, 2012

Commit 3d0f7cf0 "gpio: Adjust of_xlate API to support multiple GPIO
chips" changed the api of gpiochip_find to drop const from the data
parameter of the match hook, but didn't also drop const from data
causing a build warning.
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

07ce8ec7

gpio: Adjust of_xlate API to support multiple GPIO chips · 3d0f7cf0

由 Grant Likely 提交于 5月 17, 2012

This patch changes the of_xlate API to make it possible for multiple
gpio_chips to refer to the same device tree node.  This is useful for
banked GPIO controllers that use multiple gpio_chips for a single
device.  With this change the core code will try calling of_xlate on
each gpio_chip that references the device_node and will return the
gpio number for the first one to return 'true'.
Tested-by: NRoland Stigge <stigge@antcom.de>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

3d0f7cf0

gpiolib: Implement devm_gpio_request_one() · 09d71ff1

由 Mark Brown 提交于 5月 02, 2012

Allow drivers to use the modern request and configure idiom together
with devres.

As with plain gpio_request() and gpio_request_one() we can't implement
the old school version in terms of _one() as this would force the
explicit selection of a direction in gpio_request() which could break
systems if we pick the wrong one.  Implementing devm_gpio_request_one()
in terms of devm_gpio_request() would needlessly complicate things or
lead to duplication from the unmanaged version depending on how it's
done.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

09d71ff1

17 5月, 2012 1 次提交

ftrace: Sort all function addresses, not just per page · 9fd49328

由 Steven Rostedt 提交于 4月 24, 2012

Instead of just sorting the ip's of the functions per ftrace page,
sort the entire list before adding them to the ftrace pages.

This will allow the bsearch algorithm to be sped up as it can
also sort by pages, not just records within a page.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

9fd49328

01 5月, 2012 2 次提交

PCI: work around Stratus ftServer broken PCIe hierarchy · 284f5f9d

由 Bjorn Helgaas 提交于 4月 30, 2012

A PCIe downstream port is a P2P bridge.  Its secondary interface is
a link that should lead only to device 0 (unless ARI is enabled)[1], so
we don't probe for non-zero device numbers.

Some Stratus ftServer systems have a PCIe downstream port (02:00.0) that
leads to both an upstream port (03:00.0) and a downstream port (03:01.0),
and 03:01.0 has important devices below it:

  [0000:02]-+-00.0-[03-3c]--+-00.0-[04-09]--...
                            \-01.0-[0a-0d]--+-[USB]
                                            +-[NIC]
                                            +-...

Previously, we didn't enumerate device 03:01.0, so USB and the network
didn't work.  This patch adds a DMI quirk to scan all device numbers,
not just 0, below a downstream port.

Based on a patch by Prarit Bhargava.

[1] PCIe spec r3.0, sec 7.3.1

CC: Myron Stowe <mstowe@redhat.com>
CC: Don Dutile <ddutile@redhat.com>
CC: James Paradis <james.paradis@stratus.com>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Jesse Barnes <jbarnes@virtuousgeek.org>
CC: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

284f5f9d

asm-generic: Use __BITS_PER_LONG in statfs.h · f5c2347e

由 H. Peter Anvin 提交于 4月 26, 2012

<asm-generic/statfs.h> is exported to userspace, so using
BITS_PER_LONG is invalid.  We need to use __BITS_PER_LONG instead.

This is kernel bugzilla 43165.
Reported-by: NH.J. Lu <hjl.tools@gmail.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1335465916-16965-1-git-send-email-hpa@linux.intel.comAcked-by: NArnd Bergmann <arnd@arndb.de>
Cc: <stable@vger.kernel.org>

f5c2347e

24 4月, 2012 1 次提交

asm-generic: Allow overriding clock_t and add attributes to siginfo_t · d643bdca

由 H. Peter Anvin 提交于 4月 23, 2012

For the particular issue of x32, which shares code with i386 in the
handling of compat_siginfo_t, the use of a 64-bit clock_t bumps the
sigchld structure out of alignment, which triggers a messy cascade of
padding.

This was already handled on the kernel compat side, but it needs
handling on the user space side, which uses the generic header.  To
make that possible:

1. Allow __kernel_clock_t to be overridden in struct siginfo;
2. Allow there to be attributes added to struct siginfo.
Reported-by: NH.J. Lu <hjl.rools@gmail.com>
Cc: Bruce J. Beare <bruce.j.beare@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/CAMe9rOqF6Kh6-NK7oP0Fpzkd4SBAWU%2BG53hwBbSD4iA2UzyxuA@mail.gmail.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

d643bdca

21 4月, 2012 1 次提交

KVM: add kvm_arch_para_features stub to asm-generic/kvm_para.h · 1f15d109

由 Marcelo Tosatti 提交于 4月 20, 2012

Needed by kvm_para_has_feature().
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

1f15d109

14 4月, 2012 3 次提交

seccomp: Add SECCOMP_RET_TRAP · bb6ea430

由 Will Drewry 提交于 4月 12, 2012

Adds a new return value to seccomp filters that triggers a SIGSYS to be
delivered with the new SYS_SECCOMP si_code.

This allows in-process system call emulation, including just specifying
an errno or cleanly dumping core, rather than just dying.
Suggested-by: NMarkus Gutschke <markus@chromium.org>
Suggested-by: NJulien Tinnes <jln@chromium.org>
Signed-off-by: NWill Drewry <wad@chromium.org>
Acked-by: NEric Paris <eparis@redhat.com>

v18: - acked-by, rebase
     - don't mention secure_computing_int() anymore
v15: - use audit_seccomp/skip
     - pad out error spacing; clean up switch (indan@nul.nu)
v14: - n/a
v13: - rebase on to 88ebdda6
v12: - rebase on to linux-next
v11: - clarify the comment (indan@nul.nu)
     - s/sigtrap/sigsys
v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig
       note suggested-by (though original suggestion had other behaviors)
v9:  - changes to SIGILL
v8:  - clean up based on changes to dependent patches
v7:  - introduction
Signed-off-by: NJames Morris <james.l.morris@oracle.com>

bb6ea430

signal, x86: add SIGSYS info and make it synchronous. · a0727e8c

由 Will Drewry 提交于 4月 12, 2012

This change enables SIGSYS, defines _sigfields._sigsys, and adds
x86 (compat) arch support.  _sigsys defines fields which allow
a signal handler to receive the triggering system call number,
the relevant AUDIT_ARCH_* value for that number, and the address
of the callsite.

SIGSYS is added to the SYNCHRONOUS_MASK because it is desirable for it
to have setup_frame() called for it. The goal is to ensure that
ucontext_t reflects the machine state from the time-of-syscall and not
from another signal handler.

The first consumer of SIGSYS would be seccomp filter.  In particular,
a filter program could specify a new return value, SECCOMP_RET_TRAP,
which would result in the system call being denied and the calling
thread signaled.  This also means that implementing arch-specific
support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
Suggested-by: NH. Peter Anvin <hpa@zytor.com>
Signed-off-by: NWill Drewry <wad@chromium.org>
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: NH. Peter Anvin <hpa@zytor.com>
Acked-by: NEric Paris <eparis@redhat.com>

v18: - added acked by, rebase
v17: - rebase and reviewed-by addition
v14: - rebase/nochanges
v13: - rebase on to 88ebdda6
v12: - reworded changelog (oleg@redhat.com)
v11: - fix dropped words in the change description
     - added fallback copy_siginfo support.
     - added __ARCH_SIGSYS define to allow stepped arch support.
v10: - first version based on suggestion
Signed-off-by: NJames Morris <james.l.morris@oracle.com>

a0727e8c

asm/syscall.h: add syscall_get_arch · 07bd18d0

由 Will Drewry 提交于 4月 12, 2012

Adds a stub for a function that will return the AUDIT_ARCH_* value
appropriate to the supplied task based on the system call convention.

For audit's use, the value can generally be hard-coded at the
audit-site.  However, for other functionality not inlined into syscall
entry/exit, this makes that information available.  seccomp_filter is
the first planned consumer and, as such, the comment indicates a tie to
CONFIG_HAVE_ARCH_SECCOMP_FILTER.
Suggested-by: NRoland McGrath <mcgrathr@chromium.org>
Signed-off-by: NWill Drewry <wad@chromium.org>
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Acked-by: NEric Paris <eparis@redhat.com>

v18: comment and change reword and rebase.
v14: rebase/nochanges
v13: rebase on to 88ebdda6
v12: rebase on to linux-next
v11: fixed improper return type
v10: introduced
Signed-off-by: NJames Morris <james.l.morris@oracle.com>

07bd18d0

08 4月, 2012 1 次提交

kvmclock: Add functions to check if the host has stopped the vm · 3b5d56b9

由 Eric B Munson 提交于 3月 10, 2012

When a host stops or suspends a VM it will set a flag to show this.  The
watchdog will use these functions to determine if a softlockup is real, or the
result of a suspended VM.
Signed-off-by: NEric B Munson <emunson@mgebm.net>
asm-generic changes Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3b5d56b9

03 4月, 2012 1 次提交

asm-generic: add linux/types.h to cmpxchg.h · 80da6a4f

由 Paul Gortmaker 提交于 4月 01, 2012

Builds of the openrisc or1ksim_defconfig show the following:

  In file included from arch/openrisc/include/generated/asm/cmpxchg.h:1:0,
                   from include/asm-generic/atomic.h:18,
                   from arch/openrisc/include/generated/asm/atomic.h:1,
                   from include/linux/atomic.h:4,
                   from include/linux/dcache.h:4,
                   from fs/notify/fsnotify.c:19:
  include/asm-generic/cmpxchg.h: In function '__xchg':
  include/asm-generic/cmpxchg.h:34:20: error: expected ')' before 'u8'
  include/asm-generic/cmpxchg.h:34:20: warning: type defaults to 'int' in type name

and many more lines of similar errors.  It seems specific to the or32
because most other platforms have an arch specific component that would
have already included types.h ahead of time, but the o32 does not.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jonas Bonn <jonas@southpole.se>
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: David Howells <dhowells@redhat.com
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

80da6a4f

29 3月, 2012 8 次提交

Delete all instances of asm/system.h · 141124c0

由 David Howells 提交于 3月 28, 2012

Delete all instances of asm/system.h as they should be redundant by this
point.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

141124c0

Remove all #inclusions of asm/system.h · 9ffc93f2

由 David Howells 提交于 3月 28, 2012

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: NDavid Howells <dhowells@redhat.com>

9ffc93f2

Add #includes needed to permit the removal of asm/system.h · 96f951ed

由 David Howells 提交于 3月 28, 2012

asm/system.h is a cause of circular dependency problems because it contains
commonly used primitive stuff like barrier definitions and uncommonly used
stuff like switch_to() that might require MMU definitions.

asm/system.h has been disintegrated by this point on all arches into the
following common segments:

 (1) asm/barrier.h

     Moved memory barrier definitions here.

 (2) asm/cmpxchg.h

     Moved xchg() and cmpxchg() here.  #included in asm/atomic.h.

 (3) asm/bug.h

     Moved die() and similar here.

 (4) asm/exec.h

     Moved arch_align_stack() here.

 (5) asm/elf.h

     Moved AT_VECTOR_SIZE_ARCH here.

 (6) asm/switch_to.h

     Moved switch_to() here.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

96f951ed

Split arch_align_stack() out from asm-generic/system.h · 5d125066

由 David Howells 提交于 3月 28, 2012

Split arch_align_stack() out from asm-generic/system.h into its own header of
asm-generic/exec.h as part of the asm/system.h disintegration.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>

5d125066

Split the switch_to() wrapper out of asm-generic/system.h · 158bc507

由 David Howells 提交于 3月 28, 2012

Split the switch_to() wrapper out of asm-generic/system.h into its own
asm-generic/system.h as part of the asm/system.h disintegration.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>

158bc507

Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h · b4816afa

由 David Howells 提交于 3月 28, 2012

Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
to simplify disintegration of asm/system.h.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>

b4816afa

Create asm-generic/barrier.h · 885df91c

由 David Howells 提交于 3月 28, 2012

Create asm-generic/barrier.h and move the barrier definitions from
asm-generic/system.h to it.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>

885df91c

Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h · 34484277

由 David Howells 提交于 3月 28, 2012

Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h as all arch
files that #include the former also #include the latter.  See:

	grep -rl asm-generic/cmpxchg-local[.]h arch/ | sort > b
	grep -rl asm-generic/cmpxchg[.]h arch/ | sort > a
	comm a b

This simplifies the disintegration of asm-generic/system.h for arches that
don't have their own.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>

34484277

28 3月, 2012 1 次提交

compat: use sys_sendfile64() implementation for sendfile syscall · 1631fcea

由 Chris Metcalf 提交于 3月 26, 2012

<asm-generic/unistd.h> was set up to use sys_sendfile() for the 32-bit
compat API instead of sys_sendfile64(), but in fact the right thing to
do is to use sys_sendfile64() in all cases.  The 32-bit sendfile64() API
in glibc uses the sendfile64 syscall, so it has to be capable of doing
full 64-bit operations.  But the sys_sendfile() kernel implementation
has a MAX_NON_LFS test in it which explicitly limits the offset to 2^32.
So, we need to use the sys_sendfile64() implementation in the kernel
for this case.

Cc: <stable@kernel.org>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>

1631fcea

26 3月, 2012 1 次提交

params: <level>_initcall-like kernel parameters · 026cee00

由 Pawel Moll 提交于 3月 26, 2012

This patch adds a set of macros that can be used to declare
kernel parameters to be parsed _before_ initcalls at a chosen
level are executed.  We rename the now-unused "flags" field of
struct kernel_param as the level.  It's signed, for when we
use this for early params as well, in future.

Linker macro collating init calls had to be modified in order
to add additional symbols between levels that are later used
by the init code to split the calls into blocks.
Signed-off-by: NPawel Moll <pawel.moll@arm.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

026cee00

24 3月, 2012 2 次提交

coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP · accb61fe

由 Jason Baron 提交于 3月 23, 2012

Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
for 'VM_NODUMP' flag.  The idea is is to add a new madvise() flag:
MADV_DONTDUMP, which can be set by applications to specifically request
memory regions which should not dump core.

The specific application I have in mind is qemu: we can add a flag there
that wouldn't dump all of guest memory when qemu dumps core.  This flag
might also be useful for security sensitive apps that want to absolutely
make sure that parts of memory are not dumped.  To clear the flag use:
MADV_DODUMP.

[akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
[akpm@linux-foundation.org: fix up the architectures which broke]
Signed-off-by: NJason Baron <jbaron@redhat.com>
Acked-by: NRoland McGrath <roland@hack.frob.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

accb61fe

consolidate WARN_...ONCE() static variables · 7ccaba53

由 Jan Beulich 提交于 3月 23, 2012

Due to the alignment of following variables, these typically consume
more than just the single byte that 'bool' requires, and as there are a
few hundred instances, the cache pollution (not so much the waste of
memory) sums up.  Put these variables into their own section, outside of
any half way frequently used memory range.

Do the same also to the __warned variable of rcu_lockdep_assert().
(Don't, however, include the ones used by printk_once() and alike, as
they can potentially be hot.)
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7ccaba53

22 3月, 2012 1 次提交

mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906

由 Andrea Arcangeli 提交于 3月 21, 2012

In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode.  In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.

It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds).  The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().

Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously.  This is
probably why it wasn't common to run into this.  For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.

Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).

The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value.  Even if the real pmd is changing under the
value we hold on the stack, we don't care.  If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).

All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd.  The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds).  I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).

		if (pmd_trans_huge(*pmd)) {
			if (next-addr != HPAGE_PMD_SIZE) {
				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
				split_huge_page_pmd(vma->vm_mm, pmd);
			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
				continue;
			/* fall through */
		}
		if (pmd_none_or_clear_bad(pmd))

Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.

The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.

====== start quote =======
      mapcount 0 page_mapcount 1
      kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

      mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

        143 void pmd_clear_bad(pmd_t *pmd)
        144 {
    ->  145         pmd_ERROR(*pmd);
        146         pmd_clear(pmd);
        147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

       1381         if (mapcount != page_mapcount(page))
       1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
       1383                        mapcount, page_mapcount(page));
    -> 1384         BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

               virtual address space
              .---------------------.
              |                     |
              |                     |
            .-|---------------------|
            | |                     |
            | |                     |<-- B(fault)
            | |                     |
      2 MB  | |/////////////////////|-.
      huge <  |/////////////////////|  > A(range)
      page  | |/////////////////////|-'
            | |                     |
            | |                     |
            '-|---------------------|
              |                     |
              |                     |
              '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
      on the virtual address range "A(range)" shown in the picture.

    sys_madvise
      // Acquire the semaphore in shared mode.
      down_read(&current->mm->mmap_sem)
      ...
      madvise_vma
        switch (behavior)
        case MADV_DONTNEED:
             madvise_dontneed
               zap_page_range
                 unmap_vmas
                   unmap_page_range
                     zap_pud_range
                       zap_pmd_range
                         //
                         // Assume that this huge page has never been accessed.
                         // I.e. content of the PMD entry is zero (not mapped).
                         //
                         if (pmd_trans_huge(*pmd)) {
                             // We don't get here due to the above assumption.
                         }
                         //
                         // Assume that Thread B incurred a page fault and
             .---------> // sneaks in here as shown below.
             |           //
             |           if (pmd_none_or_clear_bad(pmd))
             |               {
             |                 if (unlikely(pmd_bad(*pmd)))
             |                     pmd_clear_bad
             |                     {
             |                       pmd_ERROR
             |                         // Log "bad pmd ..." message here.
             |                       pmd_clear
             |                         // Clear the page's PMD entry.
             |                         // Thread B incremented the map count
             |                         // in page_add_new_anon_rmap(), but
             |                         // now the page is no longer mapped
             |                         // by a PMD entry (-> inconsistency).
             |                     }
             |               }
             |
             v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
      in the picture.

    ...
    do_page_fault
      __do_page_fault
        // Acquire the semaphore in shared mode.
        down_read_trylock(&mm->mmap_sem)
        ...
        handle_mm_fault
          if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
              // We get here due to the above assumption (PMD entry is zero).
              do_huge_pmd_anonymous_page
                alloc_hugepage_vma
                  // Allocate a new transparent huge page here.
                ...
                __do_huge_pmd_anonymous_page
                  ...
                  spin_lock(&mm->page_table_lock)
                  ...
                  page_add_new_anon_rmap
                    // Here we increment the page's map count (starts at -1).
                    atomic_set(&page->_mapcount, 0)
                  set_pmd_at
                    // Here we set the page's PMD entry which will be cleared
                    // when Thread A calls pmd_clear_bad().
                  ...
                  spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read).  Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated.  However, Thread A
    does not synchronize on that lock.

====== end quote =======

[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: NLarry Woodman <lwoodman@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>		[2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1a5a9906

05 3月, 2012 1 次提交

BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882

由 Paul Gortmaker 提交于 11月 23, 2011

If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
other BUG variant in a static inline (i.e. not in a #define) then
that header really should be including <linux/bug.h> and not just
expecting it to be implicitly present.

We can make this change risk-free, since if the files using these
headers didn't have exposure to linux/bug.h already, they would have
been causing compile failures/warnings.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

187f1882

03 3月, 2012 1 次提交
- G
  gpio: constify the data parameter to gpiochip_find() · 6e2cf651
  由 Grant Likely 提交于 3月 02, 2012
```
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
```
  6e2cf651
27 2月, 2012 1 次提交

[PARISC] fix compile break caused by iomap: make IOPORT/PCI mapping functions conditional · 97a29d59

由 James Bottomley 提交于 1月 30, 2012

The problem in

commit fea80311
Author: Randy Dunlap <rdunlap@xenotime.net>
Date:   Sun Jul 24 11:39:14 2011 -0700

    iomap: make IOPORT/PCI mapping functions conditional

is that if your architecture supplies pci_iomap/pci_iounmap, it expects
always to supply them.  Adding empty body defitions in the !CONFIG_PCI
case, which is what this patch does, breaks the parisc compile because
the functions become doubly defined.  It took us a while to spot this,
because we don't actually build !CONFIG_PCI very often (only if someone
is brave enough to test the snake/asp machines).

Since the note in the commit log says this is to fix a
CONFIG_GENERIC_IOMAP issue (which it does because CONFIG_GENERIC_IOMAP
supplies pci_iounmap only if CONFIG_PCI is set), there should actually
have been a condition upon this.  This should make sure no other
architecture's !CONFIG_PCI compile breaks in the same way as parisc.

The fix had to be updated to take account of the GENERIC_PCI_IOMAP
separation.
Reported-by: NRolf Eike Beer <eike@sf-mail.de>
Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>

97a29d59

25 2月, 2012 2 次提交

epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() · d80e731e

由 Oleg Nesterov 提交于 2月 24, 2012

This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note:

	- we do not have wake_up_all_poll() but wake_up_poll()
	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
	  we need a couple of simple changes in eventpoll.c to
	  make sure it can't be "lost".
Reported-by: NMaxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d80e731e

bitops: Add missing parentheses to new get_order macro · b893485d

由 Joerg Roedel 提交于 2月 24, 2012

The new get_order macro introcuded in commit

d66acc39

does not use parentheses around all uses of the parameter n.
This causes new compile warnings, for example in the
amd_iommu_init.c function:

drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]
drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]

Fix those warnings by adding the missing parentheses.
Reported-by: NIngo Molnar <mingo@elte.hu>
Cc: David Howells <dhowells@redhat.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1330088295-28732-1-git-send-email-joerg.roedel@amd.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

b893485d

24 2月, 2012 2 次提交

net: Add framework to allow sending packets with customized CRC. · 3bdc0eba

由 Ben Greear 提交于 2月 11, 2012

This is useful for testing RX handling of frames with bad
CRCs.

Requires driver support to actually put the packet on the
wire properly.
Signed-off-by: NBen Greear <greearb@candelatech.com>
Tested-by: NAaron Brown <aaron.f.brown@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

3bdc0eba

PCI: collapse pcibios_resource_to_bus · fb127cb9

由 Bjorn Helgaas 提交于 2月 23, 2012

Everybody uses the generic pcibios_resource_to_bus() supplied by the core
now, so remove the ARCH_HAS_GENERIC_PCI_OFFSETS used during conversion.
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

fb127cb9

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年