提交 · 14318efb322e2fe1a034c69463d725209eb9d548 · openeuler / Kernel

03 12月, 2012 1 次提交

ARM: 7587/1: implement optimized percpu variable access · 14318efb

由 Rob Herring 提交于 11月 29, 2012

Use the previously unused TPIDRPRW register to store percpu offsets.
TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.

This replaces 2 loads with a mrc instruction for each percpu variable
access. With hackbench, the performance improvement is 1.4% on Cortex-A9
(highbank). Taking an average of 30 runs of "hackbench -l 1000" yields:

Before: 6.2191
After: 6.1348

Will Deacon reported similar delta on v6 with 11MPCore.

The asm "memory clobber" are needed here to ensure the percpu offset
gets reloaded. Testing by Will found that this would not happen in
__schedule() which is a bit of a special case as preemption is disabled
but the execution can move cores.
Signed-off-by: NRob Herring <rob.herring@calxeda.com>
Acked-by: NWill Deacon <will.deacon@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>

14318efb

26 11月, 2012 1 次提交

ARM: 7582/2: rename kvm_seq to vmalloc_seq so to avoid confusion with KVM · 3e99675a

由 Nicolas Pitre 提交于 11月 25, 2012

The kvm_seq value has nothing to do what so ever with this other KVM.
Given that KVM support on ARM is imminent, it's best to rename kvm_seq
into something else to clearly identify what it is about i.e. a sequence
number for vmalloc section mappings.
Signed-off-by: NNicolas Pitre <nico@linaro.org>
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>

3e99675a

23 11月, 2012 2 次提交

ARM: 7585/1: kernel: fix nr_cpu_ids check in DT logical map init · ce7b1756

由 Lorenzo Pieralisi 提交于 11月 22, 2012

If a kernel is configured with a DT containing more /cpu nodes than
nr_cpu_ids, the number of cpus must be capped in the DT parsing
code. Current code carries out the check, but fails to cap the
value and the check is executed after the cpu logical index is used,
which can lead to memory corruption due to index overflow.

This patch refactors the check against nr_cpu_ids and move it before
any computed index is used in the parsing code.
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NGrant Likely <grant.likely@secretlab.ca>
Reported-by: NMark Rutland <mark.rutland@arm.com>
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>

ce7b1756

ARM: 7584/1: perf: fix link error when CONFIG_HW_PERF_EVENTS is not selected · c7cc504b

由 Marc Zyngier 提交于 11月 22, 2012

Commit e50c5418 (ARM: perf: add guest vs host discrimination) broken the
link as perf_instruction_pointer and perf_misc_flags are not defined
when CONFIG_HW_PERF_EVENTS is not selected.

As it make little sense to try and profile a guest without any HW event,
just fallback to the original code when this config option is not selected.
Reported-by: NRussell King <linux@arm.linux.org.uk>
Acked-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>

c7cc504b

21 11月, 2012 2 次提交
- R
  
  Merge branch 'bl-cpuinfo' of git://linux-arm.org/linux-2.6-lp into devel-stable · 82b5df7b
  由 Russell King 提交于 11月 20, 2012
  
  82b5df7b
- R
  
  Merge branch 'cluster-boot-protocol' of git://linux-arm.org/linux-2.6-lp into devel-stable · e38eb34a
  由 Russell King 提交于 11月 20, 2012
  
  e38eb34a
19 11月, 2012 19 次提交

ARM: gic: use a private mapping for CPU target interfaces · 384a2902

由 Nicolas Pitre 提交于 4月 11, 2012

The GIC interface numbering does not necessarily follow the logical
CPU numbering, especially for complex topologies such as multi-cluster
systems.

Fortunately we can easily probe the GIC to create a mapping as the
Interrupt Processor Targets Registers for the first 32 interrupts are
read-only, and each field returns a value that always corresponds to
the processor reading the register.

Initially all mappings target all CPUs in case an IPI is required to
boot secondary CPUs.  It is refined as those CPUs discover what their
actual mapping is.
Signed-off-by: NNicolas Pitre <nico@linaro.org>
Acked-by: NWill Deacon <will.deacon@arm.com>

384a2902

ARM: kernel: add logical mappings look-up · 7f124aaf

由 Lorenzo Pieralisi 提交于 11月 17, 2011

In ARM SMP systems the MPIDR register ([23:0] bits) is used to uniquely
identify CPUs.

In order to retrieve the logical CPU index corresponding to a given
MPIDR value and guarantee a consistent translation throughout the kernel,
this patch adds a look-up based on the MPIDR[23:0] so that kernel subsystems
can use it whenever the logical cpu index corresponding to a given MPIDR
value is needed.
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NWill Deacon <will.deacon@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>

7f124aaf

ARM: kernel: add cpu logical map DT init in setup_arch · 5587164e

由 Lorenzo Pieralisi 提交于 12月 14, 2011

As soon as the device tree is unflattened the cpu logical to physical
mapping is carried out in setup_arch to build a proper array of MPIDR and
corresponding logical indexes.

The mapping could have been carried out using the flattened DT blob and
related primitives, but since the mapping is not needed by early boot
code it can safely be executed when the device tree has been uncompressed to
its tree data structure.

This patch adds the arm_dt_init_cpu maps() function call in setup_arch().

If the kernel is not compiled with DT support the function is empty and
no logical mapping takes place through it; the mapping carried out in
smp_setup_processor_id() is left unchanged.
If DT is supported the mapping created in smp_setup_processor_id() is overriden.
The DT mapping also sets the possible cpus mask, hence platform
code need not set it again in the respective smp_init_cpus() functions.
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NWill Deacon <will.deacon@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>

5587164e

ARM: kernel: add device tree init map function · a0ae0240

由 Lorenzo Pieralisi 提交于 11月 17, 2011

When booting through a device tree, the kernel cpu logical id map can be
initialized using device tree data passed by FW or through an embedded blob.

This patch adds a function that parses device tree "cpu" nodes and
retrieves the corresponding CPUs hardware identifiers (MPIDR).
It sets the possible cpus and the cpu logical map values according to
the number of CPUs defined in the device tree and respective properties.

The device tree HW identifiers are considered valid if all CPU nodes contain
a "reg" property, there are no duplicate "reg" entries and the DT defines a
CPU node whose "reg" property matches the MPIDR[23:0] of the boot CPU.

The primary CPU is assigned cpu logical number 0 to keep the current convention
valid.

Current bindings documentation is included in the patch:

Documentation/devicetree/bindings/arm/cpus.txt
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>

a0ae0240

ARM: kernel: smp_setup_processor_id() updates · cb8cf4f8

由 Lorenzo Pieralisi 提交于 11月 08, 2012

This patch applies some basic changes to the smp_setup_processor_id()
ARM implementation to make the code that builds cpu_logical_map more
uniform across the kernel.

The function now prints the full extent of the boot CPU MPIDR[23:0] and
initializes the cpu_logical_map for CPUs up to nr_cpu_ids.
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>
Acked-by: NWill Deacon <will.deacon@arm.com>

cb8cf4f8

ARM: kernel: update topology to use new MPIDR macros · 71db5bfe

由 Lorenzo Pieralisi 提交于 11月 16, 2012

This patch updates the topology initialization code to use the newly
defined accessors to retrieve the MPIDR affinity levels.
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NWill Deacon <will.deacon@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>

71db5bfe

ARM: kernel: enhance MPIDR macro definitions · dca463da

由 Lorenzo Pieralisi 提交于 11月 15, 2012

Kernel subsystems other than the topology layer need the MPIDR
mask definitions to access the MPIDR without relying on hardcoded
masks. This patch moves the MPIDR register masks definition to
a header file and defines a macro to simplify access to MPIDR bit fields
representing affinity levels.
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NWill Deacon <will.deacon@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>

dca463da

ARM: kernel: update cpuinfo to print all online CPUs features · b4b8f770

由 Lorenzo Pieralisi 提交于 9月 10, 2012

Currently, reading /proc/cpuinfo provides userspace with CPU ID of
the CPU carrying out the read from the file. This is fine as long as all
CPUs in the system are the same. With the advent of big.LITTLE and
heterogenous ARM systems this approach provides user space with incorrect
bits of information since CPU ids in the system might differ from the one
provided by the CPU reading the file.

This patch updates the cpuinfo show function so that a read from
/proc/cpuinfo prints HW information for all online CPUs at once, mirroring
 x86 behaviour.
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>

b4b8f770

ARM: kernel: add MIDR to per-CPU information data · e8d432c9

由 Lorenzo Pieralisi 提交于 11月 06, 2012

The advent of big.LITTLE ARM platforms requires the kernel to be able
to identify the MIDRs of all online CPUs upon request. MIDRs are stashed
at boot time so that kernel subsystems can detect the MIDR of online CPUs
by simply retrieving per-CPU data updated by all booted CPUs.
Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: NNicolas Pitre <nico@linaro.org>

e8d432c9

Merge branch 'asid-allocation' of... · 2079f30e

由 Russell King 提交于 11月 19, 2012

Merge branch 'asid-allocation' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into devel-stable

2079f30e

Merge branch 'for-rmk/prot-none' of... · f27d9b71

由 Russell King 提交于 11月 19, 2012

Merge branch 'for-rmk/prot-none' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into devel-stable

f27d9b71

Merge branch 'hw-breakpoint' of... · c71d4aa7

由 Russell King 提交于 11月 19, 2012

Merge branch 'hw-breakpoint' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into devel-stable

c71d4aa7

Merge branch 'perf/updates' of... · 667832da

由 Russell King 提交于 11月 19, 2012

Merge branch 'perf/updates' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into devel-stable

667832da

fanotify: fix FAN_Q_OVERFLOW case of fanotify_read() · 3587b1b0

由 Al Viro 提交于 11月 18, 2012

If the FAN_Q_OVERFLOW bit set in event->mask, the fanotify event
metadata will not contain a valid file descriptor, but
copy_event_to_user() didn't check for that, and unconditionally does a
fd_install() on the file descriptor.

Which in turn will cause a BUG_ON() in __fd_install().

Introduced by commit 352e3b24 ("fanotify: sanitize failure exits in
copy_event_to_user()")

Mea culpa - missed that path ;-/
Reported-by: NAlex Shi <lkml.alex@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3587b1b0

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 8d938105

由 Linus Torvalds 提交于 11月 18, 2012

Pull misc VFS fixes from Al Viro:
 "Remove a bogus BUG_ON() that can trigger spuriously + alpha bits of
  do_mount() constification I'd missed during the merge window."

This pull request came in a week ago, I missed it for some reason.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill bogus BUG_ON() in do_close_on_exec()
  missing const in alpha callers of do_mount()

8d938105

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k · aa7202c2

由 Linus Torvalds 提交于 11月 18, 2012

Pull m68k fix from Geert Uytterhoeven:
 "This is a bug fix for asm constraints that affect sending RT signals,
  also destined for -stable."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: fix sigset_t accessor functions

aa7202c2

Merge tag 'gpio-fixes-for-v3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · 5ad27d6c

由 Linus Torvalds 提交于 11月 18, 2012

Pull last minute GPIO fixes from Linus Walleij:

 - Disable blinking on the Orion GPIO driver

 - Two Kconfig-style fixes to avoid broken builds

* tag 'gpio-fixes-for-v3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio-mcp23s08: Build I2C support even when CONFIG_I2C=m
  gpio: adnp: Depend on OF_GPIO instead of OF
  mvebu-gpio: Disable blinking when enabling a GPIO for output

5ad27d6c

Merge tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs · d28d3730

由 Linus Torvalds 提交于 11月 18, 2012

Pull xfs bugfixes from Ben Myers:

 - fix attr tree double split corruption

 - fix broken error handling in xfs_vm_writepage

 - drop buffer io reference when a bad bio is built

* tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs:
  xfs: drop buffer io reference when a bad bio is built
  xfs: fix broken error handling in xfs_vm_writepage
  xfs: fix attr tree double split corruption

d28d3730

Merge tag 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev · 5e30c089

由 Linus Torvalds 提交于 11月 18, 2012

Pull libata fixes from Jeff Garzik:
"If you were going to shoot me for not sending these earlier, you would
be right. -rc6 beat me by ~2 hours it seems, and they really should
have gone out long before that.

These have been in libata-dev.git for a day or so (unfortunately
linux-next is on vacation). The main one is #1, with the others being
minor bits. #1 has multiple tested-by, and can be considered a
regression fix IMO.

1) Fix ACPI oops:

https://bugzilla.kernel.org/show_bug.cgi?id=48211

2) Temporary WARN_ONCE() debugging patch for further ACPI debugging.

The code already oopses here, and so this merely gives slightly
better info. Related to

https://bugzilla.kernel.org/show_bug.cgi?id=49151

which has been bisected down to a patch that _exposes_ a latest
bug, but said bisection target does not actually appear to be the
root cause itself.

3) sata_svw: fix longstanding error recovery bug, which was
preventing kdump, by adding missing DMA-start bit check. Core
code was already checking DMA-start, but ancillary, less-used
routines were not. Fixed.

4) sata_highbank: fix minor __init/__devinit warning

5) Fix minor warning, if CONFIG_PM is set, but CONFIG_PM_SLEEP is not
set

6) pata_arasan: proper functioning requires clock setting"

* tag 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
[libata] PM callbacks should be conditionally compiled on CONFIG_PM_SLEEP
sata_svw: check DMA start bit before reset
libata debugging: Warn when unable to find timing descriptor based on xfer_mode
sata_highbank: mark ahci_highbank_probe as __devinit
pata_arasan: Initialize cf clock to 166MHz
libata-acpi: Fix NULL ptr derference in ata_acpi_dev_handle

5e30c089

18 11月, 2012 4 次提交

m68k: fix sigset_t accessor functions · 34fa78b5

由 Andreas Schwab 提交于 11月 17, 2012

The sigaddset/sigdelset/sigismember functions that are implemented with
bitfield insn cannot allow the sigset argument to be placed in a data
register since the sigset is wider than 32 bits.  Remove the "d"
constraint from the asm statements.

The effect of the bug is that sending RT signals does not work, the signal
number is truncated modulo 32.
Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@vger.kernel.org

34fa78b5

gpio-mcp23s08: Build I2C support even when CONFIG_I2C=m · cbf24fad

由 Daniel M. Weeks 提交于 11月 06, 2012

The driver has both SPI and I2C pieces. The appropriate pieces are built based
on whether SPI and/or I2C is/are enabled. However, it was only checking if I2C
was built-in, never if it was built as a module. This patch checks for either
since building both this driver and I2C as modules is possible.
Signed-off-by: NDaniel M. Weeks <dan@danweeks.net>
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>

cbf24fad

gpio: adnp: Depend on OF_GPIO instead of OF · cb144fe8

由 Thierry Reding 提交于 11月 01, 2012

The driver accesses the of_node field of struct gpio_chip, which is only
available if OF_GPIO is selected. This solves a build issue on SPARC
which conflicts with OF_GPIO and therefore does not provide this field.
Signed-off-by: NThierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>

cb144fe8

mvebu-gpio: Disable blinking when enabling a GPIO for output · e9133760

由 Jamie Lentin 提交于 10月 28, 2012

The plat-orion GPIO driver would disable any pin blinking whenever
using a pin for output. Do the same here, as a blinking LED will
continue to blink regardless of what the GPIO pin level is.
Signed-off-by: NJamie Lentin <jm@lentin.co.uk>
Acked-by: NThomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>

e9133760

17 11月, 2012 11 次提交

xfs: drop buffer io reference when a bad bio is built · d69043c4

由 Dave Chinner 提交于 11月 12, 2012

Error handling in xfs_buf_ioapply_map() does not handle IO reference
counts correctly. We increment the b_io_remaining count before
building the bio, but then fail to decrement it in the failure case.
This leads to the buffer never running IO completion and releasing
the reference that the IO holds, so at unmount we can leak the
buffer. This leak is captured by this assert failure during unmount:

XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273

This is not a new bug - the b_io_remaining accounting has had this
problem for a long, long time - it's just very hard to get a
zero length bio being built by this code...

Further, the buffer IO error can be overwritten on a multi-segment
buffer by subsequent bio completions for partial sections of the
buffer. Hence we should only set the buffer error status if the
buffer is not already carrying an error status. This ensures that a
partial IO error on a multi-segment buffer will not be lost. This
part of the problem is a regression, however.

cc: <stable@vger.kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

d69043c4

xfs: fix broken error handling in xfs_vm_writepage · 3daed8bc

由 Dave Chinner 提交于 11月 12, 2012

When we shut down the filesystem, it might first be detected in
writeback when we are allocating a inode size transaction. This
happens after we have moved all the pages into the writeback state
and unlocked them. Unfortunately, if we fail to set up the
transaction we then abort writeback and try to invalidate the
current page. This then triggers are BUG() in block_invalidatepage()
because we are trying to invalidate an unlocked page.

Fixing this is a bit of a chicken and egg problem - we can't
allocate the transaction until we've clustered all the pages into
the IO and we know the size of it (i.e. whether the last block of
the IO is beyond the current EOF or not). However, we don't want to
hold pages locked for long periods of time, especially while we lock
other pages to cluster them into the write.

To fix this, we need to make a clear delineation in writeback where
errors can only be handled by IO completion processing. That is,
once we have marked a page for writeback and unlocked it, we have to
report errors via IO completion because we've already started the
IO. We may not have submitted any IO, but we've changed the page
state to indicate that it is under IO so we must now use the IO
completion path to report errors.

To do this, add an error field to xfs_submit_ioend() to pass it the
error that occurred during the building on the ioend chain. When
this is non-zero, mark each ioend with the error and call
xfs_finish_ioend() directly rather than building bios. This will
immediately push the ioends through completion processing with the
error that has occurred.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

3daed8bc

xfs: fix attr tree double split corruption · 42e2976f

由 Dave Chinner 提交于 11月 12, 2012

In certain circumstances, a double split of an attribute tree is
needed to insert or replace an attribute. In rare situations, this
can go wrong, leaving the attribute tree corrupted. In this case,
the attr being replaced is the last attr in a leaf node, and the
replacement is larger so doesn't fit in the same leaf node.
When we have the initial condition of a node format attribute
btree with two leaves at index 1 and 2. Call them L1 and L2.  The
leaf L1 is completely full, there is not a single byte of free space
in it. L2 is mostly empty.  The attribute being replaced - call it X
- is the last attribute in L1.

The way an attribute replace is executed is that the replacement
attribute - call it Y - is first inserted into the tree, but has an
INCOMPLETE flag set on it so that list traversals ignore it. Once
this transaction is committed, a second transaction it run to
atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
traversal will now find Y and skip X. Once that transaction is
committed, attribute X is then removed.

So, the initial condition is:

     +--------+     +--------+
     |   L1   |     |   L2   |
     | fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |
     | fsp: 0 |     | fsp: N |
     |--------|     |--------|
     | attr A |     | attr 1 |
     |--------|     |--------|
     | attr B |     | attr 2 |
     |--------|     |--------|
     ..........     ..........
     |--------|     |--------|
     | attr X |     | attr n |
     +--------+     +--------+

So now we go to replace X, and see that L1:fsp = 0 - it is full so
we can't insert Y in the same leaf. So we record the the location of
attribute X so we can track it for later use, then we split L1 into
L1 and L3 and reblance across the two leafs. We end with:

     +--------+     +--------+     +--------+
     |   L1   |     |   L3   |     |   L2   |
     | fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|
     | attr A |     | attr X |     | attr 1 |
     |--------|     +--------+     |--------|
     | attr B |                    | attr 2 |
     |--------|                    |--------|
     ..........                    ..........
     |--------|                    |--------|
     | attr W |                    | attr n |
     +--------+                    +--------+

And we track that the original attribute is now at L3:0.

We then try to insert Y into L1 again, and find that there isn't
enough room because the new attribute is larger than the old one.
Hence we have to split again to make room for Y. We end up with
this:

     +--------+     +--------+     +--------+     +--------+
     |   L1   |     |   L4   |     |   L3   |     |   L2   |
     | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|     |--------|
     | attr A |     | attr Y |     | attr X |     | attr 1 |
     |--------|     + INCOMP +     +--------+     |--------|
     | attr B |     +--------+                    | attr 2 |
     |--------|                                   |--------|
     ..........                                   ..........
     |--------|                                   |--------|
     | attr W |                                   | attr n |
     +--------+                                   +--------+

And now we have the new (incomplete) attribute @ L4:0, and the
original attribute at L3:0. At this point, the first transaction is
committed, and we move to the flipping of the flags.

This is where we are supposed to end up with this:

     +--------+     +--------+     +--------+     +--------+
     |   L1   |     |   L4   |     |   L3   |     |   L2   |
     | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|     |--------|
     | attr A |     | attr Y |     | attr X |     | attr 1 |
     |--------|     +--------+     + INCOMP +     |--------|
     | attr B |                    +--------+     | attr 2 |
     |--------|                                   |--------|
     ..........                                   ..........
     |--------|                                   |--------|
     | attr W |                                   | attr n |
     +--------+                                   +--------+

But that doesn't happen properly - the attribute tracking indexes
are not pointing to the right locations. What we end up with is both
the old attribute to be removed pointing at L4:0 and the new
attribute at L4:1.  On a debug kernel, this assert fails like so:

XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725

because the new attribute location does not exist. On a production
kernel, this goes unnoticed and the code proceeds ahead merrily and
removes L4 because it thinks that is the block that is no longer
needed. This leaves the hash index node pointing to entries
L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the
leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free
space, and so everything is busted. This corruption is caused by the
removal of the old attribute triggering a join - it joins everything
correctly but then frees the wrong block.

xfs_repair will report something like:

bad sibling back pointer for block 4 in attribute fork for inode 131
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 8 for inode 131, would reset to 3
bad anextents 4 for inode 131, would reset to 0

The problem lies in the assignment of the old/new blocks for
tracking purposes when the double leaf split occurs. The first split
tries to place the new attribute inside the current leaf (i.e.
"inleaf == true") and moves the old attribute (X) to the new block.
This sets up the old block/index to L1:X, and newly allocated
block to L3:0. It then moves attr X to the new block and tries to
insert attr Y at the old index. That fails, so it splits again.

With the second split, the rebalance ends up placing the new attr in
the second new block - L4:0 - and this is where the code goes wrong.
What is does is it sets both the new and old block index to the
second new block. Hence it inserts attr Y at the right place (L4:0)
but overwrites the current location of the attr to replace that is
held in the new block index (currently L3:0). It over writes it with
L4:1 - the index we later assert fail on.

Hopefully this table will show this in a foramt that is a bit easier
to understand:

Split		old attr index		new attr index
		vanilla	patched		vanilla	patched
before 1st	L1:26	L1:26		N/A	N/A
after 1st	L3:0	L3:0		L1:26	L1:26
after 2nd	L4:0	L3:0		L4:1	L4:0
                ^^^^			^^^^
		wrong			wrong

The fix is surprisingly simple, for all this analysis - just stop
the rebalance on the out-of leaf case from overwriting the new attr
index - it's already correct for the double split case.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

42e2976f

L

Linux 3.7-rc6 · f4a75d2e
由 Linus Torvalds 提交于 11月 16, 2012

f4a75d2e

Merge git://git.kernel.org/pub/scm/virt/kvm/kvm · 51844b0f

由 Linus Torvalds 提交于 11月 16, 2012

Pull KVM fix from Marcelo Tosatti:
 "A correction for oops on module init with older Intel hosts."

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update()

51844b0f

Merge branch 'akpm' (Fixes from Andrew) · 0cad3ff4

由 Linus Torvalds 提交于 11月 16, 2012

Merge misc fixes from Andrew Morton.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (12 patches)
  revert "mm: fix-up zone present pages"
  tmpfs: change final i_blocks BUG to WARNING
  tmpfs: fix shmem_getpage_gfp() VM_BUG_ON
  mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address
  mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
  rapidio: fix kernel-doc warnings
  swapfile: fix name leak in swapoff
  memcg: fix hotplugged memory zone oops
  mips, arc: fix build failure
  memcg: oom: fix totalpages calculation for memory.swappiness==0
  mm: fix build warning for uninitialized value
  mm: add anon_vma_lock to validate_mm()

0cad3ff4

revert "mm: fix-up zone present pages" · 5576646f

由 Andrew Morton 提交于 11月 16, 2012

Revert commit 7f1290f2 ("mm: fix-up zone present pages")

That patch tried to fix a issue when calculating zone->present_pages,
but it caused a regression on 32bit systems with HIGHMEM.  With that
change, reset_zone_present_pages() resets all zone->present_pages to
zero, and fixup_zone_present_pages() is called to recalculate
zone->present_pages when the boot allocator frees core memory pages into
buddy allocator.  Because highmem pages are not freed by bootmem
allocator, all highmem zones' present_pages becomes zero.

Various options for improving the situation are being discussed but for
now, let's return to the 3.6 code.

Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Tested-by: NChris Clayton <chris2553@googlemail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5576646f

tmpfs: change final i_blocks BUG to WARNING · 0f3c42f5

由 Hugh Dickins 提交于 11月 16, 2012

Under a particular load on one machine, I have hit shmem_evict_inode()'s
BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
race between swapout and eviction.

It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
and the lack of coherent locking between mapping's nrpages and shmem's
swapped count.  There's a window in shmem_writepage(), between lowering
nrpages in shmem_delete_from_page_cache() and then raising swapped
count, when the freed count appears to be +1 when it should be 0, and
then the asymmetry stops it from being corrected with -1 before hitting
the BUG.

One answer is coherent locking: using tree_lock throughout, without
info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
used_blocks makes that messier than expected.  Another answer may be a
further effort to eliminate the weird shmem_recalc_inode() altogether,
but previous attempts at that failed.

So far undecided, but for now change the BUG_ON to WARN_ON: in usual
circumstances it remains a useful consistency check.
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0f3c42f5

tmpfs: fix shmem_getpage_gfp() VM_BUG_ON · 215c02bc

由 Hugh Dickins 提交于 11月 16, 2012

Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora
has converted to WARNING) in shmem_getpage_gfp():

  WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
  Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
  Call Trace:
    warn_slowpath_common+0x7f/0xc0
    warn_slowpath_null+0x1a/0x20
    shmem_getpage_gfp+0xa5c/0xa70
    shmem_fault+0x4f/0xa0
    __do_fault+0x71/0x5c0
    handle_pte_fault+0x97/0xae0
    handle_mm_fault+0x289/0x350
    __do_page_fault+0x18e/0x530
    do_page_fault+0x2b/0x50
    page_fault+0x28/0x30
    tracesys+0xe1/0xe6

Thanks to Johannes for pointing to truncation: free_swap_and_cache()
only does a trylock on the page, so the page lock we've held since
before confirming swap is not enough to protect against truncation.

What cleanup is needed in this case? Just delete_from_swap_cache(),
which takes care of the memcg uncharge.
Signed-off-by: NHugh Dickins <hughd@google.com>
Reported-by: NDave Jones <davej@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

215c02bc

mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address · 498c2280

由 Will Deacon 提交于 11月 16, 2012

kmap_to_page returns the corresponding struct page for a virtual address
of an arbitrary mapping.  This works by checking whether the address
falls in the pkmap region and using the pkmap page tables instead of the
linear mapping if appropriate.

Unfortunately, the bounds checking means that PKMAP_ADDR(LAST_PKMAP) is
incorrectly treated as a highmem address and we can end up walking off
the end of pkmap_page_table and subsequently passing junk to pte_page.

This patch fixes the bound check to stay within the pkmap tables.
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

498c2280

mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" · 96710098

由 Mel Gorman 提交于 11月 16, 2012

Jiri Slaby reported the following:

	(It's an effective revert of "mm: vmscan: scale number of pages
	reclaimed by reclaim/compaction based on failures".) Given kswapd
	had hours of runtime in ps/top output yesterday in the morning
	and after the revert it's now 2 minutes in sum for the last 24h,
	I would say, it's gone.

The intention of the patch in question was to compensate for the loss of
lumpy reclaim.  Part of the reason lumpy reclaim worked is because it
aggressively reclaimed pages and this patch was meant to be a sane
compromise.

When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim.  However, since
commit c6543459 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
each time and continues reclaiming which was not taken into account when
the patch was developed.

Attempts to address the problem ended up just changing the shape of the
problem instead of fixing it.  The release window gets closer and while
a THP allocation failing is not a major problem, kswapd chewing up a lot
of CPU is.

This patch reverts commit 83fde0f2 ("mm: vmscan: scale number of
pages reclaimed by reclaim/compaction based on failures") and will be
revisited in the future.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Tested-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

96710098

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功