提交 · 3fc1f1e27a5b807791d72e5d992aa33b668a6626 · openanolis / cloud-kernel

07 5月, 2010 2 次提交

stop_machine: reimplement using cpu_stop · 3fc1f1e2

由 Tejun Heo 提交于 5月 06, 2010

Reimplement stop_machine using cpu_stop.  As cpu stoppers are
guaranteed to be available for all online cpus,
stop_machine_create/destroy() are no longer necessary and removed.

With resource management and synchronization handled by cpu_stop, the
new implementation is much simpler.  Asking the cpu_stop to execute
the stop_cpu() state machine on all online cpus with cpu hotplug
disabled is enough.

stop_machine itself doesn't need to manage any global resources
anymore, so all per-instance information is rolled into struct
stop_machine_data and the mutex and all static data variables are
removed.

The previous implementation created and destroyed RT workqueues as
necessary which made stop_machine() calls highly expensive on very
large machines.  According to Dimitri Sivanich, preventing the dynamic
creation/destruction makes booting faster more than twice on very
large machines.  cpu_stop resources are preallocated for all online
cpus and should have the same effect.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>

3fc1f1e2

cpu_stop: implement stop_cpu[s]() · 1142d810

由 Tejun Heo 提交于 5月 06, 2010

Implement a simplistic per-cpu maximum priority cpu monopolization
mechanism.  A non-sleeping callback can be scheduled to run on one or
multiple cpus with maximum priority monopolozing those cpus.  This is
primarily to replace and unify RT workqueue usage in stop_machine and
scheduler migration_thread which currently is serving multiple
purposes.

Four functions are provided - stop_one_cpu(), stop_one_cpu_nowait(),
stop_cpus() and try_stop_cpus().

This is to allow clean sharing of resources among stop_cpu and all the
migration thread users.  One stopper thread per cpu is created which
is currently named "stopper/CPU".  This will eventually replace the
migration thread and take on its name.

* This facility was originally named cpuhog and lived in separate
  files but Peter Zijlstra nacked the name and thus got renamed to
  cpu_stop and moved into stop_machine.c.

* Better reporting of preemption leak as per Peter's suggestion.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>

1142d810

23 4月, 2010 3 次提交

sched: Fix select_idle_sibling() logic in select_task_rq_fair() · 99bd5e2f

由 Suresh Siddha 提交于 3月 31, 2010

Issues in the current select_idle_sibling() logic in select_task_rq_fair()
in the context of a task wake-up:

a) Once we select the idle sibling, we use that domain (spanning the cpu that
the task is currently woken-up and the idle sibling that we found) in our
wake_affine() decisions. This domain is completely different from the
domain(we are supposed to use) that spans the cpu that the task currently
woken-up and the cpu where the task previously ran.

b) We do select_idle_sibling() check only for the cpu that the task is
currently woken-up on. If select_task_rq_fair() selects the previously run
cpu for waking the task, doing a select_idle_sibling() check
for that cpu also helps and we don't do this currently.

c) In the scenarios where the cpu that the task is woken-up is busy but
with its HT siblings are idle, we are selecting the task be woken-up
on the idle HT sibling instead of a core that it previously ran
and currently completely idle. i.e., we are not taking decisions based on
wake_affine() but directly selecting an idle sibling that can cause
an imbalance at the SMT/MC level which will be later corrected by the
periodic load balancer.

Fix this by first going through the load imbalance calculations using
wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
then choose a possible idle sibling for waking up the task on.
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1270079265.7835.8.camel@sbs-t61.sc.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

99bd5e2f

sched: Pre-compute cpumask_weight(sched_domain_span(sd)) · 669c55e9

由 Peter Zijlstra 提交于 4月 16, 2010

Dave reported that his large SPARC machines spend lots of time in
hweight64(), try and optimize some of those needless cpumask_weight()
invocations (esp. with the large offstack cpumasks these are very
expensive indeed).
Reported-by: NDavid Miller <davem@davemloft.net>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

669c55e9

sched: Cure load average vs NO_HZ woes · 74f5187a

由 Peter Zijlstra 提交于 4月 22, 2010

Chase reported that due to us decrementing calc_load_task prematurely
(before the next LOAD_FREQ sample), the load average could be scewed
by as much as the number of CPUs in the machine.

This patch, based on Chase's patch, cures the problem by keeping the
delta of the CPU going into NO_HZ idle separately and folding that in
on the next LOAD_FREQ update.

This restores the balance and we get strict LOAD_FREQ period samples.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NChase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1271934490.1776.343.camel@laptop>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

74f5187a

15 4月, 2010 2 次提交

sched: Fix UP update_avg() build warning · 09a40af5

由 Mike Galbraith 提交于 4月 15, 2010

update_avg() is only used for SMP builds, move it to the nearest
SMP block.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NMike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1271309399.14779.17.camel@marge.simson.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

09a40af5

Merge branch 'linus' into sched/core · b257c14c

由 Ingo Molnar 提交于 4月 15, 2010

Merge reason: merge the latest fixes, update to -rc4.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b257c14c

14 4月, 2010 4 次提交

L
Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 · 2ba3abd8
由 Linus Torvalds 提交于 4月 13, 2010
```
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM / Hibernate: user.c, fix SNAPSHOT_SET_SWAP_AREA handling
```
2ba3abd8

Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 · 0fdfe5ad

由 Linus Torvalds 提交于 4月 13, 2010

* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFSv4: fix delegated locking
  NFS: Ensure that the WRITE and COMMIT RPC calls are always uninterruptible
  NFS: Fix a race with the new commit code
  NFS: Ensure that writeback_single_inode() calls write_inode() when syncing
  NFS: Fix the mode calculation in nfs_find_open_context
  NFSv4: Fall back to ordinary lookup if nfs4_atomic_open() returns EISDIR

0fdfe5ad

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6 · 44d2d371

由 Linus Torvalds 提交于 4月 13, 2010

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
  sparc64: Add some more commentary to __raw_local_irq_save()
  sparc64: Fix memory leak in pci_register_iommu_region().
  sparc64: Add kmemleak annotation to sun4v_build_virq()
  sparc64: Support kmemleak.
  sparc64: Add function graph tracer support.
  sparc64: Give a stack frame to the ftrace call sites.
  sparc64: Use a seperate counter for timer interrupts and NMI checks, like x86.
  sparc64: Remove profiling from some low-level bits.
  sparc64: Kill unnecessary static on local var in ftrace_call_replace().
  sparc64: Kill CONFIG_STACK_DEBUG code.
  sparc64: Add HAVE_FUNCTION_TRACE_MCOUNT_TEST and tidy up.
  sparc64: Adjust __raw_local_irq_save() to cooperate in NMIs.
  sparc64: Use kstack_valid() in die_if_kernel().

44d2d371

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · 465de2ba

由 Linus Torvalds 提交于 4月 13, 2010

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (25 commits)
  smc91c92_cs: define multicast_table as unsigned char
  can: avoids a false warning
  e1000e: stop cleaning when we reach tx_ring->next_to_use
  igb: restrict WoL for 82576 ET2 Quad Port Server Adapter
  virtio_net: missing sg_init_table
  Revert "tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb"
  iwlwifi: need check for valid qos packet before free
  tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb
  udp: fix for unicast RX path optimization
  myri10ge: fix rx_pause in myri10ge_set_pauseparam
  net: corrected documentation for hardware time stamping
  stmmac: use resource_size()
  x.25 attempts to negotiate invalid throughput
  x25: Patch to fix bug 15678 - x25 accesses fields beyond end of packet.
  bridge: Fix IGMP3 report parsing
  cnic: Fix crash during bnx2x MTU change.
  qlcnic: fix set mac addr
  r6040: fix r6040_multicast_list
  vhost-net: fix vq_memory_access_ok error checking
  ath9k: fix double calls to ath_radio_enable
  ...

465de2ba

13 4月, 2010 29 次提交

smc91c92_cs: define multicast_table as unsigned char · a6d37024

由 Ken Kawasaki 提交于 4月 10, 2010

smc91c92_cs:
  * define multicast_table as unsigned char
  * remove unnecessary "#ifndef final_version"
Signed-off-by: NKen Kawasaki <ken_kawasaki@spring.nifty.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6d37024

can: avoids a false warning · 4ffa8701

由 Eric Dumazet 提交于 4月 09, 2010

At this point optlen == sizeof(sfilter) but some compilers are dumb.

Reported-by: Németh Márton <nm127@freemail.h
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NOliver Hartkopp <oliver@hartkopp.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ffa8701

e1000e: stop cleaning when we reach tx_ring->next_to_use · dac87619

由 Terry Loftin 提交于 4月 09, 2010

Tx ring buffers after tx_ring->next_to_use are volatile and could
change, possibly causing a crash.  Stop cleaning when we hit
tx_ring->next_to_use.
Signed-off-by: NTerry Loftin <terry.loftin@hp.com>
Acked-by: NBruce Allan <bruce.w.allan@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dac87619

igb: restrict WoL for 82576 ET2 Quad Port Server Adapter · d5aa2252

由 Stefan Assmann 提交于 4月 09, 2010

Restrict Wake-on-LAN to first port on 82576 ET2 quad port NICs, as it is
only supported there.
Signed-off-by: NStefan Assmann <sassmann@redhat.com>
Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d5aa2252

D
sparc64: Add some more commentary to __raw_local_irq_save() · c011f80b
由 David S. Miller 提交于 4月 13, 2010
```
Suggested by Peter Zijlstra
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
c011f80b
D
Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ · 9343af08
由 David S. Miller 提交于 4月 13, 2010
```
Conflicts:
	lib/Kconfig.debug
```
9343af08

sparc64: Fix memory leak in pci_register_iommu_region(). · e182c77c

由 David S. Miller 提交于 4月 10, 2010

Found by kmemleak.

If request_resource() fails, we leak the struct resource we
allocated to represent the IOMMU mapping area.

This actually happens on sun4v machines because the IOMEM area is only
reported sans the IOMMU region, unlike all previous systems.  I'll
need to fix that at some point, but for now fix the leak.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e182c77c

sparc64: Add kmemleak annotation to sun4v_build_virq() · 25ad403f

由 David S. Miller 提交于 4月 10, 2010

The only reference we store to this memory is in the form of a
physical address, so kmemleak can't see it.

Add a kmemleak_not_leak() annotation.

It's probably useful to be able to look at a dump of these things
either via debugfs or similar, and thus we could at some point store
them in some kind of table and therefore get rid of this annotation.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25ad403f

sparc64: Support kmemleak. · 8b8d8e28

由 David S. Miller 提交于 4月 09, 2010

Only missing thing was an _sdata marker in vmlinux.lds.S
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b8d8e28

D
sparc64: Add function graph tracer support. · 9960e9e8
由 David S. Miller 提交于 4月 07, 2010
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
9960e9e8

sparc64: Give a stack frame to the ftrace call sites. · a71d1d6b

由 David S. Miller 提交于 4月 06, 2010

It's the only way we'll be able to implement the function
graph tracer properly.

A positive is that we no longer have to worry about the
linker over-optimizing the tail call, since we don't
use a tail call any more.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a71d1d6b

sparc64: Use a seperate counter for timer interrupts and NMI checks, like x86. · daecbf58

由 David S. Miller 提交于 4月 06, 2010

This keeps us from having to use kstat_irqs_cpu() from the NMI handler,
the former of which is a profiled function.

Instead we use a currently empty slot in the cpu_data
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

daecbf58

sparc64: Remove profiling from some low-level bits. · f8e8a8e8

由 David S. Miller 提交于 4月 06, 2010

These include the timer implementation, perf events support, and the
performance counter register (pcr) programming layer.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8e8a8e8

D
sparc64: Kill unnecessary static on local var in ftrace_call_replace(). · d96478d5
由 David S. Miller 提交于 4月 06, 2010
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
d96478d5

sparc64: Kill CONFIG_STACK_DEBUG code. · ddacd0bc

由 David S. Miller 提交于 4月 12, 2010

The generic stack tracer does this job just as well.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ddacd0bc

sparc64: Add HAVE_FUNCTION_TRACE_MCOUNT_TEST and tidy up. · 63b75495

由 David S. Miller 提交于 4月 12, 2010

Check function_trace_stop at ftrace_caller

Toss mcount_call and dummy call of ftrace_stub, unnecessary.

Document problems we'll have if the final kernel image link
ever turns on relaxation.

Properly size 'ftrace_call' so it looks right when inspecting
instructions under gdb et al.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

63b75495

sparc64: Adjust __raw_local_irq_save() to cooperate in NMIs. · 0c25e9e6

由 David S. Miller 提交于 4月 12, 2010

If we are in an NMI then doing a plain raw_local_irq_disable() will
write PIL_NORMAL_MAX into %pil, which is lower than PIL_NMI, and thus
we'll re-enable NMIs and recurse.

Doing a simple:

	%pil = %pil | PIL_NORMAL_MAX

does what we want, if we're already at PIL_NMI (15) we leave it at
that setting, else we set it to PIL_NORMAL_MAX (14).

This should get the function tracer working on sparc64.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c25e9e6

sparc64: Use kstack_valid() in die_if_kernel(). · cb256aa6

由 David S. Miller 提交于 4月 12, 2010

This gets rid of a local function (is_kernel_stack()) which tries to
do the same thing, yet poorly in that it doesn't handle IRQ stacks
properly.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cb256aa6

virtio_net: missing sg_init_table · 0e413f22

由 Shirley Ma 提交于 3月 29, 2010

Add missing sg_init_table for sg_set_buf in virtio_net which
induced in defer skb patch.
Reported-by: NThomas Müller <thomas@mathtm.de>
Tested-by: NThomas Müller <thomas@mathtm.de>
Signed-off-by: NShirley Ma <xma@us.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e413f22

L

Linux 2.6.34-rc4 · 0d0fb0f9
由 Linus Torvalds 提交于 4月 12, 2010

0d0fb0f9

Merge branch 'anonvma' · 64a8920f

由 Linus Torvalds 提交于 4月 12, 2010

* anonvma:
  anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma
  anon_vma: clone the anon_vma chain in the right order
  vma_adjust: fix the copying of anon_vma chains
  Simplify and comment on anon_vma re-use for anon_vma_prepare()

64a8920f

Merge master.kernel.org:/home/rmk/linux-2.6-arm · 50b88c46

由 Linus Torvalds 提交于 4月 12, 2010

* master.kernel.org:/home/rmk/linux-2.6-arm: (21 commits)
  ARM: Fix ioremap_cached()/ioremap_wc() for SMP platforms
  ARM: 6043/1: AT91 slow-clock resume: Don't wait for a disabled PLL to lock
  ARM: 6031/1: fix Thumb-2 decompressor
  ARM: 6029/1: ep93xx: gpio.c: local functions should be static
  ARM: 6028/1: ARM: add MAINTAINERS for U300
  ARM: 6024/1: bcmring: fix missing down on semaphore in dma.c
  MXC: mach_armadillo5x0: Add USB Host support.
  ARM mach-mx3: duplicated include
  ARM mach-mx3: duplicated include
  imx31: add watchdog device on litekit board.
  imx3: Add watchdog platform device support
  MXC: mach-mx31_3ds: add support for freescale mc13783 power management device.
  MXC: mach-mx31_3ds: Add SPI1 device support.
  MXC: mach-mx31_3ds: Add support for on board NAND Flash.
  MXC: mach-mx31_3ds: Update variable names over recent mach name modification.
  imx31: fix parent clock for rtc
  i.MX51: remove NFC AXI static mapping
  i.MX51: determine silicon revision dynamically
  i.MX51: map TZIC dynamically
  i.MX51: Use correct clock for gpt
  ...

50b88c46

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable · d6cf853d

由 Linus Torvalds 提交于 4月 12, 2010

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: make sure the chunk allocator doesn't create zero length chunks
  Btrfs: fix data enospc check overflow

d6cf853d

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 · 6a945f38

由 Linus Torvalds 提交于 4月 12, 2010

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Fix possible dq_flags corruption
  quota: Hide warnings about writes to the filesystem before quota was turned on
  ext3: symlink must be handled via filesystem specific operation
  ext2: symlink must be handled via filesystem specific operation

6a945f38

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6 · 50fc88cb

由 Linus Torvalds 提交于 4月 12, 2010

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  udf: add speciffic ->setattr callback
  udf: potential integer overflow

50fc88cb

Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus · 4505a493

由 Linus Torvalds 提交于 4月 12, 2010

* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus: (36 commits)
  MIPS: Calculate proper ebase value for 64-bit kernels
  MIPS: Alchemy: DB1200: Remove custom wait implementation
  MIPS: Big Sur: Make defconfig more useful.
  MIPS: Fix __vmalloc() etc. on MIPS for non-GPL modules
  MIPS: Sibyte: Fix M3 TLB exception handler workaround.
  MIPS: BCM63xx: Fix build failure in board_bcm963xx.c
  MIPS: uasm: Add OR instruction.
  MIPS: Sibyte: Apply M3 workaround only on affected chip types and versions.
  MIPS: BCM63xx: Initialize gpio_out_low & out_high to current value at boot.
  MIPS: BCM63xx: Register SSB SPROM fallback in board's first stage callback
  MIPS: BCM63xx: Fix typo in cpu-feature-overrides file.
  MIPS: BCM63xx: Add support for second uart.
  MIPS: BCM63xx: Fix double gpio registration.
  MIPS: BCM63xx: Add DWVS0 board
  MIPS: BCM63xx: Add the RTA1025W-16 BCM6348-based board to suppported boards.
  MIPS: BCM63xx: Fix BCM6338 and BCM6345 gpio count
  MIPS: libgcc.h: Checkpatch cleanup
  MIPS: Loongson-2F: Flush the branch target history in BTB and RAS
  MIPS: Move signal trampolines off of the stack.
  MIPS: Preliminary VDSO
  ...

4505a493

L
Merge branch 'for-2.6.34' of git://linux-nfs.org/~bfields/linux · fedfb947
由 Linus Torvalds 提交于 4月 12, 2010
```
* 'for-2.6.34' of git://linux-nfs.org/~bfields/linux:
  svcrdma: RDMA support not yet compatible with RPC6
```
fedfb947

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 · 44fa2b4b

由 Linus Torvalds 提交于 4月 12, 2010

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix typo "numer" -> "number" in alloc.c
  nilfs2: Remove an uninitialization warning in nilfs_btree_propagate_v()
  nilfs2: fix a wrong type conversion in nilfs_ioctl()

44fa2b4b

anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma · ea90002b

由 Linus Torvalds 提交于 4月 12, 2010

Otherwise we might be mapping in a page in a new mapping, but that page
(through the swapcache) would later be mapped into an old mapping too.
The page->mapping must be the case that works for everybody, not just
the mapping that happened to page it in first.

Here's the scenario:

 - page gets allocated/mapped by process A. Let's call the anon_vma we
   associate the page with 'A' to keep it easy to track.

 - Process A forks, creating process B. The anon_vma in B is 'B', and has
   a chain that looks like 'B' -> 'A'. Everything is fine.

 - Swapping happens. The page (with mapping pointing to 'A') gets swapped
   out (perhaps not to disk - it's enough to assume that it's just not
   mapped any more, and lives entirely in the swap-cache)

 - Process B pages it in, which goes like this:

        do_swap_page ->
          page = lookup_swap_cache(entry);
         ...
          set_pte_at(mm, address, page_table, pte);
          page_add_anon_rmap(page, vma, address);

   And think about what happens here!

   In particular, what happens is that this will now be the "first"
   mapping of that page, so page_add_anon_rmap() used to do

        if (first)
                __page_set_anon_rmap(page, vma, address);

   and notice what anon_vma it will use? It will use the anon_vma for
   process B!

   What happens then? Trivial: process 'A' also pages it in (nothing
   happens, it's not the first mapping), and then process 'B' execve's
   or exits or unmaps, making anon_vma B go away.

   End result: process A has a page that points to anon_vma B, but
   anon_vma B does not exist any more.  This can go on forever.  Forget
   about RCU grace periods, forget about locking, forget anything like
   that.  The bug is simply that page->mapping points to an anon_vma
   that was correct at one point, but was _not_ the one that was shared
   by all users of that possible mapping.

Changing it to always use the deepest anon_vma in the anonvma chain gets
us to the safest model.

This can be improved in certain cases: if we know the page is private to
just this particular mapping (for example, it's a new page, or it is the
only swapcache entry), we could pick the top (most specific) anon_vma.

But that's a future optimization. Make it _work_ reliably first.
Reviewed-by: NRik van Riel <riel@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ea90002b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功