提交 · f310642123e0d32d919c60ca3fab5acd130c4ba3 · OpenHarmony / kernel_linux

29 5月, 2011 2 次提交

idle governor: Avoid lock acquisition to read pm_qos before entering idle · 333c5ae9

由 Tim Chen 提交于 2月 11, 2011

Thanks to the reviews and comments by Rafael, James, Mark and Andi.
Here's version 2 of the patch incorporating your comments and also some
update to my previous patch comments.

I noticed that before entering idle state, the menu idle governor will
look up the current pm_qos target value according to the list of qos
requests received.  This look up currently needs the acquisition of a
lock to access the list of qos requests to find the qos target value,
slowing down the entrance into idle state due to contention by multiple
cpus to access this list.  The contention is severe when there are a lot
of cpus waking and going into idle.  For example, for a simple workload
that has 32 pair of processes ping ponging messages to each other, where
64 cpu cores are active in test system, I see the following profile with
37.82% of cpu cycles spent in contention of pm_qos_lock:

-     37.82%          swapper  [kernel.kallsyms]          [k]
_raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
      - 95.65% pm_qos_request
           menu_select
           cpuidle_idle_call
         - cpu_idle
              99.98% start_secondary

A better approach will be to cache the updated pm_qos target value so
reading it does not require lock acquisition as in the patch below.
With this patch the contention for pm_qos_lock is removed and I saw a
2.2X increase in throughput for my message passing workload.

cc: stable@kernel.org
Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJames Bottomley <James.Bottomley@suse.de>
Acked-by: Nmark gross <markgross@thegnar.org>
Signed-off-by: NLen Brown <len.brown@intel.com>

333c5ae9

Cache xattr security drop check for write v2 · 69b45732

由 Andi Kleen 提交于 5月 28, 2011

Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.

Why xattr lookup on every write I hear you ask?

write wants to drop suid and security related xattrs that could set o
capabilities for executables.  To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.

In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.

Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.

This patch adds a simple per inode to avoid this problem.  We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.

I also used the same flag to avoid the suid check, although
that one is pretty cheap.

A file system can also set this flag when it creates the inode,
if it has a cheap way to do so.  This is done for some common file systems
in followon patches.

With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.

v2: Rename is_sgid. add file system helper.
Cc: chris.mason@oracle.com
Cc: josef@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: agruen@linbit.com
Cc: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

69b45732

28 5月, 2011 6 次提交

atomic: Add atomic_or() · 55c2945a

由 Paul E. McKenney 提交于 5月 11, 2011

An atomic_or() function is needed by TREE_RCU to avoid deadlock, so
add a generic version.
Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

55c2945a

cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed · 1e1b6c51

由 KOSAKI Motohiro 提交于 5月 19, 2011

The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

1e1b6c51

gpio: make gpio_{request,free}_array gpio array parameter const · 7c295975

由 Lars-Peter Clausen 提交于 5月 25, 2011

gpio_{request,free}_array should not (and do not) modify the passed gpio
array, so make the parameter const.
Signed-off-by: NLars-Peter Clausen <lars@metafoo.de>
Acked-by: NEric Miao <eric.y.miao@gmail.com>
Acked-by: NWolfram Sang <w.sang@pengutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

7c295975

net: Kill ratelimit.h dependency in linux/net.h · c5c177b4

由 David S. Miller 提交于 5月 27, 2011

Ingo Molnar noticed that we have this unnecessary ratelimit.h
dependency in linux/net.h, which hid compilation problems from
people doing builds only with CONFIG_NET enabled.

Move this stuff out to a seperate net/net_ratelimit.h file and
include that in the only two places where this thing is needed.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NIngo Molnar <mingo@elte.hu>

c5c177b4

net: Add linux/sysctl.h includes where needed. · bee95250

由 David S. Miller 提交于 5月 26, 2011

Several networking headers were depending upon the implicit
linux/sysctl.h include they get when including linux/net.h

Add explicit includes.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bee95250

net: Kill ether_table[] declaration. · 9d931dd2

由 David S. Miller 提交于 5月 26, 2011

This got missed back in 2006 when Jes Sorensen deleted
net/ethernet/sysctl_net_ether.c
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9d931dd2

27 5月, 2011 32 次提交

fs: pass exact type of data dirties to ->dirty_inode · aa385729

由 Christoph Hellwig 提交于 5月 27, 2011

Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.

This is just the prototype change with no user for it yet.  I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.

Also remove incorrect comments that ->dirty_inode can't block.  That
has been changed a long time ago, and many implementations rely on it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

aa385729

TPS65911: Comparator: Add comparator driver · 6851ad3a

由 Jorge Eduardo Candelaria 提交于 5月 16, 2011

This driver adds functionality to the tps65911 chip driver.

Two of the comparators are configurable by software and measures
VCCS voltage to detect high or low voltage scenarios.
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Acked-by: NSamuel Ortiz <sameo@linux.intel.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

6851ad3a

TPS65911: Add support for added GPIO lines · 11ad14f8

由 Jorge Eduardo Candelaria 提交于 5月 16, 2011

GPIO 1 to 8 are added for TPS65911 chip version. The gpio driver
now handles more than one gpio lines. Subsequent versions of the
chip family can add new GPIO lines with minimal driver changes.
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Acked-by: NGrant Likely <grant.likely@secretlab.ca>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

11ad14f8

TPS65911: Add new irq definitions · a2974732

由 Jorge Eduardo Candelaria 提交于 5月 16, 2011

TPS65911 adds new interrupt sources, as well as two new registers
to handle them, one for interrupt status and one for interrupt
masking. The added irqs are:

-VMBCH2 - Low and High threshold
-GPIO1-8 - Rising and falling edge detection
-WTCHDG - Watchdog interrupt
-PWRDN	- PWRDN reset interrupt

The code should handle these new registers only when the chip
version is TPS65911.
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Acked-by: NSamuel Ortiz <sameo@linux.intel.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

a2974732

regulator: tps65911: Add new chip version · a320e3c3

由 Jorge Eduardo Candelaria 提交于 5月 16, 2011

The tps65911 chip introduces new features, including changes in
the regulator module.

- VDD1 and VDD2 remain unchanged.
- VDD3 is now named VDDCTRL and has a wider voltage range.
- LDOs are now named LDO1...8 and voltage ranges are sequential,
  making LDOs easier to handle.
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Acked-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

a320e3c3

MFD: TPS65910: Add support for TPS65911 device · 79557056

由 Jorge Eduardo Candelaria 提交于 5月 16, 2011

The TPS65911 is the next generation of the TPS65910 family of
PMIC chips. It adds a few features:

- Watchdog Timer
- PWM & LED generators
- Comparators for system control status

It also adds a set of Interrupts and GPIOs, among other things.

The driver exports a function to identify between different
versions of the tps65910 family, allowing other modules to
identify the capabilities of the current chip.
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Acked-by: NSamuel Ortiz <sameo@linux.intel.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

79557056

regulator: Remove MAX8997_REG_BUCK1DVS/MAX8997_REG_BUCK2DVS/MAX8997_REG_BUCK5DVS macros · ecb9c4f5

由 Axel Lin 提交于 5月 16, 2011

In current implementation, the original macro implementation assumes the caller
pass the parameter starting from 1 (to match the register names in datasheet).
Thus we have unneeded plus one then minus one operations
when using MAX8997_REG_BUCK1DVS/MAX8997_REG_BUCK2DVS/MAX8997_REG_BUCK5DVS macros.

This patch removes these macros to avoid unneeded plus one then minus one operations
without reducing readability.
Signed-off-by: NAxel Lin <axel.lin@gmail.com>
Acked-by: NKyungmin Park <kyungmin.park@samsung.com>
Acked-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
Acked-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

ecb9c4f5

TPS65910: Add tps65910 regulator driver · 518fb721

由 Graeme Gregory 提交于 5月 02, 2011

The regulator module consists of 3 DCDCs and 8 LDOs. The output
voltages are configurable and are meant to supply power to the
main processor and other components
Signed-off-by: NGraeme Gregory <gg@slimlogic.co.uk>
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Acked-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

518fb721

TPS65910: IRQ: Add interrupt controller · e3471bdc

由 Graeme Gregory 提交于 5月 02, 2011

This module controls the interrupt handling for the tps chip. The
interrupt sources are the following:

- GPIO falling/rising edge detection
- Battery voltage below/above threshold
- PWRON signal
- PWRHOLD signal
- Temperature detection
- RTC alarm and periodic event
Signed-off-by: NGraeme Gregory <gg@slimlogic.co.uk>
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Reviewed-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: NSamuel Ortiz <sameo@linux.intel.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

e3471bdc

TPS65910: GPIO: Add GPIO driver · 2537df72

由 Graeme Gregory 提交于 5月 02, 2011

TPS65910 has one configurable GPIO that can be used for several
purposes. Subsequent versions of the TPS chip support more than
one GPIO.
Signed-off-by: NGraeme Gregory <gg@slimlogic.co.uk>
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Acked-by: NGrant Likely <grant.likely@secretlab.ca>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

2537df72

MFD: TPS65910: Add new mfd device for TPS65910 · 27c6750e

由 Graeme Gregory 提交于 5月 02, 2011

The TPS65910 chip is a power management IC for multimedia and handheld
devices. It contains the following components:

- Regulators
- GPIO controller
- RTC

The tps65910 core driver is registered as a platform driver and provides
communication through I2C with the host device for the different
components.
Signed-off-by: NGraeme Gregory <gg@slimlogic.co.uk>
Signed-off-by: NJorge Eduardo Candelaria <jedu@slimlogic.co.uk>
Acked-by: NSamuel Ortiz <sameo@linux.intel.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

27c6750e

regulator: Support voltage offsets to compensate for drops in system · bf5892a8

由 Mark Brown 提交于 5月 08, 2011

Some systems, particularly physically large systems used for early
prototyping, may experience substantial voltage drops between the regulator
and the consumers as a result of long traces in the system. With these
systems voltages may need to be set higher than requested in order to
ensure reliable system operation.

Allow systems to work around such hardware issues by allowing constraints
to supply an offset to be applied to any requested and reported voltages.
This is not ideal, especially since the voltage drop may be load dependant,
but is sufficient for most affected systems, it is not expected to be used
in production hardware. The offset is applied after all constraint
processing so constraints should be specified in terms of consumer values
not physically configured values.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

bf5892a8

regulator: Remove supply_regulator_dev from machine configuration · 492c826b

由 Mark Brown 提交于 5月 08, 2011

supply_regulator_dev (using a struct pointer) has been deprecated in favour
of supply_regulator (using a regulator name) for quite a few releases
now with a warning generated if it is used and there are no current in tree
users so just remove the code.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>

492c826b

gpio: Convert gpio_is_valid to return bool · 3474cb3c

由 Joe Perches 提交于 5月 10, 2011

Make the code a bit more readable.

Instead of casting an int to an unsigned then comparing to
MAX_NR_GPIOS, add a >= 0 test and let the compiler optimizer
do the conversion to unsigned.

The generated code should be the same.
Signed-off-by: NJoe Perches <joe@perches.com>
Acked-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

3474cb3c

arch: remove CONFIG_GENERIC_FIND_{NEXT_BIT,BIT_LE,LAST_BIT} · 63e424c8

由 Akinobu Mita 提交于 5月 26, 2011

By the previous style change, CONFIG_GENERIC_FIND_NEXT_BIT,
CONFIG_GENERIC_FIND_BIT_LE, and CONFIG_GENERIC_FIND_LAST_BIT are not used
to test for existence of find bitops anymore.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Acked-by: NGreg Ungerer <gerg@uclinux.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

63e424c8

bitops: add #ifndef for each of find bitops · 19de85ef

由 Akinobu Mita 提交于 5月 26, 2011

The style that we normally use in asm-generic is to test the macro itself
for existence, so in asm-generic, do:

	#ifndef find_next_zero_bit_le
	extern unsigned long find_next_zero_bit_le(const void *addr,
		unsigned long size, unsigned long offset);
	#endif

and in the architectures, write

	static inline unsigned long find_next_zero_bit_le(const void *addr,
		unsigned long size, unsigned long offset)
	#define find_next_zero_bit_le find_next_zero_bit_le

This adds the #ifndef for each of the find bitops in the generic header
and source files.
Suggested-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

19de85ef

pid: fix typo in function description · 26498e89

由 Sisir Koppaka 提交于 5月 26, 2011

finds is misspelt as finr. No functional change.
Signed-off-by: NSisir Koppaka <sisir.koppaka@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

26498e89

ipmi: convert to seq_file interface · 07412736

由 Alexey Dobriyan 提交于 5月 26, 2011

The ->read_proc interface is going away, convert to seq_file.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Cc:Corey Minyard <minyard@acm.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

07412736

fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages · 997c136f

由 Olaf Hering 提交于 5月 26, 2011

The balloon driver in a Xen guest frees guest pages and marks them as
mmio.  When the kernel crashes and the crash kernel attempts to read the
oldmem via /proc/vmcore a read from ballooned pages will generate 100%
load in dom0 because Xen asks qemu-dm for the page content.  Since the
reads come in as 8byte requests each ballooned page is tried 512 times.

With this change a hook can be registered which checks wether the given
pfn is really ram.  The hook has to return a value > 0 for ram pages, a
value < 0 on error (because the hypercall is not known) and 0 for non-ram
pages.

This will reduce the time to read /proc/vmcore.  Without this change a
512M guest with 128M crashkernel region needs 200 seconds to read it, with
this change it takes just 2 seconds.
Signed-off-by: NOlaf Hering <olaf@aepfle.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

997c136f

mm: extract exe_file handling from procfs · 38646013

由 Jiri Slaby 提交于 5月 26, 2011

Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc/<pid>/exe.  Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.

To achieve that move that out of proc FS to the kernel/ where in fact it
should belong.  By doing that we can make dup_mm_exe_file static.  Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.
Signed-off-by: NJiri Slaby <jslaby@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

38646013

memcg: add the pagefault count into memcg stats · 456f998e

由 Ying Han 提交于 5月 26, 2011

Two new stats in per-memcg memory.stat which tracks the number of page
faults and number of major page faults.

  "pgfault"
  "pgmajfault"

They are different from "pgpgin"/"pgpgout" stat which count number of
pages charged/discharged to the cgroup and have no meaning of reading/
writing page to disk.

It is valuable to track the two stats for both measuring application's
performance as well as the efficiency of the kernel page reclaim path.
Counting pagefaults per process is useful, but we also need the aggregated
value since processes are monitored and controlled in cgroup basis in
memcg.

Functional test: check the total number of pgfault/pgmajfault of all
memcgs and compare with global vmstat value:

  $ cat /proc/vmstat | grep fault
  pgfault 1070751
  pgmajfault 553

  $ cat /dev/cgroup/memory.stat | grep fault
  pgfault 1071138
  pgmajfault 553
  total_pgfault 1071142
  total_pgmajfault 553

  $ cat /dev/cgroup/A/memory.stat | grep fault
  pgfault 199
  pgmajfault 0
  total_pgfault 199
  total_pgmajfault 0

Performance test: run page fault test(pft) wit 16 thread on faulting in
15G anon pages in 16G container.  There is no regression noticed on the
"flt/cpu/s"

Sample output from pft:

  TAG pft:anon-sys-default:
    Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
    15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260

  +-------------------------------------------------------------------------+
      N           Min           Max        Median           Avg        Stddev
  x  10     16682.962     17344.027     16913.524     16928.812      166.5362
  +  10     16695.568     16923.896     16820.604     16824.652     84.816568
  No difference proven at 95.0% confidence

[akpm@linux-foundation.org: fix build]
[hughd@google.com: shmem fix]
Signed-off-by: NYing Han <yinghan@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

456f998e

memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages() · 1bac180b

由 Ying Han 提交于 5月 26, 2011

The caller of the function has been renamed to zone_nr_lru_pages(), and
this is just fixing up in the memcg code.  The current name is easily to
be mis-read as zone's total number of pages.
Signed-off-by: NYing Han <yinghan@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1bac180b

memcg: fix get_scan_count() for small targets · 246e87a9

由 KAMEZAWA Hiroyuki 提交于 5月 26, 2011

During memory reclaim we determine the number of pages to be scanned per
zone as

	(anon + file) >> priority.
Assume
	scan = (anon + file) >> priority.

If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
priority gets higher.  This has some problems.

  1. This increases priority as 1 without any scan.
     To do scan in this priority, amount of pages should be larger than 512M.
     If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
     batched, later. (But we lose 1 priority.)
     If memory size is below 16M, pages >> priority is 0 and no scan in
     DEF_PRIORITY forever.

  2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
     So, x86's ZONE_DMA will never be recoverred until the user of pages
     frees memory by itself.

  3. With memcg, the limit of memory can be small. When using small memcg,
     it gets priority < DEF_PRIORITY-2 very easily and need to call
     wait_iff_congested().
     For doing scan before priorty=9, 64MB of memory should be used.

Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

  1. the target is enough small.
  2. it's kswapd or memcg reclaim.

Then we can avoid rapid priority drop and may be able to recover
all_unreclaimable in a small zones.  And this patch removes nr_saved_scan.
 This will allow scanning in this priority even when pages >> priority is
very small.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NYing Han <yinghan@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

246e87a9

memcg: reclaim memory from nodes in round-robin order · 889976db

由 Ying Han 提交于 5月 26, 2011

Presently, memory cgroup's direct reclaim frees memory from the current
node.  But this has some troubles.  Usually when a set of threads works in
a cooperative way, they tend to operate on the same node.  So if they hit
limits under memcg they will reclaim memory from themselves, damaging the
active working set.

For example, assume 2 node system which has Node 0 and Node 1 and a memcg
which has 1G limit.  After some work, file cache remains and the usages
are

   Node 0:  1M
   Node 1:  998M.

and run an application on Node 0, it will eat its foot before freeing
unnecessary file caches.

This patch adds round-robin for NUMA and adds equal pressure to each node.
When using cpuset's spread memory feature, this will work very well.

But yes, a better algorithm is needed.

[akpm@linux-foundation.org: comment editing]
[kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
Signed-off-by: NYing Han <yinghan@google.com>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

889976db

memcg: count the soft_limit reclaim in global background reclaim · 0ae5e89c

由 Ying Han 提交于 5月 26, 2011

The global kswapd scans per-zone LRU and reclaims pages regardless of the
cgroup. It breaks memory isolation since one cgroup can end up reclaiming
pages from another cgroup. Instead we should rely on memcg-aware target
reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
memory pressure.

In the global background reclaim, we do soft reclaim before scanning the
per-zone LRU. However, the return value is ignored. This patch is the first
step to skip shrink_zone() if soft_limit reclaim does enough work.

This is part of the effort which tries to reduce reclaiming pages in global
LRU in memcg. The per-memcg background reclaim patchset further enhances the
per-cgroup targetting reclaim, which I should have V4 posted shortly.

Try running multiple memory intensive workloads within seperate memcgs. Watch
the counters of soft_steal in memory.stat.

  $ cat /dev/cgroup/A/memory.stat | grep 'soft'
  soft_steal 240000
  soft_scan 240000
  total_soft_steal 240000
  total_soft_scan 240000

This patch:

In the global background reclaim, we do soft reclaim before scanning the
per-zone LRU.  However, the return value is ignored.

We would like to skip shrink_zone() if soft_limit reclaim does enough
work.  Also, we need to make the memory pressure balanced across per-memcg
zones, like the logic vm-core.  This patch is the first step where we
start with counting the nr_scanned and nr_reclaimed from soft_limit
reclaim into the global scan_control.
Signed-off-by: NYing Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0ae5e89c

mm: move enum vm_event_item into a standalone header file · f042e707

由 Andrew Morton 提交于 5月 26, 2011

enums are problematic because they cannot be forward-declared:

  akpm2:/home/akpm> cat t.c

  enum foo;

  static inline void bar(enum foo f)
  {
  }
  akpm2:/home/akpm> gcc -c t.c
  t.c:4: error: parameter 1 ('f') has incomplete type

So move the enum's definition into a standalone header file which can be used
wherever its definition is needed.

Cc: Ying Han <yinghan@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f042e707

cgroup: remove the ns_cgroup · a77aea92

由 Daniel Lezcano 提交于 5月 26, 2011

The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:

  * cgroup creation is out-of-control
  * cgroup name can conflict when pids are looping
  * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
  * we may want to create a namespace without creating a cgroup

  The ns_cgroup was replaced by a compatibility flag 'clone_children',
  where a newly created cgroup will copy the parent cgroup values.
  The userspace has to manually create a cgroup and add a task to
  the 'tasks' file.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

This is a userspace-visible change.  Commit 45531757 ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal.  Since that
time we have heard from XXX users who were affected by this.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
Acked-by: NPaul Menage <menage@google.com>
Acked-by: NMatt Helsley <matthltc@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a77aea92

cgroups: add per-thread subsystem callbacks · f780bdb7

由 Ben Blum 提交于 5月 26, 2011

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface.  Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by
this.  All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.
Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: NPaul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f780bdb7

cgroups: read-write lock CLONE_THREAD forking per threadgroup · 4714d1d3

由 Ben Blum 提交于 5月 26, 2011

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

Add an rwsem that lives in a threadgroup's signal_struct that's taken for
reading in the fork path, under CONFIG_CGROUPS.  If another part of the
kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other
system would both depend on.

This is a pre-patch for cgroup-procs-write.patch.
Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: NPaul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4714d1d3

drivers/rtc/rtc-mxc.c: remove defines already included in rtc.h · 9796cc96

由 Wolfram Sang 提交于 5月 26, 2011

[akpm@linux-foundation.org: retain the code comments]
Signed-off-by: NWolfram Sang <w.sang@pengutronix.de>
Cc: Vladimir Zapolskiy <vzapolskiy@gmail.com>
Cc: Alessandro Zummo <alessandro.zummo@towertech.it>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9796cc96

flex_array: avoid divisions when accessing elements · 704f15dd

由 Jesse Gross 提交于 5月 26, 2011

On most architectures division is an expensive operation and accessing an
element currently requires four of them.  This performance penalty
effectively precludes flex arrays from being used on any kind of fast
path.  However, two of these divisions can be handled at creation time and
the others can be replaced by a reciprocal divide, completely avoiding
real divisions on access.

[eparis@redhat.com: rebase on top of changes to support 0 len elements]
[eparis@redhat.com: initialize part_nr when array fits entirely in base]
Signed-off-by: NJesse Gross <jesse@nicira.com>
Signed-off-by: NEric Paris <eparis@redhat.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

704f15dd

Intel xhci: Add PCI id for Panther Point xHCI host. · 7fe4fc88

由 Sarah Sharp 提交于 5月 25, 2011

This adds the PCI ID for the xHCI (USB 3.0) host controller in the Intel
Panther Point chipset.  It will be used by both the EHCI and xHCI driver
in the following patches.
Signed-off-by: NSarah Sharp <sarah.a.sharp@linux.intel.com>

7fe4fc88

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年