提交 · 834392a7d92677ff2bdc1c709b1171ee585b55c9 · openanolis / cloud-kernel

31 3月, 2016 8 次提交

serial: doc: Un-document non-existing uart_write_console() · 834392a7

由 Geert Uytterhoeven 提交于 3月 14, 2016

uart_write_console() never existed, not even when the "new
uart_write_console function" was documented.

Fixes: 67ab7f59 ("[SERIAL] Update serial driver documentation")
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

834392a7

Documentation: mmc: Add the introduction for mmc-utils · f3e27a80

由 Baolin Wang 提交于 3月 16, 2016

This patch introduces one mmc test tools called mmc-utils, which is convenient
if someone wants to exercise and test MMC/SD devices from userspace.
Signed-off-by: NBaolin Wang <baolin.wang@linaro.org>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

f3e27a80

Documentation: update URLs for Richard Gooch's articles · f7ca20de

由 Luis de Bethencourt 提交于 3月 19, 2016

Current URL for "Kernel API changes from 2.0 to 2.2" hasn't been available
for some time, updating. The second article about changes from 2.2 to 2.4
is missing a URL, adding it.
Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

f7ca20de

Documentation: update URL of Analysis of the Ext2fs structure · dc831ab3

由 Luis de Bethencourt 提交于 3月 19, 2016

The current URL has been down for some time, updating it to a working one.
Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

dc831ab3

Documentation: add Linux Kernel Development book · a06bdd49

由 Luis de Bethencourt 提交于 3月 19, 2016

The Linux Kernel Development book by Robert Love has been recommended to me
by multiple kernel hackers. Worth having in the list of books in
kernel-docs.txt for newbies looking for good learning resources.
Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

a06bdd49

Documentation: update missing index files in block/00-INDEX · 5408e5a4

由 Wei Fang 提交于 3月 21, 2016

Update missing index files in block/00-INDEX.
Signed-off-by: NWei Fang <fangwei1@huawei.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

5408e5a4

Documentation/IRQ-domain.txt: Document irq_domain_create_{linear, tree} · dbe7fcda

由 Jianyu Zhan 提交于 3月 27, 2016

They have the same functionalities as irq_domain_add_{linear, tree},
except fro accepting different first argument.
Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

dbe7fcda

bpf: doc: "neg" opcode has no operands · 83d26b63

由 Dave Anderson 提交于 3月 28, 2016

Fixes a copy-paste-o in the BPF opcode table: "neg" takes no arguments
and thus has no addressing modes.
Signed-off-by: NDave Anderson <danderson@google.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

83d26b63

26 3月, 2016 1 次提交

mm, kasan: SLAB support · 7ed2f9e6

由 Alexander Potapenko 提交于 3月 25, 2016

Add KASAN hooks to SLAB allocator.

This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.
Signed-off-by: NAlexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7ed2f9e6

24 3月, 2016 2 次提交

PM / AVS: rockchip-io: add io selectors and supplies for rk3399 · f447671b

由 David Wu 提交于 3月 16, 2016

This adds the necessary data for handling io voltage domains on the rk3399.
As interesting tidbit, the rk3399 contains two separate iodomain areas.
One in the regular General Register Files (GRF) and one in PMUGRF in the
pmu power domain.
Signed-off-by: NDavid Wu <david.wu@rock-chips.com>
Reviewed-by: NHeiko Stuebner <heiko@sntech.de>
Acked-by: NKevin Hilman <khilman@baylibre.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

f447671b

Documentation/ABI: Update sysfs-driver-toshiba_acpi file · 33f857a4

由 Azael Avalos 提交于 1月 25, 2016

This patch updates the documentation file adding the Cooling Method
entry.
Signed-off-by: NAzael Avalos <coproscefalo@gmail.com>
Signed-off-by: NDarren Hart <dvhart@linux.intel.com>

33f857a4

23 3月, 2016 6 次提交

kernel: add kcov code coverage · 5c9a8750

由 Dmitry Vyukov 提交于 3月 22, 2016

kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing).  Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system.  A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/).  However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible.  It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g.  scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes.  Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch).  I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side.  The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

  https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation".  For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
typical coverage can be just a dozen of basic blocks (e.g.  an invalid
input).  In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M).  Cost of
kcov depends only on number of executed basic blocks/edges.  On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure.  But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5c9a8750

rapidio: add mport char device driver · e8de3701

由 Alexandre Bounine 提交于 3月 22, 2016

Add mport character device driver to provide user space interface to
basic RapidIO subsystem operations.

See included Documentation/rapidio/mport_cdev.txt for more details.

[akpm@linux-foundation.org: fix printk warning on i386]
[dan.carpenter@oracle.com: mport_cdev: fix some error codes]
Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Tested-by: NBarry Wood <barry.wood@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Cc: Barry Wood <barry.wood@idt.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e8de3701

rapidio/tsi721: add filtered debug output · 72d8a0d2

由 Alexandre Bounine 提交于 3月 22, 2016

Replace "all-or-nothing" debug output with controlled debug output using
functional block masks.  This allows run time control of debug messages
through 'dbg_level' module parameter.
Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72d8a0d2

fat: add config option to set UTF-8 mount option by default · 38739380

由 Maciej S. Szmigiero 提交于 3月 22, 2016

FAT has long supported its own default file name encoding config
setting, separate from CONFIG_NLS_DEFAULT.

However, if UTF-8 encoded file names are desired FAT character set
should not be set to utf8 since this would make file names case
sensitive even if case insensitive matching is requested.  Instead,
"utf8" mount options should be provided to enable UTF-8 file names in
FAT file system.

Unfortunately, there was no possibility to set the default value of this
option so on UTF-8 system "utf8" mount option had to be added manually
to most FAT mounts.

This patch adds config option to set such default value.
Signed-off-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name>
Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

38739380

ocfs2: add feature document for online file check · d750c42a

由 Gang He 提交于 3月 22, 2016

This document will describe OCFS2 online file check feature.  OCFS2 is
often used in high-availaibility systems.  However, OCFS2 usually
converts the filesystem to read-only when encounters an error.  This may
not be necessary, since turning the filesystem read-only would affect
other running processes as well, decreasing availability.

Then, a mount option (errors=continue) is introduced, which would return
the -EIO errno to the calling process and terminate furhter processing
so that the filesystem is not corrupted further.  The filesystem is not
converted to read-only, and the problematic file's inode number is
reported in the kernel log.  The user can try to check/fix this file via
online filecheck feature.
Signed-off-by: NGang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d750c42a

cpufreq: powernv: Add sysfs attributes to show throttle stats · 1b028984

由 Shilpasri G Bhat 提交于 3月 22, 2016

Create sysfs attributes to export throttle information in
/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats directory. The
newly added sysfs files are as follows:

 1)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/turbo_stat
 2)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/sub-turbo_stat
 3)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/unthrottle
 4)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/powercap
 5)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/overtemp
 6)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/supply_fault
 7)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/overcurrent
 8)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/occ_reset

Detailed explanation of each attribute is added to
Documentation/ABI/testing/sysfs-devices-system-cpu
Signed-off-by: NShilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

1b028984

22 3月, 2016 3 次提交

igmp: Document sysctl_igmp_max_msf · 537377d3

由 Benjamin Poirier 提交于 3月 21, 2016

Signed-off-by: NBenjamin Poirier <bpoirier@suse.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

537377d3

net: Fix indentation of the conf/ documentation block · 6b226e2f

由 Benjamin Poirier 提交于 3月 21, 2016

Commit d67ef35f ("clarify documentation for
net.ipv4.igmp_max_memberships") mistakenly indented a block of
documentation such that it now looks like it belongs to a specific sysctl.
Restore that block's original position.

Cc: Jeremy Eder <jeder@redhat.com>
Signed-off-by: NBenjamin Poirier <bpoirier@suse.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b226e2f

rtc: mcp795: add devicetree support · 7f8a5892

由 Emil Bartczak 提交于 3月 21, 2016

Add device tree support to the rtc-mcp795 driver.
Signed-off-by: NEmil Bartczak <emilbart@gmail.com>
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>

7f8a5892

21 3月, 2016 3 次提交

Documentation: dt: mailbox: Add TI Message Manager · 94b5293d

由 Nishanth Menon 提交于 3月 16, 2016

Message Manager is a hardware block used to communicate with various
processor systems within certain Texas Instrument's Keystone
generation SoCs.

This hardware engine is used to transfer messages from various compute
entities(or processors) within the SoC. It is designed to be self
contained without needing software initialization for operation.
Signed-off-by: NNishanth Menon <nm@ti.com>
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NJassi Brar <jaswinder.singh@linaro.org>

94b5293d

irqchip/mbigen: Adjust DT bindings to handle multiple devices in a module · d0e28641

由 MaJun 提交于 3月 17, 2016

A mbigen hardware module can contain more than one device node. These device
nodes contain the same register definition.

mbigen_dev1:intc_dev1 {
	...
	reg = <0x0 0xc0080000 0x0 0x10000>;
	...
};

mbigen_dev2:intc_dev2 {
	...
	reg = <0x0 0xc0080000 0x0 0x10000>;
	...
};

In this case both devices try to request the same resource resulting in a
resource conflict.

To address this problem the devices need to be subnodes of the mbigen hardware
module, which then contains the unique register space.

[ tglx: Massaged changelog ]
Suggested-by: NMark Rutland <mark.rutland@arm.com>
Signed-off-by: NMa Jun <majun258@huawei.com>
Cc: jason@lakedaemon.net
Cc: marc.zyngier@arm.com
Cc: Catalin.Marinas@arm.com
Cc: guohanjun@huawei.com
Cc: Will.Deacon@arm.com
Cc: huxinwei@huawei.com
Cc: lizefan@huawei.com
Cc: dingtianhong@huawei.com
Cc: zhaojunhua@hisilicon.com
Cc: liguozhu@hisilicon.com
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/20160203111602.GA1234@leverpostej
Link: http://lkml.kernel.org/r/1458203641-17172-2-git-send-email-majun258@huawei.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

d0e28641

dma-buf: Update docs for SYNC ioctl · 87e332d5

由 Daniel Vetter 提交于 3月 21, 2016

Just a bit of wording polish plus mentioning that it can fail and must
be restarted.

Requested by Sumit.

v2: Fix them typos (Hans).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tiago Vignatti <tiago.vignatti@intel.com>
Cc: Stéphane Marchesin <marcheu@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Daniel Vetter <daniel.vetter@intel.com>
CC: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: intel-gfx@lists.freedesktop.org
Cc: devel@driverdev.osuosl.org
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Acked-by: NSumit Semwal <sumit.semwal@linaro.org>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>

87e332d5

18 3月, 2016 12 次提交

nfsd: add SCSI layout support · f99d4fbd

由 Christoph Hellwig 提交于 3月 04, 2016

This is a simple extension to the block layout driver to use SCSI
persistent reservations for access control and fencing, as well as
SCSI VPD pages for device identification.

For this we need to pass the nfs4_client to the proc_getdeviceinfo method
to generate the reservation key, and add a new fence_client method
to allow for fence actions in the layout driver.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

f99d4fbd

of: Add vendor prefix for eGalax_eMPIA Technology Inc · 5027e19d

由 Fabio Estevam 提交于 3月 04, 2016

eGalax_eMPIA Technology Inc (EETI) is a company specialized in
touchscreen controller solutions.
Signed-off-by: NFabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: NRob Herring <robh@kernel.org>

5027e19d

dt-bindings: i2c: Spelling s/propoerty/property/ · ddf3dc82

由 Geert Uytterhoeven 提交于 3月 15, 2016

Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: NWolfram Sang <wsa@the-dreams.de>

ddf3dc82

fix Christoph's email addresses · 93e205a7

由 Christoph Lameter 提交于 3月 17, 2016

There are various email addresses for me throughout the kernel.  Use the
one that will always be valid.
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

93e205a7

proc: add /proc/<pid>/timerslack_ns interface · 5de23d43

由 John Stultz 提交于 3月 17, 2016

This patch provides a proc/PID/timerslack_ns interface which exposes a
task's timerslack value in nanoseconds and allows it to be changed.

This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs.  background activity)

If the value written is non-zero, slack is set to that value.  Otherwise
sets it to the default for the thread.

This interface checks that the calling task has permissions to to use
PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure
arbitrary apps do not change the timer slack for other apps.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5de23d43

mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage · b6e6edcf

由 Johannes Weiner 提交于 3月 17, 2016

Setting the original memory.limit_in_bytes hardlimit is subject to a
race condition when the desired value is below the current usage. The
code tries a few times to first reclaim and then see if the usage has
dropped to where we would like it to be, but there is no locking, and
the workload is free to continue making new charges up to the old limit.
Thus, attempting to shrink a workload relies on pure luck and hope that
the workload happens to cooperate.

To fix this in the cgroup2 memory.max knob, do it the other way round:
set the limit first, then try enforcement. And if reclaim is not able
to succeed, trigger OOM kills in the group. Keep going until the new
limit is met, we run out of OOM victims and there's only unreclaimable
memory left, or the task writing to memory.max is killed. This allows
users to shrink groups reliably, and the behavior is consistent with
what happens when new charges are attempted in excess of memory.max.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6e6edcf

mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4

由 Mel Gorman 提交于 3月 17, 2016

THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure.  The problem is that
THP allocation requests potentially enter reclaim/compaction.  This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses.  While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so.  Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM.  It's been years and
it's time to throw in the towel.

First, this patch defines THP defrag as follows;

 madvise: A failed allocation will direct reclaim/compact if the application requests it
 never:   Neither reclaim/compact nor wake kswapd
 defer:   A failed allocation will wake kswapd/kcompactd
 always:  A failed allocation will direct reclaim/compact (historical behaviour)
          khugepaged defrag will enter direct/reclaim but not wake kswapd.

Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.

Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work.  The callers that
really cares are slub/slab and they are updated accordingly.  The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.

This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available.  There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary.  THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.

After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future.  In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.

The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times.  The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete.  It uses multiple threads to see
if that is a factor.  On UMA, the performance is almost identical so is
not reported but on NUMA, we see this

usemem
                                   4.4.0                 4.4.0
                          kcompactd-v1r1         nodefrag-v1r3
Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)

For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases.  Similar,
notice the large reduction in most cases in system CPU usage.  The
overall CPU time is

               4.4.0       4.4.0
        kcompactd-v1r1 nodefrag-v1r3
User        10357.65    10438.33
System       3988.88     3543.94
Elapsed      2203.01     1634.41

Which is substantial. Now, the reclaim figures

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                 128458477   278352931
Major Faults                   2174976         225
Swap Ins                      16904701           0
Swap Outs                     17359627           0
Allocation stalls                43611           0
DMA allocs                           0           0
DMA32 allocs                  19832646    19448017
Normal allocs                614488453   580941839
Movable allocs                       0           0
Direct pages scanned          24163800           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed        20691346           0
Compaction stalls                42263           0
Compaction success                 938           0
Compaction failures              41325           0

This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.

I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used

thpscale Fault Latencies
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)

The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload

                                   4.4.0                 4.4.0
                          kcompactd-v1r1         nodefrag-v1r3
Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                  37429143    47564000
Major Faults                      1916        1558
Swap Ins                          1466        1079
Swap Outs                      2936863      149626
Allocation stalls                62510           3
DMA allocs                           0           0
DMA32 allocs                   6566458     6401314
Normal allocs                216361697   216538171
Movable allocs                       0           0
Direct pages scanned          25977580       17998
Kswapd pages scanned                 0     3638931
Kswapd pages reclaimed               0      207236
Direct pages reclaimed         8833714          88
Compaction stalls               103349           5
Compaction success                 270           4
Compaction failures             103079           1

Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.

I also tried the stutter benchmark.  For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available

stutter
                                 4.4.0                 4.4.0
                        kcompactd-v1r1         nodefrag-v1r3
Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)

This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently.  This shows a mix because the ideal case of mapping with THP
is not hit as often.  However, note that 99% of the mappings complete
13.79% faster.  The CPU usage here is particularly interesting

               4.4.0       4.4.0
        kcompactd-v1r1nodefrag-v1r3
User           67.50        0.99
System       1327.88       91.30
Elapsed      2079.00     2128.98

And once again we look at the reclaim figures

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                 335241922  1314582827
Major Faults                       715         819
Swap Ins                             0           0
Swap Outs                            0           0
Allocation stalls               532723           0
DMA allocs                           0           0
DMA32 allocs                1822364341  1177950222
Normal allocs               1815640808  1517844854
Movable allocs                       0           0
Direct pages scanned          21892772           0
Kswapd pages scanned          20015890    41879484
Kswapd pages reclaimed        19961986    41822072
Direct pages reclaimed        21892741           0
Compaction stalls              1065755           0
Compaction success                 514           0
Compaction failures            1065241           0

Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.

THP gives impressive gains in some cases but only if they are quickly
available.  We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

444eb2a4

mm: scale kswapd watermarks in proportion to memory · 795ae7a0

由 Johannes Weiner 提交于 3月 17, 2016

In machines with 140G of memory and enterprise flash storage, we have
seen read and write bursts routinely exceed the kswapd watermarks and
cause thundering herds in direct reclaim. Unfortunately, the only way
to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
system's emergency reserves - which is entirely unrelated to the
system's latency requirements. In order to get kswapd to maintain a
250M buffer of free memory, the emergency reserves need to be set to 1G.
That is a lot of memory wasted for no good reason.

On the other hand, it's reasonable to assume that allocation bursts and
overall allocation concurrency scale with memory capacity, so it makes
sense to make kswapd aggressiveness a function of that as well.

Change the kswapd watermark scale factor from the currently fixed 25% of
the tunable emergency reserve to a tunable 0.1% of memory.

Beyond 1G of memory, this will produce bigger watermark steps than the
current formula in default settings. Ensure that the new formula never
chooses steps smaller than that, i.e. 25% of the emergency reserve.

On a 140G machine, this raises the default watermark steps - the
distance between min and low, and low and high - from 16M to 143M.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMel Gorman <mgorman@suse.de>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

795ae7a0

thp, vmstats: count deferred split events · f9719a03

由 Kirill A. Shutemov 提交于 3月 17, 2016

Count how many times we put a THP in split queue.  Currently, it happens
on partial unmap of a THP.

Rapidly growing value can indicate that an application behaves
unfriendly wrt THP: often fault in huge page and then unmap part of it.
This leads to unnecessary memory fragmentation and the application may
require tuning.

The event also can help with debugging kernel [mis-]behaviour.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9719a03

mm: memcontrol: report kernel stack usage in cgroup2 memory.stat · 12580e4b

由 Vladimir Davydov 提交于 3月 17, 2016

Show how much memory is allocated to kernel stacks.
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

12580e4b

mm: memcontrol: report slab usage in cgroup2 memory.stat · 27ee57c9

由 Vladimir Davydov 提交于 3月 17, 2016

Show how much memory is used for storing reclaimable and unreclaimable
in-kernel data structures allocated from slab caches.
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

27ee57c9

sh: add device tree support and generic board using device tree · 7480e0aa

由 Rich Felker 提交于 1月 23, 2016

Add a new pseudo-board, within the existing SH boards/machine-vectors
framework, which does not represent any actual hardware but instead
requires all hardware to be described by the device tree blob provided
by the boot loader. Changes made are thus non-invasive and do not risk
breaking support for legacy boards.

New hardware, including the open-hardware J2 and associated SoC
devices, will use device free from the outset. Legacy SH boards can
transition to device tree once all their hardware has device tree
bindings, driver support for device tree, and a dts file for the
board.

It is intented that, once all boards are supported in the new
framework, the existing machine-vectors framework should be removed
and the new device tree setup code integrated directly.
Signed-off-by: NRich Felker <dalias@libc.org>

7480e0aa

17 3月, 2016 5 次提交

Documentation: bindings: add description of phy for sdhci-of-arasan · 18e8d812

由 Shawn Lin 提交于 3月 07, 2016

This patch adds phys and phy-names for sdhci-of-arasan as required
properties for arasan,sdhci-5.1, and details the example as well.
Signed-off-by: NShawn Lin <shawn.lin@rock-chips.com>
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>

18e8d812

net: arc_emac: add phy reset is optional for device tree · 8700eee6

由 Caesar Wang 提交于 3月 14, 2016

This patch adds the following property for arc_emac.

1) phy-reset-gpios:
The phy-reset-gpio is an optional property for arc emac device tree boot.
Change the binding document to match the driver code.

2) phy-reset-duration:
Different boards may require different phy reset duration. Add property
phy-reset-duration for device tree probe, so that the boards that need
a longer reset duration can specify it in their device tree.

Anyway, we can add the above property for arc emac.
Signed-off-by: NCaesar Wang <wxt@rock-chips.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: devicetree@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc; Alexander Kochetkov <al.kochet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8700eee6

net: arc_emac: make the rockchip emac document more compatible · 434242cd

由 Caesar Wang 提交于 3月 14, 2016

Add the rk3036 SoCs to match driver for document since the emac driver
has supported the rk3036 SoCs.

This patch adds the rk3036/rk3066/rk3188 SoCS to compatible for rockchip
emac ducument. Also, that will suit for other SoCs in the future.
Signed-off-by: NCaesar Wang <wxt@rock-chips.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: devicetree@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Kochetkov <al.kochet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

434242cd

watchdog: Add support for minimum time between heartbeats · 15013ad8

由 Guenter Roeck 提交于 2月 28, 2016

Some watchdogs require a minimum time between heartbeats.
Examples are the watchdogs in DA9062 and AT91SAM9x.
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NWim Van Sebroeck <wim@iguana.be>

15013ad8

watchdog: Make stop function optional · d0684c8a

由 Guenter Roeck 提交于 2月 28, 2016

Not all hardware watchdogs can be stopped. The driver for
such watchdogs would typically only set the WATCHDOG_HW_RUNNING
flag in its stop function. Make the stop function optional and set
WATCHDOG_HW_RUNNING in the watchdog core if it is not provided.
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NWim Van Sebroeck <wim@iguana.be>

d0684c8a

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功