提交 · a6753c4d6257feef2a82522d0d222166d4be4aa8 · openeuler / raspberrypi-kernel

01 12月, 2017 1 次提交

hwmon: (jc42) optionally try to disable the SMBUS timeout · 68615eb0

由 Peter Rosin 提交于 11月 27, 2017

With a nxp,se97 chip on an atmel sama5d31 board, the I2C adapter driver
is not always capable of avoiding the 25-35 ms timeout as specified by
the SMBUS protocol. This may cause silent corruption of the last bit of
any transfer, e.g. a one is read instead of a zero if the sensor chip
times out. This also affects the eeprom half of the nxp-se97 chip, where
this silent corruption was originally noticed. Other I2C adapters probably
suffer similar issues, e.g. bit-banging comes to mind as risky...

The SMBUS register in the nxp chip is not a standard Jedec register, but
it is not special to the nxp chips either, at least the atmel chips
have the same mechanism. Therefore, do not special case this on the
manufacturer, it is opt-in via the device property anyway.

Cc: stable@vger.kernel.org # 4.9+
Signed-off-by: NPeter Rosin <peda@axentia.se>
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>

68615eb0

30 11月, 2017 1 次提交

Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" · 90daf306

由 Michal Hocko 提交于 11月 29, 2017

This reverts commit 0f6d24f8 ("mm/page-writeback.c: print a warning
if the vm dirtiness settings are illogical") because it causes false
positive warnings during OOM situations as noticed by Tetsuo Handa:

Node 0 active_anon:3525940kB inactive_anon:8372kB active_file:216kB inactive_file:1872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2504kB dirty:52kB writeback:0kB shmem:8660kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 636928kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
Node 0 DMA free:14848kB min:284kB low:352kB high:420kB active_anon:992kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 2687 3645 3645
Node 0 DMA32 free:53004kB min:49608kB low:62008kB high:74408kB active_anon:2712648kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2773132kB mlocked:0kB kernel_stack:96kB pagetables:5096kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 958 958
Node 0 Normal free:17140kB min:17684kB low:22104kB high:26524kB active_anon:812300kB inactive_anon:8372kB active_file:1228kB inactive_file:1868kB unevictable:0kB writepending:52kB present:1048576kB managed:981224kB mlocked:0kB kernel_stack:3520kB pagetables:8552kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
[...]
Out of memory: Kill process 8459 (a.out) score 999 or sacrifice child
Killed process 8459 (a.out) total-vm:4180kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 8459 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
vm direct limit must be set greater than background limit.

The problem is that both thresh and bg_thresh will be 0 if
available_memory is less than 4 pages when evaluating
global_dirtyable_memory.

While this might be worked around the whole point of the warning is
dubious at best. We do rely on admins to do sensible things when
changing tunable knobs. Dirty memory writeback knobs are not any
special in that regards so revert the warning rather than adding more
hacks to work this around.

Debugged by Yafang Shao.

Link: http://lkml.kernel.org/r/20171127091939.tahb77nznytcxw55@dhcp22.suse.cz
Fixes: 0f6d24f8 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical")
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

90daf306

29 11月, 2017 3 次提交

vsprintf: add printk specifier %px · 7b1924a1

由 Tobin C. Harding 提交于 11月 23, 2017

printk specifier %p now hashes all addresses before printing. Sometimes
we need to see the actual unmodified address. This can be achieved using
%lx but then we face the risk that if in future we want to change the
way the Kernel handles printing of pointers we will have to grep through
the already existent 50 000 %lx call sites. Let's add specifier %px as a
clear, opt-in, way to print a pointer and maintain some level of
isolation from all the other hex integer output within the Kernel.

Add printk specifier %px to print the actual unmodified address.
Signed-off-by: NTobin C. Harding <me@tobin.cc>

7b1924a1

printk: hash addresses printed with %p · ad67b74d

由 Tobin C. Harding 提交于 11月 01, 2017

Currently there exist approximately 14 000 places in the kernel where
addresses are being printed using an unadorned %p. This potentially
leaks sensitive information regarding the Kernel layout in memory. Many
of these calls are stale, instead of fixing every call lets hash the
address by default before printing. This will of course break some
users, forcing code printing needed addresses to be updated.

Code that _really_ needs the address will soon be able to use the new
printk specifier %px to print the address.

For what it's worth, usage of unadorned %p can be broken down as
follows (thanks to Joe Perches).

$ git grep -E '%p[^A-Za-z0-9]' | cut -f1 -d"/" | sort | uniq -c
   1084 arch
     20 block
     10 crypto
     32 Documentation
   8121 drivers
   1221 fs
    143 include
    101 kernel
     69 lib
    100 mm
   1510 net
     40 samples
      7 scripts
     11 security
    166 sound
    152 tools
      2 virt

Add function ptr_to_id() to map an address to a 32 bit unique
identifier. Hash any unadorned usage of specifier %p and any malformed
specifiers.
Signed-off-by: NTobin C. Harding <me@tobin.cc>

ad67b74d

docs: correct documentation for %pK · 553d8e8b

由 Tobin C. Harding 提交于 11月 23, 2017

Current documentation indicates that %pK prints a leading '0x'. This is
not the case.

Correct documentation for printk specifier %pK.
Signed-off-by: NTobin C. Harding <me@tobin.cc>

553d8e8b

22 11月, 2017 1 次提交

timer: Remove init_timer() interface · 7eeb6b89

由 Kees Cook 提交于 10月 11, 2017

All users of init_timer() have been updated. Remove the ancient interface.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: NKees Cook <keescook@chromium.org>

7eeb6b89

21 11月, 2017 8 次提交

x86/pkeys: Update documentation about availability · c51ff2c7

由 Dave Hansen 提交于 11月 10, 2017

Now that CPUs that implement Memory Protection Keys are publicly
available we can be a bit less oblique about where it is available.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171111001228.DC748A10@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

c51ff2c7

dt-bindings: rtc: imxdi: Improve the bindings text · 87c9fd81

由 Fabio Estevam 提交于 11月 15, 2017

Improve the bindings text by doing the following changes:

- Remove the i.MX53 reference, as the RTC on i.MX53 is a different hardware
- Add 'clocks' to the list of required properties
- Explain that the optional security violation irq is the second entry
- Use the real unit address and irq numbers for i.MX25
Signed-off-by: NFabio Estevam <fabio.estevam@nxp.com>
Acked-by: NJuergen Borleis <jbe@pengutronix.de>
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>

87c9fd81

dt-bindings: trivial-devices: Remove fsl,mc13892 · def4db33

由 Jonathan Neuschäfer 提交于 11月 18, 2017

This device's bindings are not trivial: Additional properties are
documented in in Documentation/devicetree/bindings/mfd/mc13xxx.txt.
Signed-off-by: NJonathan Neuschäfer <j.neuschaefer@gmx.net>
Reviewed-by: NFabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: NRob Herring <robh@kernel.org>

def4db33

Documentation: fix profile= options in kernel-parameters.txt · e7e61fc0

由 Randy Dunlap 提交于 11月 19, 2017

Correctly the formatting of several additions to the profile= option
that have been added by using <profiletype> and listing the choices
for it.
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

e7e61fc0

documentation/svga.txt: update outdated file · 0cf9bb67

由 Randy Dunlap 提交于 11月 19, 2017

Drop CONFIG_VIDEO_400_HACK info completely.
Drop CONFIG_VIDEO_RETAIN and CONFIG_VIDEO_LOCAL completely.
Drop CONFIG_VIDEO_COMPACT and CONFIG_VIDEO_VESA info completely.
Drop CONFIG_VIDEO_SVGA info since it has been removed.
Drop chapter number & section number references since they are wrong.
Drop (bad) ftp URL for 800x600 Thinkpad XF86Config.

Rename CONFIG_VIDEO_GFX_HACK to VIDEO_GFX_HACK since it is not a
Kconfig symbol. And to match the source code.

Build options are controlled by the kernel kconfig utility.
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Acked-By: NMartin Mares <mj@ucw.cz>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

0cf9bb67

kokr/memory-barriers.txt: Fix typo in paring example · 80416fb4

由 SeongJae Park 提交于 11月 18, 2017

This commit applies an upstream change, commit d92f842b
("memory-barriers.txt: Fix typo in pairing example") to the Korean
translation.
Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

80416fb4

kokr/memory-barriers/txt: Replace uses of "transitive" · 578152da

由 SeongJae Park 提交于 11月 18, 2017

This commit applies two upstream change, commit f1ab25a3
("memory-barriers: Replace uses of "transitive"") and commit 0902b1f4
("memory-barriers: Rework multicopy-atomicity section") to the Korean
translation.  Those two changes are applied with this signle commit
because the second change is improvement of the first one.
Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

578152da

Documentation/process: add Co-Developed-by: tag for patches with multiple authors · 8ee25f6f

由 Greg Kroah-Hartman 提交于 11月 16, 2017

Sometimes a single patch is the result of multiple authors.  As git only
can have one "author" of a patch, it is still good to properly give
credit to the other developers of a commit.  To address this, document
the "Co-Developed-by:" tag which can be used to show other authors of
the patch.

Note, these other authors must also provide a Signed-off-by: tag as it
is their work that is being submitted here.
Reported-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

8ee25f6f

19 11月, 2017 1 次提交

NTB: switchtec_ntb: Update switchtec documentation with notes for NTB · d0450244

由 Logan Gunthorpe 提交于 8月 03, 2017

The switchtec_ntb driver has a couple requirements on the switchec's
hardware configuration so we add these notes to the documentation.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NStephen Bates <sbates@raithlin.com>
Reviewed-by: NKurt Schwemmer <kurt.schwemmer@microsemi.com>
Acked-by: NAllen Hubbe <Allen.Hubbe@dell.com>
Signed-off-by: NJon Mason <jdmason@kudzu.us>

d0450244

18 11月, 2017 6 次提交

kbuild: /bin/pwd -> pwd · 16f8259c

由 Bjørn Forsman 提交于 11月 05, 2017

Most places use pwd and rely on $PATH lookup. Moving the remaining
absolute path /bin/pwd users over for consistency.

Also, a reason for doing /bin/pwd -> pwd instead of the other way around
is because I believe build systems should make little assumptions on
host filesystem layout. Case in point, we do this kind of patching
already in NixOS.

Ref. commit 028568d8
("kbuild: revert $(realpath ...) to $(shell cd ... && /bin/pwd)").
Signed-off-by: NBjørn Forsman <bjorn.forsman@gmail.com>
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>

16f8259c

kcov: update documentation · c512ac01

由 Victor Chibotaru 提交于 11月 17, 2017

The updated documentation describes new KCOV mode for collecting
comparison operands.

Link: http://lkml.kernel.org/r/20171011095459.70721-3-glider@google.comSigned-off-by: NVictor Chibotaru <tchibo@google.com>
Signed-off-by: NAlexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: <syzkaller@googlegroups.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c512ac01

Documentation/sysctl/vm.txt: fix typo · 2743232c

由 Kangmin Park 提交于 11月 17, 2017

Link: http://lkml.kernel.org/r/CAKW4uUyCi=PnKf3epgFVz8z=1tMtHSOHNm+fdNxrNw3-THvRCA@mail.gmail.comSigned-off-by: NKangmin Park <l4stpr0gr4m@gmail.com>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Alan Cox <alan@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2743232c

dynamic_debug documentation: minor fixes · 8c188759

由 Randy Dunlap 提交于 11月 17, 2017

Fix minor typo.

Fix missing words in explaining parsing of last line number.

Link: http://lkml.kernel.org/r/ebb7ff42-4945-103f-d5b4-f07a6f3343a7@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8c188759

kernel debug: support resetting WARN*_ONCE · b1fca27d

由 Andi Kleen 提交于 11月 17, 2017

I like _ONCE warnings because it's guaranteed that they don't flood the
log.

During testing I find it useful to reset the state of the once warnings,
so that I can rerun tests and see if they trigger again, or can
guarantee that a test run always hits the same warnings.

This patch adds a debugfs interface to reset all the _ONCE warnings so
that they appear again:

  echo 1 > /sys/kernel/debug/clear_warn_once

This is implemented by putting all the warning booleans into a special
section, and clearing it.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20171017221455.6740-1-andi@firstfloor.orgSigned-off-by: NAndi Kleen <ak@linux.intel.com>
Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b1fca27d

proc, coredump: add CoreDumping flag to /proc/pid/status · c6434012

由 Roman Gushchin 提交于 11月 17, 2017

Right now there is no convenient way to check if a process is being
coredumped at the moment.

It might be necessary to recognize such state to prevent killing the
process and getting a broken coredump.  Writing a large core might take
significant time, and the process is unresponsive during it, so it might
be killed by timeout, if another process is monitoring and
killing/restarting hanging tasks.

We're getting a significant number of corrupted coredump files on
machines in our fleet, just because processes are being killed by
timeout in the middle of the core writing process.

We do have a process health check, and some agent is responsible for
restarting processes which are not responding for health check requests.
Writing a large coredump to the disk can easily exceed the reasonable
timeout (especially on an overloaded machine).

This flag will allow the agent to distinguish processes which are being
coredumped, extend the timeout for them, and let them produce a full
coredump file.

To provide an ability to detect if a process is in the state of being
coredumped, we can expose a boolean CoreDumping flag in
/proc/pid/status.

Example:
$ cat core.sh
  #!/bin/sh

  echo "|/usr/bin/sleep 10" > /proc/sys/kernel/core_pattern
  sleep 1000 &
  PID=$!

  cat /proc/$PID/status | grep CoreDumping
  kill -ABRT $PID
  sleep 1
  cat /proc/$PID/status | grep CoreDumping

$ ./core.sh
  CoreDumping:	0
  CoreDumping:	1

[guro@fb.com: document CoreDumping flag in /proc/<pid>/status]
  Link: http://lkml.kernel.org/r/20170928135357.GA8470@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20170920230634.31572-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c6434012

17 11月, 2017 5 次提交

PM / runtime: Drop children check from __pm_runtime_set_status() · f8817f61

由 Rafael J. Wysocki 提交于 11月 16, 2017

The check for "active" children in __pm_runtime_set_status(), when
trying to set the parent device status to "suspended", doesn't
really make sense, because in fact it is not invalid to set the
status of a device with runtime PM disabled to "suspended" in any
case.  It is invalid to enable runtime PM for a device with its
status set to "suspended" while its child_count reference counter
is nonzero, but the check in __pm_runtime_set_status() doesn't
really cover that situation.

For this reason, drop the children check from __pm_runtime_set_status()
and add a check against child_count reference counters of "suspended"
devices to pm_runtime_enable().

Fixes: a8636c89 (PM / Runtime: Don't allow to suspend a device with an active child)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org>
Reviewed-by: NJohan Hovold <johan@kernel.org>

f8817f61

dt-bindings: usb: document hub and host-controller properties · f877918c

由 Johan Hovold 提交于 11月 09, 2017

Hub nodes and host-controller nodes with child nodes must specify values
for #address-cells (1) and #size-cells (0).

Also make the definition of the related reg property a bit more
stringent, and add comments to the example source.
Signed-off-by: NJohan Hovold <johan@kernel.org>
Signed-off-by: NRob Herring <robh@kernel.org>

f877918c

dt-bindings: usb: clean up compatible property · bfebcf54

由 Johan Hovold 提交于 11月 09, 2017

Add quotation marks around the compatible string to avoid ambiguity due
to following punctuation, and define the VID and PID components.
Signed-off-by: NJohan Hovold <johan@kernel.org>
Signed-off-by: NRob Herring <robh@kernel.org>

bfebcf54

dt-bindings: usb: fix reg-property port-number range · f42ae7b0

由 Johan Hovold 提交于 11月 09, 2017

The USB hub port-number range for USB 2.0 is 1-255 and not 1-31 which
reflects an arbitrary limit set by the current Linux implementation.

Note that for USB 3.1 hubs the valid range is 1-15.

Increase the documented valid range in the binding to 255, which is the
maximum allowed by the specifications.
Signed-off-by: NJohan Hovold <johan@kernel.org>
Signed-off-by: NRob Herring <robh@kernel.org>

f42ae7b0

dt-bindings: usb: fix example hub node name · 1759f270

由 Johan Hovold 提交于 11月 09, 2017

According to the OF Recommended Practice for USB, hub nodes shall be
named "hub", but the example had mixed up the label and node names. Fix
the node name and drop the redundant label.

While at it, remove a newline and add a missing semicolon to the example
source.
Signed-off-by: NJohan Hovold <johan@kernel.org>
Signed-off-by: NRob Herring <robh@kernel.org>

1759f270

16 11月, 2017 10 次提交

sched/deadline: Fix the description of runtime accounting in the documentation · 5c0342ca

由 Claudio Scordino 提交于 11月 14, 2017

Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com>
Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
Acked-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/1510658366-28995-1-git-send-email-claudio@evidence.eu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5c0342ca

mm, sysctl: make NUMA stats configurable · 4518085e

由 Kemi Wang 提交于 11月 15, 2017

This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested
by Dave Hansen and Ying Huang.

=========================================================================

When page allocation performance becomes a bottleneck and you can
tolerate some possible tool breakage and decreased numa counter
precision, you can do:

	echo 0 > /proc/sys/vm/numa_stat

In this case, numa counter update is ignored.  We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and
reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
drop of cpu cycles per single page allocation and reclaim on Jesper's
page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
(88 threads, 126G memory).

Benchmark link provided by Jesper D Brouer (increase loop times to
10000000):

  https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

=========================================================================

When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:

	echo 1 > /proc/sys/vm/numa_stat

This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

[keescook@chromium.org: make sure mutex is a global static]
  Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.comSigned-off-by: NKemi Wang <kemi.wang@intel.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
Suggested-by: NDave Hansen <dave.hansen@intel.com>
Suggested-by: NYing Huang <ying.huang@intel.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4518085e

kmemcheck: rip it out · 4675ff05

由 Levin, Alexander (Sasha Levin) 提交于 11月 15, 2017

Fix up makefiles, remove references, and git rm kmemcheck.

Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4675ff05

mm: consolidate page table accounting · af5b0f6a

由 Kirill A. Shutemov 提交于 11月 15, 2017

Currently, we account page tables separately for each page table level,
but that's redundant -- we only make use of total memory allocated to
page tables for oom_badness calculation. We also provide the
information to userspace, but it has dubious value there too.

This patch switches page table accounting to single counter.

mm->pgtables_bytes is now used to account all page table levels. We use
bytes, because page table size for different levels of page table tree
may be different.

The change has user-visible effect: we don't have VmPMD and VmPUD
reported in /proc/[pid]/status. Not sure if anybody uses them. (As
alternative, we can always report 0 kB for them.)

OOM-killer report is also slightly changed: we now report pgtables_bytes
instead of nr_ptes, nr_pmd, nr_puds.

Apart from reducing number of counters per-mm, the benefit is that we
now calculate oom_badness() more correctly for machines which have
different size of page tables depending on level or where page tables
are less than a page in size.

The only downside can be debuggability because we do not know which page
table level could leak. But I do not remember many bugs that would be
caught by separate counters so I wouldn't lose sleep over this.

[akpm@linux-foundation.org: fix mm/huge_memory.c]
Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
[kirill.shutemov@linux.intel.com: fix build]
Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.comSigned-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

af5b0f6a

mm: account pud page tables · b4e98d9a

由 Kirill A. Shutemov 提交于 11月 15, 2017

On a machine with 5-level paging support a process can allocate
significant amount of memory and stay unnoticed by oom-killer and memory
cgroup.  The trick is to allocate a lot of PUD page tables.  We don't
account PUD page tables, only PMD and PTE.

We already addressed the same issue for PMD page tables, see commit
dc6c9a35 ("mm: account pmd page tables to the process").
Introduction of 5-level paging brings the same issue for PUD page
tables.

The patch expands accounting to PUD level.

[kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
  Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
[heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
  Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b4e98d9a

mm/mmu_notifier: avoid double notification when it is useless · 0f10851e

由 Jérôme Glisse 提交于 11月 15, 2017

This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users.  Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range).  But that notification is not necessary
in all cases.

This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end.  It also adds documentation in all
those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space).  There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:

  A) page backing address is free before mmu_notifier_invalidate_range_end
  B) a page table entry is updated to point to a new page (COW, write fault
     on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
  - take page table lock
  - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
  - set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0  {try to write to addrA}
CPU-thread-1  {try to write to addrB}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA and populate device TLB}
DEV-thread-2  {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {write to addrA which is a write to new page}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {}
CPU-thread-3  {write to addrB which is a write to new page}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA from old page}
DEV-thread-2  {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA.  This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock.  This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end

Thanks to Andrea for thinking of a problematic scenario for COW.

[jglisse@redhat.com: v2]
  Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0f10851e

mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical · 0f6d24f8

由 Yafang Shao 提交于 11月 15, 2017

The vm direct limit setting must be set greater than vm background limit
setting. Otherwise print a warning to help the operator to figure out
that the vm dirtiness settings is in illogical state.

Link: http://lkml.kernel.org/r/1506592464-30962-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0f6d24f8

docs: dev-tools: coccinelle: delete out of date wiki reference · e9e716ff

由 Julia Lawall 提交于 11月 13, 2017

The wiki is no longer available.
Signed-off-by: NJulia Lawall <julia.lawall@lip6.fr>
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>

e9e716ff

KEYS: fix in-kernel documentation for keyctl_read() · be543dd6

由 Eric Biggers 提交于 11月 15, 2017

When keyctl_read() is passed a buffer that is too small, the behavior is
inconsistent.  Some key types will fill as much of the buffer as
possible, while others won't copy anything.  Moreover, the in-kernel
documentation contradicted the man page on this point.

Update the in-kernel documentation to say that this point is
unspecified.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

be543dd6

dt-bindings: rtc: Add Spreadtrum SC27xx RTC documentation · 8f6596f4

由 Baolin Wang 提交于 11月 09, 2017

This patch adds the binding documentation for Spreadtrum SC27xx series
RTC device.
Signed-off-by: NBaolin Wang <baolin.wang@spreadtrum.com>
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>

8f6596f4

15 11月, 2017 4 次提交

dt-bindings: pwm: Add R-Car D3 device tree bindings · ccb4e74a

由 Yoshihiro Shimoda 提交于 10月 04, 2017

Add device tree bindings for the PWM controller found on R-Car D3 SoCs.
Signed-off-by: NYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NThierry Reding <thierry.reding@gmail.com>

ccb4e74a

block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP · a33801e8

由 Luca Miccio 提交于 11月 13, 2017

BFQ currently creates, and updates, its own instance of the whole
set of blkio statistics that cfq creates. Yet, from the comments
of Tejun Heo in [1], it turned out that most of these statistics
are meant/useful only for debugging. This commit makes BFQ create
the latter, debugging statistics only if the option
CONFIG_DEBUG_BLK_CGROUP is set.

By doing so, this commit also enables BFQ to enjoy a high perfomance
boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
BFQ has to update far fewer statistics, and, in particular, not the
heaviest to update.  To give an idea of the benefits, if
CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
with 8 threads doing random I/O in parallel on null_blk (configured
with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
(+30%). We have measured similar or even much higher boosts with other
CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
been obtained and can be reproduced very easily with the script in [1].

[1] https://www.spinics.net/lists/linux-block/msg18943.htmlSuggested-by: NTejun Heo <tj@kernel.org>
Suggested-by: NUlf Hansson <ulf.hansson@linaro.org>
Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a33801e8

block, bfq: update blkio stats outside the scheduler lock · 24bfd19b

由 Paolo Valente 提交于 11月 13, 2017

bfq invokes various blkg_*stats_* functions to update the statistics
contained in the special files blkio.bfq.* in the blkio controller
groups, i.e., the I/O accounting related to the proportional-share
policy provided by bfq. The execution of these functions takes a
considerable percentage, about 40%, of the total per-request execution
time of bfq (i.e., of the sum of the execution time of all the bfq
functions that have to be executed to process an I/O request from its
creation to its destruction).  This reduces the request-processing
rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
the bfq functions that invoke blkg_*stats_* functions cannot be
executed in parallel with the rest of the code of bfq, because both
are executed under the same same per-device scheduler lock.

To reduce this slowdown, this commit moves, wherever possible, the
invocation of these functions (more precisely, of the bfq functions
that invoke blkg_*stats_* functions) outside the critical sections
protected by the scheduler lock.

With this change, and with all blkio.bfq.* statistics enabled, the
throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
i7-4850HQ, in case of 8 threads doing random I/O in parallel on
null_blk, with the latter configured with 0 latency. We obtained the
same or higher throughput boosts, up to +30%, with other processors
(some figures are reported in the documentation). For our tests, we
used the script [1], with which our results can be easily reproduced.

NOTE. This commit still protects the invocation of blkg_*stats_*
functions with the request_queue lock, because the group these
functions are invoked on may otherwise disappear before or while these
functions are executed.  Fortunately, tests without even this lock
show, by difference, that the serialization caused by this lock has a
little impact (at most ~5% of throughput reduction).

[1] https://github.com/Algodev-github/IOSpeedTested-by: NLee Tibbert <lee.tibbert@gmail.com>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

24bfd19b

doc, block, bfq: update max IOPS sustainable with BFQ · 68017e5d

由 Paolo Valente 提交于 11月 13, 2017

We have investigated more deeply the performance of BFQ, in terms of
number of IOPS that can be processed by the CPU when BFQ is used as
I/O scheduler. In more detail, using the script [1], we have measured
the number of IOPS reached on top of a null block device configured
with zero latency, as a function of the workload (sequential read,
sequential write, random read, random write) and of the system (we
considered desktops, laptops and embedded systems).

Basing on the resulting figures, with this commit we update the
current, conservative IOPS range reported in BFQ documentation. In
particular, the documentation now reports, for each of three different
systems, the lowest number of IOPS obtained for that system with the
above test (namely, the value obtained with the workload leading to
the lowest IOPS).

[1] https://github.com/Algodev-github/IOSpeedReviewed-by: NLee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

68017e5d