1. 17 2月, 2015 6 次提交
  2. 14 2月, 2015 9 次提交
    • G
      rtc: armada38x: add the device tree binding documentation · bb624047
      Gregory CLEMENT 提交于
      The Marvell Armada 38x SoCs contains an RTC which differs from the RTC
      used in the other mvebu SoCs until now.  This forth version of the patch
      set adds support for this new IP and enable it in the Device Tree of the
      Armada 38x SoC.
      
      This patch (of 5):
      
      The Armada 38x SoCs come with a new RTC which differs from the one used in
      the other mvebu SoCs until now.  This patch describes the binding of this
      RTC.
      Signed-off-by: NGregory CLEMENT <gregory.clement@free-electrons.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Cc: Arnaud Ebalard <arno@natisbad.org>
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
      Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
      Cc: Boris BREZILLON <boris.brezillon@free-electrons.com>
      Cc: Lior Amsalem <alior@marvell.com>
      Cc: Tawfik Bayouk <tawfik@marvell.com>
      Cc: Nadav Haklai <nadavh@marvell.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb624047
    • A
      rtc: add support for Abracon AB-RTCMC-32.768kHz-B5ZE-S3 I2C RTC chip · 0b2f6228
      Arnaud Ebalard 提交于
      This patch adds support for Abracon AB-RTCMC-32.768kHz-B5ZE-S3
      RTC/Calendar module w/ I2C interface.
      
      This support includes RTC time reading and setting, Alarm (1 minute
      accuracy) reading and setting, and battery low detection.  The device also
      supports frequency adjustment and two timers but those features are
      currently not implemented in this driver.  Due to alarm accuracy
      limitation (and current lack of timer support in the driver), UIE mode is
      not supported.
      Signed-off-by: NArnaud Ebalard <arno@natisbad.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Peter Huewe <peter.huewe@infineon.com>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rob Herring <robherring2@gmail.com>
      Cc: Pawel Moll <pawel.moll@arm.com>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Kumar Gala <galak@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b2f6228
    • A
      of: add vendor prefix for Abracon Corporation · 446810f2
      Arnaud Ebalard 提交于
      This series adds support for Abracon AB-RTCMC-32.768kHz-B5ZE-S3 I2C RTC
      chip. Unlike many RTC chips, it includes an internal oscillator which
      spares room on the PCB. It also has some interesting features, like
      battery low detection (which the driver in this series supports). The
      only small "limitation" (mainly due to what RTC subsystem expects from
      RTC chips) is the fact that its alarm is accurate to the second. This
      series provides a solution (described below) for that limitation using
      another mechanism of the chip.
      
      I decided to split support between three different patches for
      this v0:
      
      - Patch 1/3: it simply references Abracon Corporation in vendor-prefixes
        documentation file. As Abracon has no NASDAQ ticker symbol; I have
        decided to use "abcn" (I initially started my work w/ "ab" but later
        changed for "abcn" which looked more meaningful)
      - Patch 2/3: it adds initial support for the chip and provides the
        ability to read/write time and also read/write alarm. As the alarm
        the chip provides is accurate to the minute, the support provided
        by this patch also has this limitation (e.g. UIE mode is not
        supported).
      - Patch 3/3: the chip supports a watchdog timer which can be used to
        extend the alarm mechanism in patch 2/3 in order to provide support
        for alarms under one minute (e.g. support UIE mode). In practice,
        the logic I implemented is to use the watchdog timer for alarms which
        are at most 4 minutes in the future and use the common alarm mechanism
        for alarms which are set to larger values. With that additional patch
        the device fully passes the rtctest.c program.
      
      I decided to split the driver between two patches (2 and 3 of 3) in
      order to ease review: patch 2 should be pretty straightforward to read
      for someone familiar w/ RTC subsystem. Patch 3 only extends what is in
      patch 2 regarding alarms.
      
      This patch (of 3):
      
      Documentation/devicetree/bindings/vendor-prefixes.txt: add vendor prefix
      for Abracon Corporation
      Signed-off-by: NArnaud Ebalard <arno@natisbad.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Peter Huewe <peter.huewe@infineon.com>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rob Herring <robherring2@gmail.com>
      Cc: Pawel Moll <pawel.moll@arm.com>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Kumar Gala <galak@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      446810f2
    • A
      rtc: rtc-isl12057: add isil,irq2-can-wakeup-machine property for in-tree users · 298ff012
      Arnaud Ebalard 提交于
      Current in-tree users of ISL12057 RTC chip (NETGEAR ReadyNAS 102, 104 and
      2120) do not have the IRQ#2 pin of the chip (associated w/ the Alarm1
      mechanism) connected to their SoC, but to a PMIC (TPS65251 FWIW).  This
      specific hardware configuration allows the NAS to wake up when the alarms
      rings.
      
      Recently introduced alarm support for ISL12057 relies on the provision of
      an "interrupts" property in system .dts file, which previous three users
      will never get.  For that reason, alarm support on those devices is not
      function.  To support this use case, this patch adds a new DT property for
      ISL12057 (isil,irq2-can-wakeup-machine) to indicate that the chip is
      capable of waking up the device using its IRQ#2 pin (even though it does
      not have its IRQ#2 pin connected directly to the SoC).
      
      This specific configuration was tested on a ReadyNAS 102 by setting an
      alarm, powering off the device and see it reboot as expected when the
      alarm rang w/:
      
        # echo `date '+%s' -d '+ 1 minutes'` > /sys/class/rtc/rtc0/wakealarm
        # shutdown -h now
      
      As a side note, the ISL12057 remains in the list of trivial devices,
      because the property is not per se required by the device to work but can
      help handle system w/ specific requirements.  In exchange, the new feature
      is described in details in a specific documentation file.
      Signed-off-by: NArnaud Ebalard <arno@natisbad.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Peter Huewe <peter.huewe@infineon.com>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Darshana Padmadas <darshanapadmadas@gmail.com>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Pawel Moll <pawel.moll@arm.com>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Kumar Gala <galak@codeaurora.org>
      Cc: Uwe Kleine-König <uwe@kleine-koenig.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      298ff012
    • J
      drivers/rtc/rtc-pcf2123.c: add support for devicetree · 3fc70077
      Joshua Clayton 提交于
      Add compatible string "nxp,rtc-pcf2123"
      Document the binding
      Signed-off-by: NJoshua Clayton <stillcompiling@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Grant Likely <grant.likely@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fc70077
    • A
      kasan: enable instrumentation of global variables · bebf56a1
      Andrey Ryabinin 提交于
      This feature let us to detect accesses out of bounds of global variables.
      This will work as for globals in kernel image, so for globals in modules.
      Currently this won't work for symbols in user-specified sections (e.g.
      __init, __read_mostly, ...)
      
      The idea of this is simple.  Compiler increases each global variable by
      redzone size and add constructors invoking __asan_register_globals()
      function.  Information about global variable (address, size, size with
      redzone ...) passed to __asan_register_globals() so we could poison
      variable's redzone.
      
      This patch also forces module_alloc() to return 8*PAGE_SIZE aligned
      address making shadow memory handling (
      kasan_module_alloc()/kasan_module_free() ) more simple.  Such alignment
      guarantees that each shadow page backing modules address space correspond
      to only one module_alloc() allocation.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bebf56a1
    • A
      x86_64: add KASan support · ef7f0d6a
      Andrey Ryabinin 提交于
      This patch adds arch specific code for kernel address sanitizer.
      
      16TB of virtual addressed used for shadow memory.  It's located in range
      [ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup
      stacks.
      
      At early stage we map whole shadow region with zero page.  Latter, after
      pages mapped to direct mapping address range we unmap zero pages from
      corresponding shadow (see kasan_map_shadow()) and allocate and map a real
      shadow memory reusing vmemmap_populate() function.
      
      Also replace __pa with __pa_nodebug before shadow initialized.  __pa with
      CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr)
      __phys_addr is instrumented, so __asan_load could be called before shadow
      area initialized.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef7f0d6a
    • A
      kasan: add kernel address sanitizer infrastructure · 0b24becc
      Andrey Ryabinin 提交于
      Kernel Address sanitizer (KASan) is a dynamic memory error detector.  It
      provides fast and comprehensive solution for finding use-after-free and
      out-of-bounds bugs.
      
      KASAN uses compile-time instrumentation for checking every memory access,
      therefore GCC > v4.9.2 required.  v4.9.2 almost works, but has issues with
      putting symbol aliases into the wrong section, which breaks kasan
      instrumentation of globals.
      
      This patch only adds infrastructure for kernel address sanitizer.  It's
      not available for use yet.  The idea and some code was borrowed from [1].
      
      Basic idea:
      
      The main idea of KASAN is to use shadow memory to record whether each byte
      of memory is safe to access or not, and use compiler's instrumentation to
      check the shadow memory on each memory access.
      
      Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
      memory and uses direct mapping with a scale and offset to translate a
      memory address to its corresponding shadow address.
      
      Here is function to translate address to corresponding shadow address:
      
           unsigned long kasan_mem_to_shadow(unsigned long addr)
           {
                      return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
           }
      
      where KASAN_SHADOW_SCALE_SHIFT = 3.
      
      So for every 8 bytes there is one corresponding byte of shadow memory.
      The following encoding used for each shadow byte: 0 means that all 8 bytes
      of the corresponding memory region are valid for access; k (1 <= k <= 7)
      means that the first k bytes are valid for access, and other (8 - k) bytes
      are not; Any negative value indicates that the entire 8-bytes are
      inaccessible.  Different negative values used to distinguish between
      different kinds of inaccessible memory (redzones, freed memory) (see
      mm/kasan/kasan.h).
      
      To be able to detect accesses to bad memory we need a special compiler.
      Such compiler inserts a specific function calls (__asan_load*(addr),
      __asan_store*(addr)) before each memory access of size 1, 2, 4, 8 or 16.
      
      These functions check whether memory region is valid to access or not by
      checking corresponding shadow memory.  If access is not valid an error
      printed.
      
      Historical background of the address sanitizer from Dmitry Vyukov:
      
      	"We've developed the set of tools, AddressSanitizer (Asan),
      	ThreadSanitizer and MemorySanitizer, for user space. We actively use
      	them for testing inside of Google (continuous testing, fuzzing,
      	running prod services). To date the tools have found more than 10'000
      	scary bugs in Chromium, Google internal codebase and various
      	open-source projects (Firefox, OpenSSL, gcc, clang, ffmpeg, MySQL and
      	lots of others): [2] [3] [4].
      	The tools are part of both gcc and clang compilers.
      
      	We have not yet done massive testing under the Kernel AddressSanitizer
      	(it's kind of chicken and egg problem, you need it to be upstream to
      	start applying it extensively). To date it has found about 50 bugs.
      	Bugs that we've found in upstream kernel are listed in [5].
      	We've also found ~20 bugs in out internal version of the kernel. Also
      	people from Samsung and Oracle have found some.
      
      	[...]
      
      	As others noted, the main feature of AddressSanitizer is its
      	performance due to inline compiler instrumentation and simple linear
      	shadow memory. User-space Asan has ~2x slowdown on computational
      	programs and ~2x memory consumption increase. Taking into account that
      	kernel usually consumes only small fraction of CPU and memory when
      	running real user-space programs, I would expect that kernel Asan will
      	have ~10-30% slowdown and similar memory consumption increase (when we
      	finish all tuning).
      
      	I agree that Asan can well replace kmemcheck. We have plans to start
      	working on Kernel MemorySanitizer that finds uses of unitialized
      	memory. Asan+Msan will provide feature-parity with kmemcheck. As
      	others noted, Asan will unlikely replace debug slab and pagealloc that
      	can be enabled at runtime. Asan uses compiler instrumentation, so even
      	if it is disabled, it still incurs visible overheads.
      
      	Asan technology is easily portable to other architectures. Compiler
      	instrumentation is fully portable. Runtime has some arch-dependent
      	parts like shadow mapping and atomic operation interception. They are
      	relatively easy to port."
      
      Comparison with other debugging features:
      ========================================
      
      KMEMCHECK:
      
        - KASan can do almost everything that kmemcheck can.  KASan uses
          compile-time instrumentation, which makes it significantly faster than
          kmemcheck.  The only advantage of kmemcheck over KASan is detection of
          uninitialized memory reads.
      
          Some brief performance testing showed that kasan could be
          x500-x600 times faster than kmemcheck:
      
      $ netperf -l 30
      		MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET
      		Recv   Send    Send
      		Socket Socket  Message  Elapsed
      		Size   Size    Size     Time     Throughput
      		bytes  bytes   bytes    secs.    10^6bits/sec
      
      no debug:	87380  16384  16384    30.00    41624.72
      
      kasan inline:	87380  16384  16384    30.00    12870.54
      
      kasan outline:	87380  16384  16384    30.00    10586.39
      
      kmemcheck: 	87380  16384  16384    30.03      20.23
      
        - Also kmemcheck couldn't work on several CPUs.  It always sets
          number of CPUs to 1.  KASan doesn't have such limitation.
      
      DEBUG_PAGEALLOC:
      	- KASan is slower than DEBUG_PAGEALLOC, but KASan works on sub-page
      	  granularity level, so it able to find more bugs.
      
      SLUB_DEBUG (poisoning, redzones):
      	- SLUB_DEBUG has lower overhead than KASan.
      
      	- SLUB_DEBUG in most cases are not able to detect bad reads,
      	  KASan able to detect both reads and writes.
      
      	- In some cases (e.g. redzone overwritten) SLUB_DEBUG detect
      	  bugs only on allocation/freeing of object. KASan catch
      	  bugs right before it will happen, so we always know exact
      	  place of first bad read/write.
      
      [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
      [2] https://code.google.com/p/address-sanitizer/wiki/FoundBugs
      [3] https://code.google.com/p/thread-sanitizer/wiki/FoundBugs
      [4] https://code.google.com/p/memory-sanitizer/wiki/FoundBugs
      [5] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel#Trophies
      
      Based on work by Andrey Konovalov.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: NMichal Marek <mmarek@suse.cz>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b24becc
    • A
      MODULE_DEVICE_TABLE: fix some callsites · 0f989f74
      Andrew Morton 提交于
      The patch "module: fix types of device tables aliases" newly requires that
      invocations of
      
      MODULE_DEVICE_TABLE(type, name);
      
      come *after* the definition of `name'.  That is reasonable, but some
      drivers weren't doing this.  Fix them.
      
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Acked-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f989f74
  3. 13 2月, 2015 3 次提交
  4. 12 2月, 2015 6 次提交
    • C
      Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files · 740a5ddb
      Cyrill Gorcunov 提交于
      [akpm@linux-foundation.org: tweaks]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Calvin Owens <calvinowens@fb.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      740a5ddb
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • J
      mm: memcontrol: default hierarchy interface for memory · 241994ed
      Johannes Weiner 提交于
      Introduce the basic control files to account, partition, and limit
      memory using cgroups in default hierarchy mode.
      
      This interface versioning allows us to address fundamental design
      issues in the existing memory cgroup interface, further explained
      below.  The old interface will be maintained indefinitely, but a
      clearer model and improved workload performance should encourage
      existing users to switch over to the new one eventually.
      
      The control files are thus:
      
        - memory.current shows the current consumption of the cgroup and its
          descendants, in bytes.
      
        - memory.low configures the lower end of the cgroup's expected
          memory consumption range.  The kernel considers memory below that
          boundary to be a reserve - the minimum that the workload needs in
          order to make forward progress - and generally avoids reclaiming
          it, unless there is an imminent risk of entering an OOM situation.
      
        - memory.high configures the upper end of the cgroup's expected
          memory consumption range.  A cgroup whose consumption grows beyond
          this threshold is forced into direct reclaim, to work off the
          excess and to throttle new allocations heavily, but is generally
          allowed to continue and the OOM killer is not invoked.
      
        - memory.max configures the hard maximum amount of memory that the
          cgroup is allowed to consume before the OOM killer is invoked.
      
        - memory.events shows event counters that indicate how often the
          cgroup was reclaimed while below memory.low, how often it was
          forced to reclaim excess beyond memory.high, how often it hit
          memory.max, and how often it entered OOM due to memory.max.  This
          allows users to identify configuration problems when observing a
          degradation in workload performance.  An overcommitted system will
          have an increased rate of low boundary breaches, whereas increased
          rates of high limit breaches, maximum hits, or even OOM situations
          will indicate internally overcommitted cgroups.
      
      For existing users of memory cgroups, the following deviations from
      the current interface are worth pointing out and explaining:
      
        - The original lower boundary, the soft limit, is defined as a limit
          that is per default unset.  As a result, the set of cgroups that
          global reclaim prefers is opt-in, rather than opt-out.  The costs
          for optimizing these mostly negative lookups are so high that the
          implementation, despite its enormous size, does not even provide
          the basic desirable behavior.  First off, the soft limit has no
          hierarchical meaning.  All configured groups are organized in a
          global rbtree and treated like equal peers, regardless where they
          are located in the hierarchy.  This makes subtree delegation
          impossible.  Second, the soft limit reclaim pass is so aggressive
          that it not just introduces high allocation latencies into the
          system, but also impacts system performance due to overreclaim, to
          the point where the feature becomes self-defeating.
      
          The memory.low boundary on the other hand is a top-down allocated
          reserve.  A cgroup enjoys reclaim protection when it and all its
          ancestors are below their low boundaries, which makes delegation
          of subtrees possible.  Secondly, new cgroups have no reserve per
          default and in the common case most cgroups are eligible for the
          preferred reclaim pass.  This allows the new low boundary to be
          efficiently implemented with just a minor addition to the generic
          reclaim code, without the need for out-of-band data structures and
          reclaim passes.  Because the generic reclaim code considers all
          cgroups except for the ones running low in the preferred first
          reclaim pass, overreclaim of individual groups is eliminated as
          well, resulting in much better overall workload performance.
      
        - The original high boundary, the hard limit, is defined as a strict
          limit that can not budge, even if the OOM killer has to be called.
          But this generally goes against the goal of making the most out of
          the available memory.  The memory consumption of workloads varies
          during runtime, and that requires users to overcommit.  But doing
          that with a strict upper limit requires either a fairly accurate
          prediction of the working set size or adding slack to the limit.
          Since working set size estimation is hard and error prone, and
          getting it wrong results in OOM kills, most users tend to err on
          the side of a looser limit and end up wasting precious resources.
      
          The memory.high boundary on the other hand can be set much more
          conservatively.  When hit, it throttles allocations by forcing
          them into direct reclaim to work off the excess, but it never
          invokes the OOM killer.  As a result, a high boundary that is
          chosen too aggressively will not terminate the processes, but
          instead it will lead to gradual performance degradation.  The user
          can monitor this and make corrections until the minimal memory
          footprint that still gives acceptable performance is found.
      
          In extreme cases, with many concurrent allocations and a complete
          breakdown of reclaim progress within the group, the high boundary
          can be exceeded.  But even then it's mostly better to satisfy the
          allocation from the slack available in other groups or the rest of
          the system than killing the group.  Otherwise, memory.max is there
          to limit this type of spillover and ultimately contain buggy or
          even malicious applications.
      
        - The original control file names are unwieldy and inconsistent in
          many different ways.  For example, the upper boundary hit count is
          exported in the memory.failcnt file, but an OOM event count has to
          be manually counted by listening to memory.oom_control events, and
          lower boundary / soft limit events have to be counted by first
          setting a threshold for that value and then counting those events.
          Also, usage and limit files encode their units in the filename.
          That makes the filenames very long, even though this is not
          information that a user needs to be reminded of every time they
          type out those names.
      
          To address these naming issues, as well as to signal clearly that
          the new interface carries a new configuration model, the naming
          conventions in it necessarily differ from the old interface.
      
        - The original limit files indicate the state of an unset limit with
          a very high number, and a configured limit can be unset by echoing
          -1 into those files.  But that very high number is implementation
          and architecture dependent and not very descriptive.  And while -1
          can be understood as an underflow into the highest possible value,
          -2 or -10M etc. do not work, so it's not inconsistent.
      
          memory.low, memory.high, and memory.max will use the string
          "infinity" to indicate and set the highest possible value.
      
      [akpm@linux-foundation.org: use seq_puts() for basic strings]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      241994ed
    • W
      mm:add KPF_ZERO_PAGE flag for /proc/kpageflags · 56873f43
      Wang, Yalin 提交于
      Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
      detect zero_page in /proc/kpageflags, and then do memory analysis more
      accurately.
      Signed-off-by: NYalin Wang <yalin.wang@sonymobile.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56873f43
    • J
      f2fs: introduce a batched trim · bba681cb
      Jaegeuk Kim 提交于
      This patch introduces a batched trimming feature, which submits split discard
      commands.
      
      This is to avoid long latency due to huge trim commands.
      If fstrim was triggered ranging from 0 to the end of device, we should lock
      all the checkpoint-related mutexes, resulting in very long latency.
      Reviewed-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      bba681cb
    • J
      f2fs: support norecovery mount option · 2d834bf9
      Jaegeuk Kim 提交于
      This patch adds a mount option, norecovery, which is mostly same as
      disable_roll_forward. The only difference is that norecovery should be activated
      with read-only mount option.
      
      This can be used when user wants to check whether f2fs is mountable or not
      without any recovery process. (e.g., xfstests/200)
      Reviewed-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      2d834bf9
  5. 11 2月, 2015 6 次提交
    • A
      drm/exynos: Add DECON driver · 96976c3d
      Ajay Kumar 提交于
      This patch is based on exynos-drm-next branch of Inki Dae's tree at:
      git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos.git
      
      DECON(Display and Enhancement Controller) is the new IP
      in exynos7 SOC for generating video signals using pixel data.
      
      DECON driver can be used to drive 2 different interfaces on Exynos7:
      DECON-INT(video controller) and DECON-EXT(Mixer for HDMI)
      
      The existing FIMD driver code was used as a template to create
      DECON driver. Only DECON-INT is supported as of now, and
      DECON-EXT support will be added later.
      
      The current version of the driver supports video mode displays.
      
      Changelog v2:
      - Change config name, DRM_EXYNOS_DECON to DRM_EXYNOS7_DECON.
      Signed-off-by: NAkshu Agrawal <akshua@gmail.com>
      Signed-off-by: NAjay Kumar <ajaykumar.rs@samsung.com>
      Signed-off-by: NInki Dae <inki.dae@samsung.com>
      96976c3d
    • K
      rmap: drop support of non-linear mappings · 27ba0644
      Kirill A. Shutemov 提交于
      We don't create non-linear mappings anymore.  Let's drop code which
      handles them in rmap.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ba0644
    • K
      mm: replace remap_file_pages() syscall with emulation · c8d78c18
      Kirill A. Shutemov 提交于
      remap_file_pages(2) was invented to be able efficiently map parts of
      huge file into limited 32-bit virtual address space such as in database
      workloads.
      
      Nonlinear mappings are pain to support and it seems there's no
      legitimate use-cases nowadays since 64-bit systems are widely available.
      
      Let's drop it and get rid of all these special-cased code.
      
      The patch replaces the syscall with emulation which creates new VMA on
      each remap_file_pages(), unless they it can be merged with an adjacent
      one.
      
      I didn't find *any* real code that uses remap_file_pages(2) to test
      emulation impact on.  I've checked Debian code search and source of all
      packages in ALT Linux.  No real users: libc wrappers, mentions in
      strace, gdb, valgrind and this kind of stuff.
      
      There are few basic tests in LTP for the syscall.  They work just fine
      with emulation.
      
      To test performance impact, I've written small test case which
      demonstrate pretty much worst case scenario: map 4G shmfs file, write to
      begin of every page pgoff of the page, remap pages in reverse order,
      read every page.
      
      The test creates 1 million of VMAs if emulation is in use, so I had to
      set vm.max_map_count to 1100000 to avoid -ENOMEM.
      
      Before:		23.3 ( +-  4.31% ) seconds
      After:		43.9 ( +-  0.85% ) seconds
      Slowdown:	1.88x
      
      I believe we can live with that.
      
      Test case:
      
              #define _GNU_SOURCE
              #include <assert.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <sys/mman.h>
      
              #define MB	(1024UL * 1024)
              #define SIZE	(4096 * MB)
      
              int main(int argc, char **argv)
              {
                      unsigned long *p;
                      long i, pass;
      
                      for (pass = 0; pass < 10; pass++) {
                              p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                              if (p == MAP_FAILED) {
                                      perror("mmap");
                                      return -1;
                              }
      
                              for (i = 0; i < SIZE / 4096; i++)
                                      p[i * 4096 / sizeof(*p)] = i;
      
                              for (i = 0; i < SIZE / 4096; i++) {
                                      if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                      0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                              perror("remap_file_pages");
                                              return -1;
                                      }
                              }
      
                              for (i = SIZE / 4096 - 1; i >= 0; i--)
                                      assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
      
                              munmap(p, SIZE);
                      }
      
                      return 0;
              }
      
      [akpm@linux-foundation.org: fix spello]
      [sasha.levin@oracle.com: initialize populate before usage]
      [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
      Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8d78c18
    • D
      fsioctl.c: make generic_block_fiemap() signal-tolerant · 913e027c
      Dmitry Monakhov 提交于
      __generic_block_fiemap may spin very long time for large sparse files.
      
      Without this patch an unprivileged user may abuse system resources simply
      by spawning a vast number of unkilable busyloops (works on ext2/ext3):
      
        truncate --size 1T test
        for ((i=0;i<1024;i++))
        do
               filefrag test > /dev/null &
        done
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      913e027c
    • A
      ocfs2: add a mount option journal_async_commit on ocfs2 filesystem · 1dfeb768
      alex chen 提交于
      Add a mount option to support JBD2 feature:
      
      JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT.  When this feature is opened, journal
      commit block can be written to disk without waiting for descriptor blocks,
      which can improve journal commit performance.  This option will enable
      'journal_checksum' internally.
      
      Using the fs_mark benchmark, using journal_async_commit shows a 50%
      improvement, the files per second go up from 215.2 to 317.5.
      
      test script:
      fs_mark  -d  /mnt/ocfs2/  -s  10240  -n  1000
      
      default:
      FSUse%        Count         Size    Files/sec     App Overhead
           0         1000        10240        215.2            17878
      
      with journal_async_commit option:
      FSUse%        Count         Size    Files/sec     App Overhead
           0         1000        10240        317.5            17881
      Signed-off-by: NAlex Chen <alex.chen@huawei.com>
      Signed-off-by: NWeiwei Wang <wangww631@huawei.comm>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dfeb768
    • Z
      inotify: update documentation to reflect code changes · a5b2f95d
      Zhang Zhen 提交于
      The inotify interface has changed a lot.  The user interface was too
      old, and the kernel interface was removed by Eric Paris in commit:
      2dfc1cae inotify: remove inotify in kernel interface.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Wang Kai <morgan.wang@huawei.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Robert Love <robert.w.love@intel.com>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5b2f95d
  6. 10 2月, 2015 2 次提交
  7. 09 2月, 2015 1 次提交
  8. 08 2月, 2015 1 次提交
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
  9. 07 2月, 2015 3 次提交
  10. 06 2月, 2015 3 次提交
    • L
      mailbox: Add Altera mailbox driver · f62092f6
      Ley Foon Tan 提交于
      The Altera mailbox allows for interprocessor communication. It supports
      only one channel and work as either sender or receiver.
      Signed-off-by: NLey Foon Tan <lftan@altera.com>
      f62092f6
    • I
      cxl: Export optional AFU configuration record in sysfs · b087e619
      Ian Munsie 提交于
      An AFU may optionally contain one or more PCIe like configuration
      records, which can be used to identify the AFU.
      
      This patch adds support for exposing the raw config space and the
      vendor, device and class code under sysfs. These will appear in a
      subdirectory of the AFU device corresponding with the configuration
      record number, e.g.
      
      cat /sys/class/cxl/afu0.0/cr0/vendor
      0x1014
      
      cat /sys/class/cxl/afu0.0/cr0/device
      0x4350
      
      cat /sys/class/cxl/afu0.0/cr0/class
      0x120000
      
      hexdump -C /sys/class/cxl/afu0.0/cr0/config
      00000000  14 10 50 43 00 00 00 00  06 00 00 12 00 00 00 00  |..PC............|
      00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
      *
      00000100
      
      These files behave in much the same way as the equivalent files for PCI
      devices, with one exception being that the config file is currently
      read-only and restricted to the root user. It is not necessarily
      required to be this strict, but we currently do not have a compelling
      use-case to make it writable and/or world-readable, so I erred on the
      side of being restrictive.
      Signed-off-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b087e619
    • E
      xfs: fix panic_mask documentation · de8bd0eb
      Eric Sandeen 提交于
      This bit of the docs didn't quite reflect reality.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      de8bd0eb