1. 13 7月, 2017 18 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
    • R
      random,stackprotect: introduce get_random_canary function · 022c2040
      Rik van Riel 提交于
      Patch series "stackprotector: ascii armor the stack canary", v2.
      
      Zero out the first byte of the stack canary value on 64 bit systems, in
      order to mitigate unterminated C string overflows.
      
      The null byte both prevents C string functions from reading the canary,
      and from writing it if the canary value were guessed or obtained through
      some other means.
      
      Reducing the entropy by 8 bits is acceptable on 64-bit systems, which
      will still have 56 bits of entropy left, but not on 32 bit systems, so
      the "ascii armor" canary is only implemented on 64-bit systems.
      
      Inspired by the "ascii armor" code in execshield and Daniel Micay's
      linux-hardened tree.
      
      Also see https://github.com/thestinger/linux-hardened/
      
      This patch (of 5):
      
      Introduce get_random_canary(), which provides a random unsigned long
      canary value with the first byte zeroed out on 64 bit architectures, in
      order to mitigate non-terminated C string overflows.
      
      The null byte both prevents C string functions from reading the canary,
      and from writing it if the canary value were guessed or obtained through
      some other means.
      
      Reducing the entropy by 8 bits is acceptable on 64-bit systems, which
      will still have 56 bits of entropy left, but not on 32 bit systems, so
      the "ascii armor" canary is only implemented on 64-bit systems.
      
      Inspired by the "ascii armor" code in the old execshield patches, and
      Daniel Micay's linux-hardened tree.
      
      Link: http://lkml.kernel.org/r/20170524155751.424-2-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      022c2040
    • D
      include/linux/string.h: add the option of fortified string.h functions · 6974f0c4
      Daniel Micay 提交于
      This adds support for compiling with a rough equivalent to the glibc
      _FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer
      overflow checks for string.h functions when the compiler determines the
      size of the source or destination buffer at compile-time.  Unlike glibc,
      it covers buffer reads in addition to writes.
      
      GNU C __builtin_*_chk intrinsics are avoided because they would force a
      much more complex implementation.  They aren't designed to detect read
      overflows and offer no real benefit when using an implementation based
      on inline checks.  Inline checks don't add up to much code size and
      allow full use of the regular string intrinsics while avoiding the need
      for a bunch of _chk functions and per-arch assembly to avoid wrapper
      overhead.
      
      This detects various overflows at compile-time in various drivers and
      some non-x86 core kernel code.  There will likely be issues caught in
      regular use at runtime too.
      
      Future improvements left out of initial implementation for simplicity,
      as it's all quite optional and can be done incrementally:
      
      * Some of the fortified string functions (strncpy, strcat), don't yet
        place a limit on reads from the source based on __builtin_object_size of
        the source buffer.
      
      * Extending coverage to more string functions like strlcat.
      
      * It should be possible to optionally use __builtin_object_size(x, 1) for
        some functions (C strings) to detect intra-object overflows (like
        glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative
        approach to avoid likely compatibility issues.
      
      * The compile-time checks should be made available via a separate config
        option which can be enabled by default (or always enabled) once enough
        time has passed to get the issues it catches fixed.
      
      Kees said:
       "This is great to have. While it was out-of-tree code, it would have
        blocked at least CVE-2016-3858 from being exploitable (improper size
        argument to strlcpy()). I've sent a number of fixes for
        out-of-bounds-reads that this detected upstream already"
      
      [arnd@arndb.de: x86: fix fortified memcpy]
        Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de
      [keescook@chromium.org: avoid panic() in favor of BUG()]
        Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast
      [keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help]
      Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com
      Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.orgSigned-off-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6974f0c4
    • N
      kernel/watchdog: split up config options · 05a4a952
      Nicholas Piggin 提交于
      Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
      HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.
      
      LOCKUP_DETECTOR implies the general boot, sysctl, and programming
      interfaces for the lockup detectors.
      
      An architecture that wants to use a hard lockup detector must define
      HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.
      
      Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
      minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
      does not implement the LOCKUP_DETECTOR interfaces.
      
      sparc is unusual in that it has started to implement some of the
      interfaces, but not fully yet.  It should probably be converted to a full
      HAVE_HARDLOCKUP_DETECTOR_ARCH.
      
      [npiggin@gmail.com: fix]
        Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
      Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05a4a952
    • N
      kernel/watchdog: introduce arch_touch_nmi_watchdog() · f2e0cff8
      Nicholas Piggin 提交于
      For architectures that define HAVE_NMI_WATCHDOG, instead of having them
      provide the complete touch_nmi_watchdog() function, just have them
      provide arch_touch_nmi_watchdog().
      
      This gives the generic code more flexibility in implementing this
      function, and arch implementations don't miss out on touching the
      softlockup watchdog or other generic details.
      
      Link: http://lkml.kernel.org/r/20170616065715.18390-3-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2e0cff8
    • N
      kernel/watchdog: remove unused declaration · 24bb4461
      Nicholas Piggin 提交于
      Patch series "Improve watchdog config for arch watchdogs", v4.
      
      A series to make the hardlockup watchdog more easily replaceable by arch
      code.  The last patch provides some justification for why we want to do
      this (existing sparc watchdog is another that could benefit).
      
      This patch (of 5):
      
      Remove unused declaration.
      
      Link: http://lkml.kernel.org/r/20170616065715.18390-2-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24bb4461
    • M
      include/linux/sem.h: correctly document sem_ctime · 2cd648c1
      Manfred Spraul 提交于
      sem_ctime is initialized to the semget() time and then updated at every
      semctl() that changes the array.
      
      Thus it does not represent the time of the last change.
      
      Especially, semop() calls are only stored in sem_otime, not in
      sem_ctime.
      
      This is already described in ipc/sem.c, I just overlooked that there is
      a comment in include/linux/sem.h and man semctl(2) as well.
      
      So: Correct wrong comments.
      
      Link: http://lkml.kernel.org/r/20170515171912.6298-4-manfred@colorfullife.comSigned-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: <1vier1@web.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Fabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cd648c1
    • M
      ipc: merge ipc_rcu and kern_ipc_perm · dba4cdd3
      Manfred Spraul 提交于
      ipc has two management structures that exist for every id:
       - struct kern_ipc_perm, it contains e.g. the permissions.
       - struct ipc_rcu, it contains the rcu head for rcu handling and the
         refcount.
      
      The patch merges both structures.
      
      As a bonus, we may save one cacheline, because both structures are
      cacheline aligned.  In addition, it reduces the number of casts, instead
      most codepaths can use container_of.
      
      To simplify code, the ipc_rcu_alloc initializes the allocation to 0.
      
      [manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()]
        Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com
      Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.comSigned-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dba4cdd3
    • M
      ipc/sem.c: remove sem_base, embed struct sem · 1a233956
      Manfred Spraul 提交于
      sma->sem_base is initialized with
      
      	sma->sem_base = (struct sem *) &sma[1];
      
      The current code has four problems:
       - There is an unnecessary pointer dereference - sem_base is not needed.
       - Alignment for struct sem only works by chance.
       - The current code causes false positive for static code analysis.
       - This is a cast between different non-void types, which the future
         randstruct GCC plugin warns on.
      
      And, as bonus, the code size gets smaller:
      
        Before:
          0 .text         00003770
        After:
          0 .text         0000374e
      
      [manfred@colorfullife.com: s/[0]/[]/, per hch]
        Link: http://lkml.kernel.org/r/20170525185107.12869-2-manfred@colorfullife.com
      Link: http://lkml.kernel.org/r/20170515171912.6298-2-manfred@colorfullife.comSigned-off-by: NManfred Spraul <manfred@colorfullife.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: <1vier1@web.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a233956
    • D
      fault-inject: support systematic fault injection · e41d5818
      Dmitry Vyukov 提交于
      Add /proc/self/task/<current-tid>/fail-nth file that allows failing
      0-th, 1-st, 2-nd and so on calls systematically.
      Excerpt from the added documentation:
      
       "Write to this file of integer N makes N-th call in the current task
        fail (N is 0-based). Read from this file returns a single char 'Y' or
        'N' that says if the fault setup with a previous write to this file
        was injected or not, and disables the fault if it wasn't yet injected.
        Note that this file enables all types of faults (slab, futex, etc).
        This setting takes precedence over all other generic settings like
        probability, interval, times, etc. But per-capability settings (e.g.
        fail_futex/ignore-private) take precedence over it. This feature is
        intended for systematic testing of faults in a single system call. See
        an example below"
      
      Why add a new setting:
      1. Existing settings are global rather than per-task.
         So parallel testing is not possible.
      2. attr->interval is close but it depends on attr->count
         which is non reset to 0, so interval does not work as expected.
      3. Trying to model this with existing settings requires manipulations
         of all of probability, interval, times, space, task-filter and
         unexposed count and per-task make-it-fail files.
      4. Existing settings are per-failure-type, and the set of failure
         types is potentially expanding.
      5. make-it-fail can't be changed by unprivileged user and aggressive
         stress testing better be done from an unprivileged user.
         Similarly, this would require opening the debugfs files to the
         unprivileged user, as he would need to reopen at least times file
         (not possible to pre-open before dropping privs).
      
      The proposed interface solves all of the above (see the example).
      
      We want to integrate this into syzkaller fuzzer.  A prototype has found
      10 bugs in kernel in first day of usage:
      
        https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
      
      I've made the current interface work with all types of our sandboxes.
      For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
      make /proc entries non-root owned.  So I am fine with the current
      version of the code.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e41d5818
    • C
      kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE · 92ef6da3
      Cyrill Gorcunov 提交于
      kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
      appropriate helpers in epoll code with the config to build it
      conditionally.
      
      Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lanSigned-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reported-by: NAndrew Morton <akpm@linuxfoundation.org>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92ef6da3
    • C
      kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files · 0791e364
      Cyrill Gorcunov 提交于
      With current epoll architecture target files are addressed with
      file_struct and file descriptor number, where the last is not unique.
      Moreover files can be transferred from another process via unix socket,
      added into queue and closed then so we won't find this descriptor in the
      task fdinfo list.
      
      Thus to checkpoint and restore such processes CRIU needs to find out
      where exactly the target file is present to add it into epoll queue.
      For this sake one can use kcmp call where some particular target file
      from the queue is compared with arbitrary file passed as an argument.
      
      Because epoll target files can have same file descriptor number but
      different file_struct a caller should explicitly specify the offset
      within.
      
      To test if some particular file is matching entry inside epoll one have
      to
      
       - fill kcmp_epoll_slot structure with epoll file descriptor,
         target file number and target file offset (in case if only
         one target is present then it should be 0)
      
       - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
          - the kernel fetch file pointer matching file descriptor @fd of pid1
          - lookups for file struct in epoll queue of pid2 and returns traditional
            0,1,2 result for sorting purpose
      
      Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.comSigned-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0791e364
    • L
      sysctl: add unsigned int range support · 61d9b56a
      Luis R. Rodriguez 提交于
      To keep parity with regular int interfaces provide the an unsigned int
      proc_douintvec_minmax() which allows you to specify a range of allowed
      valid numbers.
      
      Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
      actual user for that.
      
      Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.orgSigned-off-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61d9b56a
    • X
      kdump: protect vmcoreinfo data under the crash memory · 1229384f
      Xunlei Pang 提交于
      Currently vmcoreinfo data is updated at boot time subsys_initcall(), it
      has the risk of being modified by some wrong code during system is
      running.
      
      As a result, vmcore dumped may contain the wrong vmcoreinfo.  Later on,
      when using "crash", "makedumpfile", etc utility to parse this vmcore, we
      probably will get "Segmentation fault" or other unexpected errors.
      
      E.g.  1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
      system; 3) trigger kdump, then we obviously will fail to recognize the
      crash context correctly due to the corrupted vmcoreinfo.
      
      Now except for vmcoreinfo, all the crash data is well
      protected(including the cpu note which is fully updated in the crash
      path, thus its correctness is guaranteed).  Given that vmcoreinfo data
      is a large chunk prepared for kdump, we better protect it as well.
      
      To solve this, we relocate and copy vmcoreinfo_data to the crash memory
      when kdump is loading via kexec syscalls.  Because the whole crash
      memory will be protected by existing arch_kexec_protect_crashkres()
      mechanism, we naturally protect vmcoreinfo_data from write(even read)
      access under kernel direct mapping after kdump is loaded.
      
      Since kdump is usually loaded at the very early stage after boot, we can
      trust the correctness of the vmcoreinfo data copied.
      
      On the other hand, we still need to operate the vmcoreinfo safe copy
      when crash happens to generate vmcoreinfo_note again, we rely on vmap()
      to map out a new kernel virtual address and update to use this new one
      instead in the following crash_save_vmcoreinfo().
      
      BTW, we do not touch vmcoreinfo_note, because it will be fully updated
      using the protected vmcoreinfo_data after crash which is surely correct
      just like the cpu crash note.
      
      Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.comSigned-off-by: NXunlei Pang <xlpang@redhat.com>
      Tested-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1229384f
    • X
      powerpc/fadump: use the correct VMCOREINFO_NOTE_SIZE for phdr · 5203f499
      Xunlei Pang 提交于
      vmcoreinfo_max_size stands for the vmcoreinfo_data, the correct one we
      should use is vmcoreinfo_note whose total size is VMCOREINFO_NOTE_SIZE.
      
      Like explained in commit 77019967 ("kdump: fix exported size of
      vmcoreinfo note"), it should not affect the actual function, but we
      better fix it, also this change should be safe and backward compatible.
      
      After this, we can get rid of variable vmcoreinfo_max_size, let's use
      the corresponding macros directly, fewer variables means more safety for
      vmcoreinfo operation.
      
      [xlpang@redhat.com: fix build warning]
        Link: http://lkml.kernel.org/r/1494830606-27736-1-git-send-email-xlpang@redhat.com
      Link: http://lkml.kernel.org/r/1493281021-20737-2-git-send-email-xlpang@redhat.comSigned-off-by: NXunlei Pang <xlpang@redhat.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Reviewed-by: NDave Young <dyoung@redhat.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5203f499
    • X
      kexec: move vmcoreinfo out of the kernel's .bss section · 203e9e41
      Xunlei Pang 提交于
      As Eric said,
       "what we need to do is move the variable vmcoreinfo_note out of the
        kernel's .bss section. And modify the code to regenerate and keep this
        information in something like the control page.
      
        Definitely something like this needs a page all to itself, and ideally
        far away from any other kernel data structures. I clearly was not
        watching closely the data someone decided to keep this silly thing in
        the kernel's .bss section."
      
      This patch allocates extra pages for these vmcoreinfo_XXX variables, one
      advantage is that it enhances some safety of vmcoreinfo, because
      vmcoreinfo now is kept far away from other kernel data structures.
      
      Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.comSigned-off-by: NXunlei Pang <xlpang@redhat.com>
      Tested-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Suggested-by: NEric Biederman <ebiederm@xmission.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      203e9e41
    • I
      kernel.h: handle pointers to arrays better in container_of() · c7acec71
      Ian Abbott 提交于
      If the first parameter of container_of() is a pointer to a
      non-const-qualified array type (and the third parameter names a
      non-const-qualified array member), the local variable __mptr will be
      defined with a const-qualified array type.  In ISO C, these types are
      incompatible.  They work as expected in GNU C, but some versions will
      issue warnings.  For example, GCC 4.9 produces the warning
      "initialization from incompatible pointer type".
      
      Here is an example of where the problem occurs:
      
      -------------------------------------------------------
         #include <linux/kernel.h>
         #include <linux/module.h>
      
        MODULE_LICENSE("GPL");
      
        struct st {
        	int a;
        	char b[16];
        };
      
        static int __init example_init(void) {
        	struct st t = { .a = 101, .b = "hello" };
        	char (*p)[16] = &t.b;
        	struct st *x = container_of(p, struct st, b);
        	printk(KERN_DEBUG "%p %p\n", (void *)&t, (void *)x);
        	return 0;
        }
      
        static void __exit example_exit(void) {
        }
      
        module_init(example_init);
        module_exit(example_exit);
      -------------------------------------------------------
      
      Building the module with gcc-4.9 results in these warnings (where '{m}'
      is the module source and '{k}' is the kernel source):
      
      -------------------------------------------------------
        In file included from {m}/example.c:1:0:
        {m}/example.c: In function `example_init':
        {k}/include/linux/kernel.h:854:48: warning: initialization from incompatible pointer type
          const typeof( ((type *)0)->member ) *__mptr = (ptr); \
                                                        ^
        {m}/example.c:14:17: note: in expansion of macro `container_of'
          struct st *x = container_of(p, struct st, b);
                         ^
        {k}/include/linux/kernel.h:854:48: warning: (near initialization for `x')
          const typeof( ((type *)0)->member ) *__mptr = (ptr); \
                                                        ^
        {m}/example.c:14:17: note: in expansion of macro `container_of'
          struct st *x = container_of(p, struct st, b);
                         ^
      -------------------------------------------------------
      
      Replace the type checking performed by the macro to avoid these
      warnings.  Make sure `*(ptr)` either has type compatible with the
      member, or has type compatible with `void`, ignoring qualifiers.  Raise
      compiler errors if this is not true.  This is stronger than the previous
      behaviour, which only resulted in compiler warnings for a type mismatch.
      
      [arnd@arndb.de: fix new warnings for container_of()]
        Link: http://lkml.kernel.org/r/20170620200940.90557-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170525120316.24473-7-abbotti@mev.co.ukSigned-off-by: NIan Abbott <abbotti@mev.co.uk>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Johannes Berg <johannes.berg@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7acec71
    • S
      include/linux/dcache.h: use unsigned chars in struct name_snapshot · 0a2c13d9
      Stephen Rothwell 提交于
      "kernel.h: handle pointers to arrays better in container_of()" triggers:
      
      In file included from include/uapi/linux/stddef.h:1:0,
                       from include/linux/stddef.h:4,
                       from include/uapi/linux/posix_types.h:4,
                       from include/uapi/linux/types.h:13,
                       from include/linux/types.h:5,
                       from include/linux/syscalls.h:71,
                       from fs/dcache.c:17:
      fs/dcache.c: In function 'release_dentry_name_snapshot':
      include/linux/compiler.h:542:38: error: call to '__compiletime_assert_305' declared with attribute error: pointer type mismatch in container_of()
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                            ^
      include/linux/compiler.h:525:4: note: in definition of macro '__compiletime_assert'
          prefix ## suffix();    \
          ^
      include/linux/compiler.h:542:2: note: in expansion of macro '_compiletime_assert'
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
        ^
      include/linux/build_bug.h:46:37: note: in expansion of macro 'compiletime_assert'
       #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                           ^
      include/linux/kernel.h:860:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
        BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
        ^
      fs/dcache.c:305:7: note: in expansion of macro 'container_of'
         p = container_of(name->name, struct external_name, name[0]);
      
      Switch name_snapshot to use unsigned chars, matching struct qstr and
      struct external_name.
      
      Link: http://lkml.kernel.org/r/20170710152134.0f78c1e6@canb.auug.org.auSigned-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2c13d9
  2. 11 7月, 2017 22 次提交