1. 15 7月, 2017 2 次提交
  2. 14 7月, 2017 21 次提交
  3. 13 7月, 2017 17 次提交
    • R
      kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2 · efc479e6
      Roman Kagan 提交于
      There is a flaw in the Hyper-V SynIC implementation in KVM: when message
      page or event flags page is enabled by setting the corresponding msr,
      KVM zeroes it out.  This is problematic because on migration the
      corresponding MSRs are loaded on the destination, so the content of
      those pages is lost.
      
      This went unnoticed so far because the only user of those pages was
      in-KVM hyperv synic timers, which could continue working despite that
      zeroing.
      
      Newer QEMU uses those pages for Hyper-V VMBus implementation, and
      zeroing them breaks the migration.
      
      Besides, in newer QEMU the content of those pages is fully managed by
      QEMU, so zeroing them is undesirable even when writing the MSRs from the
      guest side.
      
      To support this new scheme, introduce a new capability,
      KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
      pages aren't zeroed out in KVM.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      efc479e6
    • B
      clk: Provide bulk prepare_enable disable_unprepare variants · 3c48d86c
      Bjorn Andersson 提交于
      This extends the existing set of bulk helpers with prepare_enable and
      disable_unprepare variants.
      
      Cc: Russell King <linux@armlinux.org.uk>,
      Cc: Dong Aisheng <aisheng.dong@nxp.com>
      Signed-off-by: NBjorn Andersson <bjorn.andersson@linaro.org>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      3c48d86c
    • N
      writeback: rework wb_[dec|inc]_stat family of functions · 3e8f399d
      Nikolay Borisov 提交于
      Currently the writeback statistics code uses a percpu counters to hold
      various statistics.  Furthermore we have 2 families of functions - those
      which disable local irq and those which doesn't and whose names begin
      with double underscore.  However, they both end up calling
      __add_wb_stats which in turn calls percpu_counter_add_batch which is
      already irq-safe.
      
      Exploiting this fact allows to eliminated the __wb_* functions since
      they don't add any further protection than we already have.
      Furthermore, refactor the wb_* function to call __add_wb_stat directly
      without the irq-disabling dance.  This will likely result in better
      runtime of code which deals with modifying the stat counters.
      
      While at it also document why percpu_counter_add_batch is in fact
      preempt and irq-safe since at least 3 people got confused.
      
      Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.comSigned-off-by: NNikolay Borisov <nborisov@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e8f399d
    • J
    • M
      mm, migration: do not trigger OOM killer when migrating memory · 0f556856
      Michal Hocko 提交于
      Page migration (for memory hotplug, soft_offline_page or mbind) needs to
      allocate a new memory.  This can trigger an oom killer if the target
      memory is depleated.  Although quite unlikely, still possible,
      especially for the memory hotplug (offlining of memoery).
      
      Up to now we didn't really have reasonable means to back off.
      __GFP_NORETRY can fail just too easily and __GFP_THISNODE sticks to a
      single node and that is not suitable for all callers.
      
      But now that we have __GFP_RETRY_MAYFAIL we should use it.  It is
      preferable to fail the migration than disrupt the system by killing some
      processes.
      
      Link: http://lkml.kernel.org/r/20170623085345.11304-7-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f556856
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
    • R
      random,stackprotect: introduce get_random_canary function · 022c2040
      Rik van Riel 提交于
      Patch series "stackprotector: ascii armor the stack canary", v2.
      
      Zero out the first byte of the stack canary value on 64 bit systems, in
      order to mitigate unterminated C string overflows.
      
      The null byte both prevents C string functions from reading the canary,
      and from writing it if the canary value were guessed or obtained through
      some other means.
      
      Reducing the entropy by 8 bits is acceptable on 64-bit systems, which
      will still have 56 bits of entropy left, but not on 32 bit systems, so
      the "ascii armor" canary is only implemented on 64-bit systems.
      
      Inspired by the "ascii armor" code in execshield and Daniel Micay's
      linux-hardened tree.
      
      Also see https://github.com/thestinger/linux-hardened/
      
      This patch (of 5):
      
      Introduce get_random_canary(), which provides a random unsigned long
      canary value with the first byte zeroed out on 64 bit architectures, in
      order to mitigate non-terminated C string overflows.
      
      The null byte both prevents C string functions from reading the canary,
      and from writing it if the canary value were guessed or obtained through
      some other means.
      
      Reducing the entropy by 8 bits is acceptable on 64-bit systems, which
      will still have 56 bits of entropy left, but not on 32 bit systems, so
      the "ascii armor" canary is only implemented on 64-bit systems.
      
      Inspired by the "ascii armor" code in the old execshield patches, and
      Daniel Micay's linux-hardened tree.
      
      Link: http://lkml.kernel.org/r/20170524155751.424-2-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      022c2040
    • D
      include/linux/string.h: add the option of fortified string.h functions · 6974f0c4
      Daniel Micay 提交于
      This adds support for compiling with a rough equivalent to the glibc
      _FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer
      overflow checks for string.h functions when the compiler determines the
      size of the source or destination buffer at compile-time.  Unlike glibc,
      it covers buffer reads in addition to writes.
      
      GNU C __builtin_*_chk intrinsics are avoided because they would force a
      much more complex implementation.  They aren't designed to detect read
      overflows and offer no real benefit when using an implementation based
      on inline checks.  Inline checks don't add up to much code size and
      allow full use of the regular string intrinsics while avoiding the need
      for a bunch of _chk functions and per-arch assembly to avoid wrapper
      overhead.
      
      This detects various overflows at compile-time in various drivers and
      some non-x86 core kernel code.  There will likely be issues caught in
      regular use at runtime too.
      
      Future improvements left out of initial implementation for simplicity,
      as it's all quite optional and can be done incrementally:
      
      * Some of the fortified string functions (strncpy, strcat), don't yet
        place a limit on reads from the source based on __builtin_object_size of
        the source buffer.
      
      * Extending coverage to more string functions like strlcat.
      
      * It should be possible to optionally use __builtin_object_size(x, 1) for
        some functions (C strings) to detect intra-object overflows (like
        glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative
        approach to avoid likely compatibility issues.
      
      * The compile-time checks should be made available via a separate config
        option which can be enabled by default (or always enabled) once enough
        time has passed to get the issues it catches fixed.
      
      Kees said:
       "This is great to have. While it was out-of-tree code, it would have
        blocked at least CVE-2016-3858 from being exploitable (improper size
        argument to strlcpy()). I've sent a number of fixes for
        out-of-bounds-reads that this detected upstream already"
      
      [arnd@arndb.de: x86: fix fortified memcpy]
        Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de
      [keescook@chromium.org: avoid panic() in favor of BUG()]
        Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast
      [keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help]
      Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com
      Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.orgSigned-off-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6974f0c4
    • N
      kernel/watchdog: split up config options · 05a4a952
      Nicholas Piggin 提交于
      Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
      HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.
      
      LOCKUP_DETECTOR implies the general boot, sysctl, and programming
      interfaces for the lockup detectors.
      
      An architecture that wants to use a hard lockup detector must define
      HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.
      
      Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
      minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
      does not implement the LOCKUP_DETECTOR interfaces.
      
      sparc is unusual in that it has started to implement some of the
      interfaces, but not fully yet.  It should probably be converted to a full
      HAVE_HARDLOCKUP_DETECTOR_ARCH.
      
      [npiggin@gmail.com: fix]
        Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
      Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05a4a952
    • N
      kernel/watchdog: introduce arch_touch_nmi_watchdog() · f2e0cff8
      Nicholas Piggin 提交于
      For architectures that define HAVE_NMI_WATCHDOG, instead of having them
      provide the complete touch_nmi_watchdog() function, just have them
      provide arch_touch_nmi_watchdog().
      
      This gives the generic code more flexibility in implementing this
      function, and arch implementations don't miss out on touching the
      softlockup watchdog or other generic details.
      
      Link: http://lkml.kernel.org/r/20170616065715.18390-3-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2e0cff8
    • N
      kernel/watchdog: remove unused declaration · 24bb4461
      Nicholas Piggin 提交于
      Patch series "Improve watchdog config for arch watchdogs", v4.
      
      A series to make the hardlockup watchdog more easily replaceable by arch
      code.  The last patch provides some justification for why we want to do
      this (existing sparc watchdog is another that could benefit).
      
      This patch (of 5):
      
      Remove unused declaration.
      
      Link: http://lkml.kernel.org/r/20170616065715.18390-2-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24bb4461
    • M
      include/linux/sem.h: correctly document sem_ctime · 2cd648c1
      Manfred Spraul 提交于
      sem_ctime is initialized to the semget() time and then updated at every
      semctl() that changes the array.
      
      Thus it does not represent the time of the last change.
      
      Especially, semop() calls are only stored in sem_otime, not in
      sem_ctime.
      
      This is already described in ipc/sem.c, I just overlooked that there is
      a comment in include/linux/sem.h and man semctl(2) as well.
      
      So: Correct wrong comments.
      
      Link: http://lkml.kernel.org/r/20170515171912.6298-4-manfred@colorfullife.comSigned-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: <1vier1@web.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Fabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cd648c1
    • M
      ipc: merge ipc_rcu and kern_ipc_perm · dba4cdd3
      Manfred Spraul 提交于
      ipc has two management structures that exist for every id:
       - struct kern_ipc_perm, it contains e.g. the permissions.
       - struct ipc_rcu, it contains the rcu head for rcu handling and the
         refcount.
      
      The patch merges both structures.
      
      As a bonus, we may save one cacheline, because both structures are
      cacheline aligned.  In addition, it reduces the number of casts, instead
      most codepaths can use container_of.
      
      To simplify code, the ipc_rcu_alloc initializes the allocation to 0.
      
      [manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()]
        Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com
      Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.comSigned-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dba4cdd3
    • M
      ipc/sem.c: remove sem_base, embed struct sem · 1a233956
      Manfred Spraul 提交于
      sma->sem_base is initialized with
      
      	sma->sem_base = (struct sem *) &sma[1];
      
      The current code has four problems:
       - There is an unnecessary pointer dereference - sem_base is not needed.
       - Alignment for struct sem only works by chance.
       - The current code causes false positive for static code analysis.
       - This is a cast between different non-void types, which the future
         randstruct GCC plugin warns on.
      
      And, as bonus, the code size gets smaller:
      
        Before:
          0 .text         00003770
        After:
          0 .text         0000374e
      
      [manfred@colorfullife.com: s/[0]/[]/, per hch]
        Link: http://lkml.kernel.org/r/20170525185107.12869-2-manfred@colorfullife.com
      Link: http://lkml.kernel.org/r/20170515171912.6298-2-manfred@colorfullife.comSigned-off-by: NManfred Spraul <manfred@colorfullife.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: <1vier1@web.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a233956
    • D
      fault-inject: support systematic fault injection · e41d5818
      Dmitry Vyukov 提交于
      Add /proc/self/task/<current-tid>/fail-nth file that allows failing
      0-th, 1-st, 2-nd and so on calls systematically.
      Excerpt from the added documentation:
      
       "Write to this file of integer N makes N-th call in the current task
        fail (N is 0-based). Read from this file returns a single char 'Y' or
        'N' that says if the fault setup with a previous write to this file
        was injected or not, and disables the fault if it wasn't yet injected.
        Note that this file enables all types of faults (slab, futex, etc).
        This setting takes precedence over all other generic settings like
        probability, interval, times, etc. But per-capability settings (e.g.
        fail_futex/ignore-private) take precedence over it. This feature is
        intended for systematic testing of faults in a single system call. See
        an example below"
      
      Why add a new setting:
      1. Existing settings are global rather than per-task.
         So parallel testing is not possible.
      2. attr->interval is close but it depends on attr->count
         which is non reset to 0, so interval does not work as expected.
      3. Trying to model this with existing settings requires manipulations
         of all of probability, interval, times, space, task-filter and
         unexposed count and per-task make-it-fail files.
      4. Existing settings are per-failure-type, and the set of failure
         types is potentially expanding.
      5. make-it-fail can't be changed by unprivileged user and aggressive
         stress testing better be done from an unprivileged user.
         Similarly, this would require opening the debugfs files to the
         unprivileged user, as he would need to reopen at least times file
         (not possible to pre-open before dropping privs).
      
      The proposed interface solves all of the above (see the example).
      
      We want to integrate this into syzkaller fuzzer.  A prototype has found
      10 bugs in kernel in first day of usage:
      
        https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance
      
      I've made the current interface work with all types of our sandboxes.
      For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
      make /proc entries non-root owned.  So I am fine with the current
      version of the code.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e41d5818
    • C
      kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE · 92ef6da3
      Cyrill Gorcunov 提交于
      kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
      appropriate helpers in epoll code with the config to build it
      conditionally.
      
      Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lanSigned-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reported-by: NAndrew Morton <akpm@linuxfoundation.org>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92ef6da3
    • C
      kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files · 0791e364
      Cyrill Gorcunov 提交于
      With current epoll architecture target files are addressed with
      file_struct and file descriptor number, where the last is not unique.
      Moreover files can be transferred from another process via unix socket,
      added into queue and closed then so we won't find this descriptor in the
      task fdinfo list.
      
      Thus to checkpoint and restore such processes CRIU needs to find out
      where exactly the target file is present to add it into epoll queue.
      For this sake one can use kcmp call where some particular target file
      from the queue is compared with arbitrary file passed as an argument.
      
      Because epoll target files can have same file descriptor number but
      different file_struct a caller should explicitly specify the offset
      within.
      
      To test if some particular file is matching entry inside epoll one have
      to
      
       - fill kcmp_epoll_slot structure with epoll file descriptor,
         target file number and target file offset (in case if only
         one target is present then it should be 0)
      
       - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
          - the kernel fetch file pointer matching file descriptor @fd of pid1
          - lookups for file struct in epoll queue of pid2 and returns traditional
            0,1,2 result for sorting purpose
      
      Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.comSigned-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0791e364