1. 04 5月, 2016 1 次提交
    • A
      signals/sigaltstack: Report current flag bits in sigaltstack() · 0318bc8a
      Andy Lutomirski 提交于
      sigaltstack()'s reported previous state uses a somewhat odd
      convention, but the concept of flag bits is new, and we can do the
      flag bits sensibly.  Specifically, let's just report them directly.
      
      This will allow saving and restoring the sigaltstack state using
      sigaltstack() to work correctly.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stas Sergeev <stsp@list.ru>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/94b291ec9fd47741a9264851e316e158ded0b00d.1462296606.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0318bc8a
  2. 03 5月, 2016 2 次提交
    • S
      signals/sigaltstack: Implement SS_AUTODISARM flag · 2a742138
      Stas Sergeev 提交于
      This patch implements the SS_AUTODISARM flag that can be OR-ed with
      SS_ONSTACK when forming ss_flags.
      
      When this flag is set, sigaltstack will be disabled when entering
      the signal handler; more precisely, after saving sas to uc_stack.
      When leaving the signal handler, the sigaltstack is restored by
      uc_stack.
      
      When this flag is used, it is safe to switch from sighandler with
      swapcontext(). Without this flag, the subsequent signal will corrupt
      the state of the switched-away sighandler.
      
      To detect the support of this functionality, one can do:
      
        err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
        if (err && errno == EINVAL)
      	unsupported();
      Signed-off-by: NStas Sergeev <stsp@list.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2a742138
    • S
      signals/sigaltstack: Prepare to add new SS_xxx flags · 407bc16a
      Stas Sergeev 提交于
      This patch adds SS_FLAG_BITS - the mask that splits sigaltstack
      mode values and bit-flags. Since there is no bit-flags yet, the
      mask is defined to 0. The flags are added by subsequent patches.
      With every new flag, the mask should have the appropriate bit cleared.
      
      This makes sure if some flag is tried on a kernel that doesn't
      support it, the -EINVAL error will be returned, because such a
      flag will be treated as an invalid mode rather than the bit-flag.
      
      That way the existence of the particular features can be probed
      at run-time.
      
      This change was suggested by Andy Lutomirski:
      
        https://lkml.org/lkml/2016/3/6/158Signed-off-by: NStas Sergeev <stsp@list.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1460665206-13646-3-git-send-email-stsp@list.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
      407bc16a
  3. 29 4月, 2016 4 次提交
  4. 28 4月, 2016 1 次提交
    • P
      perf/core: Fix perf_event_open() vs. execve() race · 79c9ce57
      Peter Zijlstra 提交于
      Jann reported that the ptrace_may_access() check in
      find_lively_task_by_vpid() is racy against exec().
      
      Specifically:
      
        perf_event_open()		execve()
      
        ptrace_may_access()
      				commit_creds()
        ...				if (get_dumpable() != SUID_DUMP_USER)
      				  perf_event_exit_task();
        perf_install_in_context()
      
      would result in installing a counter across the creds boundary.
      
      Fix this by wrapping lots of perf_event_open() in cred_guard_mutex.
      This should be fine as perf_event_exit_task() is already called with
      cred_guard_mutex held, so all perf locks already nest inside it.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      79c9ce57
  5. 27 4月, 2016 1 次提交
  6. 26 4月, 2016 2 次提交
    • R
      workqueue: fix ghost PENDING flag while doing MQ IO · 346c09f8
      Roman Pen 提交于
      The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
      with the following backtrace:
      
      [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
      [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
      [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
      [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
      [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
      [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
      [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
      [  601.350965] Call Trace:
      [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
      [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
      [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
      [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
      [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
      [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
      [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
      [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
      [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
      [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
      [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
      [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
      [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
      [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
      [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
      [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
      [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
      [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
      [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
      
      The underlying device is a null_blk, with default parameters:
      
        queue_mode    = MQ
        submit_queues = 1
      
      Verification that nullb0 has something inflight:
      
      root@pserver8:~# cat /sys/block/nullb0/inflight
             0        1
      root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
      ...
      /sys/block/nullb0/mq/0/cpu2/rq_list
      CTX pending:
              ffff8838038e2400
      ...
      
      During debug it became clear that stalled request is always inserted in
      the rq_list from the following path:
      
         save_stack_trace_tsk + 34
         blk_mq_insert_requests + 231
         blk_mq_flush_plug_list + 281
         blk_flush_plug_list + 199
         wait_on_page_bit + 192
         __filemap_fdatawait_range + 228
         filemap_fdatawait_range + 20
         filemap_write_and_wait_range + 63
         blkdev_fsync + 27
         vfs_fsync_range + 73
         blkdev_write_iter + 202
         __vfs_write + 170
         vfs_write + 169
         kernel_write + 56
      
      So blk_flush_plug_list() was called with from_schedule == true.
      
      If from_schedule is true, that means that finally blk_mq_insert_requests()
      offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
      i.e. it calls kblockd_schedule_delayed_work_on().
      
      That means, that we race with another CPU, which is about to execute
      __blk_mq_run_hw_queue() work.
      
      Further debugging shows the following traces from different CPUs:
      
        CPU#0                                  CPU#1
        ----------------------------------     -------------------------------
        reqeust A inserted
        STORE hctx->ctx_map[0] bit marked
        kblockd_schedule...() returns 1
        <schedule to kblockd workqueue>
                                               request B inserted
                                               STORE hctx->ctx_map[1] bit marked
                                               kblockd_schedule...() returns 0
        *** WORK PENDING bit is cleared ***
        flush_busy_ctxs() is executed, but
        bit 1, set by CPU#1, is not observed
      
      As a result request B pended forever.
      
      This behaviour can be explained by speculative LOAD of hctx->ctx_map on
      CPU#0, which is reordered with clear of PENDING bit and executed _before_
      actual STORE of bit 1 on CPU#1.
      
      The proper fix is an explicit full barrier <mfence>, which guarantees
      that clear of PENDING bit is to be executed before all possible
      speculative LOADS or STORES inside actual work function.
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Michael Wang <yun.wang@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      346c09f8
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
  7. 23 4月, 2016 3 次提交
  8. 22 4月, 2016 1 次提交
    • S
      cpu/hotplug: Fix rollback during error-out in __cpu_disable() · 3b9d6da6
      Sebastian Andrzej Siewior 提交于
      The recent introduction of the hotplug thread which invokes the callbacks on
      the plugged cpu, cased the following regression:
      
      If takedown_cpu() fails, then we run into several issues:
      
       1) The rollback of the target cpu states is not invoked. That leaves the smp
          threads and the hotplug thread in disabled state.
      
       2) notify_online() is executed due to a missing skip_onerr flag. That causes
          that both CPU_DOWN_FAILED and CPU_ONLINE notifications are invoked which
          confuses quite some notifiers.
      
       3) The CPU_DOWN_FAILED notification is not invoked on the target CPU. That's
          not an issue per se, but it is inconsistent and in consequence blocks the
          patches which rely on these states being invoked on the target CPU and not
          on the controlling cpu. It also does not preserve the strict call order on
          rollback which is problematic for the ongoing state machine conversion as
          well.
      
      To fix this we add a rollback flag to the remote callback machinery and invoke
      the rollback including the CPU_DOWN_FAILED notification on the remote
      cpu. Further mark the notify online state with 'skip_onerr' so we don't get a
      double invokation.
      
      This workaround will go away once we moved the unplug invocation to the target
      cpu itself.
      
      [ tglx: Massaged changelog and moved the CPU_DOWN_FAILED notifiaction to the
        	target cpu ]
      
      Fixes: 4cb28ced ("cpu/hotplug: Create hotplug threads")
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: linux-s390@vger.kernel.org
      Cc: rt@linutronix.de
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Link: http://lkml.kernel.org/r/20160408124015.GA21960@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3b9d6da6
  9. 21 4月, 2016 2 次提交
  10. 20 4月, 2016 1 次提交
  11. 19 4月, 2016 1 次提交
    • D
      locking/pvqspinlock: Fix division by zero in qstat_read() · 66876595
      Davidlohr Bueso 提交于
      While playing with the qstat statistics (in <debugfs>/qlockstat/) I ran into
      the following splat on a VM when opening pv_hash_hops:
      
        divide error: 0000 [#1] SMP
        ...
        RIP: 0010:[<ffffffff810b61fe>]  [<ffffffff810b61fe>] qstat_read+0x12e/0x1e0
        ...
        Call Trace:
          [<ffffffff811cad7c>] ? mem_cgroup_commit_charge+0x6c/0xd0
          [<ffffffff8119750c>] ? page_add_new_anon_rmap+0x8c/0xd0
          [<ffffffff8118d3b9>] ? handle_mm_fault+0x1439/0x1b40
          [<ffffffff811937a9>] ? do_mmap+0x449/0x550
          [<ffffffff811d3de3>] ? __vfs_read+0x23/0xd0
          [<ffffffff811d4ab2>] ? rw_verify_area+0x52/0xd0
          [<ffffffff811d4bb1>] ? vfs_read+0x81/0x120
          [<ffffffff811d5f12>] ? SyS_read+0x42/0xa0
          [<ffffffff815720f6>] ? entry_SYSCALL_64_fastpath+0x1e/0xa8
      
      Fix this by verifying that qstat_pv_kick_unlock is in fact non-zero,
      similarly to what the qstat_pv_latency_wake case does, as if nothing
      else, this can come from resetting the statistics, thus having 0 kicks
      should be quite valid in this context.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NWaiman Long <Waiman.Long@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: waiman.long@hpe.com
      Link: http://lkml.kernel.org/r/1460961103-24953-1-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      66876595
  12. 15 4月, 2016 1 次提交
  13. 14 4月, 2016 1 次提交
  14. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  15. 04 4月, 2016 1 次提交
  16. 31 3月, 2016 3 次提交
    • A
      locking/lockdep: Print chain_key collision information · 39e2e173
      Alfredo Alvarez Fernandez 提交于
      A sequence of pairs [class_idx -> corresponding chain_key iteration]
      is printed for both the current held_lock chain and the cached chain.
      
      That exposes the two different class_idx sequences that led to that
      particular hash value.
      
      This helps with debugging hash chain collision reports.
      Signed-off-by: NAlfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: sedat.dilek@gmail.com
      Cc: tytso@mit.edu
      Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39e2e173
    • A
      perf/core: Don't leak event in the syscall error path · 201c2f85
      Alexander Shishkin 提交于
      In the error path, event_file not being NULL is used to determine
      whether the event itself still needs to be free'd, so fix it up to
      avoid leaking.
      Reported-by: NLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 13005627 ("perf: Do not double free")
      Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      201c2f85
    • P
      perf/core: Fix time tracking bug with multiplexing · 8fdc6539
      Peter Zijlstra 提交于
      Stephane reported that commit:
      
        3cbaa590 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
      
      introduced a regression wrt. time tracking, as easily observed by:
      
      > This patch introduce a bug in the time tracking of events when
      > multiplexing is used.
      >
      > The issue is easily reproducible with the following perf run:
      >
      >  $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
      >      1.000730239            652,394      branches   (66.41%)
      >      1.000730239            597,809      branches   (66.41%)
      >      1.000730239            593,870      branches   (66.63%)
      >      1.000730239            651,440      branches   (67.03%)
      >      1.000730239            656,725      branches   (66.96%)
      >      1.000730239      <not counted>      branches
      >
      > One branches event is shown as not having run. Yet, with
      > multiplexing, all events should run especially with a 1s (-I 1000)
      > interval. The delta for time_running comes out to 0. Yet, the event
      > has run because the kernel is actually multiplexing the events. The
      > problem is that the time tracking is the kernel and especially in
      > ctx_sched_out() is wrong now.
      >
      > The problem is that in case that the kernel enters ctx_sched_out() with the
      > following state:
      >    ctx->is_active=0x7 event_type=0x1
      >    Call Trace:
      >     [<ffffffff813ddd41>] dump_stack+0x63/0x82
      >     [<ffffffff81182bdc>] ctx_sched_out+0x2bc/0x2d0
      >     [<ffffffff81183896>] perf_mux_hrtimer_handler+0xf6/0x2c0
      >     [<ffffffff811837a0>] ? __perf_install_in_context+0x130/0x130
      >     [<ffffffff810f5818>] __hrtimer_run_queues+0xf8/0x2f0
      >     [<ffffffff810f6097>] hrtimer_interrupt+0xb7/0x1d0
      >     [<ffffffff810509a8>] local_apic_timer_interrupt+0x38/0x60
      >     [<ffffffff8175ca9d>] smp_apic_timer_interrupt+0x3d/0x50
      >     [<ffffffff8175ac7c>] apic_timer_interrupt+0x8c/0xa0
      >
      > In that case, the test:
      >       if (is_active & EVENT_TIME)
      >
      > will be false and the time will not be updated. Time must always be updated on
      > sched out.
      
      Fix this by always updating time if EVENT_TIME was set, as opposed to
      only updating time when EVENT_TIME changed.
      Reported-by: NStephane Eranian <eranian@google.com>
      Tested-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Cc: namhyung@kernel.org
      Fixes: 3cbaa590 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
      Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8fdc6539
  17. 29 3月, 2016 2 次提交
  18. 26 3月, 2016 3 次提交
    • A
      arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections · be7635e7
      Alexander Potapenko 提交于
      KASAN needs to know whether the allocation happens in an IRQ handler.
      This lets us strip everything below the IRQ entry point to reduce the
      number of unique stack traces needed to be stored.
      
      Move the definition of __irq_entry to <linux/interrupt.h> so that the
      users don't need to pull in <linux/ftrace.h>.  Also introduce the
      __softirq_entry macro which is similar to __irq_entry, but puts the
      corresponding functions to the .softirqentry.text section.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be7635e7
    • M
      oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space · 36324a99
      Michal Hocko 提交于
      When oom_reaper manages to unmap all the eligible vmas there shouldn't
      be much of the freable memory held by the oom victim left anymore so it
      makes sense to clear the TIF_MEMDIE flag for the victim and allow the
      OOM killer to select another task.
      
      The lack of TIF_MEMDIE also means that the victim cannot access memory
      reserves anymore but that shouldn't be a problem because it would get
      the access again if it needs to allocate and hits the OOM killer again
      due to the fatal_signal_pending resp.  PF_EXITING check.  We can safely
      hide the task from the OOM killer because it is clearly not a good
      candidate anymore as everyhing reclaimable has been torn down already.
      
      This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
      and thus hold off further global OOM killer actions granted the oom
      reaper is able to take mmap_sem for the associated mm struct.  This is
      not guaranteed now but further steps should make sure that mmap_sem for
      write should be blocked killable which will help to reduce such a lock
      contention.  This is not done by this patch.
      
      Note that exit_oom_victim might be called on a remote task from
      __oom_reap_task now so we have to check and clear the flag atomically
      otherwise we might race and underflow oom_victims or wake up waiters too
      early.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36324a99
    • A
      sched: add schedule_timeout_idle() · 69b27baf
      Andrew Morton 提交于
      This will be needed in the patch "mm, oom: introduce oom reaper".
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69b27baf
  19. 25 3月, 2016 1 次提交
  20. 23 3月, 2016 8 次提交
    • L
      PM / sleep: Clear pm_suspend_global_flags upon hibernate · 27614273
      Lukas Wunner 提交于
      When suspending to RAM, waking up and later suspending to disk,
      we gratuitously runtime resume devices after the thaw phase.
      This does not occur if we always suspend to RAM or always to disk.
      
      pm_complete_with_resume_check(), which gets called from
      pci_pm_complete() among others, schedules a runtime resume
      if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
      a suspend-to-RAM cycle. It is cleared at the beginning of
      the suspend-to-RAM cycle but not afterwards and it is not
      cleared during a suspend-to-disk cycle at all. Fix it.
      
      Fixes: ef25ba04 (PM / sleep: Add flags to indicate platform firmware involvement)
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      27614273
    • J
      kernel/...: convert pr_warning to pr_warn · a395d6a7
      Joe Perches 提交于
      Use the more common logging method with the eventual goal of removing
      pr_warning altogether.
      
      Miscellanea:
      
       - Realign arguments
       - Coalesce formats
       - Add missing space between a few coalesced formats
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[kernel/power/suspend.c]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a395d6a7
    • B
      memremap: add MEMREMAP_WC flag · c907e0eb
      Brian Starkey 提交于
      Add a flag to memremap() for writecombine mappings.  Mappings satisfied
      by this flag will not be cached, however writes may be delayed or
      combined into more efficient bursts.  This is most suitable for buffers
      written sequentially by the CPU for use by other DMA devices.
      Signed-off-by: NBrian Starkey <brian.starkey@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c907e0eb
    • B
      memremap: don't modify flags · cf61e2a1
      Brian Starkey 提交于
      These patches implement a MEMREMAP_WC flag for memremap(), which can be
      used to obtain writecombine mappings.  This is then used for setting up
      dma_coherent_mem regions which use the DMA_MEMORY_MAP flag.
      
      The motivation is to fix an alignment fault on arm64, and the suggestion
      to implement MEMREMAP_WC for this case was made at [1].  That particular
      issue is handled in patch 4, which makes sure that the appropriate
      memset function is used when zeroing allocations mapped as IO memory.
      
      This patch (of 4):
      
      Don't modify the flags input argument to memremap(). MEMREMAP_WB is
      already a special case so we can check for it directly instead of
      clearing flag bits in each mapper.
      Signed-off-by: NBrian Starkey <brian.starkey@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf61e2a1
    • H
      kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE · 41b27154
      Helge Deller 提交于
      The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including
      padding) of the part of the struct siginfo that is before the union, and
      it is then used to calculate the needed padding (SI_PAD_SIZE) to make
      the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes.
      
      Depending on the target architecture and word width it equals to either
      3 or 4 times sizeof int.
      
      Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the
      parisc architecture for the 64bit kernel build.  It's even more
      frustrating, because it can easily be checked at compile time if the
      value was defined correctly.
      
      This patch adds such a check for the correctness of
      __ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and
      future architectures from running into the same problem.
      
      I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in
      copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't
      make any difference and b) it's used in the Documentation/kmemcheck.txt
      example.
      
      I ran this patch through the 0-DAY kernel test infrastructure and only
      the parisc architecture triggered as expected.  That means that this
      patch should be OK for all major architectures.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41b27154
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
    • A
      profile: hide unused functions when !CONFIG_PROC_FS · ade356b9
      Arnd Bergmann 提交于
      A couple of functions and variables in the profile implementation are
      used only on SMP systems by the procfs code, but are unused if either
      procfs is disabled or in uniprocessor kernels.  gcc prints a harmless
      warning about the unused symbols:
      
        kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
         static void profile_flip_buffers(void)
                     ^
        kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
         static void profile_discard_flip_buffers(void)
                     ^
        kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
         static int profile_cpu_callback(struct notifier_block *info,
                    ^
      
      This adds further #ifdef to the file, to annotate exactly in which cases
      they are used.  I have done several thousand ARM randconfig kernels with
      this patch applied and no longer get any warnings in this file.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ade356b9
    • H
      panic: change nmi_panic from macro to function · ebc41f20
      Hidehiro Kawai 提交于
      Commit 1717f209 ("panic, x86: Fix re-entrance problem due to panic
      on NMI") and commit 58c5661f ("panic, x86: Allow CPUs to save
      registers even if looping in NMI context") introduced nmi_panic() which
      prevents concurrent/recursive execution of panic().  It also saves
      registers for the crash dump on x86.
      
      However, there are some cases where NMI handlers still use panic().
      This patch set partially replaces them with nmi_panic() in those cases.
      
      Even this patchset is applied, some NMI or similar handlers (e.g.  MCE
      handler) continue to use panic().  This is because I can't test them
      well and actual problems won't happen.  For example, the possibility
      that normal panic and panic on MCE happen simultaneously is very low.
      
      This patch (of 3):
      
      Convert nmi_panic() to a proper function and export it instead of
      exporting internal implementation details to modules, for obvious
      reasons.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
      Cc: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebc41f20