1. 23 4月, 2016 2 次提交
    • W
      perf/core: Add ::write_backward attribute to perf event · 9ecda41a
      Wang Nan 提交于
      This patch introduces 'write_backward' bit to perf_event_attr, which
      controls the direction of a ring buffer. After set, the corresponding
      ring buffer is written from end to beginning. This feature is design to
      support reading from overwritable ring buffer.
      
      Ring buffer can be created by mapping a perf event fd. Kernel puts event
      records into ring buffer, user tooling like perf fetch them from
      address returned by mmap(). To prevent racing between kernel and tooling,
      they communicate to each other through 'head' and 'tail' pointers.
      Kernel maintains 'head' pointer, points it to the next free area (tail
      of the last record). Tooling maintains 'tail' pointer, points it to the
      tail of last consumed record (record has already been fetched). Kernel
      determines the available space in a ring buffer using these two
      pointers to avoid overwrite unfetched records.
      
      By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
      Different from normal ring buffer, tooling is unable to maintain 'tail'
      pointer because writing is forbidden. Therefore, for this type of ring
      buffers, kernel overwrite old records unconditionally, works like flight
      recorder. This feature would be useful if reading from overwritable ring
      buffer were as easy as reading from normal ring buffer. However,
      there's an obscure problem.
      
      The following figure demonstrates a full overwritable ring buffer. In
      this figure, the 'head' pointer points to the end of last record, and a
      long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
      would have pointed to position (X), so kernel knows there's no more
      space in the ring buffer. However, for an overwritable ring buffer,
      kernel ignore the 'tail' pointer.
      
         (X)                              head
          .                                |
          .                                V
          +------+-------+----------+------+---+
          |A....A|B.....B|C........C|D....D|   |
          +------+-------+----------+------+---+
      
      Record 'A' is overwritten by event 'E':
      
            head
             |
             V
          +--+---+-------+----------+------+---+
          |.E|..A|B.....B|C........C|D....D|E..|
          +--+---+-------+----------+------+---+
      
      Now tooling decides to read from this ring buffer. However, none of these
      two natural positions, 'head' and the start of this ring buffer, are
      pointing to the head of a record. Even the full ring buffer can be
      accessed by tooling, it is unable to find a position to start decoding.
      
      The first attempt tries to solve this problem AFAIK can be found from
      [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
      buffer is half full. However, this approach introduces overhead to
      fast path. Test result shows a 1% overhead [2]. In addition, this method
      utilizes no more tham 50% records.
      
      Another attempt can be found from [3], which allows putting the size of
      an event at the end of each record. This approach allows tooling to find
      records in a backward manner from 'head' pointer by reading size of a
      record from its tail. However, because of alignment requirement, it
      needs 8 bytes to record the size of a record, which is a huge waste. Its
      performance is also not good, because more data need to be written.
      This approach also introduces some extra branch instructions to fast
      path.
      
      'write_backward' is a better solution to this problem.
      
      Following figure demonstrates the state of the overwritable ring buffer
      when 'write_backward' is set before overwriting:
      
             head
              |
              V
          +---+------+----------+-------+------+
          |   |D....D|C........C|B.....B|A....A|
          +---+------+----------+-------+------+
      
      and after overwriting:
                                           head
                                            |
                                            V
          +---+------+----------+-------+---+--+
          |..E|D....D|C........C|B.....B|A..|E.|
          +---+------+----------+-------+---+--+
      
      In each situation, 'head' points to the beginning of the newest record.
      From this record, tooling can iterate over the full ring buffer and fetch
      records one by one.
      
      The only limitation that needs to be considered is back-to-back reading.
      Due to the non-deterministic of user programs, it is impossible to ensure
      the ring buffer keeps stable during reading. Consider an extreme situation:
      tooling is scheduled out after reading record 'D', then a burst of events
      come, eat up the whole ring buffer (one or multiple rounds). When the
      tooling process comes back, reading after 'D' is incorrect now.
      
      To prevent this problem, we need to find a way to ensure the ring buffer
      is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
      suggested because its overhead is lower than
      ioctl(PERF_EVENT_IOC_ENABLE).
      
      By carefully verifying 'header' pointer, reader can avoid pausing the
      ring-buffer. For example:
      
          /* A union of all possible events */
          union perf_event event;
      
          p = head = perf_mmap__read_head();
          while (true) {
              /* copy header of next event */
              fetch(&event.header, p, sizeof(event.header));
      
              /* read 'head' pointer */
              head = perf_mmap__read_head();
      
              /* check overwritten: is the header good? */
              if (!verify(sizeof(event.header), p, head))
                  break;
      
              /* copy the whole event */
              fetch(&event, p, event.header.size);
      
              /* read 'head' pointer again */
              head = perf_mmap__read_head();
      
              /* is the whole event good? */
              if (!verify(event.header.size, p, head))
                  break;
              p += event.header.size;
          }
      
      However, the overhead is high because:
      
       a) In-place decoding is not safe.
          Copying-verifying-decoding is required.
       b) Fetching 'head' pointer requires additional synchronization.
      
      (From Alexei Starovoitov:
      
      Even when this trick works, pause is needed for more than stability of
      reading. When we collect the events into overwrite buffer we're waiting
      for some other trigger (like all cpu utilization spike or just one cpu
      running and all others are idle) and when it happens the buffer has
      valuable info from the past. At this point new events are no longer
      interesting and buffer should be paused, events read and unpaused until
      next trigger comes.)
      
      This patch utilizes event's default overflow_handler introduced
      previously. perf_event_output_backward() is created as the default
      overflow handler for backward ring buffers. To avoid extra overhead to
      fast path, original perf_event_output() becomes __perf_event_output()
      and marked '__always_inline'. In theory, there's no extra overhead
      introduced to fast path.
      
      Performance testing:
      
      Calling 3000000 times of 'close(-1)', use gettimeofday() to check
      duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
      system calls. In ns.
      
      Testing environment:
      
        CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
        Kernel : v4.5.0
                          MEAN         STDVAR
       BASE            800214.950    2853.083
       PRE1           2253846.700    9997.014
       PRE2           2257495.540    8516.293
       POST           2250896.100    8933.921
      
      Where 'BASE' is pure performance without capturing. 'PRE1' is test
      result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
      patch. 'POST' is test result after this patch. See [4] for the detailed
      experimental setup.
      
      Considering the stdvar, this patch doesn't introduce performance
      overhead to the fast path.
      
       [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
       [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
       [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
       [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: <acme@kernel.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
      [ Fixed the changelog some more. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9ecda41a
    • P
      perf/core: Make sysctl_perf_cpu_time_max_percent conform to documentation · b303e7c1
      Peter Zijlstra 提交于
      Markus reported that 0 should also disable the throttling we per
      Documentation/sysctl/kernel.txt.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 91a612ee ("perf/core: Fix dynamic interrupt throttle")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b303e7c1
  2. 15 4月, 2016 1 次提交
  3. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  4. 04 4月, 2016 1 次提交
  5. 31 3月, 2016 11 次提交
    • A
      locking/lockdep: Print chain_key collision information · 39e2e173
      Alfredo Alvarez Fernandez 提交于
      A sequence of pairs [class_idx -> corresponding chain_key iteration]
      is printed for both the current held_lock chain and the cached chain.
      
      That exposes the two different class_idx sequences that led to that
      particular hash value.
      
      This helps with debugging hash chain collision reports.
      Signed-off-by: NAlfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: sedat.dilek@gmail.com
      Cc: tytso@mit.edu
      Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39e2e173
    • W
      perf/ring_buffer: Prepare writing into the ring-buffer from the end · d1b26c70
      Wang Nan 提交于
      Convert perf_output_begin() to __perf_output_begin() and make the later
      function able to write records from the end of the ring-buffer.
      
      Following commits will utilize the 'backward' flag.
      
      This is the core patch to support writing to the ring-buffer backwards,
      which will be introduced by upcoming patches to support reading from
      overwritable ring-buffers.
      
      In theory, this patch should not introduce any extra performance
      overhead since we use always_inline, but it does not hurt to double
      check that assumption:
      
      When CONFIG_OPTIMIZE_INLINING is disabled, the output object is nearly
      identical to original one. See:
      
         http://lkml.kernel.org/g/56F52E83.70409@huawei.com
      
      When CONFIG_OPTIMIZE_INLINING is enabled, the resuling object file becomes
      smaller:
      
       $ size kernel/events/ring_buffer.o*
         text       data        bss        dec        hex    filename
         4641          4          8       4653       122d kernel/events/ring_buffer.o.old
         4545          4          8       4557       11cd kernel/events/ring_buffer.o.new
      
      Performance testing results:
      
      Calling 3000000 times of 'close(-1)', use gettimeofday() to check
      duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
      system calls. In ns.
      
      Testing environment:
      
       CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
       Kernel : v4.5.0
      
                           MEAN         STDVAR
        BASE            800214.950    2853.083
        PRE            2253846.700    9997.014
        POST           2257495.540    8516.293
      
      Where 'BASE' is pure performance without capturing. 'PRE' is test
      result of pure 'v4.5.0' kernel. 'POST' is test result after this
      patch.
      
      Considering the stdvar, this patch doesn't hurt performance, within
      noise margin.
      
      For testing details, see:
      
        http://lkml.kernel.org/g/56F89DCD.1040202@huawei.comSigned-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-4-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d1b26c70
    • W
      perf/core: Set event's default ::overflow_handler() · 1879445d
      Wang Nan 提交于
      Set a default event->overflow_handler in perf_event_alloc() so don't
      need to check event->overflow_handler in __perf_event_overflow().
      Following commits can give a different default overflow_handler.
      
      Initial idea comes from Peter:
      
        http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net
      
      Since the default value of event->overflow_handler is not NULL, existing
      'if (!overflow_handler)' checks need to be changed.
      
      is_default_overflow_handler() is introduced for this.
      
      No extra performance overhead is introduced into the hot path because in the
      original code we still need to read this handler from memory. A conditional
      branch is avoided so actually we remove some instructions.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1879445d
    • W
      perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer · 86e7972f
      Wang Nan 提交于
      Add new ioctl() to pause/resume ring-buffer output.
      
      In some situations we want to read from the ring-buffer only when we
      ensure nothing can write to the ring-buffer during reading. Without
      this patch we have to turn off all events attached to this ring-buffer
      to achieve this.
      
      This patch is a prerequisite to enable overwrite support for the
      perf ring-buffer support. Following commits will introduce new methods
      support reading from overwrite ring buffer. Before reading, caller
      must ensure the ring buffer is frozen, or the reading is unreliable.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <pi3orama@163.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      86e7972f
    • J
      ftrace/perf: Check sample types only for sampling events · 0a74c5b3
      Jiri Olsa 提交于
      Currently we check sample type for ftrace:function events
      even if it's not created as a sampling event. That prevents
      creating ftrace_function event in counting mode.
      
      Make sure we check sample types only for sampling events.
      
      Before:
        $ sudo perf stat -e ftrace:function ls
        ...
      
         Performance counter stats for 'ls':
      
           <not supported>      ftrace:function
      
               0.001983662 seconds time elapsed
      
      After:
        $ sudo perf stat -e ftrace:function ls
        ...
      
         Performance counter stats for 'ls':
      
                    44,498      ftrace:function
      
               0.037534722 seconds time elapsed
      Suggested-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1458138873-1553-2-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0a74c5b3
    • A
      perf/ring_buffer: Document AUX API usage · af5bb4ed
      Alexander Shishkin 提交于
      In order to ensure safe AUX buffer management, we rely on the assumption
      that pmu::stop() stops its ongoing AUX transaction and not just the hw.
      
      This patch documents this requirement for the perf_aux_output_{begin,end}()
      APIs.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1457098969-21595-4-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      af5bb4ed
    • A
      perf/core: Free AUX pages in unmap path · 95ff4ca2
      Alexander Shishkin 提交于
      Now that we can ensure that when ring buffer's AUX area is on the way
      to getting unmapped new transactions won't start, we only need to stop
      all events that can potentially be writing aux data to our ring buffer.
      
      Having done that, we can safely free the AUX pages and corresponding
      PMU data, as this time it is guaranteed to be the last aux reference
      holder.
      
      This partially reverts:
      
        57ffc5ca ("perf: Fix AUX buffer refcounting")
      
      ... which was made to defer deallocation that was otherwise possible
      from an NMI context. Now it is no longer the case; the last call to
      rb_free_aux() that drops the last AUX reference has to happen in
      perf_mmap_close() on that AUX area.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95ff4ca2
    • A
      perf/ring_buffer: Refuse to begin AUX transaction after rb->aux_mmap_count drops · dcb10a96
      Alexander Shishkin 提交于
      When ring buffer's AUX area is unmapped and rb->aux_mmap_count drops to
      zero, new AUX transactions into this buffer can still be started,
      even though the buffer in en route to deallocation.
      
      This patch adds a check to perf_aux_output_begin() for rb->aux_mmap_count
      being zero, in which case there is no point starting new transactions,
      in other words, the ring buffers that pass a certain point in
      perf_mmap_close will not have their events sending new data, which
      clears path for freeing those buffers' pages right there and then,
      provided that no active transactions are holding the AUX reference.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1457098969-21595-2-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dcb10a96
    • P
      perf/core: Verify we have a single perf_hw_context PMU · 26657848
      Peter Zijlstra 提交于
      There should (and can) only be a single PMU for perf_hw_context
      events.
      
      This is because of how we schedule events: once a hardware event fails to
      schedule (the PMU is 'full') we stop trying to add more. The trivial
      'fix' would break the Round-Robin scheduling we do.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      26657848
    • A
      perf/core: Don't leak event in the syscall error path · 201c2f85
      Alexander Shishkin 提交于
      In the error path, event_file not being NULL is used to determine
      whether the event itself still needs to be free'd, so fix it up to
      avoid leaking.
      Reported-by: NLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 13005627 ("perf: Do not double free")
      Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      201c2f85
    • P
      perf/core: Fix time tracking bug with multiplexing · 8fdc6539
      Peter Zijlstra 提交于
      Stephane reported that commit:
      
        3cbaa590 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
      
      introduced a regression wrt. time tracking, as easily observed by:
      
      > This patch introduce a bug in the time tracking of events when
      > multiplexing is used.
      >
      > The issue is easily reproducible with the following perf run:
      >
      >  $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
      >      1.000730239            652,394      branches   (66.41%)
      >      1.000730239            597,809      branches   (66.41%)
      >      1.000730239            593,870      branches   (66.63%)
      >      1.000730239            651,440      branches   (67.03%)
      >      1.000730239            656,725      branches   (66.96%)
      >      1.000730239      <not counted>      branches
      >
      > One branches event is shown as not having run. Yet, with
      > multiplexing, all events should run especially with a 1s (-I 1000)
      > interval. The delta for time_running comes out to 0. Yet, the event
      > has run because the kernel is actually multiplexing the events. The
      > problem is that the time tracking is the kernel and especially in
      > ctx_sched_out() is wrong now.
      >
      > The problem is that in case that the kernel enters ctx_sched_out() with the
      > following state:
      >    ctx->is_active=0x7 event_type=0x1
      >    Call Trace:
      >     [<ffffffff813ddd41>] dump_stack+0x63/0x82
      >     [<ffffffff81182bdc>] ctx_sched_out+0x2bc/0x2d0
      >     [<ffffffff81183896>] perf_mux_hrtimer_handler+0xf6/0x2c0
      >     [<ffffffff811837a0>] ? __perf_install_in_context+0x130/0x130
      >     [<ffffffff810f5818>] __hrtimer_run_queues+0xf8/0x2f0
      >     [<ffffffff810f6097>] hrtimer_interrupt+0xb7/0x1d0
      >     [<ffffffff810509a8>] local_apic_timer_interrupt+0x38/0x60
      >     [<ffffffff8175ca9d>] smp_apic_timer_interrupt+0x3d/0x50
      >     [<ffffffff8175ac7c>] apic_timer_interrupt+0x8c/0xa0
      >
      > In that case, the test:
      >       if (is_active & EVENT_TIME)
      >
      > will be false and the time will not be updated. Time must always be updated on
      > sched out.
      
      Fix this by always updating time if EVENT_TIME was set, as opposed to
      only updating time when EVENT_TIME changed.
      Reported-by: NStephane Eranian <eranian@google.com>
      Tested-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Cc: namhyung@kernel.org
      Fixes: 3cbaa590 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
      Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8fdc6539
  6. 29 3月, 2016 2 次提交
  7. 26 3月, 2016 3 次提交
    • A
      arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections · be7635e7
      Alexander Potapenko 提交于
      KASAN needs to know whether the allocation happens in an IRQ handler.
      This lets us strip everything below the IRQ entry point to reduce the
      number of unique stack traces needed to be stored.
      
      Move the definition of __irq_entry to <linux/interrupt.h> so that the
      users don't need to pull in <linux/ftrace.h>.  Also introduce the
      __softirq_entry macro which is similar to __irq_entry, but puts the
      corresponding functions to the .softirqentry.text section.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be7635e7
    • M
      oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space · 36324a99
      Michal Hocko 提交于
      When oom_reaper manages to unmap all the eligible vmas there shouldn't
      be much of the freable memory held by the oom victim left anymore so it
      makes sense to clear the TIF_MEMDIE flag for the victim and allow the
      OOM killer to select another task.
      
      The lack of TIF_MEMDIE also means that the victim cannot access memory
      reserves anymore but that shouldn't be a problem because it would get
      the access again if it needs to allocate and hits the OOM killer again
      due to the fatal_signal_pending resp.  PF_EXITING check.  We can safely
      hide the task from the OOM killer because it is clearly not a good
      candidate anymore as everyhing reclaimable has been torn down already.
      
      This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
      and thus hold off further global OOM killer actions granted the oom
      reaper is able to take mmap_sem for the associated mm struct.  This is
      not guaranteed now but further steps should make sure that mmap_sem for
      write should be blocked killable which will help to reduce such a lock
      contention.  This is not done by this patch.
      
      Note that exit_oom_victim might be called on a remote task from
      __oom_reap_task now so we have to check and clear the flag atomically
      otherwise we might race and underflow oom_victims or wake up waiters too
      early.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36324a99
    • A
      sched: add schedule_timeout_idle() · 69b27baf
      Andrew Morton 提交于
      This will be needed in the patch "mm, oom: introduce oom reaper".
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69b27baf
  8. 25 3月, 2016 1 次提交
  9. 23 3月, 2016 16 次提交
    • L
      PM / sleep: Clear pm_suspend_global_flags upon hibernate · 27614273
      Lukas Wunner 提交于
      When suspending to RAM, waking up and later suspending to disk,
      we gratuitously runtime resume devices after the thaw phase.
      This does not occur if we always suspend to RAM or always to disk.
      
      pm_complete_with_resume_check(), which gets called from
      pci_pm_complete() among others, schedules a runtime resume
      if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
      a suspend-to-RAM cycle. It is cleared at the beginning of
      the suspend-to-RAM cycle but not afterwards and it is not
      cleared during a suspend-to-disk cycle at all. Fix it.
      
      Fixes: ef25ba04 (PM / sleep: Add flags to indicate platform firmware involvement)
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      27614273
    • J
      kernel/...: convert pr_warning to pr_warn · a395d6a7
      Joe Perches 提交于
      Use the more common logging method with the eventual goal of removing
      pr_warning altogether.
      
      Miscellanea:
      
       - Realign arguments
       - Coalesce formats
       - Add missing space between a few coalesced formats
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[kernel/power/suspend.c]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a395d6a7
    • B
      memremap: add MEMREMAP_WC flag · c907e0eb
      Brian Starkey 提交于
      Add a flag to memremap() for writecombine mappings.  Mappings satisfied
      by this flag will not be cached, however writes may be delayed or
      combined into more efficient bursts.  This is most suitable for buffers
      written sequentially by the CPU for use by other DMA devices.
      Signed-off-by: NBrian Starkey <brian.starkey@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c907e0eb
    • B
      memremap: don't modify flags · cf61e2a1
      Brian Starkey 提交于
      These patches implement a MEMREMAP_WC flag for memremap(), which can be
      used to obtain writecombine mappings.  This is then used for setting up
      dma_coherent_mem regions which use the DMA_MEMORY_MAP flag.
      
      The motivation is to fix an alignment fault on arm64, and the suggestion
      to implement MEMREMAP_WC for this case was made at [1].  That particular
      issue is handled in patch 4, which makes sure that the appropriate
      memset function is used when zeroing allocations mapped as IO memory.
      
      This patch (of 4):
      
      Don't modify the flags input argument to memremap(). MEMREMAP_WB is
      already a special case so we can check for it directly instead of
      clearing flag bits in each mapper.
      Signed-off-by: NBrian Starkey <brian.starkey@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf61e2a1
    • H
      kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE · 41b27154
      Helge Deller 提交于
      The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including
      padding) of the part of the struct siginfo that is before the union, and
      it is then used to calculate the needed padding (SI_PAD_SIZE) to make
      the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes.
      
      Depending on the target architecture and word width it equals to either
      3 or 4 times sizeof int.
      
      Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the
      parisc architecture for the 64bit kernel build.  It's even more
      frustrating, because it can easily be checked at compile time if the
      value was defined correctly.
      
      This patch adds such a check for the correctness of
      __ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and
      future architectures from running into the same problem.
      
      I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in
      copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't
      make any difference and b) it's used in the Documentation/kmemcheck.txt
      example.
      
      I ran this patch through the 0-DAY kernel test infrastructure and only
      the parisc architecture triggered as expected.  That means that this
      patch should be OK for all major architectures.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41b27154
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
    • A
      profile: hide unused functions when !CONFIG_PROC_FS · ade356b9
      Arnd Bergmann 提交于
      A couple of functions and variables in the profile implementation are
      used only on SMP systems by the procfs code, but are unused if either
      procfs is disabled or in uniprocessor kernels.  gcc prints a harmless
      warning about the unused symbols:
      
        kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
         static void profile_flip_buffers(void)
                     ^
        kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
         static void profile_discard_flip_buffers(void)
                     ^
        kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
         static int profile_cpu_callback(struct notifier_block *info,
                    ^
      
      This adds further #ifdef to the file, to annotate exactly in which cases
      they are used.  I have done several thousand ARM randconfig kernels with
      this patch applied and no longer get any warnings in this file.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ade356b9
    • H
      panic: change nmi_panic from macro to function · ebc41f20
      Hidehiro Kawai 提交于
      Commit 1717f209 ("panic, x86: Fix re-entrance problem due to panic
      on NMI") and commit 58c5661f ("panic, x86: Allow CPUs to save
      registers even if looping in NMI context") introduced nmi_panic() which
      prevents concurrent/recursive execution of panic().  It also saves
      registers for the crash dump on x86.
      
      However, there are some cases where NMI handlers still use panic().
      This patch set partially replaces them with nmi_panic() in those cases.
      
      Even this patchset is applied, some NMI or similar handlers (e.g.  MCE
      handler) continue to use panic().  This is because I can't test them
      well and actual problems won't happen.  For example, the possibility
      that normal panic and panic on MCE happen simultaneously is very low.
      
      This patch (of 3):
      
      Convert nmi_panic() to a proper function and export it instead of
      exporting internal implementation details to modules, for obvious
      reasons.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
      Cc: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebc41f20
    • J
      fs/coredump: prevent fsuid=0 dumps into user-controlled directories · 378c6520
      Jann Horn 提交于
      This commit fixes the following security hole affecting systems where
      all of the following conditions are fulfilled:
      
       - The fs.suid_dumpable sysctl is set to 2.
       - The kernel.core_pattern sysctl's value starts with "/". (Systems
         where kernel.core_pattern starts with "|/" are not affected.)
       - Unprivileged user namespace creation is permitted. (This is
         true on Linux >=3.8, but some distributions disallow it by
         default using a distro patch.)
      
      Under these conditions, if a program executes under secure exec rules,
      causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
      namespace, changes its root directory and crashes, the coredump will be
      written using fsuid=0 and a path derived from kernel.core_pattern - but
      this path is interpreted relative to the root directory of the process,
      allowing the attacker to control where a coredump will be written with
      root privileges.
      
      To fix the security issue, always interpret core_pattern for dumps that
      are written under SUID_DUMP_ROOT relative to the root directory of init.
      Signed-off-by: NJann Horn <jann@thejh.net>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      378c6520
    • O
      ptrace: change __ptrace_unlink() to clear ->ptrace under ->siglock · 1333ab03
      Oleg Nesterov 提交于
      This test-case (simplified version of generated by syzkaller)
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      
      	void test(void)
      	{
      		for (;;) {
      			if (fork()) {
      				wait(NULL);
      				continue;
      			}
      
      			ptrace(PTRACE_SEIZE, getppid(), 0, 0);
      			ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
      			_exit(0);
      		}
      	}
      
      	int main(void)
      	{
      		int np;
      
      		for (np = 0; np < 8; ++np)
      			if (!fork())
      				test();
      
      		while (wait(NULL) > 0)
      			;
      		return 0;
      	}
      
      triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap().  The
      problem is that __ptrace_unlink() clears task->jobctl under siglock but
      task->ptrace is cleared without this lock held; this fools the "else"
      branch which assumes that !PT_SEIZED means PT_PTRACED.
      
      Note also that most of other PTRACE_SEIZE checks can race with detach
      from the exiting tracer too.  Say, the callers of ptrace_trap_notify()
      assume that SEIZED can't go away after it was checked.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1333ab03
    • A
      auditsc: for seccomp events, log syscall compat state using in_compat_syscall · efbc0fbf
      Andy Lutomirski 提交于
      Except on SPARC, this is what the code always did.  SPARC compat seccomp
      was buggy, although the impact of the bug was limited because SPARC
      32-bit and 64-bit syscall numbers are the same.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efbc0fbf
    • A
      ptrace: in PEEK_SIGINFO, check syscall bitness, not task bitness · 5c465217
      Andy Lutomirski 提交于
      Users of the 32-bit ptrace() ABI expect the full 32-bit ABI.  siginfo
      translation should check ptrace() ABI, not caller task ABI.
      
      This is an ABI change on SPARC.  Let's hope that no one relied on the
      old buggy ABI.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c465217
    • A
      seccomp: check in_compat_syscall, not is_compat_task, in strict mode · 5c38065e
      Andy Lutomirski 提交于
      Seccomp wants to know the syscall bitness, not the caller task bitness,
      when it selects the syscall whitelist.
      
      As far as I know, this makes no difference on any architecture, so it's
      not a security problem.  (It generates identical code everywhere except
      sparc, and, on sparc, the syscall numbering is the same for both ABIs.)
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c38065e
    • T
      kernel/hung_task.c: use timeout diff when timeout is updated · b4aa14a6
      Tetsuo Handa 提交于
      When new timeout is written to /proc/sys/kernel/hung_task_timeout_secs,
      khungtaskd is interrupted and again sleeps for full timeout duration.
      
      This means that hang task will not be checked if new timeout is written
      periodically within old timeout duration and/or checking of hang task
      will be delayed for up to previous timeout duration.  Fix this by
      remembering last time khungtaskd checked hang task.
      
      This change will allow other watchdog tasks (if any) to share khungtaskd
      by sleeping for minimal timeout diff of all watchdog tasks.  Doing more
      watchdog tasks from khungtaskd will reduce the possibility of printk()
      collisions by multiple watchdog threads.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4aa14a6
    • P
      tracing: Record and show NMI state · 7e6867bf
      Peter Zijlstra 提交于
      The latency tracer format has a nice column to indicate IRQ state, but
      this is not able to tell us about NMI state.
      
      When tracing perf interrupt handlers (which often run in NMI context)
      it is very useful to see how the events nest.
      
      Link: http://lkml.kernel.org/r/20160318153022.105068893@infradead.orgSigned-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      7e6867bf
    • S
      tracing: Fix trace_printk() to print when not using bprintk() · 3debb0a9
      Steven Rostedt (Red Hat) 提交于
      The trace_printk() code will allocate extra buffers if the compile detects
      that a trace_printk() is used. To do this, the format of the trace_printk()
      is saved to the __trace_printk_fmt section, and if that section is bigger
      than zero, the buffers are allocated (along with a message that this has
      happened).
      
      If trace_printk() uses a format that is not a constant, and thus something
      not guaranteed to be around when the print happens, the compiler optimizes
      the fmt out, as it is not used, and the __trace_printk_fmt section is not
      filled. This means the kernel will not allocate the special buffers needed
      for the trace_printk() and the trace_printk() will not write anything to the
      tracing buffer.
      
      Adding a "__used" to the variable in the __trace_printk_fmt section will
      keep it around, even though it is set to NULL. This will keep the string
      from being printed in the debugfs/tracing/printk_formats section as it is
      not needed.
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Fixes: 07d777fe "tracing: Add percpu buffers for trace_printk()"
      Cc: stable@vger.kernel.org # v3.5+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3debb0a9
  10. 21 3月, 2016 2 次提交