1. 20 2月, 2016 1 次提交
  2. 06 2月, 2016 3 次提交
    • A
      bpf: add lookup/update support for per-cpu hash and array maps · 15a07b33
      Alexei Starovoitov 提交于
      The functions bpf_map_lookup_elem(map, key, value) and
      bpf_map_update_elem(map, key, value, flags) need to get/set
      values from all-cpus for per-cpu hash and array maps,
      so that user space can aggregate/update them as necessary.
      
      Example of single counter aggregation in user space:
        unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
        long values[nr_cpus];
        long value = 0;
      
        bpf_lookup_elem(fd, key, values);
        for (i = 0; i < nr_cpus; i++)
          value += values[i];
      
      The user space must provide round_up(value_size, 8) * nr_cpus
      array to get/set values, since kernel will use 'long' copy
      of per-cpu values to try to copy good counters atomically.
      It's a best-effort, since bpf programs and user space are racing
      to access the same memory.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15a07b33
    • A
      bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map · a10423b8
      Alexei Starovoitov 提交于
      Primary use case is a histogram array of latency
      where bpf program computes the latency of block requests or other
      events and stores histogram of latency into array of 64 elements.
      All cpus are constantly running, so normal increment is not accurate,
      bpf_xadd causes cache ping-pong and this per-cpu approach allows
      fastest collision-free counters.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a10423b8
    • A
      bpf: introduce BPF_MAP_TYPE_PERCPU_HASH map · 824bd0ce
      Alexei Starovoitov 提交于
      Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do
      accurate counters without need to use BPF_XADD instruction which turned
      out to be too costly for high-performance network monitoring.
      In the typical use case the 'key' is the flow tuple or other long
      living object that sees a lot of events per second.
      
      bpf_map_lookup_elem() returns per-cpu area.
      Example:
      struct {
        u32 packets;
        u32 bytes;
      } * ptr = bpf_map_lookup_elem(&map, &key);
      /* ptr points to this_cpu area of the value, so the following
       * increments will not collide with other cpus
       */
      ptr->packets ++;
      ptr->bytes += skb->len;
      
      bpf_update_elem() atomically creates a new element where all per-cpu
      values are zero initialized and this_cpu value is populated with
      given 'value'.
      Note that non-per-cpu hash map always allocates new element
      and then deletes old after rcu grace period to maintain atomicity
      of update. Per-cpu hash map updates element values in-place.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      824bd0ce
  3. 01 2月, 2016 1 次提交
  4. 30 1月, 2016 2 次提交
    • Z
      pid: Fix spelling in comments · 840d6fe7
      Zhen Lei 提交于
      Accidentally discovered this typo when I studied this module.
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tianhong Ding <dingtianhong@huawei.com>
      Cc: Xinwei Hu <huxinwei@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1454119457-11272-1-git-send-email-thunder.leizhen@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      840d6fe7
    • D
      devm_memremap_pages: fix vmem_altmap lifetime + alignment handling · eb7d78c9
      Dan Williams 提交于
      to_vmem_altmap() needs to return valid results until
      arch_remove_memory() completes.  It also needs to be valid for any pfn
      in a section regardless of whether that pfn maps to data.  This escape
      was a result of a bug in the unit test.
      
      The signature of this bug is that free_pagetable() fails to retrieve a
      vmem_altmap and goes off into the weeds:
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffff811d2629>] get_pfnblock_flags_mask+0x49/0x60
       [..]
       Call Trace:
        [<ffffffff811d3477>] free_hot_cold_page+0x97/0x1d0
        [<ffffffff811d367a>] __free_pages+0x2a/0x40
        [<ffffffff8191e669>] free_pagetable+0x8c/0xd4
        [<ffffffff8191ef4e>] remove_pagetable+0x37a/0x808
        [<ffffffff8191b210>] vmemmap_free+0x10/0x20
      
      Fixes: 4b94ffdc ("x86, mm: introduce vmem_altmap to augment vmemmap_populate()")
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      eb7d78c9
  5. 29 1月, 2016 13 次提交
    • P
      perf: Remove/simplify lockdep annotation · 5fa7c8ec
      Peter Zijlstra 提交于
      Now that the perf_event_ctx_lock_nested() call has moved from
      put_event() into perf_event_release_kernel() the first reason is no
      longer valid as that can no longer happen.
      
      The second reason seems to have been invalidated when Al Viro made fput()
      unconditionally async in the following commit:
      
        4a9d4b02 ("switch fput to task_work_add")
      
      such that munmap()->fput()->release()->perf_release() would no longer happen.
      
      Therefore, remove the annotation. This should increase the efficiency
      of lockdep coverage of perf locking.
      Suggested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5fa7c8ec
    • P
      perf: Synchronously clean up child events · c6e5b732
      Peter Zijlstra 提交于
      The orphan cleanup workqueue doesn't always catch orphans, for example,
      if they never schedule after they are orphaned. IOW, the event leak is
      still very real. It also wouldn't work for kernel counters.
      
      Doing it synchonously is a little hairy due to lock inversion issues,
      but is made to work.
      
      Patch based on work by Alexander Shishkin.
      Suggested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c6e5b732
    • P
      perf: Untangle 'owner' confusion · 60beda84
      Peter Zijlstra 提交于
      There are two concepts of owner wrt an event and they are conflated:
      
       - event::owner / event::owner_list,
         used by prctl(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE).
      
       - the 'owner' of the event object, typically the file descriptor.
      
      Currently these two concepts are conflated, which gives trouble with
      scm_rights passing of file descriptors. Passing the event and then
      closing the creating task would render the event 'orphan' and would
      have it cleared out. Unlikely what is expectd.
      
      This patch untangles these two concepts by using PERF_EVENT_STATE_EXIT
      to denote the second type.
      Reported-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      60beda84
    • P
      perf: Add flags argument to perf_remove_from_context() · 45a0e07a
      Peter Zijlstra 提交于
      In preparation to adding more options, convert the boolean argument
      into a flags word.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      45a0e07a
    • P
      perf: Clean up sync_child_event() · 8ba289b8
      Peter Zijlstra 提交于
      sync_child_event() has outgrown its purpose, it does far too much.
      Bring it back to its named purpose.
      
      Rename __perf_event_exit_task() to perf_event_exit_event() to better
      reflect what it does and move the event->state assignment under the
      ctx->lock, like state changes ought to be.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8ba289b8
    • P
      perf: Robustify event->owner usage and SMP ordering · f47c02c0
      Peter Zijlstra 提交于
      Use smp_store_release() to clear event->owner and
      lockless_dereference() to observe it. Further use READ_ONCE() for all
      lockless reads.
      
      This changes perf_remove_from_owner() to leave event->owner cleared.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f47c02c0
    • P
      perf: Fix STATE_EXIT usage · 6e801e01
      Peter Zijlstra 提交于
      We should never attempt to enable a STATE_EXIT event.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6e801e01
    • P
      perf: Update locking order · 07c4a776
      Peter Zijlstra 提交于
      Update the locking order to note that ctx::lock nests inside of
      child_mutex, as per:
      
        perf_ioctl():                ctx::mutex
        -> perf_event_for_each():    event::child_mutex
          -> _perf_event_enable():   ctx::lock
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      07c4a776
    • P
      perf: Remove __free_event() · a0733e69
      Peter Zijlstra 提交于
      There is but a single caller, remove the function - we already have
      _free_event(), the extra indirection is nonsensical..
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a0733e69
    • A
      perf/bpf: Convert perf_event_array to use struct file · e03e7ee3
      Alexei Starovoitov 提交于
      Robustify refcounting.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e03e7ee3
    • P
      perf: Fix NULL deref · 828b6f0e
      Peter Zijlstra 提交于
      Dan reported:
      
        1229                  if (ctx->task == TASK_TOMBSTONE ||
        1230                      !atomic_inc_not_zero(&ctx->refcount)) {
        1231                          raw_spin_unlock(&ctx->lock);
        1232                          ctx = NULL;
                                      ^^^^^^^^^^
      ctx is NULL.
      
        1233                  }
        1234
        1235                  WARN_ON_ONCE(ctx->task != task);
                                           ^^^^^^^^^^^^^^^^^
      The patch adds a NULL dereference.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 63b6da39 ("perf: Fix perf_event_exit_task() race")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      828b6f0e
    • P
      perf: Fix race in perf_event_exit_task_context() · 6a3351b6
      Peter Zijlstra 提交于
      There is a race between perf_event_exit_task_context() and
      orphans_remove_work() which results in a use-after-free.
      
      We mark ctx->task with TASK_TOMBSTONE to indicate a context is
      'dead', under ctx->lock. After which point event_function_call()
      on any event of that context will NOP
      
      A concurrent orphans_remove_work() will only hold ctx->mutex for
      the list iteration and not serialize against this. Therefore its
      possible that orphans_remove_work()'s perf_remove_from_context()
      call will fail, but we'll continue to free the event, with the
      result of free'd memory still being on lists and everything.
      
      Once perf_event_exit_task_context() gets around to acquiring
      ctx->mutex it too will iterate the event list, encounter the
      already free'd event and proceed to free it _again_. This fails
      with the WARN in free_event().
      
      Plug the race by having perf_event_exit_task_context() hold
      ctx::mutex over the whole tear-down, thereby 'naturally'
      serializing against all other sites, including the orphan work.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: alexander.shishkin@linux.intel.com
      Cc: dsahern@gmail.com
      Cc: namhyung@kernel.org
      Link: http://lkml.kernel.org/r/20160125130954.GY6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6a3351b6
    • P
      perf: Fix orphan hole · 78cd2c74
      Peter Zijlstra 提交于
      We should set event->owner before we install the event,
      otherwise there is a hole where the target task can fork() and
      we'll not inherit the event because it thinks the event is
      orphaned.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      78cd2c74
  6. 28 1月, 2016 1 次提交
    • A
      PM: APM_EMULATION does not depend on PM · 993e9fe1
      Arnd Bergmann 提交于
      The APM emulation code does multiple things, and some of them depend on
      PM_SLEEP, while the battery management does not. However, selecting
      the symbol like SHARPSL_PM does causes a Kconfig warning:
      
      warning: (SHARPSL_PM && PMAC_APM_EMU) selects APM_EMULATION which has unmet direct dependencies (PM && SYS_SUPPORTS_APM_EMULATION)
      
      From all I can tell, this is completely harmless, and we can simply allow
      APM_EMULATION to be enabled here, even if PM is not.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      993e9fe1
  7. 27 1月, 2016 2 次提交
  8. 26 1月, 2016 3 次提交
    • A
      tick/sched: Hide unused oneshot timer code · 7809998a
      Arnd Bergmann 提交于
      A couple of functions in kernel/time/tick-sched.c are only
      relevant for oneshot timer mode, i.e. when hires-timers or
      nohz mode are enabled. If both are disabled, we get gcc warnings
      about them:
      
      kernel/time/tick-sched.c:98:16: warning: 'tick_init_jiffy_update' defined but not used [-Wunused-function]
       static ktime_t tick_init_jiffy_update(void)
                      ^
      kernel/time/tick-sched.c:112:13: warning: 'tick_sched_do_timer' defined but not used [-Wunused-function]
       static void tick_sched_do_timer(ktime_t now)
                   ^
      kernel/time/tick-sched.c:134:13: warning: 'tick_sched_handle' defined but not used [-Wunused-function]
       static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
                   ^
      
      This encloses the whole set of functions in an appropriate ifdef
      to avoid the warning and to make it clearer when they are used.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1453736525-1959191-1-git-send-email-arnd@arndb.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7809998a
    • M
      irqdomain: Allow domain lookup with DOMAIN_BUS_WIRED token · 530cbe10
      Marc Zyngier 提交于
      Let's take the (outlandish) example of an interrupt controller
      capable of handling both wired interrupts and PCI MSIs.
      
      With the current code, the PCI MSI domain is going to be tagged
      with DOMAIN_BUS_PCI_MSI, and the wired domain with DOMAIN_BUS_ANY.
      
      Things get hairy when we start looking up the domain for a wired
      interrupt (typically when creating it based on some firmware
      information - DT or ACPI).
      
      In irq_create_fwspec_mapping(), we perform the lookup using
      DOMAIN_BUS_ANY, which is actually used as a wildcard. This gives
      us one chance out of two to end up with the wrong domain, and
      we try to configure a wired interrupt with the MSI domain.
      Everything grinds to a halt pretty quickly.
      
      What we really need to do is to start looking for a domain that
      would uniquely identify a wired interrupt domain, and only use
      DOMAIN_BUS_ANY as a fallback.
      
      In order to solve this, let's introduce a new DOMAIN_BUS_WIRED
      token, which is going to be used exactly as described above.
      Of course, this depends on the irqchip to setup the domain
      bus_token, and nobody had to implement this so far.
      
      Only so far.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Link: http://lkml.kernel.org/r/1453816347-32720-2-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      530cbe10
    • T
      rtmutex: Make wait_lock irq safe · b4abf910
      Thomas Gleixner 提交于
      Sasha reported a lockdep splat about a potential deadlock between RCU boosting
      rtmutex and the posix timer it_lock.
      
      CPU0					CPU1
      
      rtmutex_lock(&rcu->rt_mutex)
        spin_lock(&rcu->rt_mutex.wait_lock)
      					local_irq_disable()
      					spin_lock(&timer->it_lock)
      					spin_lock(&rcu->mutex.wait_lock)
      --> Interrupt
          spin_lock(&timer->it_lock)
      
      This is caused by the following code sequence on CPU1
      
           rcu_read_lock()
           x = lookup();
           if (x)
           	spin_lock_irqsave(&x->it_lock);
           rcu_read_unlock();
           return x;
      
      We could fix that in the posix timer code by keeping rcu read locked across
      the spinlocked and irq disabled section, but the above sequence is common and
      there is no reason not to support it.
      
      Taking rt_mutex.wait_lock irq safe prevents the deadlock.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      b4abf910
  9. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  10. 22 1月, 2016 13 次提交
    • G
      sched/numa: Fix use-after-free bug in the task_numa_compare · 1dff76b9
      Gavin Guo 提交于
      The following message can be observed on the Ubuntu v3.13.0-65 with KASan
      backported:
      
        ==================================================================
        BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
        Read of size 8 by task qemu-system-x86/3998900
        =============================================================================
        BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
        -----------------------------------------------------------------------------
      
        INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
      	__slab_alloc+0x4f8/0x560
      	__kmalloc+0x1eb/0x280
      	task_numa_fault+0xc1b/0xed0
      	do_numa_page+0x192/0x200
      	handle_mm_fault+0x808/0x1160
      	__do_page_fault+0x218/0x750
      	do_page_fault+0x1a/0x70
      	page_fault+0x28/0x30
      	SyS_poll+0x66/0x1a0
      	system_call_fastpath+0x1a/0x1f
        INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
      	__slab_free+0x2ab/0x3f0
      	kfree+0x161/0x170
      	task_numa_free+0x1d2/0x200
      	finish_task_switch+0x1d2/0x210
      	__schedule+0x5d4/0xc60
      	schedule_preempt_disabled+0x40/0xc0
      	cpu_startup_entry+0x2da/0x340
      	start_secondary+0x28f/0x360
        Call Trace:
         [<ffffffff81a6ce35>] dump_stack+0x45/0x56
         [<ffffffff81244aed>] print_trailer+0xfd/0x170
         [<ffffffff8124ac36>] object_err+0x36/0x40
         [<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0
         [<ffffffff8124d260>] kasan_report+0x40/0x50
         [<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890
         [<ffffffff8124bee9>] __asan_load8+0x69/0xa0
         [<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120
         [<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890
         [<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0
         [<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0
         [<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0
         [<ffffffff8120ef02>] do_numa_page+0x192/0x200
         [<ffffffff81211038>] handle_mm_fault+0x808/0x1160
         [<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160
         [<ffffffff81068c52>] ? native_load_tls+0x82/0xa0
         [<ffffffff81a7bd68>] __do_page_fault+0x218/0x750
         [<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160
         [<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
         [<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70
         [<ffffffff81a772e8>] page_fault+0x28/0x30
         [<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0
         [<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0
         [<ffffffff810233c9>] ? sched_clock+0x9/0x10
         [<ffffffff810cf70a>] ? resched_task+0x7a/0xc0
         [<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130
         [<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170
         [<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20
         [<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90
         [<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0
         [<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0
         [<ffffffff81022c89>] ? read_tsc+0x9/0x20
         [<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170
         [<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0
         [<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0
         [<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f
      
      As commit 1effd9f1 ("sched/numa: Fix unsafe get_task_struct() in
      task_numa_assign()") points out, the rcu_read_lock() cannot protect the
      task_struct from being freed in the finish_task_switch(). And the bug
      happens in the process of calculation of imp which requires the access of
      p->numa_faults being freed in the following path:
      
      do_exit()
              current->flags |= PF_EXITING;
          release_task()
              ~~delayed_put_task_struct()~~
          schedule()
          ...
          ...
      rq->curr = next;
          context_switch()
              finish_task_switch()
                  put_task_struct()
                      __put_task_struct()
      		    task_numa_free()
      
      The fix here to get_task_struct() early before end of dst_rq->lock to
      protect the calculation process and also put_task_struct() in the
      corresponding point if finally the dst_rq->curr somehow cannot be
      assigned.
      
      Additional credit to Liang Chen who helped fix the error logic and add the
      put_task_struct() to the place it missed.
      Signed-off-by: NGavin Guo <gavin.guo@canonical.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jay.vosburgh@canonical.com
      Cc: liang.chen@canonical.com
      Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1dff76b9
    • J
      ntp: Fix ADJ_SETOFFSET being used w/ ADJ_NANO · dd4e17ab
      John Stultz 提交于
      Recently, in commit 37cf4dc3 I forgot to check if the timeval being passed
      was actually a timespec (as is signaled with ADJ_NANO).
      
      This resulted in that patch breaking ADJ_SETOFFSET users who set
      ADJ_NANO, by rejecting valid timespecs that were compared with
      valid timeval ranges.
      
      This patch addresses this by checking for the ADJ_NANO flag and
      using the timepsec check instead in that case.
      Reported-by: NHarald Hoyer <harald@redhat.com>
      Reported-by: NKay Sievers <kay@vrfy.org>
      Fixes: 37cf4dc3 "time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow"
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Link: http://lkml.kernel.org/r/1453417415-19110-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      dd4e17ab
    • S
      cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freeze · 6f16886b
      Sudeep Holla 提交于
      Commit 51164251 "sched / idle: Drop default_idle_call() fallback
      from call_cpuidle()" made find_deepest_state() return non-negative
      value and check all the states with index > 0.  Also as a result,
      find_deepest_state() returns 0 even when enter_freeze callbacks are not
      implemented and enter_freeze_proper() is called which ends up crashing
      the kernel.
      
      This patch updates the check for index > 0 in cpuidle_enter_freeze and
      cpuidle_idle_call(when idle_should_freeze is true) to restore the
      suspend-to-idle functionality in absence of enter_freeze callback.
      
      Fixes: 51164251 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()"
      Signed-off-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      6f16886b
    • A
      perf: Synchronously free aux pages in case of allocation failure · 45c815f0
      Alexander Shishkin 提交于
      We are currently using asynchronous deallocation in the error path in
      AUX mmap code, which is unnecessary and also presents a problem for users
      that wish to probe for the biggest possible buffer size they can get:
      they'll get -EINVAL on all subsequent attemts to allocate a smaller
      buffer before the asynchronous deallocation callback frees up the pages
      from the previous unsuccessful attempt.
      
      Currently, gdb does that for allocating AUX buffers for Intel PT traces.
      More specifically, overwrite mode of AUX pmus that don't support hardware
      sg (some implementations of Intel PT, for instance) is limited to only
      one contiguous high order allocation for its buffer and there is no way
      of knowing its size without trying.
      
      This patch changes error path freeing to be synchronous as there won't
      be any contenders for the AUX pages at that point.
      Reported-by: NMarkus Metzger <markus.t.metzger@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1453216469-9509-1-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      45c815f0
    • P
      perf: Fix perf_event_exit_task() race · 63b6da39
      Peter Zijlstra 提交于
      There is a race against perf_event_exit_task() vs
      event_function_call(),find_get_context(),perf_install_in_context()
      (iow, everyone).
      
      Since there is no permanent marker on a context that its dead, it is
      quite possible that we access (and even modify) a context after its
      passed through perf_event_exit_task().
      
      For instance, find_get_context() might find the context still
      installed, but by the time we get to perf_install_in_context() it
      might already have passed through perf_event_exit_task() and be
      considered dead, we will however still add the event to it.
      
      Solve this by marking a ctx dead by setting its ctx->task value to -1,
      it must be !0 so we still know its a (former) task context.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      63b6da39
    • P
      perf: Add more assertions · c97f4736
      Peter Zijlstra 提交于
      Try to trigger warnings before races do damage.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c97f4736
    • P
      perf: Collapse and fix event_function_call() users · fae3fde6
      Peter Zijlstra 提交于
      There is one common bug left in all the event_function_call() users,
      between loading ctx->task and getting to the remote_function(),
      ctx->task can already have been changed.
      
      Therefore we need to double check and retry if ctx->task != current.
      
      Insert another trampoline specific to event_function_call() that
      checks for this and further validates state. This also allows getting
      rid of the active/inactive functions.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      fae3fde6
    • P
      perf: Specialize perf_event_exit_task() · 32132a3d
      Peter Zijlstra 提交于
      The perf_remove_from_context() usage in __perf_event_exit_task() is
      different from the other usages in that this site has already
      detached and scheduled out the task context.
      
      This will stand in the way of stronger assertions checking the (task)
      context scheduling invariants.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      32132a3d
    • P
      perf: Fix task context scheduling · 39a43640
      Peter Zijlstra 提交于
      There is a very nasty problem wrt disabling the perf task scheduling
      hooks.
      
      Currently we {set,clear} ctx->is_active on every
      __perf_event_task_sched_{in,out}, _however_ this means that if we
      disable these calls we'll have task contexts with ->is_active set that
      are not active and 'active' task contexts without ->is_active set.
      
      This can result in event_function_call() looping on the ctx->is_active
      condition basically indefinitely.
      
      Resolve this by changing things such that contexts without events do
      not set ->is_active like we used to. From this invariant it trivially
      follows that if there are no (task) events, every task ctx is inactive
      and disabling the context switch hooks is harmless.
      
      This leaves two places that need attention (and already had
      accumulated weird and wonderful hacks to work around, without
      recognising this actual problem).
      
      Namely:
      
       - perf_install_in_context() will need to deal with installing events
         in an inactive context, meaning it cannot rely on ctx-is_active for
         its IPIs.
      
       - perf_remove_from_context() will have to mark a context as inactive
         when it removes the last event.
      
      For specific detail, see the patch/comments.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      39a43640
    • P
      perf: Make ctx->is_active and cpuctx->task_ctx consistent · 63e30d3e
      Peter Zijlstra 提交于
      For no apparent reason and to great confusion the rules for
      ctx->is_active and cpuctx->task_ctx are different. This means that its
      not always possible to find all active (task) contexts.
      
      Fix this such that if ctx->is_active gets set, we also set (or verify)
      cpuctx->task_ctx.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      63e30d3e
    • P
      perf: Optimize perf_sched_events() usage · 25432ae9
      Peter Zijlstra 提交于
      It doesn't make sense to take up-to _4_ references on
      perf_sched_events() per event, avoid doing this.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      25432ae9
    • P
      perf: Simplify/fix perf_event_enable() event scheduling · aee7dbc4
      Peter Zijlstra 提交于
      Like perf_enable_on_exec(), perf_event_enable() event scheduling has problems
      respecting the context hierarchy when trying to schedule events (for
      example, it will try and add a pinned event without first removing
      existing flexible events).
      
      So simplify it by using the new ctx_resched() call which will DTRT.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      aee7dbc4
    • P
      perf: Use task_ctx_sched_out() · 8833d0e2
      Peter Zijlstra 提交于
      We have a function that does exactly what we want here, use it. This
      reduces the amount of cpuctx->task_ctx muckery.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8833d0e2