1. 11 4月, 2018 1 次提交
    • Y
      bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog · 3a38bb98
      Yonghong Song 提交于
      syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
      The error details:
        ======================================================
        WARNING: possible circular locking dependency detected
        4.16.0-rc7+ #3 Not tainted
        ------------------------------------------------------
        syz-executor7/24531 is trying to acquire lock:
         (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
      
        but task is already holding lock:
         (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (&mm->mmap_sem){++++}:
             __might_fault+0x13a/0x1d0 mm/memory.c:4571
             _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
             copy_to_user include/linux/uaccess.h:155 [inline]
             bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
             perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
             _perf_ioctl kernel/events/core.c:4750 [inline]
             perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
             vfs_ioctl fs/ioctl.c:46 [inline]
             do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
             SYSC_ioctl fs/ioctl.c:701 [inline]
             SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
             do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
             entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
        -> #0 (bpf_event_mutex){+.+.}:
             lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
             __mutex_lock_common kernel/locking/mutex.c:756 [inline]
             __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
             mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
             perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
             perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
             _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
             put_event+0x24/0x30 kernel/events/core.c:4204
             perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
             remove_vma+0xb4/0x1b0 mm/mmap.c:172
             remove_vma_list mm/mmap.c:2490 [inline]
             do_munmap+0x82a/0xdf0 mm/mmap.c:2731
             mmap_region+0x59e/0x15a0 mm/mmap.c:1646
             do_mmap+0x6c0/0xe00 mm/mmap.c:1483
             do_mmap_pgoff include/linux/mm.h:2223 [inline]
             vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
             SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
             SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
             SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
             SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
             do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
             entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&mm->mmap_sem);
                                       lock(bpf_event_mutex);
                                       lock(&mm->mmap_sem);
          lock(bpf_event_mutex);
      
         *** DEADLOCK ***
        ======================================================
      
      The bug is introduced by Commit f371b304 ("bpf/tracing: allow
      user space to query prog array on the same tp") where copy_to_user,
      which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
      At the same time, during perf_event file descriptor close,
      mm->mmap_sem is held first and then subsequent
      perf_event_detach_bpf_prog needs bpf_event_mutex lock.
      Such a senario caused a deadlock.
      
      As suggested by Daniel, moving copy_to_user out of the
      bpf_event_mutex lock should fix the problem.
      
      Fixes: f371b304 ("bpf/tracing: allow user space to query prog array on the same tp")
      Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3a38bb98
  2. 06 4月, 2018 2 次提交
    • R
      headers: untangle kmemleak.h from mm.h · 514c6032
      Randy Dunlap 提交于
      Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
      reason.  It looks like it's only a convenience, so remove kmemleak.h
      from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
      don't already #include it.  Also remove <linux/kmemleak.h> from source
      files that do not use it.
      
      This is tested on i386 allmodconfig and x86_64 allmodconfig.  It would
      be good to run it through the 0day bot for other $ARCHes.  I have
      neither the horsepower nor the storage space for the other $ARCHes.
      
      Update: This patch has been extensively build-tested by both the 0day
      bot & kisskb/ozlabs build farms.  Both of them reported 2 build failures
      for which patches are included here (in v2).
      
      [ slab.h is the second most used header file after module.h; kernel.h is
        right there with slab.h. There could be some minor error in the
        counting due to some #includes having comments after them and I didn't
        combine all of those. ]
      
      [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
      Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
      Link: http://kisskb.ellerman.id.au/kisskb/head/13396/Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Reported-by: Michael Ellerman <mpe@ellerman.id.au>	[2 build failures]
      Reported-by: Fengguang Wu <fengguang.wu@intel.com>	[2 build failures]
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Wei Yongjun <weiyongjun1@huawei.com>
      Cc: Luis R. Rodriguez <mcgrof@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      514c6032
    • M
      kernel/fork.c: detect early free of a live mm · 3eda69c9
      Mark Rutland 提交于
      KASAN splats indicate that in some cases we free a live mm, then
      continue to access it, with potentially disastrous results.  This is
      likely due to a mismatched mmdrop() somewhere in the kernel, but so far
      the culprit remains elusive.
      
      Let's have __mmdrop() verify that the mm isn't live for the current
      task, similar to the existing check for init_mm.  This way, we can catch
      this class of issue earlier, and without requiring KASAN.
      
      Currently, idle_task_exit() leaves active_mm stale after it switches to
      init_mm.  This isn't harmful, but will trigger the new assertions, so we
      must adjust idle_task_exit() to update active_mm.
      
      Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eda69c9
  3. 04 4月, 2018 4 次提交
  4. 03 4月, 2018 17 次提交
  5. 31 3月, 2018 6 次提交
    • W
      locking/rwsem: Add DEBUG_RWSEMS to look for lock/unlock mismatches · 5149cbac
      Waiman Long 提交于
      For a rwsem, locking can either be exclusive or shared. The corresponding
      exclusive or shared unlock must be used. Otherwise, the protected data
      structures may get corrupted or the lock may be in an inconsistent state.
      
      In order to detect such anomaly, a new configuration option DEBUG_RWSEMS
      is added which can be enabled to look for such mismatches and print
      warnings that that happens.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1522445280-7767-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5149cbac
    • A
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov 提交于
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
    • P
      bpf: sockmap: initialize sg table entries properly · 6ef6d84c
      Prashant Bhole 提交于
      When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized in
      sg_init_table() and it is verified in sg api while navigating. We hit
      BUG_ON when magic check is failed.
      
      In functions sg_tcp_sendpage and sg_tcp_sendmsg, the struct containing
      the scatterlist is already zeroed out. So to avoid extra memset, we
      use sg_init_marker() to initialize sg_magic.
      
      Fixed following things:
      - In bpf_tcp_sendpage: initialize sg using sg_init_marker
      - In bpf_tcp_sendmsg: Replace sg_init_table with sg_init_marker
      - In bpf_tcp_push: Replace memset with sg_init_table where consumed
        sg entry needs to be re-initialized.
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6ef6d84c
  6. 30 3月, 2018 4 次提交
  7. 29 3月, 2018 5 次提交
    • T
      alarmtimer: Init nanosleep alarm timer on stack · bd031430
      Thomas Gleixner 提交于
      syszbot reported the following debugobjects splat:
      
       ODEBUG: object is on stack, but not annotated
       WARNING: CPU: 0 PID: 4185 at lib/debugobjects.c:328
      
       RIP: 0010:debug_object_is_on_stack lib/debugobjects.c:327 [inline]
       debug_object_init+0x17/0x20 lib/debugobjects.c:391
       debug_hrtimer_init kernel/time/hrtimer.c:410 [inline]
       debug_init kernel/time/hrtimer.c:458 [inline]
       hrtimer_init+0x8c/0x410 kernel/time/hrtimer.c:1259
       alarm_init kernel/time/alarmtimer.c:339 [inline]
       alarm_timer_nsleep+0x164/0x4d0 kernel/time/alarmtimer.c:787
       SYSC_clock_nanosleep kernel/time/posix-timers.c:1226 [inline]
       SyS_clock_nanosleep+0x235/0x330 kernel/time/posix-timers.c:1204
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      This happens because the hrtimer for the alarm nanosleep is on stack, but
      the code does not use the proper debug objects initialization.
      
      Split out the code for the allocated use cases and invoke
      hrtimer_init_on_stack() for the nanosleep related functions.
      
      Reported-by: syzbot+a3e0726462b2e346a31d@syzkaller.appspotmail.com
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: syzkaller-bugs@googlegroups.com
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1803261528270.1585@nanos.tec.linutronix.de
      bd031430
    • A
      perf/x86/pt, coresight: Clean up address filter structure · 6ed70cf3
      Alexander Shishkin 提交于
      This is a cosmetic patch that deals with the address filter structure's
      ambiguous fields 'filter' and 'range'. The former stands to mean that the
      filter's *action* should be to filter the traces to its address range if
      it's set or stop tracing if it's unset. This is confusing and hard on the
      eyes, so this patch replaces it with 'action' enum. The 'range' field is
      completely redundant (meaning that the filter is an address range as
      opposed to a single address trigger), as we can use zero size to mean the
      same thing.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Acked-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: http://lkml.kernel.org/r/20180329120648.11902-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6ed70cf3
    • T
      lockdep: Make the lock debug output more useful · b3c39758
      Tetsuo Handa 提交于
      The lock debug output in print_lock() has a few shortcomings:
      
       - It prints the hlock->acquire_ip field in %px and %pS format. That's
         redundant information.
      
       - It lacks information about the lock object itself. The lock class is
         not helpful to identify a particular instance of a lock.
      
      Change the output so it prints:
      
       - hlock->instance to allow identification of a particular lock instance.
      
       - only the %pS format of hlock->ip_acquire which is sufficient to decode
         the actual code line with faddr2line.
      
      The resulting output is:
      
      3 locks held by a.out/31106:
      #0: 00000000b0f753ba (&mm->mmap_sem){++++}, at: copy_process.part.41+0x10d5/0x1fe0
      #1: 00000000ef64d539 (&mm->mmap_sem/1){+.+.}, at: copy_process.part.41+0x10fe/0x1fe0
      #2: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: copy_process.part.41+0x12f2/0x1fe0
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: linux-mm@kvack.org
      Cc: Borislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/201803271941.GBE57310.tVSOJLQOFFOHFM@I-love.SAKURA.ne.jp
      b3c39758
    • P
      locking/rtmutex: Handle non enqueued waiters gracefully in remove_waiter() · c28d62cf
      Peter Zijlstra 提交于
      In -RT task_blocks_on_rt_mutex() may return with -EAGAIN due to
      (->pi_blocked_on == PI_WAKEUP_INPROGRESS) before it added itself as a
      waiter. In such a case remove_waiter() must not be called because without a
      waiter it will trigger the BUG_ON() statement.
      
      This was initially reported by Yimin Deng. Thomas Gleixner fixed it then
      with an explicit check for waiters before calling remove_waiter().
      
      Instead of an explicit NULL check before calling rt_mutex_top_waiter() make
      the function return NULL if there are no waiters. With that fixed the now
      pointless NULL check is removed from rt_mutex_slowlock().
      Reported-and-debugged-by: NYimin Deng <yimin11.deng@gmail.com>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/CAAh1qt=DCL9aUXNxanP5BKtiPp3m+qj4yB+gDohhXPVFCxWwzg@mail.gmail.com
      Link: https://lkml.kernel.org/r/20180327121438.sss7hxg3crqy4ecd@linutronix.de
      c28d62cf
    • A
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
  8. 28 3月, 2018 1 次提交