1. 12 4月, 2018 6 次提交
    • M
      linux/const.h: refactor _BITUL and _BITULL a bit · 21e7bc60
      Masahiro Yamada 提交于
      Minor cleanups available by _UL and _ULL.
      
      Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21e7bc60
    • M
      linux/const.h: move UL() macro to include/linux/const.h · 2dd8a62c
      Masahiro Yamada 提交于
      ARM, ARM64 and UniCore32 duplicate the definition of UL():
      
        #define UL(x) _AC(x, UL)
      
      This is not actually arch-specific, so it will be useful to move it to a
      common header.  Currently, we only have the uapi variant for
      linux/const.h, so I am creating include/linux/const.h.
      
      I also added _UL(), _ULL() and ULL() because _AC() is mostly used in
      the form either _AC(..., UL) or _AC(..., ULL).  I expect they will be
      replaced in follow-up cleanups.  The underscore-prefixed ones should
      be used for exported headers.
      
      Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NGuan Xuetao <gxt@mprc.pku.edu.cn>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dd8a62c
    • M
      linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI · 2a6cc8a6
      Masahiro Yamada 提交于
      Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(),
      BIT() etc", v3.
      
      ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL).  More
      architectures may introduce it in the future.
      
      UL() is arch-agnostic, and useful. So let's move it to
      include/linux/const.h
      
      Currently, <asm/memory.h> must be included to use UL().  It pulls in more
      bloats just for defining some bit macros.
      
      I posted V2 one year ago.
      
      The previous posts are:
      https://patchwork.kernel.org/patch/9498273/
      https://patchwork.kernel.org/patch/9498275/
      https://patchwork.kernel.org/patch/9498269/
      https://patchwork.kernel.org/patch/9498271/
      
      At that time, what blocked this series was a comment from
      David Howells:
        You need to be very careful doing this.  Some userspace stuff
        depends on the guard macro names on the kernel header files.
      
      (https://patchwork.kernel.org/patch/9498275/)
      
      Looking at the code closer, I noticed this is not a problem.
      
      See the following line.
      https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40
      
      scripts/headers_install.sh rips off _UAPI prefix from guard macro names.
      
      I ran "make headers_install" and confirmed the result is what I expect.
      
      So, we can prefix the include guard of include/uapi/linux/const.h,
      and add a new include/linux/const.h.
      
      This patch (of 4):
      
      I am going to add include/linux/const.h for the kernel space.
      
      Add _UAPI to the include guard of include/uapi/linux/const.h to
      prepare for that.
      
      Please notice the guard name of the exported one will be kept as-is.
      So, this commit has no impact to the userspace even if some userspace
      stuff depends on the guard macro names.
      
      scripts/headers_install.sh processes exported headers by SED, and
      rips off "_UAPI" from guard macro names.
      
        #ifndef _UAPI_LINUX_CONST_H
        #define _UAPI_LINUX_CONST_H
      
      will be turned into
      
        #ifndef _LINUX_CONST_H
        #define _LINUX_CONST_H
      
      Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a6cc8a6
    • D
      ipc/msg: introduce msgctl(MSG_STAT_ANY) · 23c8cec8
      Davidlohr Bueso 提交于
      There is a permission discrepancy when consulting msq ipc object
      metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the msq metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new MSG_STAT_ANY command such that the msq ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23c8cec8
    • D
      ipc/sem: introduce semctl(SEM_STAT_ANY) · a280d6dc
      Davidlohr Bueso 提交于
      There is a permission discrepancy when consulting shm ipc object
      metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the sma metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new SEM_STAT_ANY command such that the sem ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a280d6dc
    • D
      ipc/shm: introduce shmctl(SHM_STAT_ANY) · c21a6970
      Davidlohr Bueso 提交于
      Patch series "sysvipc: introduce STAT_ANY commands", v2.
      
      The following patches adds the discussed (see [1]) new command for shm
      as well as for sems and msq as they are subject to the same
      discrepancies for ipc object permission checks between the syscall and
      via procfs.  These new commands are justified in that (1) we are stuck
      with this semantics as changing syscall and procfs can break userland;
      and (2) some users can benefit from performance (for large amounts of
      shm segments, for example) from not having to parse the procfs
      interface.
      
      Once merged, I will submit the necesary manpage updates.  But I'm thinking
      something like:
      
      : diff --git a/man2/shmctl.2 b/man2/shmctl.2
      : index 7bb503999941..bb00bbe21a57 100644
      : --- a/man2/shmctl.2
      : +++ b/man2/shmctl.2
      : @@ -41,6 +41,7 @@
      :  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
      :  .\"	attaches to a segment that has already been marked for deletion.
      :  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
      : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
      :  .\"
      :  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
      :  .SH NAME
      : @@ -242,6 +243,18 @@ However, the
      :  argument is not a segment identifier, but instead an index into
      :  the kernel's internal array that maintains information about
      :  all shared memory segments on the system.
      : +.TP
      : +.BR SHM_STAT_ANY " (Linux-specific)"
      : +Return a
      : +.I shmid_ds
      : +structure as for
      : +.BR SHM_STAT .
      : +However, the
      : +.I shm_perm.mode
      : +is not checked for read access for
      : +.IR shmid ,
      : +resembing the behaviour of
      : +/proc/sysvipc/shm.
      :  .PP
      :  The caller can prevent or allow swapping of a shared
      :  memory segment with the following \fIcmd\fP values:
      : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
      :  kernel's internal array recording information about all
      :  shared memory segments.
      :  (This information can be used with repeated
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operations to obtain information about all shared memory segments
      :  on the system.)
      :  A successful
      : @@ -328,7 +341,7 @@ isn't accessible.
      :  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
      :  is not a valid command.
      :  Or: for a
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operation, the index value specified in
      :  .I shmid
      :  referred to an array slot that is currently unused.
      
      This patch (of 3):
      
      There is a permission discrepancy when consulting shm ipc object metadata
      between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
      later does permission checks for the object vs S_IRUGO.  As such there can
      be cases where EACCESS is returned via syscall but the info is displayed
      anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the shm metadata), this behavior goes way back and showing all
      the objects regardless of the permissions was most likely an overlook - so
      we are stuck with it.  Furthermore, modifying either the syscall or the
      procfs file can cause userspace programs to break (ie ipcs).  Some
      applications require getting the procfs info (without root privileges) and
      can be rather slow in comparison with a syscall -- up to 500x in some
      reported cases.
      
      This patch introduces a new SHM_STAT_ANY command such that the shm ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      [1] https://lkml.org/lkml/2017/12/19/220
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Robert Kettler <robert.kettler@outlook.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c21a6970
  2. 11 4月, 2018 1 次提交
  3. 05 4月, 2018 1 次提交
    • M
      dm: hold DM table for duration of ioctl rather than use blkdev_get · 971888c4
      Mike Snitzer 提交于
      Commit 519049af ("dm: use blkdev_get rather than bdgrab when issuing
      pass-through ioctl") inadvertantly introduced a regression relative to
      users of device cgroups that issue ioctls (e.g. libvirt).  Using
      blkdev_get() in DM's passthrough ioctl support implicitly introduced a
      cgroup permissions check that would fail unless care were taken to add
      all devices in the IO stack to the device cgroup.  E.g. rather than just
      adding the top-level DM multipath device to the cgroup all the
      underlying devices would need to be allowed.
      
      Fix this, to no longer require allowing all underlying devices, by
      simply holding the live DM table (which includes the table's original
      blkdev_get() reference on the blockdevice that the ioctl will be issued
      to) for the duration of the ioctl.
      
      Also, bump the DM ioctl version so a user can know that their device
      cgroup allow workaround is no longer needed.
      Reported-by: NMichal Privoznik <mprivozn@redhat.com>
      Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 519049af ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
      Cc: stable@vger.kernel.org # 4.16
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      971888c4
  4. 04 4月, 2018 1 次提交
  5. 01 4月, 2018 2 次提交
    • J
      tipc: avoid possible string overflow · 7494cfa6
      Jon Maloy 提交于
      gcc points out that the combined length of the fixed-length inputs to
      l->name is larger than the destination buffer size:
      
      net/tipc/link.c: In function 'tipc_link_create':
      net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes
      into a region of size between 26 and 58 [-Werror=format-overflow=]
      sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
      
      net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes
      (assuming 75) into a destination of size 60
      sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
      
      A detailed analysis reveals that the theoretical maximum length of
      a link name is:
      max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name =
      16 + 1 + 15 + 1 + 16 + 1 + 15 = 65
      Since we also need space for a trailing zero we now set MAX_LINK_NAME
      to 68.
      
      Just to be on the safe side we also replace the sprintf() call with
      snprintf().
      
      Fixes: 25b0b9c4 ("tipc: handle collisions of 32-bit node address
      hash values")
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7494cfa6
    • J
      tipc: tipc: rename address types in user api · 7a74d39c
      Jon Maloy 提交于
      The three address type structs in the user API have names that in
      reality reflect the specific, non-Linux environment where they were
      originally created.
      
      We now give them more intuitive names, in accordance with how TIPC is
      described in the current documentation.
      
      struct tipc_portid   -> struct tipc_socket_addr
      struct tipc_name     -> struct tipc_service_addr
      struct tipc_name_seq -> struct tipc_service_range
      
      To avoid confusion, we also update some commmets and macro names to
       match the new terminology.
      
      For compatibility, we add macros that map all old names to the new ones.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a74d39c
  6. 31 3月, 2018 5 次提交
    • A
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov 提交于
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
    • S
      blktrace: fix comment in blktrace_api.h · a5040c2d
      Souvik Banerjee 提交于
      The `__u64 time` field of the blk_io_trace struct refers to
      the time in nanoseconds, not in microseconds. It is set in
      __blk_add_trace, which does the following:
      
          t->time = ktime_to_ns(ktime_get());
      
      ktime_to_ns returns ktime_t in nanoseconds, not microseconds.
      Signed-off-by: NSouvik Banerjee <souvik1997@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a5040c2d
  7. 30 3月, 2018 1 次提交
  8. 29 3月, 2018 7 次提交
    • D
      64bf3d4b
    • D
      nl80211: Implement TX of control port frames · 2576a9ac
      Denis Kenzior 提交于
      This commit implements the TX side of NL80211_CMD_CONTROL_PORT_FRAME.
      Userspace provides the raw EAPoL frame using NL80211_ATTR_FRAME.
      Userspace should also provide the destination address and the protocol
      type to use when sending the frame.  This is used to implement TX of
      Pre-authentication frames.  If CONTROL_PORT_ETHERTYPE_NO_ENCRYPT is
      specified, then the driver will be asked not to encrypt the outgoing
      frame.
      
      A new EXT_FEATURE flag is introduced so that nl80211 code can check
      whether a given wiphy has capability to pass EAPoL frames over nl80211.
      Signed-off-by: NDenis Kenzior <denkenz@gmail.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      2576a9ac
    • D
      nl80211: Add CMD_CONTROL_PORT_FRAME API · 6a671a50
      Denis Kenzior 提交于
      This commit also adds cfg80211_rx_control_port function.  This is used
      to generate a CMD_CONTROL_PORT_FRAME event out to userspace.  The
      conn_owner_nlportid is used as the unicast destination.  This means that
      userspace must specify NL80211_ATTR_SOCKET_OWNER flag if control port
      over nl80211 routing is requested in NL80211_CMD_CONNECT,
      NL80211_CMD_ASSOCIATE, NL80211_CMD_START_AP or IBSS/mesh join.
      Signed-off-by: NDenis Kenzior <denkenz@gmail.com>
      [johannes: fix return value of cfg80211_rx_control_port()]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      6a671a50
    • D
      466a3061
    • D
      nl80211: Add SOCKET_OWNER support to JOIN_MESH · 188c1b3c
      Denis Kenzior 提交于
      Signed-off-by: NDenis Kenzior <denkenz@gmail.com>
      [johannes: fix race with wdev lock/unlock by just acquiring once]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      188c1b3c
    • D
      nl80211: Add SOCKET_OWNER support to JOIN_IBSS · f8d16d3e
      Denis Kenzior 提交于
      Signed-off-by: NDenis Kenzior <denkenz@gmail.com>
      [johannes: fix race with wdev lock/unlock by just acquiring once]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      f8d16d3e
    • A
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
  9. 28 3月, 2018 2 次提交
  10. 27 3月, 2018 2 次提交
    • I
      ethtool: Add support for configuring PFC stall prevention in ethtool · e1577c1c
      Inbar Karmy 提交于
      In the event where the device unexpectedly becomes unresponsive
      for a long period of time, flow control mechanism may propagate
      pause frames which will cause congestion spreading to the entire
      network.
      To prevent this scenario, when the device is stalled for a period
      longer than a pre-configured timeout, flow control mechanisms are
      automatically disabled.
      
      This patch adds support for the ETHTOOL_PFC_STALL_PREVENTION
      as a tunable.
      This API provides support for configuring flow control storm prevention
      timeout (msec).
      Signed-off-by: NInbar Karmy <inbark@mellanox.com>
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      e1577c1c
    • A
      vfio/pci: Add ioeventfd support · 30656177
      Alex Williamson 提交于
      The ioeventfd here is actually irqfd handling of an ioeventfd such as
      supported in KVM.  A user is able to pre-program a device write to
      occur when the eventfd triggers.  This is yet another instance of
      eventfd-irqfd triggering between KVM and vfio.  The impetus for this
      is high frequency writes to pages which are virtualized in QEMU.
      Enabling this near-direct write path for selected registers within
      the virtualized page can improve performance and reduce overhead.
      Specifically this is initially targeted at NVIDIA graphics cards where
      the driver issues a write to an MMIO register within a virtualized
      region in order to allow the MSI interrupt to re-trigger.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      30656177
  11. 24 3月, 2018 2 次提交
    • J
      tipc: add 128-bit node identifier · d50ccc2d
      Jon Maloy 提交于
      We add a 128-bit node identity, as an alternative to the currently used
      32-bit node address.
      
      For the sake of compatibility and to minimize message header changes
      we retain the existing 32-bit address field. When not set explicitly by
      the user, this field will be filled with a hash value generated from the
      much longer node identity, and be used as a shorthand value for the
      latter.
      
      We permit either the address or the identity to be set by configuration,
      but not both, so when the address value is set by a legacy user the
      corresponding 128-bit node identity is generated based on the that value.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d50ccc2d
    • D
      tls: RX path for ktls · c46234eb
      Dave Watson 提交于
      Add rx path for tls software implementation.
      
      recvmsg, splice_read, and poll implemented.
      
      An additional sockopt TLS_RX is added, with the same interface as
      TLS_TX.  Either TLX_RX or TLX_TX may be provided separately, or
      together (with two different setsockopt calls with appropriate keys).
      
      Control messages are passed via CMSG in a similar way to transmit.
      If no cmsg buffer is passed, then only application data records
      will be passed to userspace, and EIO is returned for other types of
      alerts.
      
      EBADMSG is passed for decryption errors, and EMSGSIZE is passed for
      framing too big, and EBADMSG for framing too small (matching openssl
      semantics). EINVAL is returned for TLS versions that do not match the
      original setsockopt call.  All are unrecoverable.
      
      strparser is used to parse TLS framing.   Decryption is done directly
      in to userspace buffers if they are large enough to support it, otherwise
      sk_cow_data is called (similar to ipsec), and buffers are decrypted in
      place and copied.  splice_read always decrypts in place, since no
      buffers are provided to decrypt in to.
      
      sk_poll is overridden, and only returns POLLIN if a full TLS message is
      received.  Otherwise we wait for strparser to finish reading a full frame.
      Actual decryption is only done during recvmsg or splice_read calls.
      Signed-off-by: NDave Watson <davejwatson@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c46234eb
  12. 23 3月, 2018 3 次提交
  13. 22 3月, 2018 3 次提交
  14. 21 3月, 2018 4 次提交