1. 26 11月, 2019 1 次提交
    • L
      vfs: mark pipes and sockets as stream-like file descriptors · d8e464ec
      Linus Torvalds 提交于
      In commit 3975b097 ("convert stream-like files -> stream_open, even
      if they use noop_llseek") Kirill used a coccinelle script to change
      "nonseekable_open()" to "stream_open()", which changed the trivial cases
      of stream-like file descriptors to the new model with FMODE_STREAM.
      
      However, the two big cases - sockets and pipes - don't actually have
      that trivial pattern at all, and were thus never converted to
      FMODE_STREAM even though it makes lots of sense to do so.
      
      That's particularly true when looking forward to the next change:
      getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
      decide whether f_pos updates are needed or not.  And if they are, we'll
      always do them atomically.
      
      This came up because KCSAN (correctly) noted that the non-locked f_pos
      updates are data races: they are clearly benign for the case where we
      don't care, but it would be good to just not have that issue exist at
      all.
      
      Note that the reason we used FMODE_ATOMIC_POS originally is that only
      doing it for the minimal required case is "safer" in that it's possible
      that the f_pos locking can cause unnecessary serialization across the
      whole write() call.  And in the worst case, that kind of serialization
      can cause deadlock issues: think writers that need readers to empty the
      state using the same file descriptor.
      
      [ Note that the locking is per-file descriptor - because it protects
        "f_pos", which is obviously per-file descriptor - so it only affects
        cases where you literally use the same file descriptor to both read
        and write.
      
        So a regular pipe that has separate reading and writing file
        descriptors doesn't really have this situation even though it's the
        obvious case of "reader empties what a bit writer concurrently fills"
      
        But we want to make pipes as being stream-line anyway, because we
        don't want the unnecessary overhead of locking, and because a named
        pipe can be (ab-)used by reading and writing to the same file
        descriptor. ]
      
      There are likely a lot of other cases that might want FMODE_STREAM, and
      looking for ".llseek = no_llseek" users and other cases that don't have
      an lseek file operation at all and making them use "stream_open()" might
      be a good idea.  But pipes and sockets are likely to be the two main
      cases.
      
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8e464ec
  2. 10 7月, 2019 2 次提交
  3. 09 7月, 2019 2 次提交
    • A
      coallocate socket_wq with socket itself · 333f7909
      Al Viro 提交于
      socket->wq is assign-once, set when we are initializing both
      struct socket it's in and struct socket_wq it points to.  As the
      matter of fact, the only reason for separate allocation was the
      ability to RCU-delay freeing of socket_wq.  RCU-delaying the
      freeing of socket itself gets rid of that need, so we can just
      fold struct socket_wq into the end of struct socket and simplify
      the life both for sock_alloc_inode() (one allocation instead of
      two) and for tun/tap oddballs, where we used to embed struct socket
      and struct socket_wq into the same structure (now - embedding just
      the struct socket).
      
      Note that reference to struct socket_wq in struct sock does remain
      a reference - that's unchanged.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      333f7909
    • A
      sockfs: switch to ->free_inode() · 6d7855c5
      Al Viro 提交于
      we do have an RCU-delayed part there already (freeing the wq),
      so it's not like the pipe situation; moreover, it might be
      worth considering coallocating wq with the rest of struct sock_alloc.
      ->sk_wq in struct sock would remain a pointer as it is, but
      the object it normally points to would be coallocated with
      struct socket...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d7855c5
  4. 04 7月, 2019 1 次提交
  5. 28 6月, 2019 1 次提交
    • S
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev 提交于
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      
      v9:
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      
      v8:
      * use s32 for optlen (Andrii Nakryiko)
      
      v7:
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      
      v6:
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      
      v5:
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      
      v4:
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      
      v3:
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
        Nakryiko)
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      
      v2:
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0d01da6a
  6. 06 6月, 2019 1 次提交
  7. 01 6月, 2019 1 次提交
  8. 31 5月, 2019 1 次提交
  9. 26 5月, 2019 2 次提交
    • D
      vfs: Convert sockfs to use the new mount API · fba9be49
      David Howells 提交于
      Convert the sockfs filesystem to the new internal mount API as the old
      one will be obsoleted and removed.  This allows greater flexibility in
      communication of mount parameters between userspace, the VFS and the
      filesystem.
      
      See Documentation/filesystems/mount_api.txt for more information.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: netdev@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fba9be49
    • A
      mount_pseudo(): drop 'name' argument, switch to d_make_root() · 1f58bb18
      Al Viro 提交于
      Once upon a time we used to set ->d_name of e.g. pipefs root
      so that d_path() on pipes would work.  These days it's
      completely pointless - dentries of pipes are not even connected
      to pipefs root.  However, mount_pseudo() had set the root
      dentry name (passed as the second argument) and callers
      kept inventing names to pass to it.  Including those that
      didn't *have* any non-root dentries to start with...
      
      All of that had been pointless for about 8 years now; it's
      time to get rid of that cargo-culting...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1f58bb18
  10. 20 5月, 2019 1 次提交
    • R
      net: fix kernel-doc warnings for socket.c · 85806af0
      Randy Dunlap 提交于
      Fix kernel-doc warnings by moving the kernel-doc notation to be
      immediately above the functions that it describes.
      
      Fixes these warnings for sock_sendmsg() and sock_recvmsg():
      
      ../net/socket.c:658: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:658: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'flags' description in 'INDIRECT_CALLABLE_DECLARE'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85806af0
  11. 06 5月, 2019 1 次提交
  12. 26 4月, 2019 1 次提交
  13. 20 4月, 2019 2 次提交
    • A
      net: socket: implement 64-bit timestamps · 0768e170
      Arnd Bergmann 提交于
      The 'timeval' and 'timespec' data structures used for socket timestamps
      are going to be redefined in user space based on 64-bit time_t in future
      versions of the C library to deal with the y2038 overflow problem,
      which breaks the ABI definition.
      
      Unlike many modern ioctl commands, SIOCGSTAMP and SIOCGSTAMPNS do not
      use the _IOR() macro to encode the size of the transferred data, so it
      remains ambiguous whether the application uses the old or new layout.
      
      The best workaround I could find is rather ugly: we redefine the command
      code based on the size of the respective data structure with a ternary
      operator. This lets it get evaluated as late as possible, hopefully after
      that structure is visible to the caller. We cannot use an #ifdef here,
      because inux/sockios.h might have been included before any libc header
      that could determine the size of time_t.
      
      The ioctl implementation now interprets the new command codes as always
      referring to the 64-bit structure on all architectures, while the old
      architecture specific command code still refers to the old architecture
      specific layout. The new command number is only used when they are
      actually different.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0768e170
    • A
      net: rework SIOCGSTAMP ioctl handling · c7cbdbf2
      Arnd Bergmann 提交于
      The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
      socket protocol handlers, and all of those end up calling the same
      sock_get_timestamp()/sock_get_timestampns() helper functions, which
      results in a lot of duplicate code.
      
      With the introduction of 64-bit time_t on 32-bit architectures, this
      gets worse, as we then need four different ioctl commands in each
      socket protocol implementation.
      
      To simplify that, let's add a new .gettstamp() operation in
      struct proto_ops, and move ioctl implementation into the common
      sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
      through.
      
      We can reuse the sock_get_timestamp() implementation, but generalize
      it so it can deal with both native and compat mode, as well as
      timeval and timespec structures.
      Acked-by: NStefan Schmidt <stefan@datenfreihafen.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7cbdbf2
  14. 16 3月, 2019 1 次提交
  15. 26 2月, 2019 1 次提交
    • E
      net: socket: set sock->sk to NULL after calling proto_ops::release() · ff7b11aa
      Eric Biggers 提交于
      Commit 9060cb71 ("net: crypto set sk to NULL when af_alg_release.")
      fixed a use-after-free in sockfs_setattr() when an AF_ALG socket is
      closed concurrently with fchownat().  However, it ignored that many
      other proto_ops::release() methods don't set sock->sk to NULL and
      therefore allow the same use-after-free:
      
          - base_sock_release
          - bnep_sock_release
          - cmtp_sock_release
          - data_sock_release
          - dn_release
          - hci_sock_release
          - hidp_sock_release
          - iucv_sock_release
          - l2cap_sock_release
          - llcp_sock_release
          - llc_ui_release
          - rawsock_release
          - rfcomm_sock_release
          - sco_sock_release
          - svc_release
          - vcc_release
          - x25_release
      
      Rather than fixing all these and relying on every socket type to get
      this right forever, just make __sock_release() set sock->sk to NULL
      itself after calling proto_ops::release().
      
      Reproducer that produces the KASAN splat when any of these socket types
      are configured into the kernel:
      
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/socket.h>
          #include <unistd.h>
      
          pthread_t t;
          volatile int fd;
      
          void *close_thread(void *arg)
          {
              for (;;) {
                  usleep(rand() % 100);
                  close(fd);
              }
          }
      
          int main()
          {
              pthread_create(&t, NULL, close_thread, NULL);
              for (;;) {
                  fd = socket(rand() % 50, rand() % 11, 0);
                  fchownat(fd, "", 1000, 1000, 0x1000);
                  close(fd);
              }
          }
      
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff7b11aa
  16. 04 2月, 2019 4 次提交
  17. 31 1月, 2019 4 次提交
  18. 18 12月, 2018 1 次提交
    • A
      y2038: socket: Add compat_sys_recvmmsg_time64 · e11d4284
      Arnd Bergmann 提交于
      recvmmsg() takes two arguments to pointers of structures that differ
      between 32-bit and 64-bit architectures: mmsghdr and timespec.
      
      For y2038 compatbility, we are changing the native system call from
      timespec to __kernel_timespec with a 64-bit time_t (in another patch),
      and use the existing compat system call on both 32-bit and 64-bit
      architectures for compatibility with traditional 32-bit user space.
      
      As we now have two variants of recvmmsg() for 32-bit tasks that are both
      different from the variant that we use on 64-bit tasks, this means we
      also require two compat system calls!
      
      The solution I picked is to flip things around: The existing
      compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c
      and now handles the case for old user space on all architectures that
      have set CONFIG_COMPAT_32BIT_TIME.  A new compat_sys_recvmmsg_time64()
      call gets added in the old place for 64-bit architectures only, this
      one handles the case of a compat mmsghdr structure combined with
      __kernel_timespec.
      
      In the indirect sys_socketcall(), we now need to call either
      do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of
      architecture we are on. For compat_sys_socketcall(), no such change is
      needed, we always call __compat_sys_recvmmsg().
      
      I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc
      implementation for 64-bit time_t will need significant changes including
      an updated asm/unistd.h, and it seems better to consistently use the
      separate syscalls that configuration, leaving the socketcall only for
      backward compatibility with 32-bit time_t based libc.
      
      The naming is asymmetric for the moment, so both existing syscalls
      entry points keep their names, while the new ones are recvmmsg_time32
      and compat_recvmmsg_time64 respectively. I expect that we will rename
      the compat syscalls later as we start using generated syscall tables
      everywhere and add these entry points.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      e11d4284
  19. 18 11月, 2018 1 次提交
  20. 24 10月, 2018 1 次提交
    • D
      iov_iter: Separate type from direction and use accessor functions · aa563d7b
      David Howells 提交于
      In the iov_iter struct, separate the iterator type from the iterator
      direction and use accessor functions to access them in most places.
      
      Convert a bunch of places to use switch-statements to access them rather
      then chains of bitwise-AND statements.  This makes it easier to add further
      iterator types.  Also, this can be more efficient as to implement a switch
      of small contiguous integers, the compiler can use ~50% fewer compare
      instructions than it has to use bitwise-and instructions.
      
      Further, cease passing the iterator type into the iterator setup function.
      The iterator function can set that itself.  Only the direction is required.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      aa563d7b
  21. 19 10月, 2018 1 次提交
    • W
      net: socket: fix a missing-check bug · b6168562
      Wenwen Wang 提交于
      In ethtool_ioctl(), the ioctl command 'ethcmd' is checked through a switch
      statement to see whether it is necessary to pre-process the ethtool
      structure, because, as mentioned in the comment, the structure
      ethtool_rxnfc is defined with padding. If yes, a user-space buffer 'rxnfc'
      is allocated through compat_alloc_user_space(). One thing to note here is
      that, if 'ethcmd' is ETHTOOL_GRXCLSRLALL, the size of the buffer 'rxnfc' is
      partially determined by 'rule_cnt', which is actually acquired from the
      user-space buffer 'compat_rxnfc', i.e., 'compat_rxnfc->rule_cnt', through
      get_user(). After 'rxnfc' is allocated, the data in the original user-space
      buffer 'compat_rxnfc' is then copied to 'rxnfc' through copy_in_user(),
      including the 'rule_cnt' field. However, after this copy, no check is
      re-enforced on 'rxnfc->rule_cnt'. So it is possible that a malicious user
      race to change the value in the 'compat_rxnfc->rule_cnt' between these two
      copies. Through this way, the attacker can bypass the previous check on
      'rule_cnt' and inject malicious data. This can cause undefined behavior of
      the kernel and introduce potential security risk.
      
      This patch avoids the above issue via copying the value acquired by
      get_user() to 'rxnfc->rule_cn', if 'ethcmd' is ETHTOOL_GRXCLSRLALL.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6168562
  22. 06 10月, 2018 1 次提交
  23. 14 9月, 2018 1 次提交
  24. 29 8月, 2018 1 次提交
    • A
      y2038: socket: Change recvmmsg to use __kernel_timespec · c2e6c856
      Arnd Bergmann 提交于
      This converts the recvmmsg() system call in all its variations to use
      'timespec64' internally for its timeout, and have a __kernel_timespec64
      argument in the native entry point. This lets us change the type to use
      64-bit time_t at a later point while using the 32-bit compat system call
      emulation for existing user space.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      c2e6c856
  25. 15 8月, 2018 1 次提交
  26. 01 8月, 2018 1 次提交
  27. 31 7月, 2018 2 次提交
  28. 29 7月, 2018 2 次提交