1. 01 6月, 2018 1 次提交
  2. 18 5月, 2018 2 次提交
  3. 11 5月, 2018 1 次提交
    • J
      tcp: Add mark for TIMEWAIT sockets · 00483690
      Jon Maxwell 提交于
      This version has some suggestions by Eric Dumazet:
      
      - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
      races.
      - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
      statement.
      - Factorize code as sk_fullsock() check is not necessary.
      
      Aidan McGurn from Openwave Mobility systems reported the following bug:
      
      "Marked routing is broken on customer deployment. Its effects are large
      increase in Uplink retransmissions caused by the client never receiving
      the final ACK to their FINACK - this ACK misses the mark and routes out
      of the incorrect route."
      
      Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
      sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
      setsockopt(SO_MARK..).
      
      Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
      original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
      Then progate this so that the skb gets sent with the correct mark. Do the same
      for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
      netfilter rules are still honored.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00483690
  4. 31 3月, 2018 1 次提交
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
  5. 28 3月, 2018 1 次提交
  6. 27 3月, 2018 1 次提交
  7. 22 2月, 2018 1 次提交
  8. 15 2月, 2018 1 次提交
    • E
      tcp: try to keep packet if SYN_RCV race is lost · e0f9759f
      Eric Dumazet 提交于
      배석진 reported that in some situations, packets for a given 5-tuple
      end up being processed by different CPUS.
      
      This involves RPS, and fragmentation.
      
      배석진 is seeing packet drops when a SYN_RECV request socket is
      moved into ESTABLISH state. Other states are protected by socket lock.
      
      This is caused by a CPU losing the race, and simply not caring enough.
      
      Since this seems to occur frequently, we can do better and perform
      a second lookup.
      
      Note that all needed memory barriers are already in the existing code,
      thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
      and reqsk_put(). The second lookup must find the new socket,
      unless it has already been accepted and closed by another cpu.
      
      Note that the fragmentation could be avoided in the first place by
      use of a correct TCP MSS option in the SYN{ACK} packet, but this
      does not mean we can not be more robust.
      
      Many thanks to 배석진 for a very detailed analysis.
      Reported-by: N배석진 <soukjin.bae@samsung.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0f9759f
  9. 13 2月, 2018 1 次提交
    • K
      net: Convert pernet_subsys, registered from inet_init() · f84c6821
      Kirill Tkhai 提交于
      arp_net_ops just addr/removes /proc entry.
      
      devinet_ops allocates and frees duplicate of init_net tables
      and (un)registers sysctl entries.
      
      fib_net_ops allocates and frees pernet tables, creates/destroys
      netlink socket and (un)initializes /proc entries. Foreign
      pernet_operations do not touch them.
      
      ip_rt_proc_ops only modifies pernet /proc entries.
      
      xfrm_net_ops creates/destroys /proc entries, allocates/frees
      pernet statistics, hashes and tables, and (un)initializes
      sysctl files. These are not touched by foreigh pernet_operations
      
      xfrm4_net_ops allocates/frees private pernet memory, and
      configures sysctls.
      
      sysctl_route_ops creates/destroys sysctls.
      
      rt_genid_ops only initializes fields of just allocated net.
      
      ipv4_inetpeer_ops allocated/frees net private memory.
      
      igmp_net_ops just creates/destroys /proc files and socket,
      noone else interested in.
      
      tcp_sk_ops seems to be safe, because tcp_sk_init() does not
      depend on any other pernet_operations modifications. Iteration
      over hash table in inet_twsk_purge() is made under RCU lock,
      and it's safe to iterate the table this way. Removing from
      the table happen from inet_twsk_deschedule_put(), but this
      function is safe without any extern locks, as it's synchronized
      inside itself. There are many examples, it's used in different
      context. So, it's safe to leave tcp_sk_exit_batch() unlocked.
      
      tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.
      
      udplite4_net_ops only creates/destroys pernet /proc file.
      
      icmp_sk_ops creates percpu sockets, not touched by foreign
      pernet_operations.
      
      ipmr_net_ops creates/destroys pernet fib tables, (un)registers
      fib rules and /proc files. This seem to be safe to execute
      in parallel with foreign pernet_operations.
      
      af_inet_ops just sets up default parameters of newly created net.
      
      ipv4_mib_ops creates and destroys pernet percpu statistics.
      
      raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
      and ip_proc_ops only create/destroy pernet /proc files.
      
      ip4_frags_ops creates and destroys sysctl file.
      
      So, it's safe to make the pernet_operations async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84c6821
  10. 08 2月, 2018 1 次提交
  11. 17 1月, 2018 1 次提交
    • A
      net: delete /proc THIS_MODULE references · 96890d62
      Alexey Dobriyan 提交于
      /proc has been ignoring struct file_operations::owner field for 10 years.
      Specifically, it started with commit 786d7e16
      ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
      inode->i_fop is initialized with proxy struct file_operations for
      regular files:
      
      	-               if (de->proc_fops)
      	-                       inode->i_fop = de->proc_fops;
      	+               if (de->proc_fops) {
      	+                       if (S_ISREG(inode->i_mode))
      	+                               inode->i_fop = &proc_reg_file_ops;
      	+                       else
      	+                               inode->i_fop = de->proc_fops;
      	+               }
      
      VFS stopped pinning module at this point.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96890d62
  12. 27 12月, 2017 1 次提交
  13. 21 12月, 2017 1 次提交
  14. 13 12月, 2017 1 次提交
  15. 04 12月, 2017 1 次提交
    • E
      tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb() · eeea10b8
      Eric Dumazet 提交于
      James Morris reported kernel stack corruption bug [1] while
      running the SELinux testsuite, and bisected to a recent
      commit bffa72cf ("net: sk_buff rbnode reorg")
      
      We believe this commit is fine, but exposes an older bug.
      
      SELinux code runs from tcp_filter() and might send an ICMP,
      expecting IP options to be found in skb->cb[] using regular IPCB placement.
      
      We need to defer TCP mangling of skb->cb[] after tcp_filter() calls.
      
      This patch adds tcp_v4_fill_cb()/tcp_v4_restore_cb() in a very
      similar way we added them for IPv6.
      
      [1]
      [  339.806024] SELinux: failure in selinux_parse_skb(), unable to parse packet
      [  339.822505] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81745af5
      [  339.822505]
      [  339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-rc1-test #15
      [  339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS FWKT68A   01/19/2017
      [  339.885060] Call Trace:
      [  339.896875]  <IRQ>
      [  339.908103]  dump_stack+0x63/0x87
      [  339.920645]  panic+0xe8/0x248
      [  339.932668]  ? ip_push_pending_frames+0x33/0x40
      [  339.946328]  ? icmp_send+0x525/0x530
      [  339.958861]  ? kfree_skbmem+0x60/0x70
      [  339.971431]  __stack_chk_fail+0x1b/0x20
      [  339.984049]  icmp_send+0x525/0x530
      [  339.996205]  ? netlbl_skbuff_err+0x36/0x40
      [  340.008997]  ? selinux_netlbl_err+0x11/0x20
      [  340.021816]  ? selinux_socket_sock_rcv_skb+0x211/0x230
      [  340.035529]  ? security_sock_rcv_skb+0x3b/0x50
      [  340.048471]  ? sk_filter_trim_cap+0x44/0x1c0
      [  340.061246]  ? tcp_v4_inbound_md5_hash+0x69/0x1b0
      [  340.074562]  ? tcp_filter+0x2c/0x40
      [  340.086400]  ? tcp_v4_rcv+0x820/0xa20
      [  340.098329]  ? ip_local_deliver_finish+0x71/0x1a0
      [  340.111279]  ? ip_local_deliver+0x6f/0xe0
      [  340.123535]  ? ip_rcv_finish+0x3a0/0x3a0
      [  340.135523]  ? ip_rcv_finish+0xdb/0x3a0
      [  340.147442]  ? ip_rcv+0x27c/0x3c0
      [  340.158668]  ? inet_del_offload+0x40/0x40
      [  340.170580]  ? __netif_receive_skb_core+0x4ac/0x900
      [  340.183285]  ? rcu_accelerate_cbs+0x5b/0x80
      [  340.195282]  ? __netif_receive_skb+0x18/0x60
      [  340.207288]  ? process_backlog+0x95/0x140
      [  340.218948]  ? net_rx_action+0x26c/0x3b0
      [  340.230416]  ? __do_softirq+0xc9/0x26a
      [  340.241625]  ? do_softirq_own_stack+0x2a/0x40
      [  340.253368]  </IRQ>
      [  340.262673]  ? do_softirq+0x50/0x60
      [  340.273450]  ? __local_bh_enable_ip+0x57/0x60
      [  340.285045]  ? ip_finish_output2+0x175/0x350
      [  340.296403]  ? ip_finish_output+0x127/0x1d0
      [  340.307665]  ? nf_hook_slow+0x3c/0xb0
      [  340.318230]  ? ip_output+0x72/0xe0
      [  340.328524]  ? ip_fragment.constprop.54+0x80/0x80
      [  340.340070]  ? ip_local_out+0x35/0x40
      [  340.350497]  ? ip_queue_xmit+0x15c/0x3f0
      [  340.361060]  ? __kmalloc_reserve.isra.40+0x31/0x90
      [  340.372484]  ? __skb_clone+0x2e/0x130
      [  340.382633]  ? tcp_transmit_skb+0x558/0xa10
      [  340.393262]  ? tcp_connect+0x938/0xad0
      [  340.403370]  ? ktime_get_with_offset+0x4c/0xb0
      [  340.414206]  ? tcp_v4_connect+0x457/0x4e0
      [  340.424471]  ? __inet_stream_connect+0xb3/0x300
      [  340.435195]  ? inet_stream_connect+0x3b/0x60
      [  340.445607]  ? SYSC_connect+0xd9/0x110
      [  340.455455]  ? __audit_syscall_entry+0xaf/0x100
      [  340.466112]  ? syscall_trace_enter+0x1d0/0x2b0
      [  340.476636]  ? __audit_syscall_exit+0x209/0x290
      [  340.487151]  ? SyS_connect+0xe/0x10
      [  340.496453]  ? do_syscall_64+0x67/0x1b0
      [  340.506078]  ? entry_SYSCALL64_slow_path+0x25/0x25
      
      Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJames Morris <james.l.morris@oracle.com>
      Tested-by: NJames Morris <james.l.morris@oracle.com>
      Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeea10b8
  16. 15 11月, 2017 1 次提交
  17. 10 11月, 2017 1 次提交
  18. 28 10月, 2017 10 次提交
  19. 27 10月, 2017 9 次提交
  20. 26 10月, 2017 1 次提交
    • E
      tcp/dccp: fix other lockdep splats accessing ireq_opt · 06f877d6
      Eric Dumazet 提交于
      In my first attempt to fix the lockdep splat, I forgot we could
      enter inet_csk_route_req() with a freshly allocated request socket,
      for which refcount has not yet been elevated, due to complex
      SLAB_TYPESAFE_BY_RCU rules.
      
      We either are in rcu_read_lock() section _or_ we own a refcount on the
      request.
      
      Correct RCU verb to use here is rcu_dereference_check(), although it is
      not possible to prove we actually own a reference on a shared
      refcount :/
      
      In v2, I added ireq_opt_deref() helper and use in three places, to fix other
      possible splats.
      
      [   49.844590]  lockdep_rcu_suspicious+0xea/0xf3
      [   49.846487]  inet_csk_route_req+0x53/0x14d
      [   49.848334]  tcp_v4_route_req+0xe/0x10
      [   49.850174]  tcp_conn_request+0x31c/0x6a0
      [   49.851992]  ? __lock_acquire+0x614/0x822
      [   49.854015]  tcp_v4_conn_request+0x5a/0x79
      [   49.855957]  ? tcp_v4_conn_request+0x5a/0x79
      [   49.858052]  tcp_rcv_state_process+0x98/0xdcc
      [   49.859990]  ? sk_filter_trim_cap+0x2f6/0x307
      [   49.862085]  tcp_v4_do_rcv+0xfc/0x145
      [   49.864055]  ? tcp_v4_do_rcv+0xfc/0x145
      [   49.866173]  tcp_v4_rcv+0x5ab/0xaf9
      [   49.868029]  ip_local_deliver_finish+0x1af/0x2e7
      [   49.870064]  ip_local_deliver+0x1b2/0x1c5
      [   49.871775]  ? inet_del_offload+0x45/0x45
      [   49.873916]  ip_rcv_finish+0x3f7/0x471
      [   49.875476]  ip_rcv+0x3f1/0x42f
      [   49.876991]  ? ip_local_deliver_finish+0x2e7/0x2e7
      [   49.878791]  __netif_receive_skb_core+0x6d3/0x950
      [   49.880701]  ? process_backlog+0x7e/0x216
      [   49.882589]  __netif_receive_skb+0x1d/0x5e
      [   49.884122]  process_backlog+0x10c/0x216
      [   49.885812]  net_rx_action+0x147/0x3df
      
      Fixes: a6ca7abe ("tcp/dccp: fix lockdep splat in inet_csk_route_req()")
      Fixes: c92e8c02 ("tcp/dccp: fix ireq->opt races")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nkernel test robot <fengguang.wu@intel.com>
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06f877d6
  21. 24 10月, 2017 2 次提交