1. 22 2月, 2018 4 次提交
  2. 20 2月, 2018 2 次提交
  3. 17 2月, 2018 3 次提交
    • D
      net: Only honor ifindex in IP_PKTINFO if non-0 · 1cbec076
      David Ahern 提交于
      Only allow ifindex from IP_PKTINFO to override SO_BINDTODEVICE settings
      if the index is actually set in the message.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cbec076
    • A
      udplite: fix partial checksum initialization · 15f35d49
      Alexey Kodanev 提交于
      Since UDP-Lite is always using checksum, the following path is
      triggered when calculating pseudo header for it:
      
        udp4_csum_init() or udp6_csum_init()
          skb_checksum_init_zero_check()
            __skb_checksum_validate_complete()
      
      The problem can appear if skb->len is less than CHECKSUM_BREAK. In
      this particular case __skb_checksum_validate_complete() also invokes
      __skb_checksum_complete(skb). If UDP-Lite is using partial checksum
      that covers only part of a packet, the function will return bad
      checksum and the packet will be dropped.
      
      It can be fixed if we skip skb_checksum_init_zero_check() and only
      set the required pseudo header checksum for UDP-Lite with partial
      checksum before udp4_csum_init()/udp6_csum_init() functions return.
      
      Fixes: ed70fcfc ("net: Call skb_checksum_init in IPv4")
      Fixes: e4f45b7f ("net: Call skb_checksum_init in IPv6")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15f35d49
    • S
      fib_semantics: Don't match route with mismatching tclassid · a8c6db1d
      Stefano Brivio 提交于
      In fib_nh_match(), if output interface or gateway are passed in
      the FIB configuration, we don't have to check next hops of
      multipath routes to conclude whether we have a match or not.
      
      However, we might still have routes with different realms
      matching the same output interface and gateway configuration,
      and this needs to cause the match to fail. Otherwise the first
      route inserted in the FIB will match, regardless of the realms:
      
       # ip route add 1.1.1.1 dev eth0 table 1234 realms 1/2
       # ip route append 1.1.1.1 dev eth0 table 1234 realms 3/4
       # ip route list table 1234
       1.1.1.1 dev eth0 scope link realms 1/2
       1.1.1.1 dev eth0 scope link realms 3/4
       # ip route del 1.1.1.1 dev ens3 table 1234 realms 3/4
       # ip route list table 1234
       1.1.1.1 dev ens3 scope link realms 3/4
      
      whereas route with realms 3/4 should have been deleted instead.
      
      Explicitly check for fc_flow passed in the FIB configuration
      (this comes from RTA_FLOW extracted by rtm_to_fib_config()) and
      fail matching if it differs from nh_tclassid.
      
      The handling of RTA_FLOW for multipath routes later in
      fib_nh_match() is still needed, as we can have multiple RTA_FLOW
      attributes that need to be matched against the tclassid of each
      next hop.
      
      v2: Check that fc_flow is set before discarding the match, so
          that the user can still select the first matching rule by
          not specifying any realm, as suggested by David Ahern.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8c6db1d
  4. 16 2月, 2018 1 次提交
  5. 15 2月, 2018 2 次提交
    • D
      net: Move ipv4 set_lwt_redirect helper to lwtunnel · 9942895b
      David Ahern 提交于
      IPv4 uses set_lwt_redirect to set the lwtunnel redirect functions as
      needed. Move it to lwtunnel.h as lwtunnel_set_redirect and change
      IPv6 to also use it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9942895b
    • E
      tcp: try to keep packet if SYN_RCV race is lost · e0f9759f
      Eric Dumazet 提交于
      배석진 reported that in some situations, packets for a given 5-tuple
      end up being processed by different CPUS.
      
      This involves RPS, and fragmentation.
      
      배석진 is seeing packet drops when a SYN_RECV request socket is
      moved into ESTABLISH state. Other states are protected by socket lock.
      
      This is caused by a CPU losing the race, and simply not caring enough.
      
      Since this seems to occur frequently, we can do better and perform
      a second lookup.
      
      Note that all needed memory barriers are already in the existing code,
      thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
      and reqsk_put(). The second lookup must find the new socket,
      unless it has already been accepted and closed by another cpu.
      
      Note that the fragmentation could be avoided in the first place by
      use of a correct TCP MSS option in the SYN{ACK} packet, but this
      does not mean we can not be more robust.
      
      Many thanks to 배석진 for a very detailed analysis.
      Reported-by: N배석진 <soukjin.bae@samsung.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0f9759f
  6. 14 2月, 2018 2 次提交
  7. 13 2月, 2018 4 次提交
    • K
      net: Convert ipv4_sysctl_ops · 22769a2a
      Kirill Tkhai 提交于
      These pernet_operations create and destroy sysctl,
      which are not touched by anybody else.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22769a2a
    • K
      net: Convert pernet_subsys, registered from inet_init() · f84c6821
      Kirill Tkhai 提交于
      arp_net_ops just addr/removes /proc entry.
      
      devinet_ops allocates and frees duplicate of init_net tables
      and (un)registers sysctl entries.
      
      fib_net_ops allocates and frees pernet tables, creates/destroys
      netlink socket and (un)initializes /proc entries. Foreign
      pernet_operations do not touch them.
      
      ip_rt_proc_ops only modifies pernet /proc entries.
      
      xfrm_net_ops creates/destroys /proc entries, allocates/frees
      pernet statistics, hashes and tables, and (un)initializes
      sysctl files. These are not touched by foreigh pernet_operations
      
      xfrm4_net_ops allocates/frees private pernet memory, and
      configures sysctls.
      
      sysctl_route_ops creates/destroys sysctls.
      
      rt_genid_ops only initializes fields of just allocated net.
      
      ipv4_inetpeer_ops allocated/frees net private memory.
      
      igmp_net_ops just creates/destroys /proc files and socket,
      noone else interested in.
      
      tcp_sk_ops seems to be safe, because tcp_sk_init() does not
      depend on any other pernet_operations modifications. Iteration
      over hash table in inet_twsk_purge() is made under RCU lock,
      and it's safe to iterate the table this way. Removing from
      the table happen from inet_twsk_deschedule_put(), but this
      function is safe without any extern locks, as it's synchronized
      inside itself. There are many examples, it's used in different
      context. So, it's safe to leave tcp_sk_exit_batch() unlocked.
      
      tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.
      
      udplite4_net_ops only creates/destroys pernet /proc file.
      
      icmp_sk_ops creates percpu sockets, not touched by foreign
      pernet_operations.
      
      ipmr_net_ops creates/destroys pernet fib tables, (un)registers
      fib rules and /proc files. This seem to be safe to execute
      in parallel with foreign pernet_operations.
      
      af_inet_ops just sets up default parameters of newly created net.
      
      ipv4_mib_ops creates and destroys pernet percpu statistics.
      
      raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
      and ip_proc_ops only create/destroy pernet /proc files.
      
      ip4_frags_ops creates and destroys sysctl file.
      
      So, it's safe to make the pernet_operations async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84c6821
    • D
      net: make getname() functions return length rather than use int* parameter · 9b2c45d4
      Denys Vlasenko 提交于
      Changes since v1:
      Added changes in these files:
          drivers/infiniband/hw/usnic/usnic_transport.c
          drivers/staging/lustre/lnet/lnet/lib-socket.c
          drivers/target/iscsi/iscsi_target_login.c
          drivers/vhost/net.c
          fs/dlm/lowcomms.c
          fs/ocfs2/cluster/tcp.c
          security/tomoyo/network.c
      
      Before:
      All these functions either return a negative error indicator,
      or store length of sockaddr into "int *socklen" parameter
      and return zero on success.
      
      "int *socklen" parameter is awkward. For example, if caller does not
      care, it still needs to provide on-stack storage for the value
      it does not need.
      
      None of the many FOO_getname() functions of various protocols
      ever used old value of *socklen. They always just overwrite it.
      
      This change drops this parameter, and makes all these functions, on success,
      return length of sockaddr. It's always >= 0 and can be differentiated
      from an error.
      
      Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
      
      rpc_sockname() lost "int buflen" parameter, since its only use was
      to be passed to kernel_getsockname() as &buflen and subsequently
      not used in any way.
      
      Userspace API is not changed.
      
          text    data     bss      dec     hex filename
      30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
      30108109 2633612  873672 33615393 200ee21 vmlinux.o
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: linux-bluetooth@vger.kernel.org
      CC: linux-decnet-user@lists.sourceforge.net
      CC: linux-wireless@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: linux-sctp@vger.kernel.org
      CC: linux-nfs@vger.kernel.org
      CC: linux-x25@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b2c45d4
    • I
      tcp: Honor the eor bit in tcp_mtu_probe · 808cf9e3
      Ilya Lesokhin 提交于
      Avoid SKB coalescing if eor bit is set in one of the relevant
      SKBs.
      
      Fixes: c134ecb8 ("tcp: Make use of MSG_EOR in tcp_sendmsg")
      Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      808cf9e3
  8. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  9. 08 2月, 2018 1 次提交
  10. 07 2月, 2018 3 次提交
    • P
      netfilter: nf_tables: fix flowtable free · b408c5b0
      Pablo Neira Ayuso 提交于
      Every flow_offload entry is added into the table twice. Because of this,
      rhashtable_free_and_destroy can't be used, since it would call kfree for
      each flow_offload object twice.
      
      This patch cleans up the flowtable via nf_flow_table_iterate() to
      schedule removal of entries by setting on the dying bit, then there is
      an explicitly invocation of the garbage collector to release resources.
      
      Based on patch from Felix Fietkau.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b408c5b0
    • W
      net: erspan: fix erspan config overwrite · 39f57f67
      William Tu 提交于
      When an erspan tunnel device receives an erpsan packet with different
      tunnel metadata (ex: version, index, hwid, direction), existing code
      overwrites the tunnel device's erspan configuration with the received
      packet's metadata.  The patch fixes it.
      
      Fixes: 1a66a836 ("gre: add collect_md mode to ERSPAN tunnel")
      Fixes: f551c91d ("net: erspan: introduce erspan v2 for ip_gre")
      Fixes: ef7baf5e ("ip6_gre: add ip6 erspan collect_md mode")
      Fixes: 94d7d8f2 ("ip6_gre: add erspan v2 support")
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39f57f67
    • W
      net: erspan: fix metadata extraction · 3df19283
      William Tu 提交于
      Commit d350a823 ("net: erspan: create erspan metadata uapi header")
      moves the erspan 'version' in front of the 'struct erspan_md2' for
      later extensibility reason.  This breaks the existing erspan metadata
      extraction code because the erspan_md2 then has a 4-byte offset
      to between the erspan_metadata and erspan_base_hdr.  This patch
      fixes it.
      
      Fixes: 1a66a836 ("gre: add collect_md mode to ERSPAN tunnel")
      Fixes: ef7baf5e ("ip6_gre: add ip6 erspan collect_md mode")
      Fixes: 1d7e2ed2 ("net: erspan: refactor existing erspan code")
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3df19283
  11. 06 2月, 2018 1 次提交
    • J
      net: add a UID to use for ULP socket assignment · b11a632c
      John Fastabend 提交于
      Create a UID field and enum that can be used to assign ULPs to
      sockets. This saves a set of string comparisons if the ULP id
      is known.
      
      For sockmap, which is added in the next patches, a ULP is used to
      hook into TCP sockets close state. In this case the ULP being added
      is done at map insert time and the ULP is known and done on the kernel
      side. In this case the named lookup is not needed. Because we don't
      want to expose psock internals to user space socket options a user
      visible flag is also added. For TLS this is set for BPF it will be
      cleared.
      
      Alos remove pr_notice, user gets an error code back and should check
      that rather than rely on logs.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b11a632c
  12. 03 2月, 2018 1 次提交
    • R
      Revert "defer call to mem_cgroup_sk_alloc()" · edbe69ef
      Roman Gushchin 提交于
      This patch effectively reverts commit 9f1c2674 ("net: memcontrol:
      defer call to mem_cgroup_sk_alloc()").
      
      Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
      memcg socket memory accounting, as packets received before memcg
      pointer initialization are not accounted and are causing refcounting
      underflow on socket release.
      
      Actually the free-after-use problem was fixed by
      commit c0576e39 ("net: call cgroup_sk_alloc() earlier in
      sk_clone_lock()") for the cgroup pointer.
      
      So, let's revert it and call mem_cgroup_sk_alloc() just before
      cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
      we're cloning, and it holds a reference to the memcg.
      
      Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
      mem_cgroup_sk_alloc(). I see no reasons why bumping the root
      memcg counter is a good reason to panic, and there are no realistic
      ways to hit it.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edbe69ef
  13. 02 2月, 2018 2 次提交
    • P
      netfilter: flowtable infrastructure depends on NETFILTER_INGRESS · 6be3bcd7
      Pablo Neira Ayuso 提交于
      config NF_FLOW_TABLE depends on NETFILTER_INGRESS. If users forget to
      enable this toggle, flowtable registration fails with EOPNOTSUPP.
      
      Moreover, turn 'select NF_FLOW_TABLE' in every flowtable family flavour
      into dependency instead, otherwise this new dependency on
      NETFILTER_INGRESS causes a warning. This also allows us to remove the
      explicit dependency between family flowtables <-> NF_TABLES and
      NF_CONNTRACK, given they depend on the NF_FLOW_TABLE core that already
      expresses the general dependencies for this new infrastructure.
      
      Moreover, NF_FLOW_TABLE_INET depends on NF_FLOW_TABLE_IPV4 and
      NF_FLOWTABLE_IPV6, which already depends on NF_FLOW_TABLE. So we can get
      rid of direct dependency with NF_FLOW_TABLE.
      
      In general, let's avoid 'select', it just makes things more complicated.
      Reported-by: NJohn Crispin <john@phrozen.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6be3bcd7
    • E
      net: igmp: add a missing rcu locking section · e7aadb27
      Eric Dumazet 提交于
      Newly added igmpv3_get_srcaddr() needs to be called under rcu lock.
      
      Timer callbacks do not ensure this locking.
      
      =============================
      WARNING: suspicious RCU usage
      4.15.0+ #200 Not tainted
      -----------------------------
      ./include/linux/inetdevice.h:216 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by syzkaller616973/4074:
       #0:  (&mm->mmap_sem){++++}, at: [<00000000bfce669e>] __do_page_fault+0x32d/0xc90 arch/x86/mm/fault.c:1355
       #1:  ((&im->timer)){+.-.}, at: [<00000000619d2f71>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
       #1:  ((&im->timer)){+.-.}, at: [<00000000619d2f71>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1316
       #2:  (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
       #2:  (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] igmpv3_send_report+0x98/0x5b0 net/ipv4/igmp.c:600
      
      stack backtrace:
      CPU: 0 PID: 4074 Comm: syzkaller616973 Not tainted 4.15.0+ #200
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4592
       __in_dev_get_rcu include/linux/inetdevice.h:216 [inline]
       igmpv3_get_srcaddr net/ipv4/igmp.c:329 [inline]
       igmpv3_newpack+0xeef/0x12e0 net/ipv4/igmp.c:389
       add_grhead.isra.27+0x235/0x300 net/ipv4/igmp.c:432
       add_grec+0xbd3/0x1170 net/ipv4/igmp.c:565
       igmpv3_send_report+0xd5/0x5b0 net/ipv4/igmp.c:605
       igmp_send_report+0xc43/0x1050 net/ipv4/igmp.c:722
       igmp_timer_expire+0x322/0x5c0 net/ipv4/igmp.c:831
       call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
       expire_timers kernel/time/timer.c:1363 [inline]
       __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
       run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
       __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
       invoke_softirq kernel/softirq.c:365 [inline]
       irq_exit+0x1cc/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:541 [inline]
       smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
       apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:938
      
      Fixes: a46182b0 ("net: igmp: Use correct source address on IGMPv3 reports")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7aadb27
  14. 01 2月, 2018 2 次提交
  15. 31 1月, 2018 3 次提交
    • P
      netfilter: on sockopt() acquire sock lock only in the required scope · 3f34cfae
      Paolo Abeni 提交于
      Syzbot reported several deadlocks in the netfilter area caused by
      rtnl lock and socket lock being acquired with a different order on
      different code paths, leading to backtraces like the following one:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.15.0-rc9+ #212 Not tainted
      ------------------------------------------------------
      syzkaller041579/3682 is trying to acquire lock:
        (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] lock_sock
      include/net/sock.h:1463 [inline]
        (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>]
      do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
      
      but task is already holding lock:
        (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (rtnl_mutex){+.+.}:
              __mutex_lock_common kernel/locking/mutex.c:756 [inline]
              __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
              mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
              rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
              register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607
              tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106
              xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
              check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
              find_check_entry.isra.7+0x935/0xcf0
      net/ipv6/netfilter/ip6_tables.c:580
              translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
              do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline]
              do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691
              nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
              nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
              ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
              udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
              sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
              SYSC_setsockopt net/socket.c:1849 [inline]
              SyS_setsockopt+0x189/0x360 net/socket.c:1828
              entry_SYSCALL_64_fastpath+0x29/0xa0
      
      -> #0 (sk_lock-AF_INET6){+.+.}:
              lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
              lock_sock_nested+0xc2/0x110 net/core/sock.c:2780
              lock_sock include/net/sock.h:1463 [inline]
              do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
              ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
              udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
              sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
              SYSC_setsockopt net/socket.c:1849 [inline]
              SyS_setsockopt+0x189/0x360 net/socket.c:1828
              entry_SYSCALL_64_fastpath+0x29/0xa0
      
      other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(rtnl_mutex);
                                      lock(sk_lock-AF_INET6);
                                      lock(rtnl_mutex);
         lock(sk_lock-AF_INET6);
      
        *** DEADLOCK ***
      
      1 lock held by syzkaller041579/3682:
        #0:  (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      The problem, as Florian noted, is that nf_setsockopt() is always
      called with the socket held, even if the lock itself is required only
      for very tight scopes and only for some operation.
      
      This patch addresses the issues moving the lock_sock() call only
      where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
      does not need anymore to acquire both locks.
      
      Fixes: 22265a5c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
      Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3f34cfae
    • G
      tcp_nv: fix potential integer overflow in tcpnv_acked · e4823fbd
      Gustavo A. R. Silva 提交于
      Add suffix ULL to constant 80000 in order to avoid a potential integer
      overflow and give the compiler complete information about the proper
      arithmetic to use. Notice that this constant is used in a context that
      expects an expression of type u64.
      
      The current cast to u64 effectively applies to the whole expression
      as an argument of type u64 to be passed to div64_u64, but it does
      not prevent it from being evaluated using 32-bit arithmetic instead
      of 64-bit arithmetic.
      
      Also, once the expression is properly evaluated using 64-bit arithmentic,
      there is no need for the parentheses and the external cast to u64.
      
      Addresses-Coverity-ID: 1357588 ("Unintentional integer overflow")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4823fbd
    • D
      netfilter: ipt_CLUSTERIP: fix out-of-bounds accesses in clusterip_tg_check() · 1a38956c
      Dmitry Vyukov 提交于
      Commit 136e92bb switched local_nodes from an array to a bitmask
      but did not add proper bounds checks. As the result
      clusterip_config_init_nodelist() can both over-read
      ipt_clusterip_tgt_info.local_nodes and over-write
      clusterip_config.local_nodes.
      
      Add bounds checks for both.
      
      Fixes: 136e92bb ("[NETFILTER] CLUSTERIP: use a bitmap to store node responsibility data")
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1a38956c
  16. 30 1月, 2018 3 次提交
  17. 26 1月, 2018 5 次提交
    • D
      net/ipv4: Allow send to local broadcast from a socket bound to a VRF · 9515a2e0
      David Ahern 提交于
      Message sends to the local broadcast address (255.255.255.255) require
      uc_index or sk_bound_dev_if to be set to an egress device. However,
      responses or only received if the socket is bound to the device. This
      is overly constraining for processes running in an L3 domain. This
      patch allows a socket bound to the VRF device to send to the local
      broadcast address by using IP_UNICAST_IF to set the egress interface
      with packet receipt handled by the VRF binding.
      
      Similar to IP_MULTICAST_IF, relax the constraint on setting
      IP_UNICAST_IF if a socket is bound to an L3 master device. In this
      case allow uc_index to be set to an enslaved if sk_bound_dev_if is
      an L3 master device and is the master device for the ifindex.
      
      In udp and raw sendmsg, allow uc_index to override the oif if
      uc_index master device is oif (ie., the oif is an L3 master and the
      index is an L3 slave).
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9515a2e0
    • W
      net: erspan: use bitfield instead of mask and offset · c69de58b
      William Tu 提交于
      Originally the erspan fields are defined as a group into a __be16 field,
      and use mask and offset to access each field.  This is more costly due to
      calling ntohs/htons.  The patch changes it to use bitfields.
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c69de58b
    • L
      bpf: Add BPF_SOCK_OPS_STATE_CB · d4487491
      Lawrence Brakmo 提交于
      Adds support for calling sock_ops BPF program when there is a TCP state
      change. Two arguments are used; one for the old state and another for
      the new state.
      
      There is a new enum in include/uapi/linux/bpf.h that exports the TCP
      states that prepends BPF_ to the current TCP state names. If it is ever
      necessary to change the internal TCP state values (other than adding
      more to the end), then it will become necessary to convert from the
      internal TCP state value to the BPF value before calling the BPF
      sock_ops function. There are a set of compile checks added in tcp.c
      to detect if the internal and BPF values differ so we can make the
      necessary fixes.
      
      New op: BPF_SOCK_OPS_STATE_CB.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d4487491
    • L
      bpf: Add BPF_SOCK_OPS_RETRANS_CB · a31ad29e
      Lawrence Brakmo 提交于
      Adds support for calling sock_ops BPF program when there is a
      retransmission. Three arguments are used; one for the sequence number,
      another for the number of segments retransmitted, and the last one for
      the return value of tcp_transmit_skb (0 => success).
      Does not include syn-ack retransmissions.
      
      New op: BPF_SOCK_OPS_RETRANS_CB.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a31ad29e
    • L
      bpf: Add sock_ops RTO callback · f89013f6
      Lawrence Brakmo 提交于
      Adds an optional call to sock_ops BPF program based on whether the
      BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
      The BPF program is passed 2 arguments: icsk_retransmits and whether the
      RTO has expired.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f89013f6