1. 19 10月, 2015 2 次提交
    • E
      tcp: do not set queue_mapping on SYNACK · dc6ef6be
      Eric Dumazet 提交于
      At the time of commit fff32699 ("tcp: reflect SYN queue_mapping into
      SYNACK packets") we had little ways to cope with SYN floods.
      
      We no longer need to reflect incoming skb queue mappings, and instead
      can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
      affinities.
      
      Note that all SYNACK retransmits were picking TX queue 0, this no longer
      is a win given that SYNACK rtx are now distributed on all cpus.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc6ef6be
    • L
      ipconfig: send Client-identifier in DHCP requests · 26fb342c
      Li RongQing 提交于
      A dhcp server may provide parameters to a client from a pool of IP
      addresses and using a shared rootfs, or provide a specific set of
      parameters for a specific client, usually using the MAC address to
      identify each client individually. The dhcp protocol also specifies
      a client-id field which can be used to determine the correct
      parameters to supply when no MAC address is available. There is
      currently no way to tell the kernel to supply a specific client-id,
      only the userspace dhcp clients support this feature, but this can
      not be used when the network is needed before userspace is available
      such as when the root filesystem is on NFS.
      
      This patch is to be able to do something like "ip=dhcp,client_id_type,
      client_id_value", as a kernel parameter to enable the kernel to
      identify itself to the server.
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26fb342c
  2. 16 10月, 2015 4 次提交
    • D
      net: Fix suspicious RCU usage in fib_rebalance · 51161aa9
      David Ahern 提交于
      This command:
        ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2
      
      generated this suspicious RCU usage message:
      
      [ 63.249262]
      [ 63.249939] ===============================
      [ 63.251571] [ INFO: suspicious RCU usage. ]
      [ 63.253250] 4.3.0-rc3+ #298 Not tainted
      [ 63.254724] -------------------------------
      [ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
      [ 63.259450]
      [ 63.259450] other info that might help us debug this:
      [ 63.259450]
      [ 63.262297]
      [ 63.262297] rcu_scheduler_active = 1, debug_locks = 1
      [ 63.264647] 1 lock held by ip/2870:
      [ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14
      [ 63.268858]
      [ 63.268858] stack backtrace:
      [ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298
      [ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301
      [ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000
      [ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908
      [ 63.283177] Call Trace:
      [ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68
      [ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103
      [ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f
      [ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127
      [ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f
      [ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc
      [ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451
      [ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43
      [ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51
      [ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195
      [ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf
      [ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14
      [ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12
      [ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90
      [ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28
      [ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f
      [ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc
      [ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d
      [ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b
      [ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c
      [ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6
      [ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54
      [ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77
      [ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b
      [ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19
      [ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It looks like all of the code paths to fib_rebalance are under rtnl.
      
      Fixes: 0e884c78 ("ipv4: L3 hash-based multipath")
      Cc: Peter Nørlund <pch@ordbogen.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51161aa9
    • E
      tcp/dccp: fix race at listener dismantle phase · ebb516af
      Eric Dumazet 提交于
      Under stress, a close() on a listener can trigger the
      WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()
      
      We need to test if listener is still active before queueing
      a child in inet_csk_reqsk_queue_add()
      
      Create a common inet_child_forget() helper, and use it
      from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebb516af
    • E
      tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper · f03f2e15
      Eric Dumazet 提交于
      Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
      In many cases we also need to release reference on request socket,
      so add a helper to do this, reducing code size and complexity.
      
      Fixes: 4bdc3d66 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f03f2e15
    • E
      Revert "inet: fix double request socket freeing" · ef84d8ce
      Eric Dumazet 提交于
      This reverts commit c6973669.
      
      At the time of above commit, tcp_req_err() and dccp_req_err()
      were dead code, as SYN_RECV request sockets were not yet in ehash table.
      
      Real bug was fixed later in a different commit.
      
      We need to revert to not leak a refcount on request socket.
      
      inet_csk_reqsk_queue_drop_and_put() will be added
      in following commit to make clean inet_csk_reqsk_queue_drop()
      does not release the reference owned by caller.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef84d8ce
  3. 15 10月, 2015 2 次提交
  4. 14 10月, 2015 2 次提交
    • P
      Revert "ipv4/icmp: redirect messages can use the ingress daddr as source" · 02a6d613
      Paolo Abeni 提交于
      Revert the commit e2ca690b ("ipv4/icmp: redirect messages
      can use the ingress daddr as source"), which tried to introduce a more
      suitable behaviour for ICMP redirect messages generated by VRRP routers.
      However RFC 5798 section 8.1.1 states:
      
          The IPv4 source address of an ICMP redirect should be the address
          that the end-host used when making its next-hop routing decision.
      
      while said commit used the generating packet destination
      address, which do not match the above and in most cases leads to
      no redirect packets to be generated.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02a6d613
    • E
      tcp/dccp: fix behavior of stale SYN_RECV request sockets · 4bdc3d66
      Eric Dumazet 提交于
      When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
      become stale, meaning 3WHS can not complete.
      
      But current behavior is wrong :
      incoming packets finding such stale sockets are dropped.
      
      We need instead to cleanup the request socket and perform another
      lookup :
      - Incoming ACK will give a RST answer,
      - SYN rtx might find another listener if available.
      - We expedite cleanup of request sockets and old listener socket.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bdc3d66
  5. 13 10月, 2015 5 次提交
    • E
      ipv4: Pass struct net into ip_defrag and ip_check_defrag · 19bcf9f2
      Eric W. Biederman 提交于
      The function ip_defrag is called on both the input and the output
      paths of the networking stack.  In particular conntrack when it is
      tracking outbound packets from the local machine calls ip_defrag.
      
      So add a struct net parameter and stop making ip_defrag guess which
      network namespace it needs to defragment packets in.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19bcf9f2
    • E
      ipv4: Only compute net once in ip_call_ra_chain · 37fcbab6
      Eric W. Biederman 提交于
      ip_call_ra_chain is called early in the forwarding chain from
      ip_forward and ip_mr_input, which makes skb->dev the correct
      expression to get the input network device and dev_net(skb->dev) a
      correct expression for the network namespace the packet is being
      processed in.
      
      Compute the network namespace and store it in a variable to make the
      code clearer.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37fcbab6
    • P
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni 提交于
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      redirect.
      
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      x.x.x.254/24.
      
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      
      The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
      and will continue to use the wrong next-op.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ca690b
    • E
      net: shrink struct sock and request_sock by 8 bytes · ed53d0ab
      Eric Dumazet 提交于
      One 32bit hole is following skc_refcnt, use it.
      skc_incoming_cpu can also be an union for request_sock rcv_wnd.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed53d0ab
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
  6. 12 10月, 2015 1 次提交
  7. 11 10月, 2015 1 次提交
  8. 08 10月, 2015 12 次提交
  9. 07 10月, 2015 7 次提交
  10. 06 10月, 2015 1 次提交
  11. 05 10月, 2015 3 次提交