1. 10 2月, 2015 23 次提交
  2. 09 2月, 2015 5 次提交
    • E
      net:rfs: adjust table size checking · 93c1af6c
      Eric Dumazet 提交于
      Make sure root user does not try something stupid.
      
      Also make sure mask field in struct rps_sock_flow_table
      does not share a cache line with the potentially often dirtied
      flow table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c1af6c
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
    • S
      gre/ipip: use be16 variants of netlink functions · 3e97fa70
      Sabrina Dubroca 提交于
      encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead
      of nla_{get,put}_u16.
      
      Fixes the sparse warnings:
      
      warning: incorrect type in assignment (different base types)
         expected restricted __be32 [addressable] [usertype] o_key
         got restricted __be16 [addressable] [usertype] i_flags
      warning: incorrect type in assignment (different base types)
         expected restricted __be16 [usertype] sport
         got unsigned short
      warning: incorrect type in assignment (different base types)
         expected restricted __be16 [usertype] dport
         got unsigned short
      warning: incorrect type in argument 3 (different base types)
         expected unsigned short [unsigned] [usertype] value
         got restricted __be16 [usertype] sport
      warning: incorrect type in argument 3 (different base types)
         expected unsigned short [unsigned] [usertype] value
         got restricted __be16 [usertype] dport
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e97fa70
    • J
      tipc: fix bug in socket reception function · 51a00daf
      Jon Paul Maloy 提交于
      In commit c637c103 ("tipc: resolve race
      problem at unicast message reception") we introduced a time limit
      for how long the function tipc_sk_eneque() would be allowed to execute
      its loop. Unfortunately, the test for when this limit is passed was put
      in the wrong place, resulting in a lost message when the test is true.
      
      We fix this by moving the test to before we dequeue the next buffer
      from the input queue.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51a00daf
    • M
      rt6_probe_deferred: Do not depend on struct ordering · 662f5533
      Michael Büsch 提交于
      rt6_probe allocates a struct __rt6_probe_work and schedules a work handler rt6_probe_deferred.
      But rt6_probe_deferred kfree's the struct work_struct instead of struct __rt6_probe_work.
      This works, because struct work_struct is the first element of struct __rt6_probe_work.
      
      Change it to kfree struct __rt6_probe_work to not implicitly depend on
      struct work_struct being the first element.
      
      This does not affect the generated code.
      Signed-off-by: NMichael Buesch <m@bues.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      662f5533
  3. 08 2月, 2015 12 次提交
    • N
      tcp: mitigate ACK loops for connections as tcp_timewait_sock · 4fb17a60
      Neal Cardwell 提交于
      Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
      represented by a tcp_timewait_sock, we rate limit dupacks in response
      to incoming packets (a) with TCP timestamps that fail PAWS checks, or
      (b) with sequence numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fb17a60
    • N
      tcp: mitigate ACK loops for connections as tcp_sock · f2b2c582
      Neal Cardwell 提交于
      Ensure that in state ESTABLISHED, where the connection is represented
      by a tcp_sock, we rate limit dupacks in response to incoming packets
      (a) with TCP timestamps that fail PAWS checks, or (b) with sequence
      numbers or ACK numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      
      There is already a similar (although global) rate-limiting mechanism
      for "challenge ACKs". When deciding whether to send a challence ACK,
      we first consult the new per-connection rate limit, and then the
      global rate limit.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2b2c582
    • N
      tcp: mitigate ACK loops for connections as tcp_request_sock · a9b2c06d
      Neal Cardwell 提交于
      In the SYN_RECV state, where the TCP connection is represented by
      tcp_request_sock, we now rate-limit SYNACKs in response to a client's
      retransmitted SYNs: we do not send a SYNACK in response to client SYN
      if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
      since we last sent a SYNACK in response to a client's retransmitted
      SYN.
      
      This allows the vast majority of legitimate client connections to
      proceed unimpeded, even for the most aggressive platforms, iOS and
      MacOS, which actually retransmit SYNs 1-second intervals for several
      times in a row. They use SYN RTO timeouts following the progression:
      1,1,1,1,1,2,4,8,16,32.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b2c06d
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
    • P
      openvswitch: Initialize unmasked key and uid len · ca539345
      Pravin B Shelar 提交于
      Flow alloc needs to initialize unmasked key pointer. Otherwise
      it can crash kernel trying to free random unmasked-key pointer.
      
      general protection fault: 0000 [#1] SMP
      3.19.0-rc6-net-next+ #457
      Hardware name: Supermicro X7DWU/X7DWU, BIOS  1.1 04/30/2008
      RIP: 0010:[<ffffffff8111df0e>] [<ffffffff8111df0e>] kfree+0xac/0x196
      Call Trace:
       [<ffffffffa060bd87>] flow_free+0x21/0x59 [openvswitch]
       [<ffffffffa060bde0>] ovs_flow_free+0x21/0x23 [openvswitch]
       [<ffffffffa0605b4a>] ovs_packet_cmd_execute+0x2f3/0x35f [openvswitch]
       [<ffffffffa0605995>] ? ovs_packet_cmd_execute+0x13e/0x35f [openvswitch]
       [<ffffffff811fe6fb>] ? nla_parse+0x4f/0xec
       [<ffffffff8139a2fc>] genl_family_rcv_msg+0x26d/0x2c9
       [<ffffffff8107620f>] ? __lock_acquire+0x90e/0x9aa
       [<ffffffff8139a3be>] genl_rcv_msg+0x66/0x89
       [<ffffffff8139a358>] ? genl_family_rcv_msg+0x2c9/0x2c9
       [<ffffffff81399591>] netlink_rcv_skb+0x3e/0x95
       [<ffffffff81399898>] ? genl_rcv+0x18/0x37
       [<ffffffff813998a7>] genl_rcv+0x27/0x37
       [<ffffffff81399033>] netlink_unicast+0x103/0x191
       [<ffffffff81399382>] netlink_sendmsg+0x2c1/0x310
       [<ffffffff811007ad>] ? might_fault+0x50/0xa0
       [<ffffffff8135c773>] do_sock_sendmsg+0x5f/0x7a
       [<ffffffff8135c799>] sock_sendmsg+0xb/0xd
       [<ffffffff8135cacf>] ___sys_sendmsg+0x1a3/0x218
       [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86
       [<ffffffff8115a9d0>] ? fsnotify+0x32c/0x348
       [<ffffffff8115a720>] ? fsnotify+0x7c/0x348
       [<ffffffff8113e5f5>] ? __fget+0xaa/0xbf
       [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86
       [<ffffffff8135cccd>] __sys_sendmsg+0x3d/0x5e
       [<ffffffff8135cd02>] SyS_sendmsg+0x14/0x16
       [<ffffffff81411852>] system_call_fastpath+0x12/0x17
      
      Fixes: 74ed7ab9("openvswitch: Add support for unique flow IDs.")
      CC: Joe Stringer <joestringer@nicira.com>
      Reported-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca539345
    • R
      bridge: add missing bridge port check for offloads · 1fd0bddb
      Roopa Prabhu 提交于
      This patch fixes a missing bridge port check caught by smatch.
      
      setlink/dellink of attributes like vlans can come for a bridge device
      and there is no need to offload those today. So, this patch adds a bridge
      port check. (In these cases however, the BRIDGE_SELF flags will always be set
      and we may not hit a problem with the current code).
      
      smatch complaint:
      
      The patch 68e331c7: "bridge: offload bridge port attributes to
      switch asic if feature flag set" from Jan 29, 2015, leads to the
      following Smatch complaint:
      
      net/bridge/br_netlink.c:552 br_setlink()
      	 error: we previously assumed 'p' could be null (see line 518)
      
      net/bridge/br_netlink.c
         517
         518		if (p && protinfo) {
                          ^
      Check for NULL.
      Reported-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fd0bddb
    • E
      net: use netif_rx_ni() from process context · 91e83133
      Eric Dumazet 提交于
      Hotpluging a cpu might be rare, yet we have to use proper
      handlers when taking over packets found in backlog queues.
      
      dev_cpu_callback() runs from process context, thus we should
      call netif_rx_ni() to properly invoke softirq handler.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91e83133
    • S
      rds: Make rds_message_copy_from_user() return 0 on success. · d0a47d32
      Sowmini Varadhan 提交于
      Commit 083735f4 ("rds: switch rds_message_copy_from_user() to iov_iter")
      breaks rds_message_copy_from_user() semantics on success, and causes it
      to return nbytes copied, when it should return 0.  This commit fixes that bug.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0a47d32
    • R
      net: rds: Remove repeated function names from debug output · 11ac1199
      Rasmus Villemoes 提交于
      The macro rdsdebug is defined as
      
        pr_debug("%s(): " fmt, __func__ , ##args)
      
      Hence it doesn't make sense to include the name of the calling
      function explicitly in the format string passed to rdsdebug.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11ac1199
    • J
      net: openvswitch: Support masked set actions. · 83d2b9ba
      Jarno Rajahalme 提交于
      OVS userspace already probes the openvswitch kernel module for
      OVS_ACTION_ATTR_SET_MASKED support.  This patch adds the kernel module
      implementation of masked set actions.
      
      The existing set action sets many fields at once.  When only a subset
      of the IP header fields, for example, should be modified, all the IP
      fields need to be exact matched so that the other field values can be
      copied to the set action.  A masked set action allows modification of
      an arbitrary subset of the supported header bits without requiring the
      rest to be matched.
      
      Masked set action is now supported for all writeable key types, except
      for the tunnel key.  The set tunnel action is an exception as any
      input tunnel info is cleared before action processing starts, so there
      is no tunnel info to mask.
      
      The kernel module converts all (non-tunnel) set actions to masked set
      actions.  This makes action processing more uniform, and results in
      less branching and duplicating the action processing code.  When
      returning actions to userspace, the fully masked set actions are
      converted back to normal set actions.  We use a kernel internal action
      code to be able to tell the userspace provided and converted masked
      set actions apart.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83d2b9ba
    • D
      rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY · 364d5716
      Daniel Borkmann 提交于
      ifla_vf_policy[] is wrong in advertising its individual member types as
      NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
      len member as *max* attribute length [0, len].
      
      The issue is that when do_setvfinfo() is being called to set up a VF
      through ndo handler, we could set corrupted data if the attribute length
      is less than the size of the related structure itself.
      
      The intent is exactly the opposite, namely to make sure to pass at least
      data of minimum size of len.
      
      Fixes: ebc08a6f ("rtnetlink: Add VF config code to rtnetlink")
      Cc: Mitch Williams <mitch.a.williams@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      364d5716
    • T
      dsa: correctly determine the number of switches in a system · e04449fc
      Tobias Waldekranz 提交于
      The number of connected switches was sourced from the number of
      children to the DSA node, change it to the number of available
      children, skipping any disabled switches.
      
      Fixes: 5e95329b ("dsa: add device tree bindings to register DSA switches")
      Signed-off-by: NTobias Waldekranz <tobias@waldekranz.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e04449fc