1. 10 2月, 2015 1 次提交
  2. 09 2月, 2015 5 次提交
    • E
      net:rfs: adjust table size checking · 93c1af6c
      Eric Dumazet 提交于
      Make sure root user does not try something stupid.
      
      Also make sure mask field in struct rps_sock_flow_table
      does not share a cache line with the potentially often dirtied
      flow table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c1af6c
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
    • S
      gre/ipip: use be16 variants of netlink functions · 3e97fa70
      Sabrina Dubroca 提交于
      encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead
      of nla_{get,put}_u16.
      
      Fixes the sparse warnings:
      
      warning: incorrect type in assignment (different base types)
         expected restricted __be32 [addressable] [usertype] o_key
         got restricted __be16 [addressable] [usertype] i_flags
      warning: incorrect type in assignment (different base types)
         expected restricted __be16 [usertype] sport
         got unsigned short
      warning: incorrect type in assignment (different base types)
         expected restricted __be16 [usertype] dport
         got unsigned short
      warning: incorrect type in argument 3 (different base types)
         expected unsigned short [unsigned] [usertype] value
         got restricted __be16 [usertype] sport
      warning: incorrect type in argument 3 (different base types)
         expected unsigned short [unsigned] [usertype] value
         got restricted __be16 [usertype] dport
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e97fa70
    • J
      tipc: fix bug in socket reception function · 51a00daf
      Jon Paul Maloy 提交于
      In commit c637c103 ("tipc: resolve race
      problem at unicast message reception") we introduced a time limit
      for how long the function tipc_sk_eneque() would be allowed to execute
      its loop. Unfortunately, the test for when this limit is passed was put
      in the wrong place, resulting in a lost message when the test is true.
      
      We fix this by moving the test to before we dequeue the next buffer
      from the input queue.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51a00daf
    • M
      rt6_probe_deferred: Do not depend on struct ordering · 662f5533
      Michael Büsch 提交于
      rt6_probe allocates a struct __rt6_probe_work and schedules a work handler rt6_probe_deferred.
      But rt6_probe_deferred kfree's the struct work_struct instead of struct __rt6_probe_work.
      This works, because struct work_struct is the first element of struct __rt6_probe_work.
      
      Change it to kfree struct __rt6_probe_work to not implicitly depend on
      struct work_struct being the first element.
      
      This does not affect the generated code.
      Signed-off-by: NMichael Buesch <m@bues.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      662f5533
  3. 08 2月, 2015 9 次提交
    • N
      tcp: mitigate ACK loops for connections as tcp_timewait_sock · 4fb17a60
      Neal Cardwell 提交于
      Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
      represented by a tcp_timewait_sock, we rate limit dupacks in response
      to incoming packets (a) with TCP timestamps that fail PAWS checks, or
      (b) with sequence numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fb17a60
    • N
      tcp: mitigate ACK loops for connections as tcp_sock · f2b2c582
      Neal Cardwell 提交于
      Ensure that in state ESTABLISHED, where the connection is represented
      by a tcp_sock, we rate limit dupacks in response to incoming packets
      (a) with TCP timestamps that fail PAWS checks, or (b) with sequence
      numbers or ACK numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      
      There is already a similar (although global) rate-limiting mechanism
      for "challenge ACKs". When deciding whether to send a challence ACK,
      we first consult the new per-connection rate limit, and then the
      global rate limit.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2b2c582
    • N
      tcp: mitigate ACK loops for connections as tcp_request_sock · a9b2c06d
      Neal Cardwell 提交于
      In the SYN_RECV state, where the TCP connection is represented by
      tcp_request_sock, we now rate-limit SYNACKs in response to a client's
      retransmitted SYNs: we do not send a SYNACK in response to client SYN
      if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
      since we last sent a SYNACK in response to a client's retransmitted
      SYN.
      
      This allows the vast majority of legitimate client connections to
      proceed unimpeded, even for the most aggressive platforms, iOS and
      MacOS, which actually retransmit SYNs 1-second intervals for several
      times in a row. They use SYN RTO timeouts following the progression:
      1,1,1,1,1,2,4,8,16,32.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b2c06d
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
    • P
      openvswitch: Initialize unmasked key and uid len · ca539345
      Pravin B Shelar 提交于
      Flow alloc needs to initialize unmasked key pointer. Otherwise
      it can crash kernel trying to free random unmasked-key pointer.
      
      general protection fault: 0000 [#1] SMP
      3.19.0-rc6-net-next+ #457
      Hardware name: Supermicro X7DWU/X7DWU, BIOS  1.1 04/30/2008
      RIP: 0010:[<ffffffff8111df0e>] [<ffffffff8111df0e>] kfree+0xac/0x196
      Call Trace:
       [<ffffffffa060bd87>] flow_free+0x21/0x59 [openvswitch]
       [<ffffffffa060bde0>] ovs_flow_free+0x21/0x23 [openvswitch]
       [<ffffffffa0605b4a>] ovs_packet_cmd_execute+0x2f3/0x35f [openvswitch]
       [<ffffffffa0605995>] ? ovs_packet_cmd_execute+0x13e/0x35f [openvswitch]
       [<ffffffff811fe6fb>] ? nla_parse+0x4f/0xec
       [<ffffffff8139a2fc>] genl_family_rcv_msg+0x26d/0x2c9
       [<ffffffff8107620f>] ? __lock_acquire+0x90e/0x9aa
       [<ffffffff8139a3be>] genl_rcv_msg+0x66/0x89
       [<ffffffff8139a358>] ? genl_family_rcv_msg+0x2c9/0x2c9
       [<ffffffff81399591>] netlink_rcv_skb+0x3e/0x95
       [<ffffffff81399898>] ? genl_rcv+0x18/0x37
       [<ffffffff813998a7>] genl_rcv+0x27/0x37
       [<ffffffff81399033>] netlink_unicast+0x103/0x191
       [<ffffffff81399382>] netlink_sendmsg+0x2c1/0x310
       [<ffffffff811007ad>] ? might_fault+0x50/0xa0
       [<ffffffff8135c773>] do_sock_sendmsg+0x5f/0x7a
       [<ffffffff8135c799>] sock_sendmsg+0xb/0xd
       [<ffffffff8135cacf>] ___sys_sendmsg+0x1a3/0x218
       [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86
       [<ffffffff8115a9d0>] ? fsnotify+0x32c/0x348
       [<ffffffff8115a720>] ? fsnotify+0x7c/0x348
       [<ffffffff8113e5f5>] ? __fget+0xaa/0xbf
       [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86
       [<ffffffff8135cccd>] __sys_sendmsg+0x3d/0x5e
       [<ffffffff8135cd02>] SyS_sendmsg+0x14/0x16
       [<ffffffff81411852>] system_call_fastpath+0x12/0x17
      
      Fixes: 74ed7ab9("openvswitch: Add support for unique flow IDs.")
      CC: Joe Stringer <joestringer@nicira.com>
      Reported-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca539345
    • R
      bridge: add missing bridge port check for offloads · 1fd0bddb
      Roopa Prabhu 提交于
      This patch fixes a missing bridge port check caught by smatch.
      
      setlink/dellink of attributes like vlans can come for a bridge device
      and there is no need to offload those today. So, this patch adds a bridge
      port check. (In these cases however, the BRIDGE_SELF flags will always be set
      and we may not hit a problem with the current code).
      
      smatch complaint:
      
      The patch 68e331c7: "bridge: offload bridge port attributes to
      switch asic if feature flag set" from Jan 29, 2015, leads to the
      following Smatch complaint:
      
      net/bridge/br_netlink.c:552 br_setlink()
      	 error: we previously assumed 'p' could be null (see line 518)
      
      net/bridge/br_netlink.c
         517
         518		if (p && protinfo) {
                          ^
      Check for NULL.
      Reported-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fd0bddb
    • S
      rds: Make rds_message_copy_from_user() return 0 on success. · d0a47d32
      Sowmini Varadhan 提交于
      Commit 083735f4 ("rds: switch rds_message_copy_from_user() to iov_iter")
      breaks rds_message_copy_from_user() semantics on success, and causes it
      to return nbytes copied, when it should return 0.  This commit fixes that bug.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0a47d32
    • R
      net: rds: Remove repeated function names from debug output · 11ac1199
      Rasmus Villemoes 提交于
      The macro rdsdebug is defined as
      
        pr_debug("%s(): " fmt, __func__ , ##args)
      
      Hence it doesn't make sense to include the name of the calling
      function explicitly in the format string passed to rdsdebug.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11ac1199
    • J
      net: openvswitch: Support masked set actions. · 83d2b9ba
      Jarno Rajahalme 提交于
      OVS userspace already probes the openvswitch kernel module for
      OVS_ACTION_ATTR_SET_MASKED support.  This patch adds the kernel module
      implementation of masked set actions.
      
      The existing set action sets many fields at once.  When only a subset
      of the IP header fields, for example, should be modified, all the IP
      fields need to be exact matched so that the other field values can be
      copied to the set action.  A masked set action allows modification of
      an arbitrary subset of the supported header bits without requiring the
      rest to be matched.
      
      Masked set action is now supported for all writeable key types, except
      for the tunnel key.  The set tunnel action is an exception as any
      input tunnel info is cleared before action processing starts, so there
      is no tunnel info to mask.
      
      The kernel module converts all (non-tunnel) set actions to masked set
      actions.  This makes action processing more uniform, and results in
      less branching and duplicating the action processing code.  When
      returning actions to userspace, the fully masked set actions are
      converted back to normal set actions.  We use a kernel internal action
      code to be able to tell the userspace provided and converted masked
      set actions apart.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83d2b9ba
  4. 06 2月, 2015 9 次提交
    • J
      tipc: eliminate race condition at multicast reception · cb1b7280
      Jon Paul Maloy 提交于
      In a previous commit in this series we resolved a race problem during
      unicast message reception.
      
      Here, we resolve the same problem at multicast reception. We apply the
      same technique: an input queue serializing the delivery of arriving
      buffers. The main difference is that here we do it in two steps.
      First, the broadcast link feeds arriving buffers into the tail of an
      arrival queue, which head is consumed at the socket level, and where
      destination lookup is performed. Second, if the lookup is successful,
      the resulting buffer clones are fed into a second queue, the input
      queue. This queue is consumed at reception in the socket just like
      in the unicast case. Both queues are protected by the same lock, -the
      one of the input queue.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb1b7280
    • J
      tipc: simplify socket multicast reception · 3c724acd
      Jon Paul Maloy 提交于
      The structure 'tipc_port_list' is used to collect port numbers
      representing multicast destination socket on a receiving node.
      The list is not based on a standard linked list, and is in reality
      optimized for the uncommon case that there are more than one
      multicast destinations per node. This makes the list handling
      unecessarily complex, and as a consequence, even the socket
      multicast reception becomes more complex.
      
      In this commit, we replace 'tipc_port_list' with a new 'struct
      tipc_plist', which is based on a standard list. We give the new
      list stack (push/pop) semantics, someting that simplifies
      the implementation of the function tipc_sk_mcast_rcv().
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c724acd
    • J
      tipc: simplify connection abort notifications when links break · 708ac32c
      Jon Paul Maloy 提交于
      The new input message queue in struct tipc_link can be used for
      delivering connection abort messages to subscribing sockets. This
      makes it possible to simplify the code for such cases.
      
      This commit removes the temporary list in tipc_node_unlock()
      used for transforming abort subscriptions to messages. Instead, the
      abort messages are now created at the moment of lost contact, and
      then added to the last failed link's generic input queue for delivery
      to the sockets concerned.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      708ac32c
    • J
      tipc: resolve race problem at unicast message reception · c637c103
      Jon Paul Maloy 提交于
      TIPC handles message cardinality and sequencing at the link layer,
      before passing messages upwards to the destination sockets. During the
      upcall from link to socket no locks are held. It is therefore possible,
      and we see it happen occasionally, that messages arriving in different
      threads and delivered in sequence still bypass each other before they
      reach the destination socket. This must not happen, since it violates
      the sequentiality guarantee.
      
      We solve this by adding a new input buffer queue to the link structure.
      Arriving messages are added safely to the tail of that queue by the
      link, while the head of the queue is consumed, also safely, by the
      receiving socket. Sequentiality is secured per socket by only allowing
      buffers to be dequeued inside the socket lock. Since there may be multiple
      simultaneous readers of the queue, we use a 'filter' parameter to reduce
      the risk that they peek the same buffer from the queue, hence also
      reducing the risk of contention on the receiving socket locks.
      
      This solves the sequentiality problem, and seems to cause no measurable
      performance degradation.
      
      A nice side effect of this change is that lock handling in the functions
      tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
      will enable future simplifications of those functions.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c637c103
    • J
      tipc: use existing sk_write_queue for outgoing packet chain · 94153e36
      Jon Paul Maloy 提交于
      The list for outgoing traffic buffers from a socket is currently
      allocated on the stack. This forces us to initialize the queue for
      each sent message, something costing extra CPU cycles in the most
      critical data path. Later in this series we will introduce a new
      safe input buffer queue, something that would force us to initialize
      even the spinlock of the outgoing queue. A closer analysis reveals
      that the queue always is filled and emptied within the same lock_sock()
      session. It is therefore safe to use a queue aggregated in the socket
      itself for this purpose. Since there already exists a queue for this
      in struct sock, sk_write_queue, we introduce use of that queue in
      this commit.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94153e36
    • J
      tipc: split up function tipc_msg_eval() · e3a77561
      Jon Paul Maloy 提交于
      The function tipc_msg_eval() is in reality doing two related, but
      different tasks. First it tries to find a new destination for named
      messages, in case there was no first lookup, or if the first lookup
      failed. Second, it does what its name suggests, evaluating the validity
      of the message and its destination, and returning an appropriate error
      code depending on the result.
      
      This is confusing, and in this commit we choose to break it up into two
      functions. A new function, tipc_msg_lookup_dest(), first attempts to find
      a new destination, if the message is of the right type. If this lookup
      fails, or if the message should not be subject to a second lookup, the
      already existing tipc_msg_reverse() is called. This function performs
      prepares the message for rejection, if applicable.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3a77561
    • J
      tipc: enqueue arrived buffers in socket in separate function · d570d864
      Jon Paul Maloy 提交于
      The code for enqueuing arriving buffers in the function tipc_sk_rcv()
      contains long code lines and currently goes to two indentation levels.
      As a cosmetic preparaton for the next commits, we break it out into
      a separate function.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d570d864
    • J
      tipc: simplify message forwarding and rejection in socket layer · 1186adf7
      Jon Paul Maloy 提交于
      Despite recent improvements, the handling of error codes and return
      values at reception of messages in the socket layer is still confusing.
      
      In this commit, we try to make it more comprehensible. First, we
      separate between the return values coming from the functions called
      by tipc_sk_rcv(), -those are TIPC specific error codes, and the
      return values returned by tipc_sk_rcv() itself. Second, we don't
      use the returned TIPC error code as indication for whether a buffer
      should be forwarded/rejected or not; instead we use the buffer pointer
      passed along with filter_msg(). This separation is necessary because
      we sometimes want to forward messages even when there is no error
      (i.e., protocol messages and successfully secondary looked up data
      messages).
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1186adf7
    • J
      tipc: reduce usage of context info in socket and link · c5898636
      Jon Paul Maloy 提交于
      The most common usage of namespace information is when we fetch the
      own node addess from the net structure. This leads to a lot of
      passing around of a parameter of type 'struct net *' between
      functions just to make them able to obtain this address.
      
      However, in many cases this is unnecessary. The own node address
      is readily available as a member of both struct tipc_sock and
      tipc_link, and can be fetched from there instead.
      The fact that the vast majority of functions in socket.c and link.c
      anyway are maintaining a pointer to their respective base structures
      makes this option even more compelling.
      
      In this commit, we introduce the inline functions tsk_own_node()
      and link_own_node() to make it easy for functions to fetch the node
      address from those structs instead of having to pass along and
      dereference the namespace struct.
      
      In particular, we make calls to the msg_xx() functions in msg.{h,c}
      context independent by directly passing them the own node address
      as parameter when needed. Those functions should be regarded as
      leaves in the code dependency tree, and it is hence desirable to
      keep them namspace unaware.
      
      Apart from a potential positive effect on cache behavior, these
      changes make it easier to introduce the changes that will follow
      later in this series.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5898636
  5. 05 2月, 2015 16 次提交
    • E
      sit: fix some __be16/u16 mismatches · a409caec
      Eric Dumazet 提交于
      Fixes following sparse warnings :
      
      net/ipv6/sit.c:1509:32: warning: incorrect type in assignment (different base types)
      net/ipv6/sit.c:1509:32:    expected restricted __be16 [usertype] sport
      net/ipv6/sit.c:1509:32:    got unsigned short
      net/ipv6/sit.c:1514:32: warning: incorrect type in assignment (different base types)
      net/ipv6/sit.c:1514:32:    expected restricted __be16 [usertype] dport
      net/ipv6/sit.c:1514:32:    got unsigned short
      net/ipv6/sit.c:1711:38: warning: incorrect type in argument 3 (different base types)
      net/ipv6/sit.c:1711:38:    expected unsigned short [unsigned] [usertype] value
      net/ipv6/sit.c:1711:38:    got restricted __be16 [usertype] sport
      net/ipv6/sit.c:1713:38: warning: incorrect type in argument 3 (different base types)
      net/ipv6/sit.c:1713:38:    expected unsigned short [unsigned] [usertype] value
      net/ipv6/sit.c:1713:38:    got restricted __be16 [usertype] dport
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a409caec
    • E
      net: remove some sparse warnings · 2ce1ee17
      Eric Dumazet 提交于
      netdev_adjacent_add_links() and netdev_adjacent_del_links()
      are static.
      
      queue->qdisc has __rcu annotation, need to use RCU_INIT_POINTER()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ce1ee17
    • S
      ip6_gre: fix endianness errors in ip6gre_err · d1e158e2
      Sabrina Dubroca 提交于
      info is in network byte order, change it back to host byte order
      before use. In particular, the current code sets the MTU of the tunnel
      to a wrong (too big) value.
      
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1e158e2
    • D
      Revert "bridge: Let bridge not age 'externally' learnt FDB entries, they are... · 3fcf9011
      David S. Miller 提交于
      Revert "bridge: Let bridge not age 'externally' learnt FDB entries, they are removed when 'external' entity notifies the aging"
      
      This reverts commit 9a05dde5.
      
      Requested by Scott Feldman.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fcf9011
    • E
      pkt_sched: fq: better control of DDOS traffic · 06eb395f
      Eric Dumazet 提交于
      FQ has a fast path for skb attached to a socket, as it does not
      have to compute a flow hash. But for other packets, FQ being non
      stochastic means that hosts exposed to random Internet traffic
      can allocate million of flows structure (104 bytes each) pretty
      easily. Not only host can OOM, but lookup in RB trees can take
      too much cpu and memory resources.
      
      This patch adds a new attribute, orphan_mask, that is adding
      possibility of having a stochastic hash for orphaned skb.
      
      Its default value is 1024 slots, to mimic SFQ behavior.
      
      Note: This does not apply to locally generated TCP traffic,
      and no locally generated traffic will share a flow structure
      with another perfect or stochastic flow.
      
      This patch also handles the specific case of SYNACK messages:
      
      They are attached to the listener socket, and therefore all map
      to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
      hoping to have new accepted socket inherit this rate, SYNACK
      might be paced and even dropped.
      
      This is very similar to an internal patch Google have used more
      than one year.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06eb395f
    • E
      tcp: do not pace pure ack packets · 98781965
      Eric Dumazet 提交于
      When we added pacing to TCP, we decided to let sch_fq take care
      of actual pacing.
      
      All TCP had to do was to compute sk->pacing_rate using simple formula:
      
      sk->pacing_rate = 2 * cwnd * mss / rtt
      
      It works well for senders (bulk flows), but not very well for receivers
      or even RPC :
      
      cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
      can end up pacing ACK packets, slowing down the sender.
      
      Really, only the sender should pace, according to its own logic.
      
      Instead of adding a new bit in skb, or call yet another flow
      dissection, we tweak skb->truesize to a small value (2), and
      we instruct sch_fq to use new helper and not pace pure ack.
      
      Note this also helps TCP small queue, as ack packets present
      in qdisc/NIC do not prevent sending a data packet (RPC workload)
      
      This helps to reduce tx completion overhead, ack packets can use regular
      sock_wfree() instead of tcp_wfree() which is a bit more expensive.
      
      This has no impact in the case packets are sent to loopback interface,
      as we do not coalesce ack packets (were we would detect skb->truesize
      lie)
      
      In case netem (with a delay) is used, skb_orphan_partial() also sets
      skb->truesize to 1.
      
      This patch is a combination of two patches we used for about one year at
      Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98781965
    • H
      netfilter: Use rhashtable walk iterator · 9a776628
      Herbert Xu 提交于
      This patch gets rid of the manual rhashtable walk in nft_hash
      which touches rhashtable internals that should not be exposed.
      It does so by using the rhashtable iterator primitives.
      
      Note that I'm leaving nft_hash_destroy alone since it's only
      invoked on shutdown and it shouldn't be affected by changes
      to rhashtable internals (or at least not what I'm planning to
      change).
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a776628
    • H
      netlink: Use rhashtable walk iterator · 56d28b1e
      Herbert Xu 提交于
      This patch gets rid of the manual rhashtable walk in netlink
      which touches rhashtable internals that should not be exposed.
      It does so by using the rhashtable iterator primitives.
      
      In fact the existing code was very buggy.  Some sockets weren't
      shown at all while others were shown more than once.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56d28b1e
    • I
      cls_api.c: Fix dumping of non-existing actions' stats. · b057df24
      Ignacy Gawędzki 提交于
      In tcf_exts_dump_stats(), ensure that exts->actions is not empty before
      accessing the first element of that list and calling tcf_action_copy_stats()
      on it.  This fixes some random segvs when adding filters of type "basic" with
      no particular action.
      
      This also fixes the dumping of those "no-action" filters, which more often
      than not made calls to tcf_action_copy_stats() fail and consequently netlink
      attributes added by the caller to be removed by a call to nla_nest_cancel().
      
      Fixes: 33be6271 ("net_sched: act: use standard struct list_head")
      Signed-off-by: NIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Acked-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b057df24
    • K
      pkt_sched: fq: avoid hang when quantum 0 · 3725a269
      Kenneth Klette Jonassen 提交于
      Configuring fq with quantum 0 hangs the system, presumably because of a
      non-interruptible infinite loop. Either way quantum 0 does not make sense.
      
      Reproduce with:
      sudo tc qdisc add dev lo root fq quantum 0 initial_quantum 0
      ping 127.0.0.1
      Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3725a269
    • M
      net/core: Add event for a change in slave state · 61bd3857
      Moni Shoua 提交于
      Add event which provides an indication on a change in the state
      of a bonding slave. The event handler should cast the pointer to the
      appropriate type (struct netdev_bonding_info) in order to get the
      full info about the slave.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61bd3857
    • J
      tipc: separate link starting event from link timeout event · af9946fd
      Jon Paul Maloy 提交于
      When a new link instance is created, it is trigged to start by
      sending it a TIPC_STARTING_EVT, whereafter a regular link
      reset is applied to it.
      
      The starting event is codewise treated as a timeout event, and prompts
      a link RESET message to be sent to the peer node, carrying a link
      session identifier. The later link_reset() call nudges this session
      identifier, whereafter all subsequent RESET messages will be sent out
      with the new identifier. The latter session number overrides the former,
      causing the peer to unconditionally accept it irrespective of its
      current working state.
      
      We don't think that this causes any problem, but it is not in accordance
      with the protocol spec, and may cause confusion when debugging TIPC
      sessions.
      
      To avoid this, we make the starting event distinct from the
      subsequent timeout events, by not allowing the former to send
      out any RESET message. This eliminates the described problem.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af9946fd
    • J
      tipc: eliminate race during node creation · b45db71b
      Jon Paul Maloy 提交于
      Instances of struct node are created in the function tipc_disc_rcv()
      under the assumption that there is no race between received discovery
      messages arriving from the same node. This assumption is wrong.
      When we use more than one bearer, it is possible that discovery
      messages from the same node arrive at the same moment, resulting in
      creation of two instances of struct tipc_node. This may later cause
      confusion during link establishment, and may result in one of the links
      never becoming activated.
      
      We fix this by making lookup and potential creation of nodes atomic.
      Instead of first looking up the node, and in case of failure, create it,
      we now start with looking up the node inside node_link_create(), and
      return a reference to that one if found. Otherwise, we go ahead and
      create the node as we did before.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b45db71b
    • J
      tipc: avoid stale link after aborted failover · 7d24dcdb
      Jon Paul Maloy 提交于
      During link failover it may happen that the remaining link goes
      down while it is still in the process of taking over traffic
      from a previously failed link. When this happens, we currently
      abort the failover procedure and reset the first failed link to
      non-failover mode, so that it will be ready to re-establish
      contact with its peer when it comes available.
      
      However, if the first link goes down because its bearer was manually
      disabled, it is not enough to reset it; it must also be deleted;
      which is supposed to happen when the failover procedure is finished.
      Otherwise it will remain a zombie link: attached to the owner node
      structure, in mode LINK_STOPPED, and permanently blocking any re-
      establishing of the link to the peer via the interface in question.
      
      We fix this by amending the failover abort procedure. Apart from
      resetting the link to non-failover state, we test if the link is
      also in LINK_STOPPED mode. If so, we delete it, using the conditional
      tipc_link_delete() function introduced in the previous commit.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d24dcdb
    • J
      tipc: add reference count to struct tipc_link · 2d72d495
      Jon Paul Maloy 提交于
      When a bearer is disabled, all pertaining links will be reset and
      deleted. However, if there is a second active link towards a killed
      link's destination, the delete has to be postponed until the failover
      is finished. During this interval, we currently put the link in zombie
      mode, i.e., we take it out of traffic, delete its timer, but leave it
      attached to the owner node structure until all missing packets have
      been received.  When this is done, we detach the link from its node
      and delete it, assuming that the synchronous timer deletion that was
      initiated earlier in a different thread has finished.
      
      This is unsafe, as the failover may finish before del_timer_sync()
      has returned in the other thread.
      
      We fix this by adding an atomic reference counter of type kref in
      struct tipc_link. The counter keeps track of the references kept
      to the link by the owner node and the timer. We then do a conditional
      delete, based on the reference counter, both after the failover has
      been finished and when the timer expires, if applicable. Whoever
      comes last, will actually delete the link. This approach also implies
      that we can make the deletion of the timer asynchronous.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d72d495
    • S
      net: rds: use correct size for max unacked packets and bytes · db27ebb1
      Sasha Levin 提交于
      Max unacked packets/bytes is an int while sizeof(long) was used in the
      sysctl table.
      
      This means that when they were getting read we'd also leak kernel memory
      to userspace along with the timeout values.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db27ebb1