1. 13 5月, 2015 12 次提交
  2. 12 5月, 2015 1 次提交
  3. 11 5月, 2015 7 次提交
    • M
      bonding: Implement user key part of port_key in an AD system. · d22a5fc0
      Mahesh Bandewar 提交于
      The port key has three components - user-key, speed-part, and duplex-part.
      The LSBit is for the duplex-part, next 5 bits are for the speed while the
      remaining 10 bits are the user defined key bits. Get these 10 bits
      from the user-space (through the SysFs interface) and use it to form the
      admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
      it is not provided then use zero for the user-key-bits (default).
      
      It can set using following example code -
      
         # modprobe bonding mode=4
         # usr_port_key=$(( RANDOM & 0x3FF ))
         # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
         # echo +eth1 > /sys/class/net/bond0/bonding/slaves
         ...
         # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * fixed up context from change in ad_actor_sys_prio patch]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d22a5fc0
    • M
      bonding: Allow userspace to set actors' macaddr in an AD-system. · 74514957
      Mahesh Bandewar 提交于
      In an AD system, the communication between actor and partner is the
      business between these two entities. In the current setup anyone on the
      same L2 can "guess" the LACPDU contents and then possibly send the
      spoofed LACPDUs and trick the partner causing connectivity issues for
      the AD system. This patch allows to use a random mac-address obscuring
      it's identity making it harder for someone in the L2 is do the same thing.
      
      This patch allows user-space to choose the mac-address for the AD-system.
      This mac-address can not be NULL or a Multicast. If the mac-address is set
      from user-space; kernel will honor it and will not overwrite it. In the
      absence (value from user space); the logic will default to using the
      masters' mac as the mac-address for the AD-system.
      
      It can be set using example code below -
      
         # modprobe bonding mode=4
         # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
                          $(( (RANDOM & 0xFE) | 0x02 )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )))
         # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
         # echo +eth1 > /sys/class/net/bond0/bonding/slaves
         ...
         # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: fixed up style issues reported by checkpatch]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74514957
    • M
      bonding: Allow userspace to set actors' system_priority in AD system · 6791e466
      Mahesh Bandewar 提交于
      This patch allows user to randomize the system-priority in an ad-system.
      The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
      does not specify this value, the system defaults to 0xFFFF, which is
      what it was before this patch.
      
      Following example code could set the value -
          # modprobe bonding mode=4
          # sys_prio=$(( 1 + RANDOM + RANDOM ))
          # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
          # echo +eth1 > /sys/class/net/bond0/bonding/slaves
          ...
          # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * changed how the default value is set in bond_check_params(), this
             makes the default consistent between what gets set for a new bond
             and what the default is claimed to be in the bonding options.]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6791e466
    • E
      net: kill sk_change_net and sk_release_kernel · affb9792
      Eric W. Biederman 提交于
      These functions are no longer needed and no longer used kill them.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      affb9792
    • E
      net: Modify sk_alloc to not reference count the netns of kernel sockets. · 26abe143
      Eric W. Biederman 提交于
      Now that sk_alloc knows when a kernel socket is being allocated modify
      it to not reference count the network namespace of kernel sockets.
      
      Keep track of if a socket needs reference counting by adding a flag to
      struct sock called sk_net_refcnt.
      
      Update all of the callers of sock_create_kern to stop using
      sk_change_net and sk_release_kernel as those hacks are no longer
      needed, to avoid reference counting a kernel socket.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26abe143
    • E
      net: Pass kern from net_proto_family.create to sk_alloc · 11aa9c28
      Eric W. Biederman 提交于
      In preparation for changing how struct net is refcounted
      on kernel sockets pass the knowledge that we are creating
      a kernel socket from sock_create_kern through to sk_alloc.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11aa9c28
    • E
      codel: add ce_threshold attribute · 80ba92fa
      Eric Dumazet 提交于
      For DCTCP or similar ECN based deployments on fabrics with shallow
      buffers, hosts are responsible for a good part of the buffering.
      
      This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
      so that DCTCP can have feedback from queuing in the host.
      
      A DCTCP enabled egress port simply have a queue occupancy threshold
      above which ECT packets get CE mark.
      
      In codel language this translates to a sojourn time, so that one doesn't
      have to worry about bytes or bandwidth but delays.
      
      This makes the host an active participant in the health of the whole
      network.
      
      This also helps experimenting DCTCP in a setup without DCTCP compliant
      fabric.
      
      On following example, ce_threshold is set to 1ms, and we can see from
      'ldelay xxx us' that TCP is not trying to go around the 5ms codel
      target.
      
      Queue has more capacity to absorb inelastic bursts (say from UDP
      traffic), as queues are maintained to an optimal level.
      
      lpaa23:~# ./tc -s -d qd sh dev eth1
      qdisc mq 1: dev eth1 root
       Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
       backlog 3108242b 364p requeues 42961
      qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
       rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
        count 0 lastcount 0 ldelay 1.0ms drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
      qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
       rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
        count 0 lastcount 0 ldelay 694us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
      qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
       rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
        count 0 lastcount 0 ldelay 889us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
      ...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80ba92fa
  4. 10 5月, 2015 4 次提交
  5. 06 5月, 2015 4 次提交
  6. 05 5月, 2015 2 次提交
  7. 04 5月, 2015 2 次提交
    • T
      net: Add flow_keys digest · 2f59e1eb
      Tom Herbert 提交于
      Some users of flow keys (well just sch_choke now) need to pass
      flow_keys in skbuff cb, and use them for exact comparisons of flows
      so that skb->hash is not sufficient. In order to increase size of
      the flow_keys structure, we introduce another structure for
      the purpose of passing flow keys in skbuff cb. We limit this structure
      to sixteen bytes, and we will technically treat this as a digest of
      flow_keys struct hence its name flow_keys_digest. In the first
      incaranation we just copy the flow_keys structure up to 16 bytes--
      this is the same information previously passed in the cb. In the
      future, we'll adapt this for larger flow_keys and could use something
      like SHA-1 over the whole flow_keys to improve the quality of the
      digest.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f59e1eb
    • T
      ipv6: Flow label state ranges · 82a584b7
      Tom Herbert 提交于
      This patch divides the IPv6 flow label space into two ranges:
      0-7ffff is reserved for flow label manager, 80000-fffff will be
      used for creating auto flow labels (per RFC6438). This only affects how
      labels are set on transmit, it does not affect receive. This range split
      can be disbaled by systcl.
      
      Background:
      
      IPv6 flow labels have been an unmitigated disappointment thus far
      in the lifetime of IPv6. Support in HW devices to use them for ECMP
      is lacking, and OSes don't turn them on by default. If we had these
      we could get much better hashing in IPv6 networks without resorting
      to DPI, possibly eliminating some of the motivations to to define new
      encaps in UDP just for getting ECMP.
      
      Unfortunately, the initial specfications of IPv6 did not clarify
      how they are to be used. There has always been a vague concept that
      these can be used for ECMP, flow hashing, etc. and we do now have a
      good standard how to this in RFC6438. The problem is that flow labels
      can be either stateful or stateless (as in RFC6438), and we are
      presented with the possibility that a stateless label may collide
      with a stateful one.  Attempts to split the flow label space were
      rejected in IETF. When we added support in Linux for RFC6438, we
      could not turn on flow labels by default due to this conflict.
      
      This patch splits the flow label space and should give us
      a path to enabling auto flow labels by default for all IPv6 packets.
      This is an API change so we need to consider compatibility with
      existing deployment. The stateful range is chosen to be the lower
      values in hopes that most uses would have chosen small numbers.
      
      Once we resolve the stateless/stateful issue, we can proceed to
      look at enabling RFC6438 flow labels by default (starting with
      scaled testing).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82a584b7
  8. 03 5月, 2015 1 次提交
  9. 02 5月, 2015 2 次提交
    • M
      ipv6: Remove DST_METRICS_FORCE_OVERWRITE and _rt6i_peer · afc4eef8
      Martin KaFai Lau 提交于
      _rt6i_peer is no longer needed after the last patch,
      'ipv6: Stop rt6_info from using inet_peer's metrics'.
      
      DST_METRICS_FORCE_OVERWRITE is added by
      commit e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely").
      Since inetpeer is no longer used for metrics, this bit is also not needed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afc4eef8
    • M
      ipv6: Stop rt6_info from using inet_peer's metrics · 4b32b5ad
      Martin KaFai Lau 提交于
      inet_peer is indexed by the dst address alone.  However, the fib6 tree
      could have multiple routing entries (rt6_info) for the same dst. For
      example,
      1. A /128 dst via multiple gateways.
      2. A RTF_CACHE route cloned from a /128 route.
      
      In the above cases, all of them will share the same metrics and
      step on each other.
      
      This patch will steer away from inet_peer's metrics and use
      dst_cow_metrics_generic() for everything.
      
      Change Highlights:
      1. Remove rt6_cow_metrics() which currently acquires metrics from
         inet_peer for DST_HOST route (i.e. /128 route).
      2. Add rt6i_pmtu to take care of the pmtu update to avoid creating a
         full size metrics just to override the RTAX_MTU.
      3. After (2), the RTF_CACHE route can also share the metrics with its
         dst.from route, by:
         dst_init_metrics(&cache_rt->dst, dst_metrics_ptr(cache_rt->dst.from), true);
      4. Stop creating RTF_CACHE route by cloning another RTF_CACHE route.  Instead,
         directly clone from rt->dst.
      
         [ Currently, cloning from another RTF_CACHE is only possible during
           rt6_do_redirect().  Also, the old clone is removed from the tree
           immediately after the new clone is added. ]
      
         In case of cloning from an older redirect RTF_CACHE, it should work as
         before.
      
         In case of cloning from an older pmtu RTF_CACHE, this patch will forget
         the pmtu and re-learn it (if there is any) from the redirected route.
      
      The _rt6i_peer and DST_METRICS_FORCE_OVERWRITE will be removed
      in the next cleanup patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b32b5ad
  10. 30 4月, 2015 2 次提交
  11. 27 4月, 2015 1 次提交
  12. 24 4月, 2015 2 次提交
    • E
      inet: fix possible panic in reqsk_queue_unlink() · b357a364
      Eric Dumazet 提交于
      [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
       0000000000000080
      [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243
      
      There is a race when reqsk_timer_handler() and tcp_check_req() call
      inet_csk_reqsk_queue_unlink() on the same req at the same time.
      
      Before commit fa76ce73 ("inet: get rid of central tcp/dccp listener
      timer"), listener spinlock was held and race could not happen.
      
      To solve this bug, we change reqsk_queue_unlink() to not assume req
      must be found, and we return a status, to conditionally release a
      refcount on the request sock.
      
      This also means tcp_check_req() in non fastopen case might or not
      consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
      to properly handle this.
      
      (Same remark for dccp_check_req() and its callers)
      
      inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
      called 4 times in tcp and 3 times in dccp.
      
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b357a364
    • E
      mac80211: notify the driver on reordering buffer timeout · b497de63
      Emmanuel Grumbach 提交于
      When frames time out in the reordering buffer, it is a
      good indication that something went wrong and the driver
      may want to know about that to take action or trigger
      debug flows.
      It is pointless to notify the driver about each frame that
      is released. Notify each time the timer fires.
      Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      b497de63