1. 12 5月, 2015 4 次提交
  2. 11 5月, 2015 11 次提交
    • D
      net: sched: further simplify handle_ing · d2788d34
      Daniel Borkmann 提交于
      Ingress qdisc has no other purpose than calling into tc_classify()
      that executes attached classifier(s) and action(s).
      
      It has a 1:1 relationship to dev->ingress_queue. After having commit
      087c1a60 ("net: sched: run ingress qdisc without locks") removed
      the central ingress lock, one major contention point is gone.
      
      The extra indirection layers however, are not necessary for calling
      into ingress qdisc. pktgen calling locally into netif_receive_skb()
      with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
      E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
      
      We can redirect the private classifier list to the netdev directly,
      without changing any classifier API bits (!) and execute on that from
      handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
      ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
      is also not applicable, ingress_cl_list provides similar behaviour.
      In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
      
      One next possible step is the removal of the dev's ingress (dummy)
      netdev_queue, and to only have the list member in the netdevice
      itself.
      
      Note, the filter chain is RCU protected and individual filter elements
      are being kfree'd by sched subsystem after RCU grace period. RCU read
      lock is being held by __netif_receive_skb_core().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2788d34
    • A
      bonding: add netlink support for sys prio, actor sys mac, and port key · 171a42c3
      Andy Gospodarek 提交于
      Adds netlink support for the following bonding options:
      * BOND_OPT_AD_ACTOR_SYS_PRIO
      * BOND_OPT_AD_ACTOR_SYSTEM
      * BOND_OPT_AD_USER_PORT_KEY
      
      When setting the actor system mac address we assume the netlink message
      contains a binary mac and not a string representation of a mac.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      [jt: completed the setting side of the netlink attributes]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      171a42c3
    • M
      bonding: Implement user key part of port_key in an AD system. · d22a5fc0
      Mahesh Bandewar 提交于
      The port key has three components - user-key, speed-part, and duplex-part.
      The LSBit is for the duplex-part, next 5 bits are for the speed while the
      remaining 10 bits are the user defined key bits. Get these 10 bits
      from the user-space (through the SysFs interface) and use it to form the
      admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
      it is not provided then use zero for the user-key-bits (default).
      
      It can set using following example code -
      
         # modprobe bonding mode=4
         # usr_port_key=$(( RANDOM & 0x3FF ))
         # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
         # echo +eth1 > /sys/class/net/bond0/bonding/slaves
         ...
         # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * fixed up context from change in ad_actor_sys_prio patch]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d22a5fc0
    • M
      bonding: Allow userspace to set actors' macaddr in an AD-system. · 74514957
      Mahesh Bandewar 提交于
      In an AD system, the communication between actor and partner is the
      business between these two entities. In the current setup anyone on the
      same L2 can "guess" the LACPDU contents and then possibly send the
      spoofed LACPDUs and trick the partner causing connectivity issues for
      the AD system. This patch allows to use a random mac-address obscuring
      it's identity making it harder for someone in the L2 is do the same thing.
      
      This patch allows user-space to choose the mac-address for the AD-system.
      This mac-address can not be NULL or a Multicast. If the mac-address is set
      from user-space; kernel will honor it and will not overwrite it. In the
      absence (value from user space); the logic will default to using the
      masters' mac as the mac-address for the AD-system.
      
      It can be set using example code below -
      
         # modprobe bonding mode=4
         # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
                          $(( (RANDOM & 0xFE) | 0x02 )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )))
         # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
         # echo +eth1 > /sys/class/net/bond0/bonding/slaves
         ...
         # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: fixed up style issues reported by checkpatch]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74514957
    • M
      bonding: Allow userspace to set actors' system_priority in AD system · 6791e466
      Mahesh Bandewar 提交于
      This patch allows user to randomize the system-priority in an ad-system.
      The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
      does not specify this value, the system defaults to 0xFFFF, which is
      what it was before this patch.
      
      Following example code could set the value -
          # modprobe bonding mode=4
          # sys_prio=$(( 1 + RANDOM + RANDOM ))
          # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
          # echo +eth1 > /sys/class/net/bond0/bonding/slaves
          ...
          # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * changed how the default value is set in bond_check_params(), this
             makes the default consistent between what gets set for a new bond
             and what the default is claimed to be in the bonding options.]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6791e466
    • E
      net: kill sk_change_net and sk_release_kernel · affb9792
      Eric W. Biederman 提交于
      These functions are no longer needed and no longer used kill them.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      affb9792
    • E
      net: Modify sk_alloc to not reference count the netns of kernel sockets. · 26abe143
      Eric W. Biederman 提交于
      Now that sk_alloc knows when a kernel socket is being allocated modify
      it to not reference count the network namespace of kernel sockets.
      
      Keep track of if a socket needs reference counting by adding a flag to
      struct sock called sk_net_refcnt.
      
      Update all of the callers of sock_create_kern to stop using
      sk_change_net and sk_release_kernel as those hacks are no longer
      needed, to avoid reference counting a kernel socket.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26abe143
    • E
      net: Pass kern from net_proto_family.create to sk_alloc · 11aa9c28
      Eric W. Biederman 提交于
      In preparation for changing how struct net is refcounted
      on kernel sockets pass the knowledge that we are creating
      a kernel socket from sock_create_kern through to sk_alloc.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11aa9c28
    • E
      net: Add a struct net parameter to sock_create_kern · eeb1bd5c
      Eric W. Biederman 提交于
      This is long overdue, and is part of cleaning up how we allocate kernel
      sockets that don't reference count struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeb1bd5c
    • E
      tun: Utilize the normal socket network namespace refcounting. · 140e807d
      Eric W. Biederman 提交于
      There is no need for tun to do the weird network namespace refcounting.
      The existing network namespace refcounting in tfile has almost exactly
      the same lifetime.  So rewrite the code to use the struct sock network
      namespace refcounting and remove the unnecessary hand rolled network
      namespace refcounting and the unncesary tfile->net.
      
      This change allows the tun code to directly call sock_put bypassing
      sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.
      
      Remove the now unncessary tun_release so that if anything tries to use
      the sock_release code path the kernel will oops, and let us know about
      the bug.
      
      The macvtap code already uses it's internal socket this way.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      140e807d
    • E
      codel: add ce_threshold attribute · 80ba92fa
      Eric Dumazet 提交于
      For DCTCP or similar ECN based deployments on fabrics with shallow
      buffers, hosts are responsible for a good part of the buffering.
      
      This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
      so that DCTCP can have feedback from queuing in the host.
      
      A DCTCP enabled egress port simply have a queue occupancy threshold
      above which ECT packets get CE mark.
      
      In codel language this translates to a sojourn time, so that one doesn't
      have to worry about bytes or bandwidth but delays.
      
      This makes the host an active participant in the health of the whole
      network.
      
      This also helps experimenting DCTCP in a setup without DCTCP compliant
      fabric.
      
      On following example, ce_threshold is set to 1ms, and we can see from
      'ldelay xxx us' that TCP is not trying to go around the 5ms codel
      target.
      
      Queue has more capacity to absorb inelastic bursts (say from UDP
      traffic), as queues are maintained to an optimal level.
      
      lpaa23:~# ./tc -s -d qd sh dev eth1
      qdisc mq 1: dev eth1 root
       Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
       backlog 3108242b 364p requeues 42961
      qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
       rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
        count 0 lastcount 0 ldelay 1.0ms drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
      qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
       rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
        count 0 lastcount 0 ldelay 694us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
      qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
       rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
        count 0 lastcount 0 ldelay 889us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
      ...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80ba92fa
  3. 10 5月, 2015 7 次提交
  4. 06 5月, 2015 7 次提交
  5. 05 5月, 2015 3 次提交
  6. 04 5月, 2015 4 次提交
    • T
      net: Add flow_keys digest · 2f59e1eb
      Tom Herbert 提交于
      Some users of flow keys (well just sch_choke now) need to pass
      flow_keys in skbuff cb, and use them for exact comparisons of flows
      so that skb->hash is not sufficient. In order to increase size of
      the flow_keys structure, we introduce another structure for
      the purpose of passing flow keys in skbuff cb. We limit this structure
      to sixteen bytes, and we will technically treat this as a digest of
      flow_keys struct hence its name flow_keys_digest. In the first
      incaranation we just copy the flow_keys structure up to 16 bytes--
      this is the same information previously passed in the cb. In the
      future, we'll adapt this for larger flow_keys and could use something
      like SHA-1 over the whole flow_keys to improve the quality of the
      digest.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f59e1eb
    • T
      net: Add skb_get_hash_perturb · 50fb7992
      Tom Herbert 提交于
      This calls flow_disect and __skb_get_hash to procure a hash for a
      packet. Input includes a key to initialize jhash. This function
      does not set skb->hash.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50fb7992
    • A
      etherdev: Process is_multicast_ether_addr at same size as other operations · d54385ce
      Alexander Duyck 提交于
      This change makes it so that we process the address in
      is_multicast_ether_addr at the same size as the other calls.  This allows
      us to avoid duplicate reads when used with other calls such as
      is_zero_ether_addr or eth_addr_copy.  In addition I have added a 64 bit
      version of the function so in eth_type_trans we can process the destination
      address as a 64 bit value throughout.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d54385ce
    • T
      ipv6: Flow label state ranges · 82a584b7
      Tom Herbert 提交于
      This patch divides the IPv6 flow label space into two ranges:
      0-7ffff is reserved for flow label manager, 80000-fffff will be
      used for creating auto flow labels (per RFC6438). This only affects how
      labels are set on transmit, it does not affect receive. This range split
      can be disbaled by systcl.
      
      Background:
      
      IPv6 flow labels have been an unmitigated disappointment thus far
      in the lifetime of IPv6. Support in HW devices to use them for ECMP
      is lacking, and OSes don't turn them on by default. If we had these
      we could get much better hashing in IPv6 networks without resorting
      to DPI, possibly eliminating some of the motivations to to define new
      encaps in UDP just for getting ECMP.
      
      Unfortunately, the initial specfications of IPv6 did not clarify
      how they are to be used. There has always been a vague concept that
      these can be used for ECMP, flow hashing, etc. and we do now have a
      good standard how to this in RFC6438. The problem is that flow labels
      can be either stateful or stateless (as in RFC6438), and we are
      presented with the possibility that a stateless label may collide
      with a stateful one.  Attempts to split the flow label space were
      rejected in IETF. When we added support in Linux for RFC6438, we
      could not turn on flow labels by default due to this conflict.
      
      This patch splits the flow label space and should give us
      a path to enabling auto flow labels by default for all IPv6 packets.
      This is an API change so we need to consider compatibility with
      existing deployment. The stateful range is chosen to be the lower
      values in hopes that most uses would have chosen small numbers.
      
      Once we resolve the stateless/stateful issue, we can proceed to
      look at enabling RFC6438 flow labels by default (starting with
      scaled testing).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82a584b7
  7. 03 5月, 2015 1 次提交
  8. 02 5月, 2015 3 次提交
    • S
      virtio: fix typo in vring_need_event() doc comment · e412d3a3
      Stefan Hajnoczi 提交于
      Here the "other side" refers to the guest or host.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e412d3a3
    • M
      ipv6: Remove DST_METRICS_FORCE_OVERWRITE and _rt6i_peer · afc4eef8
      Martin KaFai Lau 提交于
      _rt6i_peer is no longer needed after the last patch,
      'ipv6: Stop rt6_info from using inet_peer's metrics'.
      
      DST_METRICS_FORCE_OVERWRITE is added by
      commit e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely").
      Since inetpeer is no longer used for metrics, this bit is also not needed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afc4eef8
    • M
      ipv6: Stop rt6_info from using inet_peer's metrics · 4b32b5ad
      Martin KaFai Lau 提交于
      inet_peer is indexed by the dst address alone.  However, the fib6 tree
      could have multiple routing entries (rt6_info) for the same dst. For
      example,
      1. A /128 dst via multiple gateways.
      2. A RTF_CACHE route cloned from a /128 route.
      
      In the above cases, all of them will share the same metrics and
      step on each other.
      
      This patch will steer away from inet_peer's metrics and use
      dst_cow_metrics_generic() for everything.
      
      Change Highlights:
      1. Remove rt6_cow_metrics() which currently acquires metrics from
         inet_peer for DST_HOST route (i.e. /128 route).
      2. Add rt6i_pmtu to take care of the pmtu update to avoid creating a
         full size metrics just to override the RTAX_MTU.
      3. After (2), the RTF_CACHE route can also share the metrics with its
         dst.from route, by:
         dst_init_metrics(&cache_rt->dst, dst_metrics_ptr(cache_rt->dst.from), true);
      4. Stop creating RTF_CACHE route by cloning another RTF_CACHE route.  Instead,
         directly clone from rt->dst.
      
         [ Currently, cloning from another RTF_CACHE is only possible during
           rt6_do_redirect().  Also, the old clone is removed from the tree
           immediately after the new clone is added. ]
      
         In case of cloning from an older redirect RTF_CACHE, it should work as
         before.
      
         In case of cloning from an older pmtu RTF_CACHE, this patch will forget
         the pmtu and re-learn it (if there is any) from the redirected route.
      
      The _rt6i_peer and DST_METRICS_FORCE_OVERWRITE will be removed
      in the next cleanup patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b32b5ad