1. 13 5月, 2015 8 次提交
  2. 12 5月, 2015 15 次提交
  3. 11 5月, 2015 17 次提交
    • D
      Merge branch 'handle_ing_lightweight' · 3bb45001
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      handle_ing update
      
      These are a couple of cleanups to make ingress a bit more lightweight.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3bb45001
    • D
      net: sched: further simplify handle_ing · d2788d34
      Daniel Borkmann 提交于
      Ingress qdisc has no other purpose than calling into tc_classify()
      that executes attached classifier(s) and action(s).
      
      It has a 1:1 relationship to dev->ingress_queue. After having commit
      087c1a60 ("net: sched: run ingress qdisc without locks") removed
      the central ingress lock, one major contention point is gone.
      
      The extra indirection layers however, are not necessary for calling
      into ingress qdisc. pktgen calling locally into netif_receive_skb()
      with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
      E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
      
      We can redirect the private classifier list to the netdev directly,
      without changing any classifier API bits (!) and execute on that from
      handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
      ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
      is also not applicable, ingress_cl_list provides similar behaviour.
      In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
      
      One next possible step is the removal of the dev's ingress (dummy)
      netdev_queue, and to only have the list member in the netdevice
      itself.
      
      Note, the filter chain is RCU protected and individual filter elements
      are being kfree'd by sched subsystem after RCU grace period. RCU read
      lock is being held by __netif_receive_skb_core().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2788d34
    • D
      net: sched: consolidate handle_ing and ing_filter · c9e99fd0
      Daniel Borkmann 提交于
      Given quite some code has been removed from ing_filter(), we can just
      consolidate that function into handle_ing() and get rid of a few
      instructions at the same time.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9e99fd0
    • X
      test: bpf: extend "load 64-bit immediate" testcase · 986ccfdb
      Xi Wang 提交于
      Extend the testcase to catch a signedness bug in the arm64 JIT:
      
      test_bpf: #58 load 64-bit immediate jited:1 ret -1 != 1 FAIL (1 times)
      
      This is useful to ensure other JITs won't have a similar bug.
      
      Link: https://lkml.org/lkml/2015/5/8/458
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      986ccfdb
    • D
      Merge branch 'bonding_netlink_lacp' · 32f89e5c
      David S. Miller 提交于
      Jonathan Toppins says:
      
      ====================
      add netlink support for new lacp bonding parameters
      
      This is a resubmit of Mahesh's last 3 bonding patches from this series
      (http://marc.info/?l=linux-netdev&m=142432864626179&w=2) with one
      additional kernel patch which adds the netlink bits. I have noted any
      modifications I did to the original patches just above my signoff line.
      Patch 5 is the iproute2 support for these bonding options. All patches
      were coded against the net-next branch of their respective projects.
      
      v2:
        * rebased
        * only send these new parameters via netlink when bond is in mode 4
        * fixed ad_actor_sys_prio to be 0xFFFF by default even when the bond
          is initially created in mode 0 and switched to mode 4
      
      v3:
        * reverted changes to bond_option_ad_actor_system_set() from v1 in Mahesh's
          patch "bonding: Allow userspace to set actors' macaddr in an AD-system."
          Instead implementing all setting in the option specific set function as
          Nik suggested.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32f89e5c
    • A
      bonding: add netlink support for sys prio, actor sys mac, and port key · 171a42c3
      Andy Gospodarek 提交于
      Adds netlink support for the following bonding options:
      * BOND_OPT_AD_ACTOR_SYS_PRIO
      * BOND_OPT_AD_ACTOR_SYSTEM
      * BOND_OPT_AD_USER_PORT_KEY
      
      When setting the actor system mac address we assume the netlink message
      contains a binary mac and not a string representation of a mac.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      [jt: completed the setting side of the netlink attributes]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      171a42c3
    • M
      bonding: Implement user key part of port_key in an AD system. · d22a5fc0
      Mahesh Bandewar 提交于
      The port key has three components - user-key, speed-part, and duplex-part.
      The LSBit is for the duplex-part, next 5 bits are for the speed while the
      remaining 10 bits are the user defined key bits. Get these 10 bits
      from the user-space (through the SysFs interface) and use it to form the
      admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
      it is not provided then use zero for the user-key-bits (default).
      
      It can set using following example code -
      
         # modprobe bonding mode=4
         # usr_port_key=$(( RANDOM & 0x3FF ))
         # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
         # echo +eth1 > /sys/class/net/bond0/bonding/slaves
         ...
         # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * fixed up context from change in ad_actor_sys_prio patch]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d22a5fc0
    • M
      bonding: Allow userspace to set actors' macaddr in an AD-system. · 74514957
      Mahesh Bandewar 提交于
      In an AD system, the communication between actor and partner is the
      business between these two entities. In the current setup anyone on the
      same L2 can "guess" the LACPDU contents and then possibly send the
      spoofed LACPDUs and trick the partner causing connectivity issues for
      the AD system. This patch allows to use a random mac-address obscuring
      it's identity making it harder for someone in the L2 is do the same thing.
      
      This patch allows user-space to choose the mac-address for the AD-system.
      This mac-address can not be NULL or a Multicast. If the mac-address is set
      from user-space; kernel will honor it and will not overwrite it. In the
      absence (value from user space); the logic will default to using the
      masters' mac as the mac-address for the AD-system.
      
      It can be set using example code below -
      
         # modprobe bonding mode=4
         # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
                          $(( (RANDOM & 0xFE) | 0x02 )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )))
         # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
         # echo +eth1 > /sys/class/net/bond0/bonding/slaves
         ...
         # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: fixed up style issues reported by checkpatch]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74514957
    • M
      bonding: Allow userspace to set actors' system_priority in AD system · 6791e466
      Mahesh Bandewar 提交于
      This patch allows user to randomize the system-priority in an ad-system.
      The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
      does not specify this value, the system defaults to 0xFFFF, which is
      what it was before this patch.
      
      Following example code could set the value -
          # modprobe bonding mode=4
          # sys_prio=$(( 1 + RANDOM + RANDOM ))
          # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
          # echo +eth1 > /sys/class/net/bond0/bonding/slaves
          ...
          # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * changed how the default value is set in bond_check_params(), this
             makes the default consistent between what gets set for a new bond
             and what the default is claimed to be in the bonding options.]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6791e466
    • D
      Merge branch 'kernel_socket_netns' · 0198e09c
      David S. Miller 提交于
      Eric W. Biederman says:
      
      ====================
      Cleanup the kernel sockets.
      
      Right now the situtation for allocating kernel sockets is a mess.
      - sock_create_kern does not take a namespace parameter.
      - kernel sockets must not reference count a network namespace and keep
        it alive or else we will have a reference counting loop.
      - The way we avoid the reference counting loop with sk_change_net
        and sk_release_kernel are major hacks.
      
      This patchset addresses this mess by fixing sock_create_kern to do
      everything necessary to create a kernel socket.  None of the current
      users of kernel sockets need the network namespace reference counted.
      Either kernel sockets are network namespace aware (and using the current
      hacks) or kernel sockets are limited to the initial network namespace
      in which case it does not matter.
      
      This patchset starts by addressing tun which should be using normal
      userspace sockets like macvtap.
      
      Then sock_create_kern is fixed to take a network namespace.
      Then the in kernel status of sockets are passed through to sk_alloc.
      Then sk_alloc is fixed to not reference count the network namespace
           of kernel sockets.
      Then the callers of sock_create_kern are fixed up to stop using hacks.
      Then netlink which uses it's own flavor of sock_create_kern is fixed.
      
      Finally the hacks that are sk_change_net and sk_release_kernel are removed.
      
      When it is all done the code is easier to follow, easier to use, easier
      to maintain and shorter by about 70 lines.
      ====================
      Reported-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0198e09c
    • E
      net: kill sk_change_net and sk_release_kernel · affb9792
      Eric W. Biederman 提交于
      These functions are no longer needed and no longer used kill them.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      affb9792
    • E
      netlink: Create kernel netlink sockets in the proper network namespace · 13d3078e
      Eric W. Biederman 提交于
      Utilize the new functionality of sk_alloc so that nothing needs to be
      done to suprress the reference counting on kernel sockets.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13d3078e
    • E
      net: Modify sk_alloc to not reference count the netns of kernel sockets. · 26abe143
      Eric W. Biederman 提交于
      Now that sk_alloc knows when a kernel socket is being allocated modify
      it to not reference count the network namespace of kernel sockets.
      
      Keep track of if a socket needs reference counting by adding a flag to
      struct sock called sk_net_refcnt.
      
      Update all of the callers of sock_create_kern to stop using
      sk_change_net and sk_release_kernel as those hacks are no longer
      needed, to avoid reference counting a kernel socket.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26abe143
    • E
      net: Pass kern from net_proto_family.create to sk_alloc · 11aa9c28
      Eric W. Biederman 提交于
      In preparation for changing how struct net is refcounted
      on kernel sockets pass the knowledge that we are creating
      a kernel socket from sock_create_kern through to sk_alloc.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11aa9c28
    • E
      net: Add a struct net parameter to sock_create_kern · eeb1bd5c
      Eric W. Biederman 提交于
      This is long overdue, and is part of cleaning up how we allocate kernel
      sockets that don't reference count struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeb1bd5c
    • E
      tun: Utilize the normal socket network namespace refcounting. · 140e807d
      Eric W. Biederman 提交于
      There is no need for tun to do the weird network namespace refcounting.
      The existing network namespace refcounting in tfile has almost exactly
      the same lifetime.  So rewrite the code to use the struct sock network
      namespace refcounting and remove the unnecessary hand rolled network
      namespace refcounting and the unncesary tfile->net.
      
      This change allows the tun code to directly call sock_put bypassing
      sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.
      
      Remove the now unncessary tun_release so that if anything tries to use
      the sock_release code path the kernel will oops, and let us know about
      the bug.
      
      The macvtap code already uses it's internal socket this way.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      140e807d
    • E
      codel: add ce_threshold attribute · 80ba92fa
      Eric Dumazet 提交于
      For DCTCP or similar ECN based deployments on fabrics with shallow
      buffers, hosts are responsible for a good part of the buffering.
      
      This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
      so that DCTCP can have feedback from queuing in the host.
      
      A DCTCP enabled egress port simply have a queue occupancy threshold
      above which ECT packets get CE mark.
      
      In codel language this translates to a sojourn time, so that one doesn't
      have to worry about bytes or bandwidth but delays.
      
      This makes the host an active participant in the health of the whole
      network.
      
      This also helps experimenting DCTCP in a setup without DCTCP compliant
      fabric.
      
      On following example, ce_threshold is set to 1ms, and we can see from
      'ldelay xxx us' that TCP is not trying to go around the 5ms codel
      target.
      
      Queue has more capacity to absorb inelastic bursts (say from UDP
      traffic), as queues are maintained to an optimal level.
      
      lpaa23:~# ./tc -s -d qd sh dev eth1
      qdisc mq 1: dev eth1 root
       Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
       backlog 3108242b 364p requeues 42961
      qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
       rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
        count 0 lastcount 0 ldelay 1.0ms drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
      qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
       rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
        count 0 lastcount 0 ldelay 694us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
      qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
       rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
        count 0 lastcount 0 ldelay 889us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
      ...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80ba92fa