1. 11 5月, 2015 10 次提交
    • M
      bonding: Allow userspace to set actors' system_priority in AD system · 6791e466
      Mahesh Bandewar 提交于
      This patch allows user to randomize the system-priority in an ad-system.
      The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
      does not specify this value, the system defaults to 0xFFFF, which is
      what it was before this patch.
      
      Following example code could set the value -
          # modprobe bonding mode=4
          # sys_prio=$(( 1 + RANDOM + RANDOM ))
          # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
          # echo +eth1 > /sys/class/net/bond0/bonding/slaves
          ...
          # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * changed how the default value is set in bond_check_params(), this
             makes the default consistent between what gets set for a new bond
             and what the default is claimed to be in the bonding options.]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6791e466
    • D
      Merge branch 'kernel_socket_netns' · 0198e09c
      David S. Miller 提交于
      Eric W. Biederman says:
      
      ====================
      Cleanup the kernel sockets.
      
      Right now the situtation for allocating kernel sockets is a mess.
      - sock_create_kern does not take a namespace parameter.
      - kernel sockets must not reference count a network namespace and keep
        it alive or else we will have a reference counting loop.
      - The way we avoid the reference counting loop with sk_change_net
        and sk_release_kernel are major hacks.
      
      This patchset addresses this mess by fixing sock_create_kern to do
      everything necessary to create a kernel socket.  None of the current
      users of kernel sockets need the network namespace reference counted.
      Either kernel sockets are network namespace aware (and using the current
      hacks) or kernel sockets are limited to the initial network namespace
      in which case it does not matter.
      
      This patchset starts by addressing tun which should be using normal
      userspace sockets like macvtap.
      
      Then sock_create_kern is fixed to take a network namespace.
      Then the in kernel status of sockets are passed through to sk_alloc.
      Then sk_alloc is fixed to not reference count the network namespace
           of kernel sockets.
      Then the callers of sock_create_kern are fixed up to stop using hacks.
      Then netlink which uses it's own flavor of sock_create_kern is fixed.
      
      Finally the hacks that are sk_change_net and sk_release_kernel are removed.
      
      When it is all done the code is easier to follow, easier to use, easier
      to maintain and shorter by about 70 lines.
      ====================
      Reported-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0198e09c
    • E
      net: kill sk_change_net and sk_release_kernel · affb9792
      Eric W. Biederman 提交于
      These functions are no longer needed and no longer used kill them.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      affb9792
    • E
      netlink: Create kernel netlink sockets in the proper network namespace · 13d3078e
      Eric W. Biederman 提交于
      Utilize the new functionality of sk_alloc so that nothing needs to be
      done to suprress the reference counting on kernel sockets.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13d3078e
    • E
      net: Modify sk_alloc to not reference count the netns of kernel sockets. · 26abe143
      Eric W. Biederman 提交于
      Now that sk_alloc knows when a kernel socket is being allocated modify
      it to not reference count the network namespace of kernel sockets.
      
      Keep track of if a socket needs reference counting by adding a flag to
      struct sock called sk_net_refcnt.
      
      Update all of the callers of sock_create_kern to stop using
      sk_change_net and sk_release_kernel as those hacks are no longer
      needed, to avoid reference counting a kernel socket.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26abe143
    • E
      net: Pass kern from net_proto_family.create to sk_alloc · 11aa9c28
      Eric W. Biederman 提交于
      In preparation for changing how struct net is refcounted
      on kernel sockets pass the knowledge that we are creating
      a kernel socket from sock_create_kern through to sk_alloc.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11aa9c28
    • E
      net: Add a struct net parameter to sock_create_kern · eeb1bd5c
      Eric W. Biederman 提交于
      This is long overdue, and is part of cleaning up how we allocate kernel
      sockets that don't reference count struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeb1bd5c
    • E
      tun: Utilize the normal socket network namespace refcounting. · 140e807d
      Eric W. Biederman 提交于
      There is no need for tun to do the weird network namespace refcounting.
      The existing network namespace refcounting in tfile has almost exactly
      the same lifetime.  So rewrite the code to use the struct sock network
      namespace refcounting and remove the unnecessary hand rolled network
      namespace refcounting and the unncesary tfile->net.
      
      This change allows the tun code to directly call sock_put bypassing
      sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.
      
      Remove the now unncessary tun_release so that if anything tries to use
      the sock_release code path the kernel will oops, and let us know about
      the bug.
      
      The macvtap code already uses it's internal socket this way.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      140e807d
    • E
      codel: add ce_threshold attribute · 80ba92fa
      Eric Dumazet 提交于
      For DCTCP or similar ECN based deployments on fabrics with shallow
      buffers, hosts are responsible for a good part of the buffering.
      
      This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
      so that DCTCP can have feedback from queuing in the host.
      
      A DCTCP enabled egress port simply have a queue occupancy threshold
      above which ECT packets get CE mark.
      
      In codel language this translates to a sojourn time, so that one doesn't
      have to worry about bytes or bandwidth but delays.
      
      This makes the host an active participant in the health of the whole
      network.
      
      This also helps experimenting DCTCP in a setup without DCTCP compliant
      fabric.
      
      On following example, ce_threshold is set to 1ms, and we can see from
      'ldelay xxx us' that TCP is not trying to go around the 5ms codel
      target.
      
      Queue has more capacity to absorb inelastic bursts (say from UDP
      traffic), as queues are maintained to an optimal level.
      
      lpaa23:~# ./tc -s -d qd sh dev eth1
      qdisc mq 1: dev eth1 root
       Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
       backlog 3108242b 364p requeues 42961
      qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
       rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
        count 0 lastcount 0 ldelay 1.0ms drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
      qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
       rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
        count 0 lastcount 0 ldelay 694us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
      qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
       rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
        count 0 lastcount 0 ldelay 889us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
      ...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80ba92fa
    • V
      ethernet: qualcomm: use spi instead of spi_device · cf9d0dcc
      Varka Bhadram 提交于
      All spi based drivers have an instance of struct spi_device
      as spi. This patch renames spi_device to spi to synchronize
      with all the drivers.
      Signed-off-by: NVarka Bhadram <varkab@cdac.in>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf9d0dcc
  2. 10 5月, 2015 30 次提交
    • D
      Merge branch 'pktgen-next' · 3e3b3468
      David S. Miller 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      The following series introduce some pktgen changes
      
      Patch01:
       Cleanup my own work when I introduced NO_TIMESTAMP.
      
      Patch02:
       Took over patch from Alexei, and addressed my own concerns, as Alexie
       is too busy with other work, and this will provide an easy tool for
       measuring ingress path performance, which is a hot topic ATM.
      
       Changes were primarily user interface related.  Introduced a separate
       "xmit_mode" setting, instead of stealing one of the dev flags like
       Alexei did.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e3b3468
    • A
      pktgen: introduce xmit_mode '<start_xmit|netif_receive>' · 62f64aed
      Alexei Starovoitov 提交于
      Introduce xmit_mode 'netif_receive' for pktgen which generates the
      packets using familiar pktgen commands, but feeds them into
      netif_receive_skb() instead of ndo_start_xmit().
      
      Default mode is called 'start_xmit'.
      
      It is designed to test netif_receive_skb and ingress qdisc
      performace only. Make sure to understand how it works before
      using it for other rx benchmarking.
      
      Sample script 'pktgen.sh':
      \#!/bin/bash
      function pgset() {
        local result
      
        echo $1 > $PGDEV
      
        result=`cat $PGDEV | fgrep "Result: OK:"`
        if [ "$result" = "" ]; then
          cat $PGDEV | fgrep Result:
        fi
      }
      
      [ -z "$1" ] && echo "Usage: $0 DEV" && exit 1
      ETH=$1
      
      PGDEV=/proc/net/pktgen/kpktgend_0
      pgset "rem_device_all"
      pgset "add_device $ETH"
      
      PGDEV=/proc/net/pktgen/$ETH
      pgset "xmit_mode netif_receive"
      pgset "pkt_size 60"
      pgset "dst 198.18.0.1"
      pgset "dst_mac 90:e2:ba:ff:ff:ff"
      pgset "count 10000000"
      pgset "burst 32"
      
      PGDEV=/proc/net/pktgen/pgctrl
      echo "Running... ctrl^C to stop"
      pgset "start"
      echo "Done"
      cat /proc/net/pktgen/$ETH
      
      Usage:
      $ sudo ./pktgen.sh eth2
      ...
      Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
        43033682pps 20656Mb/sec (20656167360bps) errors: 10000000
      
      Raw netif_receive_skb speed should be ~43 million packet
      per second on 3.7Ghz x86 and 'perf report' should look like:
        37.69%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
        25.81%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
         7.22%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
         5.68%  kpktgend_0   [pktgen]          [k] pktgen_thread_worker
      
      If fib_table_lookup is seen on top, it means skb was processed
      by the stack. To benchmark netif_receive_skb only make sure
      that 'dst_mac' of your pktgen script is different from
      receiving device mac and it will be dropped by ip_rcv
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62f64aed
    • J
      pktgen: adjust flag NO_TIMESTAMP to be more pktgen compliant · f1f00d8f
      Jesper Dangaard Brouer 提交于
      Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags,
      with a negation of the flag like !NO_TIMESTAMP.
      
      Also document the option flag NO_TIMESTAMP.
      
      Fixes: afb84b62 ("pktgen: add flag NO_TIMESTAMP to disable timestamping")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f00d8f
    • D
      Merge branch 'netns-scalability' · 4d95b72f
      David S. Miller 提交于
      Nicolas Dichtel says:
      
      ====================
      netns: ease netlink use with a lot of netns
      
      This idea was informally discussed in Ottawa / netdev0.1. The goal is to
      ease the use/scalability of netns, from a userland point of view.
      Today, users need to open one netlink socket per family and per netns.
      Thus, when the number of netns inscreases (for example 5K or more), the
      number of sockets needed to manage them grows a lot.
      
      The goal of this series is to be able to monitor netlink events, for a
      specified family, for a set of netns, with only one netlink socket. For
      this purpose, a netlink socket option is added: NETLINK_LISTEN_ALL_NSID.
      When this option is set on a netlink socket, this socket will receive
      netlink notifications from all netns that have a nsid assigned into the
      netns where the socket has been opened.
      The nsid is sent to userland via an anscillary data.
      
      Here is an example with a patched iproute2. vxlan10 is created in the
      current netns (netns0, nsid 0) and then moved to another netns (netns1,
      nsid 1):
      
      $ ip netns exec netns0 ip monitor all-nsid label
      [nsid 0][NSID]nsid 1 (iproute2 netns name: netns1)
      [nsid 0][NEIGH]??? lladdr 00:00:00:00:00:00 REACHABLE,PERMANENT
      [nsid 0][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
          link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff
      [nsid 0][LINK]Deleted 5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
          link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff
      [nsid 1][NSID]nsid 0 (iproute2 netns name: netns0)
      [nsid 1][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
          link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0
      [nsid 1][ADDR]5: vxlan10    inet 192.168.0.249/24 brd 192.168.0.255 scope global vxlan10
             valid_lft forever preferred_lft forever
      [nsid 1][ROUTE]local 192.168.0.249 dev vxlan10  table local  proto kernel  scope host  src 192.168.0.249
      [nsid 1][ROUTE]ff00::/8 dev vxlan10  table local  metric 256  pref medium
      [nsid 1][ROUTE]2001:123::/64 dev vxlan10  proto kernel  metric 256  pref medium
      [nsid 1][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
          link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0
      [nsid 1][ROUTE]broadcast 192.168.0.255 dev vxlan10  table local  proto kernel  scope link  src 192.168.0.249
      [nsid 1][ROUTE]192.168.0.0/24 dev vxlan10  proto kernel  scope link  src 192.168.0.249
      [nsid 1][ROUTE]broadcast 192.168.0.0 dev vxlan10  table local  proto kernel  scope link  src 192.168.0.249
      [nsid 1][ROUTE]fe80::/64 dev vxlan10  proto kernel  metric 256  pref medium
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d95b72f
    • N
      netlink: allow to listen "all" netns · 59324cf3
      Nicolas Dichtel 提交于
      More accurately, listen all netns that have a nsid assigned into the netns
      where the netlink socket is opened.
      For this purpose, a netlink socket option is added:
      NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
      socket will receive netlink notifications from all netns that have a nsid
      assigned into the netns where the socket has been opened. The nsid is sent
      to userland via an anscillary data.
      
      With this patch, a daemon needs only one socket to listen many netns. This
      is useful when the number of netns is high.
      
      Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
      the field nsid is valid or not. skb->cb is initialized to 0 on skb
      allocation, thus we are sure that we will never send a nsid 0 by error to
      the userland.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59324cf3
    • N
      netlink: rename private flags and states · cc3a572f
      Nicolas Dichtel 提交于
      These flags and states have the same prefix (NETLINK_) that netlink socket
      options. To avoid confusion and to be able to name a flag like a socket
      option, let's use an other prefix: NETLINK_[S|F]_.
      
      Note: a comment has been fixed, it was talking about
      NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc3a572f
    • N
      netns: use a spin_lock to protect nsid management · 95f38411
      Nicolas Dichtel 提交于
      Before this patch, nsid were protected by the rtnl lock. The goal of this
      patch is to be able to find a nsid without needing to hold the rtnl lock.
      
      The next patch will introduce a netlink socket option to listen to all
      netns that have a nsid assigned into the netns where the socket is opened.
      Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
      avoid a recursive lock (nsid are notified via rtnl). This was the main
      reason of the previous patch.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95f38411
    • N
      netns: notify new nsid outside __peernet2id() · 3138dbf8
      Nicolas Dichtel 提交于
      There is no functional change with this patch. It will ease the refactoring
      of the locking system that protects nsids and the support of the netlink
      socket option NETLINK_LISTEN_ALL_NSID.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3138dbf8
    • N
      netns: rename peernet2id() to peernet2id_alloc() · 7a0877d4
      Nicolas Dichtel 提交于
      In a following commit, a new function will be introduced to only lookup for
      a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
      existing function is renamed.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a0877d4
    • N
      netns: always provide the id to rtnl_net_fill() · cab3c8ec
      Nicolas Dichtel 提交于
      The goal of this commit is to prepare the rework of the locking of nsnid
      protection.
      After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(),
      ie no idr_* operation into this function.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cab3c8ec
    • N
      netns: returns always an id in __peernet2id() · 109582af
      Nicolas Dichtel 提交于
      All callers of this function expect a nsid, not an error.
      Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers
      don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      109582af
    • D
      Merge tag 'linux-can-next-for-4.2-20150506' of... · 43996fdd
      David S. Miller 提交于
      Merge tag 'linux-can-next-for-4.2-20150506' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
      
      Marc Kleine-Budde says:
      
      ====================
      pull-request: can-next 2015-05-06
      
      this is a pull request of a seven patches for net-next/master.
      
      Andreas Gröger contributes two patches for the janz-ican3 driver. In
      the first patch, the documentation for already existing sysfs entries
      is added, the second patch adds support for another module/firmware
      variant. A patch by Shawn Landden makes the padding in the struct
      can_frame explicit. The next 4 patches target the flexcan driver, the
      first one is by David Jander adding some documentation, the reaming
      three by me add more documentation and two small code cleanups.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43996fdd
    • H
      net: macb: Add change_mtu callback with jumbo support · a5898ea0
      Harini Katakam 提交于
      Add macb_change_mtu callback; if jumbo frame support is present allow
      mtu size changes upto (jumbo max length allowed - headers).
      Signed-off-by: NHarini Katakam <harinik@xilinx.com>
      Reviewed-by: NPunnaiah Choudary Kalluri <punnaia@xilinx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5898ea0
    • H
      net: macb: Add support for jumbo frames · 98b5a0f4
      Harini Katakam 提交于
      Enable jumbo frame support for Zynq Ultrascale+ MPSoC.
      Update the NWCFG register and descriptor length masks accordingly.
      Jumbo max length register should be set according to support in SoC; it is
      set to 10240 for Zynq Ultrascale+ MPSoC.
      Signed-off-by: NHarini Katakam <harinik@xilinx.com>
      Reviewed-by: NPunnaiah Choudary Kalluri <punnaia@xilinx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98b5a0f4
    • H
      net: macb: Add compatible string for Zynq Ultrascale+ MPSoC · 7b61f9c1
      Harini Katakam 提交于
      Add compatible string and config structure for Zynq Ultrascale+ MPSoC
      Signed-off-by: NHarini Katakam <harinik@xilinx.com>
      Reviewed-by: NPunnaiah Choudary Kalluri <punnaia@xilinx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b61f9c1
    • H
      devicetree: Add compatible string for Zynq Ultrascale+ MPSoC · 988d6f07
      Harini Katakam 提交于
      Add "cdns,zynqmp-gem" to be used for Zynq Ultrascale+ MPSoC.
      Signed-off-by: NHarini Katakam <harinik@xilinx.com>
      Reviewed-by: NPunnaiah Choudary Kalluri <punnaia@xilinx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      988d6f07
    • J
      tcp: set SOCK_NOSPACE under memory pressure · 790ba456
      Jason Baron 提交于
      Under tcp memory pressure, calling epoll_wait() in edge triggered
      mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
      even when there is sufficient memory available to continue making
      progress. The problem is that when __sk_mem_schedule() returns 0
      under memory pressure, we do not set the SOCK_NOSPACE flag in the
      tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since
      SOCK_NOSPACE is used to trigger wakeups when incoming acks create
      sufficient new space in the write queue, all outstanding packets
      are acked, but we never wake up with the the EPOLLOUT that we are
      expecting from epoll_wait().
      
      This issue is currently limited to epoll() when used in edge trigger
      mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
      This is sufficient for poll()/select() and epoll() in level trigger
      mode. However, in edge trigger mode, epoll() is relying on the write
      path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we
      can only call epoll_wait() after read/write return -EAGAIN. Thus, in
      the case of the socket write, we are relying on the fact that
      tcp_sendmsg()/network write paths are going to issue a wakeup for
      us at some point in the future when we get -EAGAIN.
      
      Normally, epoll() edge trigger works fine when we've exceeded the
      sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when
      we return -EAGAIN from the write path b/c we are over the tcp memory
      limits and not b/c we are over the sndbuf, we are never going to get
      another wakeup.
      
      I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule()
      will return 0, or failure more readily with SO_SNDBUF:
      
      1) create socket and set SO_SNDBUF to N
      2) add socket as edge trigger
      3) write to socket and block in epoll on -EAGAIN
      4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem
      
      The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
      when the socket is non-blocking. Note that SOCK_NOSPACE, in addition
      to waking up outstanding waiters is also used to expand the size of
      the sk->sndbuf. However, we will not expand it by setting it in this
      case because tcp_should_expand_sndbuf(), ensures that no expansion
      occurs when we are under tcp memory pressure.
      
      Note that we could still hang if sk->sk_wmem_queue is 0, when we get
      the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
      are waiting for and event that will never happen. I believe
      that this case is harder to hit (and did not hit in my testing),
      in that over the tcp 'soft' memory limits, we continue to guarantee a
      minimum write buffer size. Perhaps, we could return -ENOSPC in this
      case, or maybe we simply issue a wakeup in this case, such that we
      keep retrying the write. Note that this case is not specific to
      epoll() ET, but rather would affect blocking sockets as well. So I
      view this patch as bringing epoll() edge-trigger into sync with the
      current poll()/select()/epoll() level trigger and blocking sockets
      behavior.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      790ba456
    • C
      gianfar: Enable changing mac addr when if up · 3d23a05c
      Claudiu Manoil 提交于
      Use device flag IFF_LIVE_ADDR_CHANGE to signal that
      the device supports changing the hardware address when
      the device is running.
      This allows eth_mac_addr() to change the mac address
      also when the network device's interface is open.
      This capability is required by certain applications,
      like bonding mode 6 (Adaptive Load Balancing).
      Signed-off-by: NClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d23a05c
    • C
      gianfar: Move TxFIFO underrun handling to reset path · bc602280
      Claudiu Manoil 提交于
      Handle TxFIFO underrun exceptions outside the fast path.
      A controller reset is more reliable in this exceptional
      case, as opposed to re-enabling on-the-fly the Tx DMA.
      
      As the controller reset is handled outside the fast path
      by the reset_gfar() workqueue handler, the locking
      scheme on the Tx path is significantly simplified.
      Because the Tx processing (xmit queues and tx napi) is
      disabled during controller reset, tstat access from xmit
      does not require locking.  So the scope of the txlock on
      the processing path is now reduced to num_txbdfree, which
      is shared only between process context (xmit) and softirq
      (clean_tx_ring).  As a result, the txlock must not guard
      against interrupt context, and the spin_lock_irqsave()
      from xmit can be replaced by spin_lock_bh().  Likewise,
      the locking has been downgraded for clean_tx_ring().
      Signed-off-by: NClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc602280
    • D
      Merge branch 'bpf_seccomp' · 39d726b7
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      BPF updates
      
      This set gets rid of BPF special handling in seccomp filter preparation
      and provides generic infrastructure from BPF side, which eventually also
      allows for classic BPF JITs to add support for seccomp filters.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39d726b7
    • D
      seccomp, filter: add and use bpf_prog_create_from_user from seccomp · ac67eb2c
      Daniel Borkmann 提交于
      Seccomp has always been a special candidate when it comes to preparation
      of its filters in seccomp_prepare_filter(). Due to the extra checks and
      filter rewrite it partially duplicates code and has BPF internals exposed.
      
      This patch adds a generic API inside the BPF code code that seccomp can use
      and thus keep it's filter preparation code minimal and better maintainable.
      The other side-effect is that now classic JITs can add seccomp support as
      well by only providing a BPF_LDX | BPF_W | BPF_ABS translation.
      
      Tested with seccomp and BPF test suites.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Nicolas Schichan <nschichan@freebox.fr>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac67eb2c
    • D
      net: filter: add __GFP_NOWARN flag for larger kmem allocs · 658da937
      Daniel Borkmann 提交于
      When seccomp BPF was added, it was discussed to add __GFP_NOWARN
      flag for their configuration path as f.e. up to 32K allocations are
      more prone to fail under stress. As we're going to reuse BPF API,
      add __GFP_NOWARN flags where larger kmalloc() and friends allocations
      could fail.
      
      It doesn't make much sense to pass around __GFP_NOWARN everywhere as
      an extra argument only for seccomp while we just as well could run
      into similar issues for socket filters, where it's not desired to
      have a user application throw a WARN() due to allocation failure.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Nicolas Schichan <nschichan@freebox.fr>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      658da937
    • N
      seccomp: simplify seccomp_prepare_filter and reuse bpf_prepare_filter · d9e12f42
      Nicolas Schichan 提交于
      Remove the calls to bpf_check_classic(), bpf_convert_filter() and
      bpf_migrate_runtime() and let bpf_prepare_filter() take care of that
      instead.
      
      seccomp_check_filter() is passed to bpf_prepare_filter() so that it
      gets called from there, after bpf_check_classic().
      
      We can now remove exposure of two internal classic BPF functions
      previously used by seccomp. The export of bpf_check_classic() symbol,
      previously known as sk_chk_filter(), was there since pre git times,
      and no in-tree module was using it, therefore remove it.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9e12f42
    • N
      net: filter: add a callback to allow classic post-verifier transformations · 4ae92bc7
      Nicolas Schichan 提交于
      This is in preparation for use by the seccomp code, the rationale is
      not to duplicate additional code within the seccomp layer, but instead,
      have it abstracted and hidden within the classic BPF API.
      
      As an interim step, this now also makes bpf_prepare_filter() visible
      (not as exported symbol though), so that seccomp can reuse that code
      path instead of reimplementing it.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ae92bc7
    • D
      Merge tag 'mac80211-next-for-davem-2015-05-06' of... · 0e00a0f7
      David S. Miller 提交于
      Merge tag 'mac80211-next-for-davem-2015-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
      
      Johannes Berg says:
      
      ====================
      Lots of updates for net-next for this cycle. As usual, we have
      a lot of small fixes and cleanups, the bigger items are:
       * proper mac80211 rate control locking, to fix some random crashes
         (this required changing other locking as well)
       * mac80211 "fast-xmit", a mechanism to reduce, in most cases, the
         amount of code we execute while going from ndo_start_xmit() to
         the driver
       * this also clears the way for properly supporting S/G and checksum
         and segmentation offloads
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e00a0f7
    • D
      Merge branch 'tcp-more-reliable-window-probes' · 82ae9c60
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: more reliable window probes
      
      This series address a problem caused by small rto_min timers in DC,
      leading to either timer storms or early flow terminations.
      
      We also add two new SNMP counters for proper monitoring :
      TCPWinProbe and TCPKeepAlive
      
      v2: added TCPKeepAlive counter, as suggested by Yuchung & Neal
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82ae9c60
    • E
      tcp: add TCPWinProbe and TCPKeepAlive SNMP counters · e520af48
      Eric Dumazet 提交于
      Diagnosing problems related to Window Probes has been hard because
      we lack a counter.
      
      TCPWinProbe counts the number of ACK packets a sender has to send
      at regular intervals to make sure a reverse ACK packet opening back
      a window had not been lost.
      
      TCPKeepAlive counts the number of ACK packets sent to keep TCP
      flows alive (SO_KEEPALIVE)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e520af48
    • E
      tcp: adjust window probe timers to safer values · 21c8fe99
      Eric Dumazet 提交于
      With the advent of small rto timers in datacenter TCP,
      (ip route ... rto_min x), the following can happen :
      
      1) Qdisc is full, transmit fails.
      
         TCP sets a timer based on icsk_rto to retry the transmit, without
         exponential backoff.
         With low icsk_rto, and lot of sockets, all cpus are servicing timer
         interrupts like crazy.
         Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
         and 500ms (TCP_RESOURCE_PROBE_INTERVAL)
      
      2) Receivers can send zero windows if they don't drain their receive queue.
      
         TCP sends zero window probes, based on icsk_rto current value, with
         exponential backoff.
         With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
         some cases), sender can abort in less than one or two minutes !
         If receiver stops the sender, it obviously doesn't care of very tight
         rto. Probability of dropping the ACK reopening the window is not
         worth the risk.
      
      Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
      events (but not normal RTO based retransmits)
      
      A followup patch adds a new SNMP counter, as it would have helped a lot
      diagnosing this issue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21c8fe99
    • R
      tipc: send explicit not supported error in nl compat · b063bc5e
      Richard Alpe 提交于
      The legacy netlink API treated EPERM (permission denied) as
      "operation not supported".
      Reported-by: NTomi Ollila <tomi.ollila@iki.fi>
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b063bc5e
    • R
      tipc: add broadcast link window set/get to nl api · 670f4f88
      Richard Alpe 提交于
      Add the ability to get or set the broadcast link window through the
      new netlink API. The functionality was unintentionally missing from
      the new netlink API. Adding this means that we also fix the breakage
      in the old API when coming through the compat layer.
      
      Fixes: 37e2d484 (tipc: convert legacy nl link prop set to nl compat)
      Reported-by: NTomi Ollila <tomi.ollila@iki.fi>
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      670f4f88