1. 30 11月, 2016 1 次提交
    • F
      tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING · 1c885808
      Francis Yan 提交于
      This patch exports the sender chronograph stats via the socket
      SO_TIMESTAMPING channel. Currently we can instrument how long a
      particular application unit of data was queued in TCP by tracking
      SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
      these sender chronograph stats exported simultaneously along with
      these timestamps allow further breaking down the various sender
      limitation.  For example, a video server can tell if a particular
      chunk of video on a connection takes a long time to deliver because
      TCP was experiencing small receive window. It is not possible to
      tell before this patch without packet traces.
      
      To prepare these stats, the user needs to set
      SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
      while requesting other SOF_TIMESTAMPING TX timestamps. When the
      timestamps are available in the error queue, the stats are returned
      in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
      in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
      TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.
      Signed-off-by: NFrancis Yan <francisyyan@gmail.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c885808
  2. 29 11月, 2016 4 次提交
  3. 10 11月, 2016 2 次提交
  4. 08 11月, 2016 1 次提交
  5. 17 10月, 2016 6 次提交
  6. 14 10月, 2016 1 次提交
  7. 10 10月, 2016 1 次提交
  8. 28 9月, 2016 1 次提交
  9. 20 9月, 2016 1 次提交
  10. 19 9月, 2016 1 次提交
    • M
      ipvlan: Introduce l3s mode · 4fbae7d8
      Mahesh Bandewar 提交于
      In a typical IPvlan L3 setup where master is in default-ns and
      each slave is into different (slave) ns. In this setup egress
      packet processing for traffic originating from slave-ns will
      hit all NF_HOOKs in slave-ns as well as default-ns. However same
      is not true for ingress processing. All these NF_HOOKs are
      hit only in the slave-ns skipping them in the default-ns.
      IPvlan in L3 mode is restrictive and if admins want to deploy
      iptables rules in default-ns, this asymmetric data path makes it
      impossible to do so.
      
      This patch makes use of the l3_rcv() (added as part of l3mdev
      enhancements) to perform input route lookup on RX packets without
      changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
      to change the skb->dev just before handing over skb to L4.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fbae7d8
  11. 02 9月, 2016 1 次提交
    • D
      rxrpc: Don't expose skbs to in-kernel users [ver #2] · d001648e
      David Howells 提交于
      Don't expose skbs to in-kernel users, such as the AFS filesystem, but
      instead provide a notification hook the indicates that a call needs
      attention and another that indicates that there's a new call to be
      collected.
      
      This makes the following possibilities more achievable:
      
       (1) Call refcounting can be made simpler if skbs don't hold refs to calls.
      
       (2) skbs referring to non-data events will be able to be freed much sooner
           rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
           will be able to consult the call state.
      
       (3) We can shortcut the receive phase when a call is remotely aborted
           because we don't have to go through all the packets to get to the one
           cancelling the operation.
      
       (4) It makes it easier to do encryption/decryption directly between AFS's
           buffers and sk_buffs.
      
       (5) Encryption/decryption can more easily be done in the AFS's thread
           contexts - usually that of the userspace process that issued a syscall
           - rather than in one of rxrpc's background threads on a workqueue.
      
       (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.
      
      To make this work, the following interface function has been added:
      
           int rxrpc_kernel_recv_data(
      		struct socket *sock, struct rxrpc_call *call,
      		void *buffer, size_t bufsize, size_t *_offset,
      		bool want_more, u32 *_abort_code);
      
      This is the recvmsg equivalent.  It allows the caller to find out about the
      state of a specific call and to transfer received data into a buffer
      piecemeal.
      
      afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
      logic between them.  They don't wait synchronously yet because the socket
      lock needs to be dealt with.
      
      Five interface functions have been removed:
      
      	rxrpc_kernel_is_data_last()
          	rxrpc_kernel_get_abort_code()
          	rxrpc_kernel_get_error_number()
          	rxrpc_kernel_free_skb()
          	rxrpc_kernel_data_consumed()
      
      As a temporary hack, sk_buffs going to an in-kernel call are queued on the
      rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
      in-kernel user.  To process the queue internally, a temporary function,
      temp_deliver_data() has been added.  This will be replaced with common code
      between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
      future patch.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d001648e
  12. 01 9月, 2016 1 次提交
  13. 30 8月, 2016 2 次提交
    • D
      rxrpc: Pass struct socket * to more rxrpc kernel interface functions · 4de48af6
      David Howells 提交于
      Pass struct socket * to more rxrpc kernel interface functions.  They should
      be starting from this rather than the socket pointer in the rxrpc_call
      struct if they need to access the socket.
      
      I have left:
      
      	rxrpc_kernel_is_data_last()
      	rxrpc_kernel_get_abort_code()
      	rxrpc_kernel_get_error_number()
      	rxrpc_kernel_free_skb()
      	rxrpc_kernel_data_consumed()
      
      unmodified as they're all about to be removed (and, in any case, don't
      touch the socket).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4de48af6
    • D
      rxrpc: Provide a way for AFS to ask for the peer address of a call · 8324f0bc
      David Howells 提交于
      Provide a function so that kernel users, such as AFS, can ask for the peer
      address of a call:
      
         void rxrpc_kernel_get_peer(struct rxrpc_call *call,
      			      struct sockaddr_rxrpc *_srx);
      
      In the future the kernel service won't get sk_buffs to look inside.
      Further, this allows us to hide any canonicalisation inside AF_RXRPC for
      when IPv6 support is added.
      
      Also propagate this through to afs_find_server() and issue a warning if we
      can't handle the address family yet.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8324f0bc
  14. 29 8月, 2016 1 次提交
  15. 27 8月, 2016 1 次提交
    • I
      bridge: switchdev: Add forward mark support for stacked devices · 6bc506b4
      Ido Schimmel 提交于
      switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
      port netdevs so that packets being flooded by the device won't be
      flooded twice.
      
      It works by assigning a unique identifier (the ifindex of the first
      bridge port) to bridge ports sharing the same parent ID. This prevents
      packets from being flooded twice by the same switch, but will flood
      packets through bridge ports belonging to a different switch.
      
      This method is problematic when stacked devices are taken into account,
      such as VLANs. In such cases, a physical port netdev can have upper
      devices being members in two different bridges, thus requiring two
      different 'offload_fwd_mark's to be configured on the port netdev, which
      is impossible.
      
      The main problem is that packet and netdev marking is performed at the
      physical netdev level, whereas flooding occurs between bridge ports,
      which are not necessarily port netdevs.
      
      Instead, packet and netdev marking should really be done in the bridge
      driver with the switch driver only telling it which packets it already
      forwarded. The bridge driver will mark such packets using the mark
      assigned to the ingress bridge port and will prevent the packet from
      being forwarded through any bridge port sharing the same mark (i.e.
      having the same parent ID).
      
      Remove the current switchdev 'offload_fwd_mark' implementation and
      instead implement the proposed method. In addition, make rocker - the
      sole user of the mark - use the proposed method.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bc506b4
  16. 25 8月, 2016 1 次提交
    • V
      net: dsa: rename switch operations structure · 9d490b4e
      Vivien Didelot 提交于
      Now that the dsa_switch_driver structure contains only function pointers
      as it is supposed to, rename it to the more appropriate dsa_switch_ops,
      uniformly to any other operations structure in the kernel.
      
      No functional changes here, basically just the result of something like:
      s/dsa_switch_driver *drv/dsa_switch_ops *ops/g
      
      However keep the {un,}register_switch_driver functions and their
      dsa_switch_drivers list as is, since they represent the -- likely to be
      deprecated soon -- legacy DSA registration framework.
      
      In the meantime, also fix the following checks from checkpatch.pl to
      make it happy with this patch:
      
          CHECK: Comparison to NULL could be written "!ops"
          #403: FILE: net/dsa/dsa.c:470:
          +	if (ops == NULL) {
      
          CHECK: Comparison to NULL could be written "ds->ops->get_strings"
          #773: FILE: net/dsa/slave.c:697:
          +		if (ds->ops->get_strings != NULL)
      
          CHECK: Comparison to NULL could be written "ds->ops->get_ethtool_stats"
          #824: FILE: net/dsa/slave.c:785:
          +	if (ds->ops->get_ethtool_stats != NULL)
      
          CHECK: Comparison to NULL could be written "ds->ops->get_sset_count"
          #835: FILE: net/dsa/slave.c:798:
          +		if (ds->ops->get_sset_count != NULL)
      
          total: 0 errors, 0 warnings, 4 checks, 784 lines checked
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d490b4e
  17. 24 8月, 2016 1 次提交
  18. 18 8月, 2016 1 次提交
  19. 13 8月, 2016 1 次提交
  20. 09 8月, 2016 1 次提交
  21. 06 8月, 2016 1 次提交
    • D
      rxrpc: Fix races between skb free, ACK generation and replying · 372ee163
      David Howells 提交于
      Inside the kafs filesystem it is possible to occasionally have a call
      processed and terminated before we've had a chance to check whether we need
      to clean up the rx queue for that call because afs_send_simple_reply() ends
      the call when it is done, but this is done in a workqueue item that might
      happen to run to completion before afs_deliver_to_call() completes.
      
      Further, it is possible for rxrpc_kernel_send_data() to be called to send a
      reply before the last request-phase data skb is released.  The rxrpc skb
      destructor is where the ACK processing is done and the call state is
      advanced upon release of the last skb.  ACK generation is also deferred to
      a work item because it's possible that the skb destructor is not called in
      a context where kernel_sendmsg() can be invoked.
      
      To this end, the following changes are made:
      
       (1) kernel_rxrpc_data_consumed() is added.  This should be called whenever
           an skb is emptied so as to crank the ACK and call states.  This does
           not release the skb, however.  kernel_rxrpc_free_skb() must now be
           called to achieve that.  These together replace
           rxrpc_kernel_data_delivered().
      
       (2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().
      
           This makes afs_deliver_to_call() easier to work as the skb can simply
           be discarded unconditionally here without trying to work out what the
           return value of the ->deliver() function means.
      
           The ->deliver() functions can, via afs_data_complete(),
           afs_transfer_reply() and afs_extract_data() mark that an skb has been
           consumed (thereby cranking the state) without the need to
           conditionally free the skb to make sure the state is correct on an
           incoming call for when the call processor tries to send the reply.
      
       (3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
           has finished with a packet and MSG_PEEK isn't set.
      
       (4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().
      
           Because of this, we no longer need to clear the destructor and put the
           call before we free the skb in cases where we don't want the ACK/call
           state to be cranked.
      
       (5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
           than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
           the delivery function already), and the caller is now responsible for
           producing an abort if that was the last packet.
      
       (6) There are many bits of unmarshalling code where:
      
       		ret = afs_extract_data(call, skb, last, ...);
      		switch (ret) {
      		case 0:		break;
      		case -EAGAIN:	return 0;
      		default:	return ret;
      		}
      
           is to be found.  As -EAGAIN can now be passed back to the caller, we
           now just return if ret < 0:
      
       		ret = afs_extract_data(call, skb, last, ...);
      		if (ret < 0)
      			return ret;
      
       (7) Checks for trailing data and empty final data packets has been
           consolidated as afs_data_complete().  So:
      
      		if (skb->len > 0)
      			return -EBADMSG;
      		if (!last)
      			return 0;
      
           becomes:
      
      		ret = afs_data_complete(call, skb, last);
      		if (ret < 0)
      			return ret;
      
       (8) afs_transfer_reply() now checks the amount of data it has against the
           amount of data desired and the amount of data in the skb and returns
           an error to induce an abort if we don't get exactly what we want.
      
      Without these changes, the following oops can occasionally be observed,
      particularly if some printks are inserted into the delivery path:
      
      general protection fault: 0000 [#1] SMP
      Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
      CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G            E   4.7.0-fsdevel+ #1303
      Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
      Workqueue: kafsd afs_async_workfn [kafs]
      task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
      RIP: 0010:[<ffffffff8108fd3c>]  [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
      RSP: 0018:ffff88040c073bc0  EFLAGS: 00010002
      RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
      RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
      FS:  0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
      Stack:
       0000000000000006 000000000be04930 0000000000000000 ffff880400000000
       ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
       ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
      Call Trace:
       [<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
       [<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
       [<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
       [<ffffffff810915f4>] lock_acquire+0x122/0x1b6
       [<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
       [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
       [<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
       [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
       [<ffffffff814c928f>] skb_dequeue+0x18/0x61
       [<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
       [<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
       [<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
       [<ffffffff81063a3a>] process_one_work+0x29d/0x57c
       [<ffffffff81064ac2>] worker_thread+0x24a/0x385
       [<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
       [<ffffffff810696f5>] kthread+0xf3/0xfb
       [<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
       [<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      372ee163
  22. 16 7月, 2016 2 次提交
  23. 14 7月, 2016 2 次提交
  24. 28 6月, 2016 1 次提交
    • G
      drivers: net: stmmac: reworking the PCS code. · 70523e63
      Giuseppe CAVALLARO 提交于
      The 3.xx and 4.xx synopsys gmacs have a very similar
      PCS embedded module and they share almost the same registers:
      for example:
        AN_Control, AN_Status, AN_Advertisement, AN_Link_Partner_Ability,
        AN_Expansion, TBI_Extended_Status.
      
      Just the RGMII/SMII Control/Status register differs.
      
      So This patch aims to reorganize and enhance the PCS support.
      It removes the existent support from the dwmac1000/dwmac4_core.c
      moving basic PCS functions inside a new file called: stmmac_pcs.h.
      
      The patch also reviews the available APIs to be better shared among
      different hardware and easily enhanced to support new features.
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70523e63
  25. 24 6月, 2016 1 次提交
    • F
      netfilter: conntrack: allow increasing bucket size via sysctl too · 3183ab89
      Florian Westphal 提交于
      No need to restrict this to module parameter.
      
      We export a copy of the real hash size -- when user alters the value we
      allocate the new table, copy entries etc before we update the real size
      to the requested one.
      
      This is also needed because the real size is used by concurrent readers
      and cannot be changed without synchronizing the conntrack generation
      seqcnt.
      
      We only allow changing this value from the initial net namespace.
      
      Tested using http-client-benchmark vs. httpterm with concurrent
      
      while true;do
       echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
      done
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3183ab89
  26. 17 6月, 2016 1 次提交
  27. 08 6月, 2016 1 次提交
    • E
      net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1
      Eric Dumazet 提交于
      Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
      agent [1] are problematic at scale :
      
      For each qdisc/class found in the dump, we currently lock the root qdisc
      spinlock in order to get stats. Sampling stats every 5 seconds from
      thousands of HTB classes is a challenge when the root qdisc spinlock is
      under high pressure. Not only the dumps take time, they also slow
      down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
      
      An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
      that might need the qdisc lock in fq_codel_dump_stats() and
      fq_codel_dump_class_stats()
      
      In v2 of this patch, I now use the Qdisc running seqcount to provide
      consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
      
      I also changed rate estimators to use the same infrastructure
      so that they no longer need to lock root qdisc lock.
      
      [1]
      http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Kevin Athey <kda@google.com>
      Cc: Xiaotian Pei <xiaotian@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edb09eb1
  28. 30 5月, 2016 1 次提交