1. 06 9月, 2014 2 次提交
    • A
      net: Add function for parsing the header length out of linear ethernet frames · 56193d1b
      Alexander Duyck 提交于
      This patch updates some of the flow_dissector api so that it can be used to
      parse the length of ethernet buffers stored in fragments.  Most of the
      changes needed were to __skb_get_poff as it needed to be updated to support
      sending a linear buffer instead of a skb.
      
      I have split __skb_get_poff into two functions, the first is skb_get_poff
      and it retains the functionality of the original __skb_get_poff.  The other
      function is __skb_get_poff which now works much like __skb_flow_dissect in
      relation to skb_flow_dissect in that it provides the same functionality but
      works with just a data buffer and hlen instead of needing an skb.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56193d1b
    • A
      net-timestamp: Make the clone operation stand-alone from phy timestamping · 62bccb8c
      Alexander Duyck 提交于
      The phy timestamping takes a different path than the regular timestamping
      does in that it will create a clone first so that the packets needing to be
      timestamped can be placed in a queue, or the context block could be used.
      
      In order to support these use cases I am pulling the core of the code out
      so it can be used in other drivers beyond just phy devices.
      
      In addition I have added a destructor named sock_efree which is meant to
      provide a simple way for dropping the reference to skb exceptions that
      aren't part of either the receive or send windows for the socket, and I
      have removed some duplication in spots where this destructor could be used
      in place of sock_edemux.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62bccb8c
  2. 02 9月, 2014 2 次提交
    • T
      net: Infrastructure for checksum unnecessary conversions · d96535a1
      Tom Herbert 提交于
      For normal path, added skb_checksum_try_convert which is called
      to attempt to convert CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE. The
      primary condition to allow this is that ip_summed is CHECKSUM_NONE
      and csum_valid is true, which will be the state after consuming
      a CHECKSUM_UNNECESSARY.
      
      For GRO path, added skb_gro_checksum_try_convert which is the GRO
      analogue of skb_checksum_try_convert. The primary condition to allow
      this is that NAPI_GRO_CB(skb)->csum_cnt == 0 and
      NAPI_GRO_CB(skb)->csum_valid is set. This implies that we have consumed
      all available CHECKSUM_UNNECESSARY checksums in the GRO path.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d96535a1
    • T
      net: Support for csum_bad in skbuff · 5a212329
      Tom Herbert 提交于
      This flag indicates that an invalid checksum was detected in the
      packet. __skb_mark_checksum_bad helper function was added to set this.
      
      Checksums can be marked bad from a driver or the GRO path (the latter
      is implemented in this patch). csum_bad is checked in
      __skb_checksum_validate_complete (i.e. calling that when ip_summed ==
      CHECKSUM_NONE).
      
      csum_bad works in conjunction with ip_summed value. In the case that
      ip_summed is CHECKSUM_NONE and csum_bad is set, this implies that the
      first (or next) checksum encountered in the packet is bad. When
      ip_summed is CHECKSUM_UNNECESSARY, the first checksum after the last
      one validated is bad. For example, if ip_summed == CHECKSUM_UNNECESSARY,
      csum_level == 1, and csum_bad is set-- then the third checksum in the
      packet is bad. In the normal path, the packet will be dropped when
      processing the protocol layer of the bad checksum:
      __skb_decr_checksum_unnecessary called twice for the good checksums
      changing ip_summed to CHECKSUM_NONE so that
      __skb_checksum_validate_complete is called to validate the third
      checksum and that will fail since csum_bad is set.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a212329
  3. 30 8月, 2014 2 次提交
  4. 28 8月, 2014 1 次提交
  5. 26 8月, 2014 1 次提交
    • D
      net: Remove ndo_xmit_flush netdev operation, use signalling instead. · 0b725a2c
      David S. Miller 提交于
      As reported by Jesper Dangaard Brouer, for high packet rates the
      overhead of having another indirect call in the TX path is
      non-trivial.
      
      There is the indirect call itself, and then there is all of the
      reloading of the state to refetch the tail pointer value and
      then write the device register.
      
      Move to a more passive scheme, which requires very light modifications
      to the device drivers.
      
      The signal is a new skb->xmit_more value, if it is non-zero it means
      that more SKBs are pending to be transmitted on the same queue as the
      current SKB.  And therefore, the driver may elide the tail pointer
      update.
      
      Right now skb->xmit_more is always zero.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b725a2c
  6. 24 8月, 2014 1 次提交
    • D
      net: Allow raw buffers to be passed into the flow dissector. · 690e36e7
      David S. Miller 提交于
      Drivers, and perhaps other entities we have not yet considered,
      sometimes want to know how deep the protocol headers go before
      deciding how large of an SKB to allocate and how much of the packet to
      place into the linear SKB area.
      
      For example, consider a driver which has a device which DMAs into
      pools of pages and then tells the driver where the data went in the
      DMA descriptor(s).  The driver can then build an SKB and reference
      most of the data via SKB fragments (which are page/offset/length
      triplets).
      
      However at least some of the front of the packet should be placed into
      the linear SKB area, which comes before the fragments, so that packet
      processing can get at the headers efficiently.  The first thing each
      protocol layer is going to do is a "pskb_may_pull()" so we might as
      well aggregate as much of this as possible while we're building the
      SKB in the driver.
      
      Part of supporting this is that we don't have an SKB yet, so we want
      to be able to let the flow dissector operate on a raw buffer in order
      to compute the offset of the end of the headers.
      
      So now we have a __skb_flow_dissect() which takes an explicit data
      pointer and length.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      690e36e7
  7. 12 8月, 2014 1 次提交
    • V
      net: Always untag vlan-tagged traffic on input. · 0d5501c1
      Vlad Yasevich 提交于
      Currently the functionality to untag traffic on input resides
      as part of the vlan module and is build only when VLAN support
      is enabled in the kernel.  When VLAN is disabled, the function
      vlan_untag() turns into a stub and doesn't really untag the
      packets.  This seems to create an interesting interaction
      between VMs supporting checksum offloading and some network drivers.
      
      There are some drivers that do not allow the user to change
      tx-vlan-offload feature of the driver.  These drivers also seem
      to assume that any VLAN-tagged traffic they transmit will
      have the vlan information in the vlan_tci and not in the vlan
      header already in the skb.  When transmitting skbs that already
      have tagged data with partial checksum set, the checksum doesn't
      appear to be updated correctly by the card thus resulting in a
      failure to establish TCP connections.
      
      The following is a packet trace taken on the receiver where a
      sender is a VM with a VLAN configued.  The host VM is running on
      doest not have VLAN support and the outging interface on the
      host is tg3:
      10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294837885 ecr 0,nop,wscale 7], length 0
      10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294838888 ecr 0,nop,wscale 7], length 0
      
      This connection finally times out.
      
      I've only access to the TG3 hardware in this configuration thus have
      only tested this with TG3 driver.  There are a lot of other drivers
      that do not permit user changes to vlan acceleration features, and
      I don't know if they all suffere from a similar issue.
      
      The patch attempt to fix this another way.  It moves the vlan header
      stipping code out of the vlan module and always builds it into the
      kernel network core.  This way, even if vlan is not supported on
      a virtualizatoin host, the virtual machines running on top of such
      host will still work with VLANs enabled.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: Nithin Nayak Sujir <nsujir@broadcom.com>
      CC: Michael Chan <mchan@broadcom.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d5501c1
  8. 06 8月, 2014 4 次提交
    • W
      net-timestamp: ACK timestamp for bytestreams · e1c8a607
      Willem de Bruijn 提交于
      Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
      in the send() call is acknowledged. It implements the feature for TCP.
      
      The timestamp is generated when the TCP socket cumulative ACK is moved
      beyond the tracked seqno for the first time. The feature ignores SACK
      and FACK, because those acknowledge the specific byte, but not
      necessarily the entire contents of the buffer up to that byte.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1c8a607
    • W
      net-timestamp: SCHED timestamp on entering packet scheduler · e7fd2885
      Willem de Bruijn 提交于
      Kernel transmit latency is often incurred in the packet scheduler.
      Introduce a new timestamp on transmission just before entering the
      scheduler. When data travels through multiple devices (bonding,
      tunneling, ...) each device will export an individual timestamp.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7fd2885
    • W
      net-timestamp: add key to disambiguate concurrent datagrams · 09c2d251
      Willem de Bruijn 提交于
      Datagrams timestamped on transmission can coexist in the kernel stack
      and be reordered in packet scheduling. When reading looped datagrams
      from the socket error queue it is not always possible to unique
      correlate looped data with original send() call (for application
      level retransmits). Even if possible, it may be expensive and complex,
      requiring packet inspection.
      
      Introduce a data-independent ID mechanism to associate timestamps with
      send calls. Pass an ID alongside the timestamp in field ee_data of
      sock_extended_err.
      
      The ID is a simple 32 bit unsigned int that is associated with the
      socket and incremented on each send() call for which software tx
      timestamp generation is enabled.
      
      The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
      avoid changing ee_data for existing applications that expect it 0.
      The counter is reset each time the flag is reenabled. Reenabling
      does not change the ID of already submitted data. It is possible
      to receive out of order IDs if the timestamp stream is not quiesced
      first.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09c2d251
    • W
      net-timestamp: extend SCM_TIMESTAMPING ancillary data struct · f24b9be5
      Willem de Bruijn 提交于
      Applications that request kernel tx timestamps with SO_TIMESTAMPING
      read timestamps as recvmsg() ancillary data. The response is defined
      implicitly as timespec[3].
      
      1) define struct scm_timestamping explicitly and
      
      2) add support for new tstamp types. On tx, scm_timestamping always
         accompanies a sock_extended_err. Define previously unused field
         ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
         define the existing behavior.
      
      The reception path is not modified. On rx, no struct similar to
      sock_extended_err is passed along with SCM_TIMESTAMPING.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f24b9be5
  9. 30 7月, 2014 1 次提交
    • W
      net: remove deprecated syststamp timestamp · 4d276eb6
      Willem de Bruijn 提交于
      The SO_TIMESTAMPING API defines three types of timestamps: software,
      hardware in raw format (hwtstamp) and hardware converted to system
      format (syststamp). The last has been deprecated in favor of combining
      hwtstamp with a PTP clock driver. There are no active users in the
      kernel.
      
      The option was device driver dependent. If set, but without hardware
      support, the correct behavior is to return zero in the relevant field
      in the SCM_TIMESTAMPING ancillary message. Without device drivers
      implementing the option, this field is effectively always zero.
      
      Remove the internal plumbing to dissuage new drivers from implementing
      the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
      avoid breaking existing applications that request the timestamp.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d276eb6
  10. 23 7月, 2014 1 次提交
  11. 16 7月, 2014 1 次提交
    • W
      net-timestamp: document deprecated syststamp · 26c4fdb0
      Willem de Bruijn 提交于
      The SO_TIMESTAMPING API defines option SOF_TIMESTAMPING_SYS_HW.
      This feature is deprecated. It should not be implemented by new
      device drivers. Existing drivers do not implement it, either --
      with one exception.
      
      Driver developers are encouraged to expose the NIC hw clock as a
      PTP HW clock source, instead, and synchronize system time to the
      HW source.
      
      The control flag cannot be removed due to being part of the ABI, nor
      can the structure scm_timestamping that is returned. Due to the one
      legacy driver, the internal datapath and structure are not removed.
      
      This patch only clearly marks the interface as deprecated. Device
      drivers should always return a syststamp value of zero.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      We can consider adding a WARN_ON_ONCE in__sock_recv_timestamp
      if non-zero syststamp is encountered
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26c4fdb0
  12. 08 7月, 2014 2 次提交
  13. 15 6月, 2014 2 次提交
  14. 12 6月, 2014 3 次提交
    • T
      net: Save software checksum complete · 7e3cead5
      Tom Herbert 提交于
      In skb_checksum complete, if we need to compute the checksum for the
      packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
      Subsequent checksum verification can use this.
      
      Also, added csum_complete_sw flag to distinguish between software and
      hardware generated checksum complete, we should always be able to trust
      the software computation.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e3cead5
    • T
      net: Preserve CHECKSUM_COMPLETE at validation · 5d0c2b95
      Tom Herbert 提交于
      Currently when the first checksum in a packet is validated using
      CHECKSUM_COMPLETE, ip_summed is overwritten to be CHECKSUM_UNNECESSARY
      so that any subsequent checksums in the packet are not correctly
      validated.
      
      This patch adds csum_valid flag in sk_buff and uses that to indicate
      validated checksum instead of setting CHECKSUM_UNNECESSARY. The bit
      is set accordingly in the skb_checksum_validate_* functions. The flag
      is checked in skb_checksum_complete, so that validation is communicated
      between checksum_init and checksum_complete sequence in TCP and UDP.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d0c2b95
    • O
      net: add __pskb_copy_fclone and pskb_copy_for_clone · bad93e9d
      Octavian Purdila 提交于
      There are several instances where a pskb_copy or __pskb_copy is
      immediately followed by an skb_clone.
      
      Add a couple of new functions to allow the copy skb to be allocated
      from the fclone cache and thus speed up subsequent skb_clone calls.
      
      Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Marek Lindner <mareklindner@neomailbox.ch>
      Cc: Simon Wunderlich <sw@simonwunderlich.de>
      Cc: Antonio Quartulli <antonio@meshcoding.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Cc: Gustavo Padovan <gustavo@padovan.org>
      Cc: Johan Hedberg <johan.hedberg@gmail.com>
      Cc: Arvid Brodin <arvid.brodin@alten.se>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
      Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
      Cc: Samuel Ortiz <sameo@linux.intel.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Allan Stephens <allan.stephens@windriver.com>
      Cc: Andrew Hendry <andrew.hendry@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Reviewed-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bad93e9d
  15. 05 6月, 2014 3 次提交
    • T
      gre: Call gso_make_checksum · 4749c09c
      Tom Herbert 提交于
      Call gso_make_checksum. This should have the benefit of using a
      checksum that may have been previously computed for the packet.
      
      This also adds NETIF_F_GSO_GRE_CSUM to differentiate devices that
      offload GRE GSO with and without the GRE checksum offloaed.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4749c09c
    • T
      net: Add GSO support for UDP tunnels with checksum · 0f4f4ffa
      Tom Herbert 提交于
      Added a new netif feature for GSO_UDP_TUNNEL_CSUM. This indicates
      that a device is capable of computing the UDP checksum in the
      encapsulating header of a UDP tunnel.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f4f4ffa
    • T
      net: Support for multiple checksums with gso · 7e2b10c1
      Tom Herbert 提交于
      When creating a GSO packet segment we may need to set more than
      one checksum in the packet (for instance a TCP checksum and
      UDP checksum for VXLAN encapsulation). To be efficient, we want
      to do checksum calculation for any part of the packet at most once.
      
      This patch adds csum_start offset to skb_gso_cb. This tracks the
      starting offset for skb->csum which is initially set in skb_segment.
      When a protocol needs to compute a transport checksum it calls
      gso_make_checksum which computes the checksum value from the start
      of transport header to csum_start and then adds in skb->csum to get
      the full checksum. skb->csum and csum_start are then updated to reflect
      the checksum of the resultant packet starting from the transport header.
      
      This patch also adds a flag to skbuff, encap_hdr_csum, which is set
      in *gso_segment fucntions to indicate that a tunnel protocol needs
      checksum calculation
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e2b10c1
  16. 13 5月, 2014 1 次提交
  17. 06 5月, 2014 1 次提交
    • T
      net: Generalize checksum_init functions · 76ba0aae
      Tom Herbert 提交于
      Create a general __skb_checksum_validate function (actually a
      macro) to subsume the various checksum_init functions. This
      function can either init the checksum, or do the full validation
      (logically checksum_init+skb_check_complete)-- a flag specifies
      if full vaidation is performed. Also, there is a flag to the function
      to indicate that zero checksums are allowed (to support optional
      UDP checksums).
      
      Added several stub functions for calling __skb_checksum_validate.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76ba0aae
  18. 02 4月, 2014 2 次提交
    • E
      net: Add a test to see if a skb is freeable in irq context · 574f7194
      Eric W. Biederman 提交于
      Currently netpoll and skb_release_head_state assume that a skb is
      freeable in hard irq context except when skb->destructor is set.
      
      The reality is far from this.  So add a function skb_irq_freeable to
      compute the full test and in the process be the living documentation
      of what the requirements are of actually freeing a skb in hard irq
      context.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      574f7194
    • D
      net: ptp: move PTP classifier in its own file · 408eccce
      Daniel Borkmann 提交于
      This commit fixes a build error reported by Fengguang, that is
      triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:
      
        ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!
      
      The fix is to introduce its own file for the PTP BPF classifier,
      so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
      it independently from each other. IXP4xx driver on ARM needs to
      select it as well since it does not seem to select PTP_1588_CLOCK
      or similar that would pull it in automatically.
      
      This also allows for hiding all of the internals of the BPF PTP
      program inside that file, and only exporting relevant API bits
      to drivers.
      
      This patch also adds a kdoc documentation of ptp_classify_raw()
      API to make it clear that it can return PTP_CLASS_* defines. Also,
      the BPF program has been translated into bpf_asm code, so that it
      can be more easily read and altered (extensively documented in [1]).
      
      In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
      tools, so the commented program can simply be translated via
      `./bpf_asm -c prog` where prog is a file that contains the
      commented code. This makes it easily readable/verifiable and when
      there's a need to change something, jump offsets etc do not need
      to be replaced manually which can be very error prone. Instead,
      a newly translated version via bpf_asm can simply replace the old
      code. I have checked opcode diffs before/after and it's the very
      same filter.
      
        [1] Documentation/networking/filter.txt
      
      Fixes: 164d8c66 ("net: ptp: do not reimplement PTP/BPF classifier")
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Jiri Benc <jbenc@redhat.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      408eccce
  19. 28 3月, 2014 1 次提交
  20. 27 3月, 2014 1 次提交
  21. 27 2月, 2014 1 次提交
    • E
      net: add skb_mstamp infrastructure · 363ec392
      Eric Dumazet 提交于
      ktime_get() is too expensive on some cases, and we'd like to get
      usec resolution timestamps in TCP stack.
      
      This patch adds a light weight facility using a combination of
      local_clock() and jiffies samples.
      
      Instead of :
      
              u64 t0, t1;
      
              t0 = ktime_get();
              // stuff
              t1 = ktime_get();
              delta_us = ktime_us_delta(t1, t0);
      
      use :
              struct skb_mstamp t0, t1;
      
              skb_mstamp_get(&t0);
              // stuff
              skb_mstamp_get(&t1);
              delta_us = skb_mstamp_us_delta(&t1, &t0);
      
      Note : local_clock() might have a (bounded) drift between cpus.
      
      Do not use this infra in place of ktime_get() without understanding the
      issues.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Larry Brakmo <brakmo@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      363ec392
  22. 19 2月, 2014 1 次提交
  23. 17 2月, 2014 1 次提交
  24. 14 2月, 2014 1 次提交
    • F
      net: ip, ipv6: handle gso skbs in forwarding path · fe6cc55f
      Florian Westphal 提交于
      Marcelo Ricardo Leitner reported problems when the forwarding link path
      has a lower mtu than the incoming one if the inbound interface supports GRO.
      
      Given:
      Host <mtu1500> R1 <mtu1200> R2
      
      Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.
      
      In this case, the kernel will fail to send ICMP fragmentation needed
      messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
      checks in forward path. Instead, Linux tries to send out packets exceeding
      the mtu.
      
      When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
      not fragment the packets when forwarding, and again tries to send out
      packets exceeding R1-R2 link mtu.
      
      This alters the forwarding dstmtu checks to take the individual gso
      segment lengths into account.
      
      For ipv6, we send out pkt too big error for gso if the individual
      segments are too big.
      
      For ipv4, we either send icmp fragmentation needed, or, if the DF bit
      is not set, perform software segmentation and let the output path
      create fragments when the packet is leaving the machine.
      It is not 100% correct as the error message will contain the headers of
      the GRO skb instead of the original/segmented one, but it seems to
      work fine in my (limited) tests.
      
      Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
      sofware segmentation.
      
      However it turns out that skb_segment() assumes skb nr_frags is related
      to mss size so we would BUG there.  I don't want to mess with it considering
      Herbert and Eric disagree on what the correct behavior should be.
      
      Hannes Frederic Sowa notes that when we would shrink gso_size
      skb_segment would then also need to deal with the case where
      SKB_MAX_FRAGS would be exceeded.
      
      This uses sofware segmentation in the forward path when we hit ipv4
      non-DF packets and the outgoing link mtu is too small.  Its not perfect,
      but given the lack of bug reports wrt. GRO fwd being broken this is a
      rare case anyway.  Also its not like this could not be improved later
      once the dust settles.
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Reported-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe6cc55f
  25. 12 2月, 2014 1 次提交
  26. 27 1月, 2014 1 次提交
  27. 17 1月, 2014 1 次提交