1. 07 12月, 2010 1 次提交
    • C
      af_packet: use vmalloc_to_page() instead for the addresss returned by vmalloc() · 0af55bb5
      Changli Gao 提交于
      The following commit causes the pgv->buffer may point to the memory
      returned by vmalloc(). And we can't use virt_to_page() for the vmalloc
      address.
      
      This patch introduces a new inline function pgv_to_page(), which calls
      vmalloc_to_page() for the vmalloc address, and virt_to_page() for the
      __get_free_pages address.
      
      We used to increase page pointer to get the next page at the next page
      address, after Neil's patch, it is wrong, as the physical address may
      be not continuous. This patch also fixes this issue.
      
          commit 0e3125c7
          Author: Neil Horman <nhorman@tuxdriver.com>
          Date:   Tue Nov 16 10:26:47 2010 -0800
      
          packet: Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v4)
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0af55bb5
  2. 22 11月, 2010 1 次提交
  3. 20 11月, 2010 1 次提交
    • E
      filter: optimize sk_run_filter · 93aaae2e
      Eric Dumazet 提交于
      Remove pc variable to avoid arithmetic to compute fentry at each filter
      instruction. Jumps directly manipulate fentry pointer.
      
      As the last instruction of filter[] is guaranteed to be a RETURN, and
      all jumps are before the last instruction, we dont need to check filter
      bounds (number of instructions in filter array) at each iteration, so we
      remove it from sk_run_filter() params.
      
      On x86_32 remove f_k var introduced in commit 57fe93b3
      (filter: make sure filters dont read uninitialized memory)
      
      Note : We could use a CONFIG_ARCH_HAS_{FEW|MANY}_REGISTERS in order to
      avoid too many ifdefs in this code.
      
      This helps compiler to use cpu registers to hold fentry and A
      accumulator.
      
      On x86_32, this saves 401 bytes, and more important, sk_run_filter()
      runs much faster because less register pressure (One less conditional
      branch per BPF instruction)
      
      # size net/core/filter.o net/core/filter_pre.o
         text    data     bss     dec     hex filename
         2948       0       0    2948     b84 net/core/filter.o
         3349       0       0    3349     d15 net/core/filter_pre.o
      
      on x86_64 :
      # size net/core/filter.o net/core/filter_pre.o
         text    data     bss     dec     hex filename
         5173       0       0    5173    1435 net/core/filter.o
         5224       0       0    5224    1468 net/core/filter_pre.o
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93aaae2e
  4. 17 11月, 2010 1 次提交
    • N
      packet: Enhance AF_PACKET implementation to not require high order contiguous... · 0e3125c7
      Neil Horman 提交于
      packet: Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v4)
      MIME-Version: 1.0
      Content-Type: text/plain; charset=UTF-8
      Content-Transfer-Encoding: 8bit
      
      Version 4 of this patch.
      
      Change notes:
      1) Removed extra memset.  Didn't think kcalloc added a GFP_ZERO the way kzalloc did :)
      
      Summary:
      It was shown to me recently that systems under high load were driven very deep
      into swap when tcpdump was run.  The reason this happened was because the
      AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
      application to specify how many entries an AF_PACKET socket will have and how
      large each entry will be.  It seems the default setting for tcpdump is to set
      the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
      allocation.  Thats difficult under good circumstances, and horrid under memory
      pressure.
      
      I thought it would be good to make that a bit more usable.  I was going to do a
      simple conversion of the ring buffer from contigous pages to iovecs, but
      unfortunately, the metadata which AF_PACKET places in these buffers can easily
      span a page boundary, and given that these buffers get mapped into user space,
      and the data layout doesn't easily allow for a change to padding between frames
      to avoid that, a simple iovec change is just going to break user space ABI
      consistency.
      
      So I've done this, I've added a three tiered mechanism to the af_packet set_ring
      socket option.  It attempts to allocate memory in the following order:
      
      1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
      digging into swap
      
      2) Using vmalloc
      
      3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
      needed to get the memory
      
      The effect is that we don't disturb the system as much when we're under load,
      while still being able to conduct tcpdumps effectively.
      
      Tested successfully by me.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Reported-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e3125c7
  5. 13 11月, 2010 1 次提交
  6. 11 11月, 2010 1 次提交
  7. 19 8月, 2010 1 次提交
  8. 02 6月, 2010 1 次提交
    • S
      packet_mmap: expose hw packet timestamps to network packet capture utilities · 614f60fa
      Scott McMillan 提交于
      This patch adds a setting, PACKET_TIMESTAMP, to specify the packet
      timestamp source that is exported to capture utilities like tcpdump by
      packet_mmap.
      
      PACKET_TIMESTAMP accepts the same integer bit field as
      SO_TIMESTAMPING.  However, only the SOF_TIMESTAMPING_SYS_HARDWARE and
      SOF_TIMESTAMPING_RAW_HARDWARE values are currently recognized by
      PACKET_TIMESTAMP.  SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over
      SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set.
      
      If PACKET_TIMESTAMP is not set, a software timestamp generated inside
      the networking stack is used (the behavior before this setting was
      added).
      Signed-off-by: NScott McMillan <scott.a.mcmillan@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      614f60fa
  9. 17 4月, 2010 1 次提交
  10. 13 4月, 2010 1 次提交
  11. 04 4月, 2010 2 次提交
  12. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  13. 03 3月, 2010 1 次提交
  14. 26 2月, 2010 1 次提交
  15. 25 2月, 2010 1 次提交
    • P
      net: Add checking to rcu_dereference() primitives · a898def2
      Paul E. McKenney 提交于
      Update rcu_dereference() primitives to use new lockdep-based
      checking. The rcu_dereference() in __in6_dev_get() may be
      protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
      The rcu_dereference() in __sk_free() is protected by the fact
      that it is never reached if an update could change it.  Check
      for this by using rcu_dereference_check() to verify that the
      struct sock's ->sk_wmem_alloc counter is zero.
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a898def2
  16. 23 2月, 2010 1 次提交
  17. 11 2月, 2010 1 次提交
  18. 06 2月, 2010 1 次提交
  19. 05 2月, 2010 1 次提交
    • S
      packet: Add GSO/csum offload support. · bfd5f4a3
      Sridhar Samudrala 提交于
      This patch adds GSO/checksum offload to af_packet sockets using
      virtio_net_hdr. Based on Rusty's patch to add this support to tun.
      It allows GSO/checksum offload to be enabled when using raw socket
      backend with virtio_net.
      Adds PACKET_VNET_HDR socket option to prepend virtio_net_hdr in the
      receive path and process/skip virtio_net_hdr in the send path. This
      option is only allowed with SOCK_RAW sockets attached to ethernet
      type devices.
      
      v2 updates
      ----------
      Michael's Comments
      - Perform length check in packet_snd() when GSO is off even when
        vnet_hdr is present.
      - Check for SKB_GSO_FCOE type and return -EINVAL
      - don't allow tx/rx ring when vnet_hdr is enabled.
      Herbert's Comments
      - Removed ethernet specific code.
      - protocol value is assumed to be passed in by the caller.
      Signed-off-by: NSridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfd5f4a3
  20. 18 1月, 2010 1 次提交
  21. 12 1月, 2010 1 次提交
  22. 16 12月, 2009 1 次提交
  23. 30 11月, 2009 1 次提交
  24. 26 11月, 2009 1 次提交
  25. 11 11月, 2009 1 次提交
  26. 06 11月, 2009 1 次提交
  27. 02 11月, 2009 1 次提交
  28. 29 10月, 2009 1 次提交
  29. 27 10月, 2009 1 次提交
    • E
      vlan: allow null VLAN ID to be used · 05423b24
      Eric Dumazet 提交于
      We currently use a 16 bit field (vlan_tci) to store VLAN ID/PRIO on a skb.
      
      Null value is used as a special value, meaning vlan tagging not enabled.
      This forbids use of null vlan ID.
      
      As pointed by David, some drivers use the 3 high order bits (PRIO)
      
      As VLAN ID is 12 bits, we can use the remaining bit (CFI) as a flag, and
      allow null VLAN ID.
      
      In case future code really wants to use VLAN_CFI_MASK, we'll have to use
      a bit outside of vlan_tci.
      
      #define VLAN_PRIO_MASK         0xe000 /* Priority Code Point */
      #define VLAN_PRIO_SHIFT        13
      #define VLAN_CFI_MASK          0x1000 /* Canonical Format Indicator */
      #define VLAN_TAG_PRESENT       VLAN_CFI_MASK
      #define VLAN_VID_MASK          0x0fff /* VLAN Identifier */
      Reported-by: NGertjan Hofman <gertjan_hofman@yahoo.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05423b24
  30. 20 10月, 2009 2 次提交
  31. 13 10月, 2009 1 次提交
    • N
      net: Generalize socket rx gap / receive queue overflow cmsg · 3b885787
      Neil Horman 提交于
      Create a new socket level option to report number of queue overflows
      
      Recently I augmented the AF_PACKET protocol to report the number of frames lost
      on the socket receive queue between any two enqueued frames.  This value was
      exported via a SOL_PACKET level cmsg.  AFter I completed that work it was
      requested that this feature be generalized so that any datagram oriented socket
      could make use of this option.  As such I've created this patch, It creates a
      new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
      SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
      overflowed between any two given frames.  It also augments the AF_PACKET
      protocol to take advantage of this new feature (as it previously did not touch
      sk->sk_drops, which this patch uses to record the overflow count).  Tested
      successfully by me.
      
      Notes:
      
      1) Unlike my previous patch, this patch simply records the sk_drops value, which
      is not a number of drops between packets, but rather a total number of drops.
      Deltas must be computed in user space.
      
      2) While this patch currently works with datagram oriented protocols, it will
      also be accepted by non-datagram oriented protocols. I'm not sure if thats
      agreeable to everyone, but my argument in favor of doing so is that, for those
      protocols which aren't applicable to this option, sk_drops will always be zero,
      and reporting no drops on a receive queue that isn't used for those
      non-participating protocols seems reasonable to me.  This also saves us having
      to code in a per-protocol opt in mechanism.
      
      3) This applies cleanly to net-next assuming that commit
      97775007 (my af packet cmsg patch) is reverted
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b885787
  32. 12 10月, 2009 1 次提交
  33. 07 10月, 2009 2 次提交
  34. 05 10月, 2009 1 次提交
    • N
      af_packet: add interframe drop cmsg (v6) · 97775007
      Neil Horman 提交于
      Add Ancilliary data to better represent loss information
      
      I've had a few requests recently to provide more detail regarding frame loss
      during an AF_PACKET packet capture session.  Specifically the requestors want to
      see where in a packet sequence frames were lost, i.e. they want to see that 40
      frames were lost between frames 302 and 303 in a packet capture file.  In order
      to do this we need:
      
      1) The kernel to export this data to user space
      2) The applications to make use of it
      
      This patch addresses item (1).  It does this by doing the following:
      
      A) Anytime we drop a frame for which we would increment po->stats.tp_drops, we
      also no increment a stats called po->stats.tp_gap.
      
      B) Every time we successfully enqueue a frame to sk_receive_queue, we record the
      value of po->stats.tp_gap in skb->mark.  skb->cb would nominally be the place to
      record this, but since all the space there is used up, we're overloading
      skb->mark.  Its safe to do since any enqueued packet is guaranteed to be
      unshared at this point, and skb->mark isn't used for anything else in the rx
      path to the application.  After we record tp_gap in the skb, we zero
      po->stats.tp_gap.  This allows us to keep a counter of the number of frames lost
      between any two enqueued packets
      
      C) When the application goes to dequeue a frame from the packet socket, we look
      at skb->mark for that frame.  If it is non-zero, we add a cmsg chunk to the
      msghdr of level SOL_PACKET and type PACKET_GAPDATA.  Its a 32 bit integer that
      represents the number of frames lost between this packet and the last previous
      frame received.
      
      Note there is a chance that if there is frame loss after a receive, and then the
      socket is closed, some gap data might be lost.  This is covered by the use of
      the PACKET_AUXDATA socket option, which gives total loss data.  With a bit of
      math, the final gap can be determined that way.
      
      I've tested this patch myself, and it works well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      
       include/linux/if_packet.h |    2 ++
       net/packet/af_packet.c    |   33 +++++++++++++++++++++++++++++++++
       2 files changed, 35 insertions(+)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97775007
  35. 01 10月, 2009 1 次提交
  36. 28 9月, 2009 1 次提交
  37. 24 7月, 2009 1 次提交