1. 19 11月, 2011 2 次提交
  2. 04 11月, 2011 1 次提交
    • O
      af_packet: de-inline some helper functions · eea49cc9
      Olof Johansson 提交于
      This popped some compiler errors due to mismatched prototypes. Just
      remove most manual inlines, the compiler should be able to figure out
      what makes sense to inline and not.
      
      net/packet/af_packet.c:252: warning: 'prb_curr_blk_in_use' declared inline after being called
      net/packet/af_packet.c:252: warning: previous declaration of 'prb_curr_blk_in_use' was here
      net/packet/af_packet.c:258: warning: 'prb_queue_frozen' declared inline after being called
      net/packet/af_packet.c:258: warning: previous declaration of 'prb_queue_frozen' was here
      net/packet/af_packet.c:248: warning: 'packet_previous_frame' declared inline after being called
      net/packet/af_packet.c:248: warning: previous declaration of 'packet_previous_frame' was here
      net/packet/af_packet.c:251: warning: 'packet_increment_head' declared inline after being called
      net/packet/af_packet.c:251: warning: previous declaration of 'packet_increment_head' was here
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Cc: Chetan Loke <loke.chetan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eea49cc9
  3. 19 10月, 2011 1 次提交
  4. 11 10月, 2011 1 次提交
  5. 04 10月, 2011 1 次提交
    • W
      make PACKET_STATISTICS getsockopt report consistently between ring and non-ring · 7091fbd8
      Willem de Bruijn 提交于
      This is a minor change.
      
      Up until kernel 2.6.32, getsockopt(fd, SOL_PACKET, PACKET_STATISTICS,
      ...) would return total and dropped packets since its last invocation. The
      introduction of socket queue overflow reporting [1] changed drop
      rate calculation in the normal packet socket path, but not when using a
      packet ring. As a result, the getsockopt now returns different statistics
      depending on the reception method used. With a ring, it still returns the
      count since the last call, as counts are incremented in tpacket_rcv and
      reset in getsockopt. Without a ring, it returns 0 if no drops occurred
      since the last getsockopt and the total drops over the lifespan of
      the socket otherwise. The culprit is this line in packet_rcv, executed
      on a drop:
      
      drop_n_acct:
              po->stats.tp_drops = atomic_inc_return(&sk->sk_drops);
      
      As it shows, the new drop number it taken from the socket drop counter,
      which is not reset at getsockopt. I put together a small example
      that demonstrates the issue [2]. It runs for 10 seconds and overflows
      the queue/ring on every odd second. The reported drop rates are:
      ring: 16, 0, 16, 0, 16, ...
      non-ring: 0, 15, 0, 30, 0, 46, 0, 60, 0 , 74.
      
      Note how the even ring counts monotonically increase. Because the
      getsockopt adds tp_drops to tp_packets, total counts are similarly
      reported cumulatively. Long story short, reinstating the original code, as
      the below patch does, fixes the issue at the cost of additional per-packet
      cycles. Another solution that does not introduce per-packet overhead
      is be to keep the current data path, record the value of sk_drops at
      getsockopt() at call N in a new field in struct packetsock and subtract
      that when reporting at call N+1. I'll be happy to code that, instead,
      it's just more messy.
      
      [1] http://patchwork.ozlabs.org/patch/35665/
      [2] http://kernel.googlecode.com/files/test-packetsock-getstatistics.cSigned-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7091fbd8
  6. 16 9月, 2011 1 次提交
    • J
      net: consolidate and fix ethtool_ops->get_settings calling · 4bc71cb9
      Jiri Pirko 提交于
      This patch does several things:
      - introduces __ethtool_get_settings which is called from ethtool code and
        from drivers as well. Put ASSERT_RTNL there.
      - dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
      - changes calling in drivers so rtnl locking is respected. In
        iboe_get_rate was previously ->get_settings() called unlocked. This
        fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
        problem. Also fixed by calling __dev_get_by_index() instead of
        dev_get_by_index() and holding rtnl_lock for both calls.
      - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
        so bnx2fc_if_create() and fcoe_if_create() are called locked as they
        are from other places.
      - use __ethtool_get_settings() in bonding code
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      
      v2->v3:
      	-removed dev_ethtool_get_settings()
      	-added ASSERT_RTNL into __ethtool_get_settings()
      	-prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock
      	 around it and __ethtool_get_settings() call
      v1->v2:
              add missing export_symbol
      Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits]
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bc71cb9
  7. 27 8月, 2011 1 次提交
  8. 25 8月, 2011 1 次提交
    • C
      af-packet: TPACKET_V3 flexible buffer implementation. · f6fb8f10
      chetan loke 提交于
      1) Blocks can be configured with non-static frame-size.
      2) Read/poll is at a block-level(as opposed to packet-level).
      3) Added poll timeout to avoid indefinite user-space wait on idle links.
      4) Added user-configurable knobs:
         4.1) block::timeout.
         4.2) tpkt_hdr::sk_rxhash.
      
      Changes:
      C1) tpacket_rcv()
          C1.1) packet_current_frame() is replaced by packet_current_rx_frame()
                The bulk of the processing is then moved in the following chain:
                packet_current_rx_frame()
                  __packet_lookup_frame_in_block
                    fill_curr_block()
                    or
                      retire_current_block
                      dispatch_next_block
                    or
                    return NULL(queue is plugged/paused)
      Signed-off-by: NChetan Loke <loke.chetan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6fb8f10
  9. 14 7月, 2011 1 次提交
  10. 07 7月, 2011 2 次提交
  11. 06 7月, 2011 5 次提交
  12. 12 6月, 2011 1 次提交
    • J
      virtio_net: introduce VIRTIO_NET_HDR_F_DATA_VALID · 10a8d94a
      Jason Wang 提交于
      There's no need for the guest to validate the checksum if it have been
      validated by host nics. So this patch introduces a new flag -
      VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
      examing in guest. The backend (tap/macvtap) may set this flag when
      met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.
      
      No feature negotiation is needed as old driver just ignore this flag.
      
      Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
      when gro is on no difference as it produces skb with partial
      checksum. But when gro is disabled, 20% or even higher improvement
      could be measured by netperf.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10a8d94a
  13. 07 6月, 2011 1 次提交
  14. 06 6月, 2011 2 次提交
  15. 02 6月, 2011 1 次提交
  16. 24 5月, 2011 1 次提交
    • D
      net: convert %p usage to %pK · 71338aa7
      Dan Rosenberg 提交于
      The %pK format specifier is designed to hide exposed kernel pointers,
      specifically via /proc interfaces.  Exposing these pointers provides an
      easy target for kernel write vulnerabilities, since they reveal the
      locations of writable structures containing easily triggerable function
      pointers.  The behavior of %pK depends on the kptr_restrict sysctl.
      
      If kptr_restrict is set to 0, no deviation from the standard %p behavior
      occurs.  If kptr_restrict is set to 1, the default, if the current user
      (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
      (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
       If kptr_restrict is set to 2, kernel pointers using %pK are printed as
      0's regardless of privileges.  Replacing with 0's was chosen over the
      default "(null)", which cannot be parsed by userland %p, which expects
      "(nil)".
      
      The supporting code for kptr_restrict and %pK are currently in the -mm
      tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
      pointers to the syslog are not covered, since this would eliminate useful
      information for postmortem debugging and the reading of the syslog is
      already optionally protected by the dmesg_restrict sysctl.
      Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Thomas Graf <tgraf@infradead.org>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71338aa7
  17. 28 4月, 2011 1 次提交
    • E
      net: filter: Just In Time compiler for x86-64 · 0a14842f
      Eric Dumazet 提交于
      In order to speedup packet filtering, here is an implementation of a
      JIT compiler for x86_64
      
      It is disabled by default, and must be enabled by the admin.
      
      echo 1 >/proc/sys/net/core/bpf_jit_enable
      
      It uses module_alloc() and module_free() to get memory in the 2GB text
      kernel range since we call helpers functions from the generated code.
      
      EAX : BPF A accumulator
      EBX : BPF X accumulator
      RDI : pointer to skb   (first argument given to JIT function)
      RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
      r9d : skb->len - skb->data_len (headlen)
      r8  : skb->data
      
      To get a trace of generated code, use :
      
      echo 2 >/proc/sys/net/core/bpf_jit_enable
      
      Example of generated code :
      
      # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24
      
      flen=18 proglen=147 pass=3 image=ffffffffa00b5000
      JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
      JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
      JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
      JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
      JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
      JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
      JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
      JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
      JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
      JIT code: ffffffffa00b5090: c0 c9 c3
      
      BPF program is 144 bytes long, so native program is almost same size ;)
      
      (000) ldh      [12]
      (001) jeq      #0x800           jt 2    jf 8
      (002) ld       [26]
      (003) and      #0xffffff00
      (004) jeq      #0xc0a81400      jt 16   jf 5
      (005) ld       [30]
      (006) and      #0xffffff00
      (007) jeq      #0xc0a81400      jt 16   jf 17
      (008) jeq      #0x806           jt 10   jf 9
      (009) jeq      #0x8035          jt 10   jf 17
      (010) ld       [28]
      (011) and      #0xffffff00
      (012) jeq      #0xc0a81400      jt 16   jf 13
      (013) ld       [38]
      (014) and      #0xffffff00
      (015) jeq      #0xc0a81400      jt 16   jf 17
      (016) ret      #65535
      (017) ret      #0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a14842f
  18. 08 3月, 2011 1 次提交
  19. 12 2月, 2011 1 次提交
  20. 20 1月, 2011 1 次提交
  21. 19 1月, 2011 1 次提交
  22. 17 12月, 2010 1 次提交
  23. 11 12月, 2010 1 次提交
  24. 09 12月, 2010 3 次提交
  25. 07 12月, 2010 2 次提交
  26. 22 11月, 2010 1 次提交
  27. 20 11月, 2010 1 次提交
    • E
      filter: optimize sk_run_filter · 93aaae2e
      Eric Dumazet 提交于
      Remove pc variable to avoid arithmetic to compute fentry at each filter
      instruction. Jumps directly manipulate fentry pointer.
      
      As the last instruction of filter[] is guaranteed to be a RETURN, and
      all jumps are before the last instruction, we dont need to check filter
      bounds (number of instructions in filter array) at each iteration, so we
      remove it from sk_run_filter() params.
      
      On x86_32 remove f_k var introduced in commit 57fe93b3
      (filter: make sure filters dont read uninitialized memory)
      
      Note : We could use a CONFIG_ARCH_HAS_{FEW|MANY}_REGISTERS in order to
      avoid too many ifdefs in this code.
      
      This helps compiler to use cpu registers to hold fentry and A
      accumulator.
      
      On x86_32, this saves 401 bytes, and more important, sk_run_filter()
      runs much faster because less register pressure (One less conditional
      branch per BPF instruction)
      
      # size net/core/filter.o net/core/filter_pre.o
         text    data     bss     dec     hex filename
         2948       0       0    2948     b84 net/core/filter.o
         3349       0       0    3349     d15 net/core/filter_pre.o
      
      on x86_64 :
      # size net/core/filter.o net/core/filter_pre.o
         text    data     bss     dec     hex filename
         5173       0       0    5173    1435 net/core/filter.o
         5224       0       0    5224    1468 net/core/filter_pre.o
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93aaae2e
  28. 17 11月, 2010 1 次提交
    • N
      packet: Enhance AF_PACKET implementation to not require high order contiguous... · 0e3125c7
      Neil Horman 提交于
      packet: Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v4)
      MIME-Version: 1.0
      Content-Type: text/plain; charset=UTF-8
      Content-Transfer-Encoding: 8bit
      
      Version 4 of this patch.
      
      Change notes:
      1) Removed extra memset.  Didn't think kcalloc added a GFP_ZERO the way kzalloc did :)
      
      Summary:
      It was shown to me recently that systems under high load were driven very deep
      into swap when tcpdump was run.  The reason this happened was because the
      AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
      application to specify how many entries an AF_PACKET socket will have and how
      large each entry will be.  It seems the default setting for tcpdump is to set
      the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
      allocation.  Thats difficult under good circumstances, and horrid under memory
      pressure.
      
      I thought it would be good to make that a bit more usable.  I was going to do a
      simple conversion of the ring buffer from contigous pages to iovecs, but
      unfortunately, the metadata which AF_PACKET places in these buffers can easily
      span a page boundary, and given that these buffers get mapped into user space,
      and the data layout doesn't easily allow for a change to padding between frames
      to avoid that, a simple iovec change is just going to break user space ABI
      consistency.
      
      So I've done this, I've added a three tiered mechanism to the af_packet set_ring
      socket option.  It attempts to allocate memory in the following order:
      
      1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
      digging into swap
      
      2) Using vmalloc
      
      3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
      needed to get the memory
      
      The effect is that we don't disturb the system as much when we're under load,
      while still being able to conduct tcpdumps effectively.
      
      Tested successfully by me.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Reported-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e3125c7
  29. 13 11月, 2010 1 次提交
  30. 11 11月, 2010 1 次提交