1. 24 9月, 2015 1 次提交
    • D
      Fix AF_PACKET ABI breakage in 4.2 · d3869efe
      David Woodhouse 提交于
      Commit 7d824109 ("virtio: add explicit big-endian support to memory
      accessors") accidentally changed the virtio_net header used by
      AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.
      
      Since virtio_legacy_is_little_endian() is a very long identifier,
      define a vio_le macro and use that throughout the code instead of the
      hard-coded 'false' for little-endian.
      
      This restores the ABI to match 4.1 and earlier kernels, and makes my
      test program work again.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3869efe
  2. 18 8月, 2015 2 次提交
  3. 30 7月, 2015 1 次提交
  4. 29 7月, 2015 1 次提交
  5. 28 7月, 2015 1 次提交
  6. 23 6月, 2015 1 次提交
  7. 22 6月, 2015 3 次提交
  8. 18 5月, 2015 1 次提交
    • W
      net-packet: fix null pointer exception in rollover mode · 4633c9e0
      Willem de Bruijn 提交于
      Rollover can be enabled as flag or mode. Allocate state in both cases.
      This solves a NULL pointer exception in fanout_demux_rollover on
      referencing po->rollover if using mode rollover.
      
      Also make sure that in rollover mode each silo is tried (contrary
      to rollover flag, where the main socket is excluded after an initial
      try_self).
      
      Tested:
        Passes tools/testing/net/psock_fanout.c, which tests both modes and
        flag. My previous tests were limited to bench_rollover, which only
        stresses the flag. The test now completes safely. it still gives an
        error for mode rollover, because it does not expect the new headroom
        (ROOM_NORMAL) requirement. I will send a separate patch to the test.
      
      Fixes: 0648ab70 ("packet: rollover prepare: per-socket state")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      I should have run this test and caught this before submission, of
      course. Apologies for the oversight.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4633c9e0
  9. 15 5月, 2015 1 次提交
  10. 14 5月, 2015 6 次提交
    • W
      packet: rollover statistics · a9b63918
      Willem de Bruijn 提交于
      Rollover indicates exceptional conditions. Export a counter to inform
      socket owners of this state.
      
      If no socket with sufficient room is found, rollover fails. Also count
      these events.
      
      Finally, also count when flows are rolled over early thanks to huge
      flow detection, to validate its correctness.
      
      Tested:
        Read counters in bench_rollover on all other tests in the patchset
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b63918
    • W
      packet: rollover huge flows before small flows · 3b3a5b0a
      Willem de Bruijn 提交于
      Migrate flows from a socket to another socket in the fanout group not
      only when the socket is full. Start migrating huge flows early, to
      divert possible 4-tuple attacks without affecting normal traffic.
      
      Introduce fanout_flow_is_huge(). This detects huge flows, which are
      defined as taking up more than half the load. It does so cheaply, by
      storing the rxhashes of the N most recent packets. If over half of
      these are the same rxhash as the current packet, then drop it. This
      only protects against 4-tuple attacks. N is chosen to fit all data in
      a single cache line.
      
      Tested:
        Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
      
          lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
          cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
            0         14         14          0          0          0          0
            1         20         20          0          0          0          0
            2         16         16          0          0          0          0
            3    6168824    6168824          0    4867721    4867721          0
            4    4867741    4867741          0          0          0          0
            5         12         12          0          0          0          0
            6         15         15          0          0          0          0
            7         17         17          0          0          0          0
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b3a5b0a
    • W
      packet: rollover lock contention avoidance · 2ccdbaa6
      Willem de Bruijn 提交于
      Rollover has to call packet_rcv_has_room on sockets in the fanout
      group to find a socket to migrate to. This operation is expensive
      especially if the packet sockets use rings, when a lock has to be
      acquired.
      
      Avoid pounding on the lock by all sockets by temporarily marking a
      socket as "under memory pressure" when such pressure is detected.
      While set, only the socket owner may call packet_rcv_has_room on the
      socket. Once it detects normal conditions, it clears the flag. The
      socket is not used as a victim by any other socket in the meantime.
      
      Under reasonably balanced load, each socket writer frequently calls
      packet_rcv_has_room and clears its own pressure field. As a backup
      for when the socket is rarely written to, also clear the flag on
      reading (packet_recvmsg, packet_poll) if this can be done cheaply
      (i.e., without calling packet_rcv_has_room). This is only for
      edge cases.
      
      Tested:
        Ran bench_rollover: a process with 8 sockets in a single fanout
        group, each pinned to a single cpu that receives one nic recv
        interrupt. RPS and RFS are disabled. The benchmark uses packet
        rx_ring, which has to take a lock when determining whether a
        socket has room.
      
        Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
        uniformly across the packet sockets (and inserted an iptables
        rule to drop in PREROUTING to avoid protocol stack processing).
      
        Without this patch, all sockets try to migrate traffic to
        neighbors, causing lock contention when searching for a non-
        empty neighbor. The lock is the top 9 entries.
      
          perf record -a -g sleep 5
      
          -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
             - _raw_spin_lock
                - 99.00% spin_lock
          	 + 81.77% packet_rcv_has_room.isra.41
          	 + 18.23% tpacket_rcv
                + 0.84% packet_rcv_has_room.isra.41
          +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
          +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
          +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
          +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41
      
        On net-next with this patch, this lock contention is no longer a
        top entry. Most time is spent in the actual read function. Next up
        are other locks:
      
          +  15.52%  bench_rollover  bench_rollover     [.] reader
          +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
          +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
          +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
          +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
          +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq
      
        Looking closer at the remaining _raw_spin_lock, the cost of probing
        in rollover is now comparable to the cost of taking the lock later
        in tpacket_rcv.
      
          -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
             - _raw_spin_lock
                + 33.41% packet_rcv_has_room
                + 28.15% tpacket_rcv
                + 19.54% enqueue_to_backlog
                + 6.45% __free_pages_ok
                + 2.78% packet_rcv_fanout
                + 2.13% fanout_demux_rollover
                + 2.01% netif_receive_skb_internal
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ccdbaa6
    • W
      packet: rollover only to socket with headroom · 9954729b
      Willem de Bruijn 提交于
      Only migrate flows to sockets that have sufficient headroom, where
      sufficient is defined as having at least 25% empty space.
      
      The kernel has three different buffer types: a regular socket, a ring
      with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
      latter two do not expose a read pointer to the kernel, so headroom is
      not computed easily. All three needs a different implementation to
      estimate free space.
      
      Tested:
        Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
      
        bench_rollover has as many sockets as there are NIC receive queues
        in the system. Each socket is owned by a process that is pinned to
        one of the receive cpus. RFS is disabled. RPS is enabled with an
        identity mapping (cpu x -> cpu x), to count drops with softnettop.
      
          lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
          Press [Enter] to exit
      
          cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
            0         16         16          0          0          0          0
            1         21         21          0          0          0          0
            2    5227502    5227502          0          0          0          0
            3         18         18          0          0          0          0
            4    6083289    6083289          0    5227496          0          0
            5         22         22          0          0          0          0
            6         21         21          0          0          0          0
            7          9          9          0          0          0          0
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9954729b
    • W
      packet: rollover prepare: per-socket state · 0648ab70
      Willem de Bruijn 提交于
      Replace rollover state per fanout group with state per socket. Future
      patches will add fields to the new structure.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0648ab70
    • W
      packet: rollover prepare: move code out of callsites · ad377cab
      Willem de Bruijn 提交于
      packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
      logic into the callee to simplify these callsites, especially with
      upcoming changes.
      
      The main differences between the two callsites is that the FLAG
      variant tests whether the socket previously selected by another
      mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
      rollover mode has no original socket to test.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad377cab
  11. 11 5月, 2015 2 次提交
  12. 24 3月, 2015 2 次提交
  13. 13 3月, 2015 1 次提交
    • E
      net: Introduce possible_net_t · 0c5c9fb5
      Eric W. Biederman 提交于
      Having to say
      > #ifdef CONFIG_NET_NS
      > 	struct net *net;
      > #endif
      
      in structures is a little bit wordy and a little bit error prone.
      
      Instead it is possible to say:
      > typedef struct {
      > #ifdef CONFIG_NET_NS
      >       struct net *net;
      > #endif
      > } possible_net_t;
      
      And then in a header say:
      
      > 	possible_net_t net;
      
      Which is cleaner and easier to use and easier to test, as the
      possible_net_t is always there no matter what the compile options.
      
      Further this allows read_pnet and write_pnet to be functions in all
      cases which is better at catching typos.
      
      This change adds possible_net_t, updates the definitions of read_pnet
      and write_pnet, updates optional struct net * variables that
      write_pnet uses on to have the type possible_net_t, and finally fixes
      up the b0rked users of read_pnet and write_pnet.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c5c9fb5
  14. 10 3月, 2015 1 次提交
    • F
      net: delete stale packet_mclist entries · 82f17091
      Francesco Ruggeri 提交于
      When an interface is deleted from a net namespace the ifindex in the
      corresponding entries in PF_PACKET sockets' mclists becomes stale.
      This can create inconsistencies if later an interface with the same ifindex
      is moved from a different namespace (not that unlikely since ifindexes are
      per-namespace).
      In particular we saw problems with dev->promiscuity, resulting
      in "promiscuity touches roof, set promiscuity failed. promiscuity
      feature of device might be broken" warnings and EOVERFLOW failures of
      setsockopt(PACKET_ADD_MEMBERSHIP).
      This patch deletes the mclist entries for interfaces that are deleted.
      Since this now causes setsockopt(PACKET_DROP_MEMBERSHIP) to fail with
      EADDRNOTAVAIL if called after the interface is deleted, also make
      packet_mc_drop not fail.
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82f17091
  15. 03 3月, 2015 1 次提交
  16. 02 3月, 2015 3 次提交
  17. 25 2月, 2015 1 次提交
    • A
      af_packet: don't pass empty blocks for PACKET_V3 · 41a50d62
      Alexander Drozdov 提交于
      Before da413eec ("packet: Fixed TPACKET V3 to signal poll when block is
      closed rather than every packet") poll listening for an af_packet socket was
      not signaled if there was no packets to process. After the patch poll is
      signaled evety time when block retire timer expires. That happens because
      af_packet closes the current block on timeout even if the block is empty.
      
      Passing empty blocks to the user not only wastes CPU but also wastes ring
      buffer space increasing probability of packets dropping on small timeouts.
      Signed-off-by: NAlexander Drozdov <al.drozdov@gmail.com>
      Cc: Dan Collins <dan@dcollins.co.nz>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Guy Harris <guy@alum.mit.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41a50d62
  18. 22 2月, 2015 1 次提交
  19. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  20. 14 1月, 2015 1 次提交
  21. 13 1月, 2015 1 次提交
  22. 12 1月, 2015 1 次提交
  23. 23 12月, 2014 1 次提交
    • D
      packet: Fixed TPACKET V3 to signal poll when block is closed rather than every packet · da413eec
      Dan Collins 提交于
      Make TPACKET_V3 signal poll when block is closed rather than for every
      packet. Side effect is that poll will be signaled when block retire
      timer expires which didn't previously happen. Issue was visible when
      sending packets at a very low frequency such that all blocks are retired
      before packets are received by TPACKET_V3. This caused avoidable packet
      loss. The fix ensures that the signal is sent when blocks are closed
      which covers the normal path where the block is filled as well as the
      path where the timer expires. The case where a block is filled without
      moving to the next block (ie. all blocks are full) will still cause poll
      to be signaled.
      Signed-off-by: NDan Collins <dan@dcollins.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da413eec
  24. 10 12月, 2014 1 次提交
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
  25. 09 12月, 2014 1 次提交
  26. 25 11月, 2014 1 次提交
    • M
      af_packet: fix sparse warning · 6e58040b
      Michael S. Tsirkin 提交于
      af_packet produces lots of these:
      	net/packet/af_packet.c:384:39: warning: incorrect type in return expression (different modifiers)
      	net/packet/af_packet.c:384:39:    expected struct page [pure] *
      	net/packet/af_packet.c:384:39:    got struct page *
      
      this seems to be because sparse does not realize that _pure
      refers to function, not the returned pointer.
      
      Tweak code slightly to avoid the warning.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e58040b
  27. 24 11月, 2014 2 次提交