1. 16 11月, 2015 5 次提交
  2. 06 11月, 2015 1 次提交
    • F
      packet: race condition in packet_bind · 30f7ea1c
      Francesco Ruggeri 提交于
      There is a race conditions between packet_notifier and packet_bind{_spkt}.
      
      It happens if packet_notifier(NETDEV_UNREGISTER) executes between the
      time packet_bind{_spkt} takes a reference on the new netdevice and the
      time packet_do_bind sets po->ifindex.
      In this case the notification can be missed.
      If this happens during a dev_change_net_namespace this can result in the
      netdevice to be moved to the new namespace while the packet_sock in the
      old namespace still holds a reference on it. When the netdevice is later
      deleted in the new namespace the deletion hangs since the packet_sock
      is not found in the new namespace' &net->packet.sklist.
      It can be reproduced with the script below.
      
      This patch makes packet_do_bind check again for the presence of the
      netdevice in the packet_sock's namespace after the synchronize_net
      in unregister_prot_hook.
      More in general it also uses the rcu lock for the duration of the bind
      to stop dev_change_net_namespace/rollback_registered_many from
      going past the synchronize_net following unlist_netdevice, so that
      no NETDEV_UNREGISTER notifications can happen on the new netdevice
      while the bind is executing. In order to do this some code from
      packet_bind{_spkt} is consolidated into packet_do_dev.
      
      import socket, os, time, sys
      proto=7
      realDev='em1'
      vlanId=400
      if len(sys.argv) > 1:
         vlanId=int(sys.argv[1])
      dev='vlan%d' % vlanId
      
      os.system('taskset -p 0x10 %d' % os.getpid())
      
      s = socket.socket(socket.PF_PACKET, socket.SOCK_RAW, proto)
      os.system('ip link add link %s name %s type vlan id %d' %
                (realDev, dev, vlanId))
      os.system('ip netns add dummy')
      
      pid=os.fork()
      
      if pid == 0:
         # dev should be moved while packet_do_bind is in synchronize net
         os.system('taskset -p 0x20000 %d' % os.getpid())
         os.system('ip link set %s netns dummy' % dev)
         os.system('ip netns exec dummy ip link del %s' % dev)
         s.close()
         sys.exit(0)
      
      time.sleep(.004)
      try:
         s.bind(('%s' % dev, proto+1))
      except:
         print 'Could not bind socket'
         s.close()
         os.system('ip netns del dummy')
         sys.exit(0)
      
      os.waitpid(pid, 0)
      s.close()
      os.system('ip netns del dummy')
      sys.exit(0)
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f7ea1c
  3. 13 10月, 2015 3 次提交
  4. 11 10月, 2015 1 次提交
    • A
      bpf: fix cb access in socket filter programs · ff936a04
      Alexei Starovoitov 提交于
      eBPF socket filter programs may see junk in 'u32 cb[5]' area,
      since it could have been used by protocol layers earlier.
      
      For socket filter programs used in af_packet we need to clean
      20 bytes of skb->cb area if it could be used by the program.
      For programs attached to TCP/UDP sockets we need to save/restore
      these 20 bytes, since it's used by protocol layers.
      
      Remove SK_RUN_FILTER macro, since it's no longer used.
      
      Long term we may move this bpf cb area to per-cpu scratch, but that
      requires addition of new 'per-cpu load/store' instructions,
      so not suitable as a short term fix.
      
      Fixes: d691f9e8 ("bpf: allow programs to write to certain skb fields")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff936a04
  5. 05 10月, 2015 1 次提交
    • D
      bpf, seccomp: prepare for upcoming criu support · bab18991
      Daniel Borkmann 提交于
      The current ongoing effort to dump existing cBPF seccomp filters back
      to user space requires to hold the pre-transformed instructions like
      we do in case of socket filters from sk_attach_filter() side, so they
      can be reloaded in original form at a later point in time by utilities
      such as criu.
      
      To prepare for this, simply extend the bpf_prog_create_from_user()
      API to hold a flag that tells whether we should store the original
      or not. Also, fanout filters could make use of that in future for
      things like diag. While fanout filters already use bpf_prog_destroy(),
      move seccomp over to them as well to handle original programs when
      present.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Tycho Andersen <tycho.andersen@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Tested-by: NTycho Andersen <tycho.andersen@canonical.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bab18991
  6. 24 9月, 2015 1 次提交
    • D
      Fix AF_PACKET ABI breakage in 4.2 · d3869efe
      David Woodhouse 提交于
      Commit 7d824109 ("virtio: add explicit big-endian support to memory
      accessors") accidentally changed the virtio_net header used by
      AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.
      
      Since virtio_legacy_is_little_endian() is a very long identifier,
      define a vio_le macro and use that throughout the code instead of the
      hard-coded 'false' for little-endian.
      
      This restores the ABI to match 4.1 and earlier kernels, and makes my
      test program work again.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3869efe
  7. 18 8月, 2015 2 次提交
  8. 30 7月, 2015 1 次提交
  9. 29 7月, 2015 1 次提交
  10. 28 7月, 2015 1 次提交
  11. 23 6月, 2015 1 次提交
  12. 22 6月, 2015 3 次提交
  13. 18 5月, 2015 1 次提交
    • W
      net-packet: fix null pointer exception in rollover mode · 4633c9e0
      Willem de Bruijn 提交于
      Rollover can be enabled as flag or mode. Allocate state in both cases.
      This solves a NULL pointer exception in fanout_demux_rollover on
      referencing po->rollover if using mode rollover.
      
      Also make sure that in rollover mode each silo is tried (contrary
      to rollover flag, where the main socket is excluded after an initial
      try_self).
      
      Tested:
        Passes tools/testing/net/psock_fanout.c, which tests both modes and
        flag. My previous tests were limited to bench_rollover, which only
        stresses the flag. The test now completes safely. it still gives an
        error for mode rollover, because it does not expect the new headroom
        (ROOM_NORMAL) requirement. I will send a separate patch to the test.
      
      Fixes: 0648ab70 ("packet: rollover prepare: per-socket state")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      I should have run this test and caught this before submission, of
      course. Apologies for the oversight.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4633c9e0
  14. 15 5月, 2015 1 次提交
  15. 14 5月, 2015 6 次提交
    • W
      packet: rollover statistics · a9b63918
      Willem de Bruijn 提交于
      Rollover indicates exceptional conditions. Export a counter to inform
      socket owners of this state.
      
      If no socket with sufficient room is found, rollover fails. Also count
      these events.
      
      Finally, also count when flows are rolled over early thanks to huge
      flow detection, to validate its correctness.
      
      Tested:
        Read counters in bench_rollover on all other tests in the patchset
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b63918
    • W
      packet: rollover huge flows before small flows · 3b3a5b0a
      Willem de Bruijn 提交于
      Migrate flows from a socket to another socket in the fanout group not
      only when the socket is full. Start migrating huge flows early, to
      divert possible 4-tuple attacks without affecting normal traffic.
      
      Introduce fanout_flow_is_huge(). This detects huge flows, which are
      defined as taking up more than half the load. It does so cheaply, by
      storing the rxhashes of the N most recent packets. If over half of
      these are the same rxhash as the current packet, then drop it. This
      only protects against 4-tuple attacks. N is chosen to fit all data in
      a single cache line.
      
      Tested:
        Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
      
          lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
          cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
            0         14         14          0          0          0          0
            1         20         20          0          0          0          0
            2         16         16          0          0          0          0
            3    6168824    6168824          0    4867721    4867721          0
            4    4867741    4867741          0          0          0          0
            5         12         12          0          0          0          0
            6         15         15          0          0          0          0
            7         17         17          0          0          0          0
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b3a5b0a
    • W
      packet: rollover lock contention avoidance · 2ccdbaa6
      Willem de Bruijn 提交于
      Rollover has to call packet_rcv_has_room on sockets in the fanout
      group to find a socket to migrate to. This operation is expensive
      especially if the packet sockets use rings, when a lock has to be
      acquired.
      
      Avoid pounding on the lock by all sockets by temporarily marking a
      socket as "under memory pressure" when such pressure is detected.
      While set, only the socket owner may call packet_rcv_has_room on the
      socket. Once it detects normal conditions, it clears the flag. The
      socket is not used as a victim by any other socket in the meantime.
      
      Under reasonably balanced load, each socket writer frequently calls
      packet_rcv_has_room and clears its own pressure field. As a backup
      for when the socket is rarely written to, also clear the flag on
      reading (packet_recvmsg, packet_poll) if this can be done cheaply
      (i.e., without calling packet_rcv_has_room). This is only for
      edge cases.
      
      Tested:
        Ran bench_rollover: a process with 8 sockets in a single fanout
        group, each pinned to a single cpu that receives one nic recv
        interrupt. RPS and RFS are disabled. The benchmark uses packet
        rx_ring, which has to take a lock when determining whether a
        socket has room.
      
        Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
        uniformly across the packet sockets (and inserted an iptables
        rule to drop in PREROUTING to avoid protocol stack processing).
      
        Without this patch, all sockets try to migrate traffic to
        neighbors, causing lock contention when searching for a non-
        empty neighbor. The lock is the top 9 entries.
      
          perf record -a -g sleep 5
      
          -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
             - _raw_spin_lock
                - 99.00% spin_lock
          	 + 81.77% packet_rcv_has_room.isra.41
          	 + 18.23% tpacket_rcv
                + 0.84% packet_rcv_has_room.isra.41
          +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
          +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
          +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
          +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41
      
        On net-next with this patch, this lock contention is no longer a
        top entry. Most time is spent in the actual read function. Next up
        are other locks:
      
          +  15.52%  bench_rollover  bench_rollover     [.] reader
          +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
          +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
          +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
          +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
          +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq
      
        Looking closer at the remaining _raw_spin_lock, the cost of probing
        in rollover is now comparable to the cost of taking the lock later
        in tpacket_rcv.
      
          -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
             - _raw_spin_lock
                + 33.41% packet_rcv_has_room
                + 28.15% tpacket_rcv
                + 19.54% enqueue_to_backlog
                + 6.45% __free_pages_ok
                + 2.78% packet_rcv_fanout
                + 2.13% fanout_demux_rollover
                + 2.01% netif_receive_skb_internal
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ccdbaa6
    • W
      packet: rollover only to socket with headroom · 9954729b
      Willem de Bruijn 提交于
      Only migrate flows to sockets that have sufficient headroom, where
      sufficient is defined as having at least 25% empty space.
      
      The kernel has three different buffer types: a regular socket, a ring
      with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
      latter two do not expose a read pointer to the kernel, so headroom is
      not computed easily. All three needs a different implementation to
      estimate free space.
      
      Tested:
        Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
      
        bench_rollover has as many sockets as there are NIC receive queues
        in the system. Each socket is owned by a process that is pinned to
        one of the receive cpus. RFS is disabled. RPS is enabled with an
        identity mapping (cpu x -> cpu x), to count drops with softnettop.
      
          lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
          Press [Enter] to exit
      
          cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
            0         16         16          0          0          0          0
            1         21         21          0          0          0          0
            2    5227502    5227502          0          0          0          0
            3         18         18          0          0          0          0
            4    6083289    6083289          0    5227496          0          0
            5         22         22          0          0          0          0
            6         21         21          0          0          0          0
            7          9          9          0          0          0          0
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9954729b
    • W
      packet: rollover prepare: per-socket state · 0648ab70
      Willem de Bruijn 提交于
      Replace rollover state per fanout group with state per socket. Future
      patches will add fields to the new structure.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0648ab70
    • W
      packet: rollover prepare: move code out of callsites · ad377cab
      Willem de Bruijn 提交于
      packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
      logic into the callee to simplify these callsites, especially with
      upcoming changes.
      
      The main differences between the two callsites is that the FLAG
      variant tests whether the socket previously selected by another
      mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
      rollover mode has no original socket to test.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad377cab
  16. 11 5月, 2015 2 次提交
  17. 24 3月, 2015 2 次提交
  18. 10 3月, 2015 1 次提交
    • F
      net: delete stale packet_mclist entries · 82f17091
      Francesco Ruggeri 提交于
      When an interface is deleted from a net namespace the ifindex in the
      corresponding entries in PF_PACKET sockets' mclists becomes stale.
      This can create inconsistencies if later an interface with the same ifindex
      is moved from a different namespace (not that unlikely since ifindexes are
      per-namespace).
      In particular we saw problems with dev->promiscuity, resulting
      in "promiscuity touches roof, set promiscuity failed. promiscuity
      feature of device might be broken" warnings and EOVERFLOW failures of
      setsockopt(PACKET_ADD_MEMBERSHIP).
      This patch deletes the mclist entries for interfaces that are deleted.
      Since this now causes setsockopt(PACKET_DROP_MEMBERSHIP) to fail with
      EADDRNOTAVAIL if called after the interface is deleted, also make
      packet_mc_drop not fail.
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82f17091
  19. 03 3月, 2015 1 次提交
  20. 02 3月, 2015 3 次提交
  21. 25 2月, 2015 1 次提交
    • A
      af_packet: don't pass empty blocks for PACKET_V3 · 41a50d62
      Alexander Drozdov 提交于
      Before da413eec ("packet: Fixed TPACKET V3 to signal poll when block is
      closed rather than every packet") poll listening for an af_packet socket was
      not signaled if there was no packets to process. After the patch poll is
      signaled evety time when block retire timer expires. That happens because
      af_packet closes the current block on timeout even if the block is empty.
      
      Passing empty blocks to the user not only wastes CPU but also wastes ring
      buffer space increasing probability of packets dropping on small timeouts.
      Signed-off-by: NAlexander Drozdov <al.drozdov@gmail.com>
      Cc: Dan Collins <dan@dcollins.co.nz>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Guy Harris <guy@alum.mit.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41a50d62
  22. 22 2月, 2015 1 次提交