1. 15 6月, 2019 6 次提交
  2. 12 6月, 2019 1 次提交
  3. 03 6月, 2019 1 次提交
  4. 31 5月, 2019 1 次提交
  5. 10 5月, 2019 1 次提交
    • Y
      packet: Fix error path in packet_init · 36096f2f
      YueHaibing 提交于
      kernel BUG at lib/list_debug.c:47!
      invalid opcode: 0000 [#1
      CPU: 0 PID: 12914 Comm: rmmod Tainted: G        W         5.1.0+ #47
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
      RIP: 0010:__list_del_entry_valid+0x53/0x90
      Code: 48 8b 32 48 39 fe 75 35 48 8b 50 08 48 39 f2 75 40 b8 01 00 00 00 5d c3 48
      89 fe 48 89 c2 48 c7 c7 18 75 fe 82 e8 cb 34 78 ff <0f> 0b 48 89 fe 48 c7 c7 50 75 fe 82 e8 ba 34 78 ff 0f 0b 48 89 f2
      RSP: 0018:ffffc90001c2fe40 EFLAGS: 00010286
      RAX: 000000000000004e RBX: ffffffffa0184000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff888237a17788 RDI: 00000000ffffffff
      RBP: ffffc90001c2fe40 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffc90001c2fe10 R11: 0000000000000000 R12: 0000000000000000
      R13: ffffc90001c2fe50 R14: ffffffffa0184000 R15: 0000000000000000
      FS:  00007f3d83634540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000555c350ea818 CR3: 0000000231677000 CR4: 00000000000006f0
      Call Trace:
       unregister_pernet_operations+0x34/0x120
       unregister_pernet_subsys+0x1c/0x30
       packet_exit+0x1c/0x369 [af_packet
       __x64_sys_delete_module+0x156/0x260
       ? lockdep_hardirqs_on+0x133/0x1b0
       ? do_syscall_64+0x12/0x1f0
       do_syscall_64+0x6e/0x1f0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When modprobe af_packet, register_pernet_subsys
      fails and does a cleanup, ops->list is set to LIST_POISON1,
      but the module init is considered to success, then while rmmod it,
      BUG() is triggered in __list_del_entry_valid which is called from
      unregister_pernet_subsys. This patch fix error handing path in
      packet_init to avoid possilbe issue if some error occur.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36096f2f
  6. 01 5月, 2019 2 次提交
  7. 20 4月, 2019 1 次提交
  8. 21 3月, 2019 3 次提交
    • P
      net: remove 'fallback' argument from dev->ndo_select_queue() · a350ecce
      Paolo Abeni 提交于
      After the previous patch, all the callers of ndo_select_queue()
      provide as a 'fallback' argument netdev_pick_tx.
      The only exceptions are nested calls to ndo_select_queue(),
      which pass down the 'fallback' available in the current scope
      - still netdev_pick_tx.
      
      We can drop such argument and replace fallback() invocation with
      netdev_pick_tx(). This avoids an indirect call per xmit packet
      in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
      with device drivers implementing such ndo. It also clean the code
      a bit.
      
      Tested with ixgbe and CONFIG_FCOE=m
      
      With pktgen using queue xmit:
      threads		vanilla 	patched
      		(kpps)		(kpps)
      1		2334		2428
      2		4166		4278
      4		7895		8100
      
       v1 -> v2:
       - rebased after helper's name change
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a350ecce
    • P
      packet: rework packet_pick_tx_queue() to use common code selection · b71b5837
      Paolo Abeni 提交于
      Currently packet_pick_tx_queue() is the only caller of
      ndo_select_queue() using a fallback argument other than
      netdev_pick_tx.
      
      Leveraging rx queue, we can obtain a similar queue selection
      behavior using core helpers. After this change, ndo_select_queue()
      is always invoked with netdev_pick_tx() as fallback.
      We can change ndo_select_queue() signature in a followup patch,
      dropping an indirect call per transmitted packet in some scenarios
      (e.g. TCP syn and XDP generic xmit)
      
      This changes slightly how af packet queue selection happens when
      PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit()
      tacking in account both XPS and TC mapping.
      
       v1  -> v2:
        - rebased after helper name change
       RFC -> v1:
        - initialize sender_cpu to the expected value
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b71b5837
    • C
      net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec · 398f0132
      Christoph Paasch 提交于
      Since commit fc62814d ("net/packet: fix 4gb buffer limit due to overflow check")
      one can now allocate packet ring buffers >= UINT_MAX. However, syzkaller
      found that that triggers a warning:
      
      [   21.100000] WARNING: CPU: 2 PID: 2075 at mm/page_alloc.c:4584 __alloc_pages_nod0
      [   21.101490] Modules linked in:
      [   21.101921] CPU: 2 PID: 2075 Comm: syz-executor.0 Not tainted 5.0.0 #146
      [   21.102784] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
      [   21.103887] RIP: 0010:__alloc_pages_nodemask+0x2a0/0x630
      [   21.104640] Code: fe ff ff 65 48 8b 04 25 c0 de 01 00 48 05 90 0f 00 00 41 bd 01 00 00 00 48 89 44 24 48 e9 9c fe 3
      [   21.107121] RSP: 0018:ffff88805e1cf920 EFLAGS: 00010246
      [   21.107819] RAX: 0000000000000000 RBX: ffffffff85a488a0 RCX: 0000000000000000
      [   21.108753] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000000
      [   21.109699] RBP: 1ffff1100bc39f28 R08: ffffed100bcefb67 R09: ffffed100bcefb67
      [   21.110646] R10: 0000000000000001 R11: ffffed100bcefb66 R12: 000000000000000d
      [   21.111623] R13: 0000000000000000 R14: ffff88805e77d888 R15: 000000000000000d
      [   21.112552] FS:  00007f7c7de05700(0000) GS:ffff88806d100000(0000) knlGS:0000000000000000
      [   21.113612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   21.114405] CR2: 000000000065c000 CR3: 000000005e58e006 CR4: 00000000001606e0
      [   21.115367] Call Trace:
      [   21.115705]  ? __alloc_pages_slowpath+0x21c0/0x21c0
      [   21.116362]  alloc_pages_current+0xac/0x1e0
      [   21.116923]  kmalloc_order+0x18/0x70
      [   21.117393]  kmalloc_order_trace+0x18/0x110
      [   21.117949]  packet_set_ring+0x9d5/0x1770
      [   21.118524]  ? packet_rcv_spkt+0x440/0x440
      [   21.119094]  ? lock_downgrade+0x620/0x620
      [   21.119646]  ? __might_fault+0x177/0x1b0
      [   21.120177]  packet_setsockopt+0x981/0x2940
      [   21.120753]  ? __fget+0x2fb/0x4b0
      [   21.121209]  ? packet_release+0xab0/0xab0
      [   21.121740]  ? sock_has_perm+0x1cd/0x260
      [   21.122297]  ? selinux_secmark_relabel_packet+0xd0/0xd0
      [   21.123013]  ? __fget+0x324/0x4b0
      [   21.123451]  ? selinux_netlbl_socket_setsockopt+0x101/0x320
      [   21.124186]  ? selinux_netlbl_sock_rcv_skb+0x3a0/0x3a0
      [   21.124908]  ? __lock_acquire+0x529/0x3200
      [   21.125453]  ? selinux_socket_setsockopt+0x5d/0x70
      [   21.126075]  ? __sys_setsockopt+0x131/0x210
      [   21.126533]  ? packet_release+0xab0/0xab0
      [   21.127004]  __sys_setsockopt+0x131/0x210
      [   21.127449]  ? kernel_accept+0x2f0/0x2f0
      [   21.127911]  ? ret_from_fork+0x8/0x50
      [   21.128313]  ? do_raw_spin_lock+0x11b/0x280
      [   21.128800]  __x64_sys_setsockopt+0xba/0x150
      [   21.129271]  ? lockdep_hardirqs_on+0x37f/0x560
      [   21.129769]  do_syscall_64+0x9f/0x450
      [   21.130182]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      We should allocate with __GFP_NOWARN to handle this.
      
      Cc: Kal Conley <kal.conley@dectris.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Fixes: fc62814d ("net/packet: fix 4gb buffer limit due to overflow check")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      398f0132
  9. 19 3月, 2019 2 次提交
    • Y
      af_packet: fix the tx skb protocol in raw sockets with ETH_P_ALL · 18bed891
      Yoshiki Komachi 提交于
      I am using "protocol ip" filters in TC to manipulate TC flower
      classifiers, which are only available with "protocol ip". However,
      I faced an issue that packets sent via raw sockets with ETH_P_ALL
      did not match the ip filters even if they did satisfy the condition
      (e.g., DHCP offer from dhcpd).
      
      I have determined that the behavior was caused by an unexpected
      value stored in skb->protocol, namely, ETH_P_ALL instead of ETH_P_IP,
      when packets were sent via raw sockets with ETH_P_ALL set.
      
      IMHO, storing ETH_P_ALL in skb->protocol is not appropriate for
      packets sent via raw sockets because ETH_P_ALL is not a real ether
      type used on wire, but a virtual one.
      
      This patch fixes the tx protocol selection in cases of transmission
      via raw sockets created with ETH_P_ALL so that it asks the driver to
      extract protocol from the Ethernet header.
      
      Fixes: 75c65772 ("net/packet: Ask driver for protocol if not provided by user")
      Signed-off-by: NYoshiki Komachi <komachi.yoshiki@lab.ntt.co.jp>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18bed891
    • M
      packets: Always register packet sk in the same order · a4dc6a49
      Maxime Chevallier 提交于
      When using fanouts with AF_PACKET, the demux functions such as
      fanout_demux_cpu will return an index in the fanout socket array, which
      corresponds to the selected socket.
      
      The ordering of this array depends on the order the sockets were added
      to a given fanout group, so for FANOUT_CPU this means sockets are bound
      to cpus in the order they are configured, which is OK.
      
      However, when stopping then restarting the interface these sockets are
      bound to, the sockets are reassigned to the fanout group in the reverse
      order, due to the fact that they were inserted at the head of the
      interface's AF_PACKET socket list.
      
      This means that traffic that was directed to the first socket in the
      fanout group is now directed to the last one after an interface restart.
      
      In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
      socket that used to receive traffic from the last CPU after an interface
      restart.
      
      This commit introduces a helper to add a socket at the tail of a list,
      then uses it to register AF_PACKET sockets.
      
      Note that this changes the order in which sockets are listed in /proc and
      with sock_diag.
      
      Fixes: dc99f600 ("packet: Add fanout support")
      Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4dc6a49
  10. 23 2月, 2019 3 次提交
  11. 13 2月, 2019 1 次提交
  12. 18 1月, 2019 1 次提交
    • N
      af_packet: fix raw sockets over 6in4 tunnel · 88a8121d
      Nicolas Dichtel 提交于
      Since commit cb9f1b78, scapy (which uses an AF_PACKET socket in
      SOCK_RAW mode) is unable to send a basic icmp packet over a sit tunnel:
      
      Here is a example of the setup:
      $ ip link set ntfp2 up
      $ ip addr add 10.125.0.1/24 dev ntfp2
      $ ip tunnel add tun1 mode sit ttl 64 local 10.125.0.1 remote 10.125.0.2 dev ntfp2
      $ ip addr add fd00:cafe:cafe::1/128 dev tun1
      $ ip link set dev tun1 up
      $ ip route add fd00:200::/64 dev tun1
      $ scapy
      >>> p = []
      >>> p += IPv6(src='fd00:100::1', dst='fd00:200::1')/ICMPv6EchoRequest()
      >>> send(p, count=1, inter=0.1)
      >>> quit()
      $ ip -s link ls dev tun1 | grep -A1 "TX.*errors"
          TX: bytes  packets  errors  dropped carrier collsns
          0          0        1       0       0       0
      
      The problem is that the network offset is set to the hard_header_len of the
      output device (tun1, ie 14 + 20) and in our case, because the packet is
      small (48 bytes) the pskb_inet_may_pull() fails (it tries to pull 40 bytes
      (ipv6 header) starting from the network offset).
      
      This problem is more generally related to device with variable hard header
      length. To avoid a too intrusive patch in the current release, a (ugly)
      workaround is proposed in this patch. It has to be cleaned up in net-next.
      
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=993675a3100b1
      Link: http://patchwork.ozlabs.org/patch/1024489/
      Fixes: cb9f1b78 ("ip: validate header length on virtual device xmit")
      CC: Willem de Bruijn <willemb@google.com>
      CC: Maxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88a8121d
  13. 09 1月, 2019 1 次提交
  14. 23 12月, 2018 1 次提交
  15. 22 12月, 2018 1 次提交
  16. 18 12月, 2018 1 次提交
  17. 24 11月, 2018 1 次提交
    • W
      packet: copy user buffers before orphan or clone · 5cd8d46e
      Willem de Bruijn 提交于
      tpacket_snd sends packets with user pages linked into skb frags. It
      notifies that pages can be reused when the skb is released by setting
      skb->destructor to tpacket_destruct_skb.
      
      This can cause data corruption if the skb is orphaned (e.g., on
      transmit through veth) or cloned (e.g., on mirror to another psock).
      
      Create a kernel-private copy of data in these cases, same as tun/tap
      zerocopy transmission. Reuse that infrastructure: mark the skb as
      SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx).
      
      Unlike other zerocopy packets, do not set shinfo destructor_arg to
      struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify
      when the original skb is released and a timestamp is recorded. Do not
      change this timestamp behavior. The ubuf_info->callback is not needed
      anyway, as no zerocopy notification is expected.
      
      Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The
      resulting value is not a valid ubuf_info pointer, nor a valid
      tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this.
      
      The fix relies on features introduced in commit 52267790 ("sock:
      add MSG_ZEROCOPY"), so can be backported as is only to 4.14.
      
      Tested with from `./in_netns.sh ./txring_overwrite` from
      http://github.com/wdebruij/kerneltools/tests
      
      Fixes: 69e3c75f ("net: TX_RING and packet mmap")
      Reported-by: NAnand H. Krishnan <anandhkrishnan@gmail.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cd8d46e
  18. 05 10月, 2018 1 次提交
  19. 06 9月, 2018 1 次提交
    • V
      packet: add sockopt to ignore outgoing packets · fa788d98
      Vincent Whitchurch 提交于
      Currently, the only way to ignore outgoing packets on a packet socket is
      via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
      AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
      if the filter run from packet_rcv() would reject them.  So the presence
      of a packet socket on the interface takes away the benefits of
      MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
      packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
      cloned, but the cost for that is much lower.)
      
      Add a socket option to allow AF_PACKET sockets to ignore outgoing
      packets to solve this.  Note that the *BSDs already have something
      similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
      
      The first intended user is lldpd.
      Signed-off-by: NVincent Whitchurch <vincent.whitchurch@axis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa788d98
  20. 01 9月, 2018 1 次提交
  21. 14 8月, 2018 1 次提交
  22. 07 8月, 2018 1 次提交
    • W
      packet: refine ring v3 block size test to hold one frame · 4576cd46
      Willem de Bruijn 提交于
      TPACKET_V3 stores variable length frames in fixed length blocks.
      Blocks must be able to store a block header, optional private space
      and at least one minimum sized frame.
      
      Frames, even for a zero snaplen packet, store metadata headers and
      optional reserved space.
      
      In the block size bounds check, ensure that the frame of the
      chosen configuration fits. This includes sockaddr_ll and optional
      tp_reserve.
      
      Syzbot was able to construct a ring with insuffient room for the
      sockaddr_ll in the header of a zero-length frame, triggering an
      out-of-bounds write in dev_parse_header.
      
      Convert the comparison to less than, as zero is a valid snap len.
      This matches the test for minimum tp_frame_size immediately below.
      
      Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.")
      Fixes: eb73190f ("net/packet: refine check for priv area size")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4576cd46
  23. 05 8月, 2018 1 次提交
  24. 13 7月, 2018 1 次提交
  25. 10 7月, 2018 2 次提交
  26. 07 7月, 2018 1 次提交
  27. 04 7月, 2018 1 次提交
  28. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43