1. 31 3月, 2017 3 次提交
  2. 02 3月, 2017 1 次提交
    • A
      net: don't call strlen() on the user buffer in packet_bind_spkt() · 540e2894
      Alexander Potapenko 提交于
      KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
      uninitialized memory in packet_bind_spkt():
      Acked-by: NEric Dumazet <edumazet@google.com>
      
      ==================================================================
      BUG: KMSAN: use of unitialized memory
      CPU: 0 PID: 1074 Comm: packet Not tainted 4.8.0-rc6+ #1891
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
      01/01/2011
       0000000000000000 ffff88006b6dfc08 ffffffff82559ae8 ffff88006b6dfb48
       ffffffff818a7c91 ffffffff85b9c870 0000000000000092 ffffffff85b9c550
       0000000000000000 0000000000000092 00000000ec400911 0000000000000002
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82559ae8>] dump_stack+0x238/0x290 lib/dump_stack.c:51
       [<ffffffff818a6626>] kmsan_report+0x276/0x2e0 mm/kmsan/kmsan.c:1003
       [<ffffffff818a783b>] __msan_warning+0x5b/0xb0
      mm/kmsan/kmsan_instr.c:424
       [<     inline     >] strlen lib/string.c:484
       [<ffffffff8259b58d>] strlcpy+0x9d/0x200 lib/string.c:144
       [<ffffffff84b2eca4>] packet_bind_spkt+0x144/0x230
      net/packet/af_packet.c:3132
       [<ffffffff84242e4d>] SYSC_bind+0x40d/0x5f0 net/socket.c:1370
       [<ffffffff84242a22>] SyS_bind+0x82/0xa0 net/socket.c:1356
       [<ffffffff8515991b>] entry_SYSCALL_64_fastpath+0x13/0x8f
      arch/x86/entry/entry_64.o:?
      chained origin: 00000000eba00911
       [<ffffffff810bb787>] save_stack_trace+0x27/0x50
      arch/x86/kernel/stacktrace.c:67
       [<     inline     >] kmsan_save_stack_with_flags mm/kmsan/kmsan.c:322
       [<     inline     >] kmsan_save_stack mm/kmsan/kmsan.c:334
       [<ffffffff818a59f8>] kmsan_internal_chain_origin+0x118/0x1e0
      mm/kmsan/kmsan.c:527
       [<ffffffff818a7773>] __msan_set_alloca_origin4+0xc3/0x130
      mm/kmsan/kmsan_instr.c:380
       [<ffffffff84242b69>] SYSC_bind+0x129/0x5f0 net/socket.c:1356
       [<ffffffff84242a22>] SyS_bind+0x82/0xa0 net/socket.c:1356
       [<ffffffff8515991b>] entry_SYSCALL_64_fastpath+0x13/0x8f
      arch/x86/entry/entry_64.o:?
      origin description: ----address@SYSC_bind (origin=00000000eb400911)
      ==================================================================
      (the line numbers are relative to 4.8-rc6, but the bug persists
      upstream)
      
      , when I run the following program as root:
      
      =====================================
       #include <string.h>
       #include <sys/socket.h>
       #include <netpacket/packet.h>
       #include <net/ethernet.h>
      
       int main() {
         struct sockaddr addr;
         memset(&addr, 0xff, sizeof(addr));
         addr.sa_family = AF_PACKET;
         int fd = socket(PF_PACKET, SOCK_PACKET, htons(ETH_P_ALL));
         bind(fd, &addr, sizeof(addr));
         return 0;
       }
      =====================================
      
      This happens because addr.sa_data copied from the userspace is not
      zero-terminated, and copying it with strlcpy() in packet_bind_spkt()
      results in calling strlen() on the kernel copy of that non-terminated
      buffer.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      540e2894
  3. 18 2月, 2017 1 次提交
    • A
      packet: Do not call fanout_release from atomic contexts · 2bd624b4
      Anoob Soman 提交于
      Commit 66644982 ("packet: call fanout_release, while UNREGISTERING a
      netdev"), unfortunately, introduced the following issues.
      
      1. calling mutex_lock(&fanout_mutex) (fanout_release()) from inside
      rcu_read-side critical section. rcu_read_lock disables preemption, most often,
      which prohibits calling sleeping functions.
      
      [  ] include/linux/rcupdate.h:560 Illegal context switch in RCU read-side critical section!
      [  ]
      [  ] rcu_scheduler_active = 1, debug_locks = 0
      [  ] 4 locks held by ovs-vswitchd/1969:
      [  ]  #0:  (cb_lock){++++++}, at: [<ffffffff8158a6c9>] genl_rcv+0x19/0x40
      [  ]  #1:  (ovs_mutex){+.+.+.}, at: [<ffffffffa04878ca>] ovs_vport_cmd_del+0x4a/0x100 [openvswitch]
      [  ]  #2:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81564157>] rtnl_lock+0x17/0x20
      [  ]  #3:  (rcu_read_lock){......}, at: [<ffffffff81614165>] packet_notifier+0x5/0x3f0
      [  ]
      [  ] Call Trace:
      [  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
      [  ]  [<ffffffff810c9077>] lockdep_rcu_suspicious+0x107/0x110
      [  ]  [<ffffffff810a2da7>] ___might_sleep+0x57/0x210
      [  ]  [<ffffffff810a2fd0>] __might_sleep+0x70/0x90
      [  ]  [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0
      [  ]  [<ffffffff810de93f>] ? vprintk_default+0x1f/0x30
      [  ]  [<ffffffff81186e88>] ? printk+0x4d/0x4f
      [  ]  [<ffffffff816106dd>] fanout_release+0x1d/0xe0
      [  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0
      
      2. calling mutex_lock(&fanout_mutex) inside spin_lock(&po->bind_lock).
      "sleeping function called from invalid context"
      
      [  ] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
      [  ] in_atomic(): 1, irqs_disabled(): 0, pid: 1969, name: ovs-vswitchd
      [  ] INFO: lockdep is turned off.
      [  ] Call Trace:
      [  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
      [  ]  [<ffffffff810a2f52>] ___might_sleep+0x202/0x210
      [  ]  [<ffffffff810a2fd0>] __might_sleep+0x70/0x90
      [  ]  [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0
      [  ]  [<ffffffff816106dd>] fanout_release+0x1d/0xe0
      [  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0
      
      3. calling dev_remove_pack(&fanout->prot_hook), from inside
      spin_lock(&po->bind_lock) or rcu_read-side critical-section. dev_remove_pack()
      -> synchronize_net(), which might sleep.
      
      [  ] BUG: scheduling while atomic: ovs-vswitchd/1969/0x00000002
      [  ] INFO: lockdep is turned off.
      [  ] Call Trace:
      [  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
      [  ]  [<ffffffff81186274>] __schedule_bug+0x64/0x73
      [  ]  [<ffffffff8162b8cb>] __schedule+0x6b/0xd10
      [  ]  [<ffffffff8162c5db>] schedule+0x6b/0x80
      [  ]  [<ffffffff81630b1d>] schedule_timeout+0x38d/0x410
      [  ]  [<ffffffff810ea3fd>] synchronize_sched_expedited+0x53d/0x810
      [  ]  [<ffffffff810ea6de>] synchronize_rcu_expedited+0xe/0x10
      [  ]  [<ffffffff8154eab5>] synchronize_net+0x35/0x50
      [  ]  [<ffffffff8154eae3>] dev_remove_pack+0x13/0x20
      [  ]  [<ffffffff8161077e>] fanout_release+0xbe/0xe0
      [  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0
      
      4. fanout_release() races with calls from different CPU.
      
      To fix the above problems, remove the call to fanout_release() under
      rcu_read_lock(). Instead, call __dev_remove_pack(&fanout->prot_hook) and
      netdev_run_todo will be happy that &dev->ptype_specific list is empty. In order
      to achieve this, I moved dev_{add,remove}_pack() out of fanout_{add,release} to
      __fanout_{link,unlink}. So, call to {,__}unregister_prot_hook() will make sure
      fanout->prot_hook is removed as well.
      
      Fixes: 66644982 ("packet: call fanout_release, while UNREGISTERING a netdev")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAnoob Soman <anoob.soman@citrix.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bd624b4
  4. 15 2月, 2017 1 次提交
    • E
      packet: fix races in fanout_add() · d199fab6
      Eric Dumazet 提交于
      Multiple threads can call fanout_add() at the same time.
      
      We need to grab fanout_mutex earlier to avoid races that could
      lead to one thread freeing po->rollover that was set by another thread.
      
      Do the same in fanout_release(), for peace of mind, and to help us
      finding lockdep issues earlier.
      
      Fixes: dc99f600 ("packet: Add fanout support.")
      Fixes: 0648ab70 ("packet: rollover prepare: per-socket state")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d199fab6
  5. 09 2月, 2017 1 次提交
  6. 21 1月, 2017 1 次提交
  7. 05 1月, 2017 1 次提交
  8. 04 1月, 2017 1 次提交
    • S
      af_packet: TX_RING support for TPACKET_V3 · 7f953ab2
      Sowmini Varadhan 提交于
      Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx, *_v3
      does not currently have TX_RING support. As a result an application
      that wants the best perf for Tx and Rx (e.g. to handle request/response
      transacations) ends up needing 2 sockets, one with *_v2 for Tx and
      another with *_v3 for Rx.
      
      This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3
      so that an application can use a single descriptor to get the benefits
      of _v3 RX_RING and _v2 TX_RING. An application may do a block-send by
      first filling up multiple frames in the Tx ring and then triggering a
      transmit. This patch only support fixed size Tx frames for TPACKET_V3,
      and requires that tp_next_offset must be zero.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f953ab2
  9. 25 12月, 2016 1 次提交
  10. 06 12月, 2016 1 次提交
    • A
      [iov_iter] new primitives - copy_from_iter_full() and friends · cbbd26b8
      Al Viro 提交于
      copy_from_iter_full(), copy_from_iter_full_nocache() and
      csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
      et.al., advancing iterator only in case of successful full copy
      and returning whether it had been successful or not.
      
      Convert some obvious users.  *NOTE* - do not blindly assume that
      something is a good candidate for those unless you are sure that
      not advancing iov_iter in failure case is the right thing in
      this case.  Anything that does short read/short write kind of
      stuff (or is in a loop, etc.) is unlikely to be a good one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbbd26b8
  11. 03 12月, 2016 1 次提交
    • P
      packet: fix race condition in packet_set_ring · 84ac7260
      Philip Pettersson 提交于
      When packet_set_ring creates a ring buffer it will initialize a
      struct timer_list if the packet version is TPACKET_V3. This value
      can then be raced by a different thread calling setsockopt to
      set the version to TPACKET_V1 before packet_set_ring has finished.
      
      This leads to a use-after-free on a function pointer in the
      struct timer_list when the socket is closed as the previously
      initialized timer will not be deleted.
      
      The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
      changing the packet version while also taking the lock at the start
      of packet_set_ring.
      
      Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.")
      Signed-off-by: NPhilip Pettersson <philip.pettersson@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84ac7260
  12. 19 11月, 2016 3 次提交
  13. 30 10月, 2016 1 次提交
    • W
      packet: on direct_xmit, limit tso and csum to supported devices · 104ba78c
      Willem de Bruijn 提交于
      When transmitting on a packet socket with PACKET_VNET_HDR and
      PACKET_QDISC_BYPASS, validate device support for features requested
      in vnet_hdr.
      
      Drop TSO packets sent to devices that do not support TSO or have the
      feature disabled. Note that the latter currently do process those
      packets correctly, regardless of not advertising the feature.
      
      Because of SKB_GSO_DODGY, it is not sufficient to test device features
      with netif_needs_gso. Full validate_xmit_skb is needed.
      
      Switch to software checksum for non-TSO packets that request checksum
      offload if that device feature is unsupported or disabled. Note that
      similar to the TSO case, device drivers may perform checksum offload
      correctly even when not advertising it.
      
      When switching to software checksum, packets hit skb_checksum_help,
      which has two BUG_ON checksum not in linear segment. Packet sockets
      always allocate at least up to csum_start + csum_off + 2 as linear.
      
      Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c
      
        ethtool -K eth0 tso off tx on
        psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
        psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N
      
        ethtool -K eth0 tx off
        psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
        psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N
      
      v2:
        - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)
      
      Fixes: d346a3fa ("packet: introduce PACKET_QDISC_BYPASS socket option")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      104ba78c
  14. 07 10月, 2016 1 次提交
  15. 22 7月, 2016 1 次提交
  16. 20 7月, 2016 1 次提交
  17. 02 7月, 2016 2 次提交
    • D
      packet: Use symmetric hash for PACKET_FANOUT_HASH. · eb70db87
      David S. Miller 提交于
      People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
      they want packets going in both directions on a flow to hash to the
      same bucket.
      
      The core kernel SKB hash became non-symmetric when the ipv6 flow label
      and other entities were incorporated into the standard flow hash order
      to increase entropy.
      
      But there are no users of PACKET_FANOUT_HASH who want an assymetric
      hash, they all want a symmetric one.
      
      Therefore, use the flow dissector to compute a flat symmetric hash
      over only the protocol, addresses and ports.  This hash does not get
      installed into and override the normal skb hash, so this change has
      no effect whatsoever on the rest of the stack.
      Reported-by: NEric Leblond <eric@regit.org>
      Tested-by: NEric Leblond <eric@regit.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb70db87
    • D
      bpf: refactor bpf_prog_get and type check into helper · 113214be
      Daniel Borkmann 提交于
      Since bpf_prog_get() and program type check is used in a couple of places,
      refactor this into a small helper function that we can make use of. Since
      the non RO prog->aux part is not used in performance critical paths and a
      program destruction via RCU is rather very unlikley when doing the put, we
      shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
      check, but actually not taking the ref at all (due to being in fdget() /
      fdput() section of the bpf fd) is even cleaner and makes the diff smaller
      as well, so just go for that. Callsites are changed to make use of the new
      helper where possible.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      113214be
  18. 11 6月, 2016 1 次提交
  19. 10 6月, 2016 1 次提交
  20. 15 4月, 2016 1 次提交
  21. 14 4月, 2016 1 次提交
  22. 07 4月, 2016 1 次提交
  23. 05 4月, 2016 1 次提交
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
  24. 10 3月, 2016 1 次提交
  25. 26 2月, 2016 1 次提交
  26. 09 2月, 2016 4 次提交
  27. 30 11月, 2015 1 次提交
  28. 18 11月, 2015 2 次提交
  29. 16 11月, 2015 3 次提交
    • D
      packet: fix tpacket_snd max frame len · 5cfb4c8d
      Daniel Borkmann 提交于
      Since it's introduction in commit 69e3c75f ("net: TX_RING and
      packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW
      side. When used with SOCK_DGRAM only, the size_max > dev->mtu +
      reserve check should have reserve as 0, but currently, this is
      unconditionally set (in it's original form as dev->hard_header_len).
      
      I think this is not correct since tpacket_fill_skb() would then
      take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM,
      the extra VLAN_HLEN could be possible in both cases. Presumably, the
      reserve code was copied from packet_snd(), but later on missed the
      check. Make it similar as we have it in packet_snd().
      
      Fixes: 69e3c75f ("net: TX_RING and packet mmap")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cfb4c8d
    • D
      packet: infer protocol from ethernet header if unset · c72219b7
      Daniel Borkmann 提交于
      In case no struct sockaddr_ll has been passed to packet
      socket's sendmsg() when doing a TX_RING flush run, then
      skb->protocol is set to po->num instead, which is the protocol
      passed via socket(2)/bind(2).
      
      Applications only xmitting can go the path of allocating the
      socket as socket(PF_PACKET, <mode>, 0) and do a bind(2) on the
      TX_RING with sll_protocol of 0. That way, register_prot_hook()
      is neither called on creation nor on bind time, which saves
      cycles when there's no interest in capturing anyway.
      
      That leaves us however with po->num 0 instead and therefore
      the TX_RING flush run sets skb->protocol to 0 as well. Eric
      reported that this leads to problems when using tools like
      trafgen over bonding device. I.e. the bonding's hash function
      could invoke the kernel's flow dissector, which depends on
      skb->protocol being properly set. In the current situation, all
      the traffic is then directed to a single slave.
      
      Fix it up by inferring skb->protocol from the Ethernet header
      when not set and we have ARPHRD_ETHER device type. This is only
      done in case of SOCK_RAW and where we have a dev->hard_header_len
      length. In case of ARPHRD_ETHER devices, this is guaranteed to
      cover ETH_HLEN, and therefore being accessed on the skb after
      the skb_store_bits().
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c72219b7
    • D
      packet: only allow extra vlan len on ethernet devices · 3c70c132
      Daniel Borkmann 提交于
      Packet sockets can be used by various net devices and are not
      really restricted to ARPHRD_ETHER device types. However, when
      currently checking for the extra 4 bytes that can be transmitted
      in VLAN case, our assumption is that we generally probe on
      ARPHRD_ETHER devices. Therefore, before looking into Ethernet
      header, check the device type first.
      
      This also fixes the issue where non-ARPHRD_ETHER devices could
      have no dev->hard_header_len in TX_RING SOCK_RAW case, and thus
      the check would test unfilled linear part of the skb (instead
      of non-linear).
      
      Fixes: 57f89bfa ("network: Allow af_packet to transmit +4 bytes for VLAN packets.")
      Fixes: 52f1454f ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c70c132