1. 29 3月, 2019 1 次提交
  2. 28 3月, 2019 1 次提交
  3. 27 3月, 2019 1 次提交
  4. 22 3月, 2019 1 次提交
    • W
      net-sysfs: Fix memory leak in netdev_register_kobject · 6b70fc94
      Wang Hai 提交于
      When registering struct net_device, it will call
      	register_netdevice ->
      		netdev_register_kobject ->
      			device_initialize(dev);
      			dev_set_name(dev, "%s", ndev->name)
      			device_add(dev)
      			register_queue_kobjects(ndev)
      
      In netdev_register_kobject(), if device_add(dev) or
      register_queue_kobjects(ndev) failed. Register_netdevice()
      will return error, causing netdev_freemem(ndev) to be
      called to free net_device, however put_device(&dev->dev)->..->
      kobject_cleanup() won't be called, resulting in a memory leak.
      
      syzkaller report this:
      BUG: memory leak
      unreferenced object 0xffff8881f4fad168 (size 8):
      comm "syz-executor.0", pid 3575, jiffies 4294778002 (age 20.134s)
      hex dump (first 8 bytes):
        77 70 61 6e 30 00 ff ff                          wpan0...
      backtrace:
        [<000000006d2d91d7>] kstrdup_const+0x3d/0x50 mm/util.c:73
        [<00000000ba9ff953>] kvasprintf_const+0x112/0x170 lib/kasprintf.c:48
        [<000000005555ec09>] kobject_set_name_vargs+0x55/0x130 lib/kobject.c:281
        [<0000000098d28ec3>] dev_set_name+0xbb/0xf0 drivers/base/core.c:1915
        [<00000000b7553017>] netdev_register_kobject+0xc0/0x410 net/core/net-sysfs.c:1727
        [<00000000c826a797>] register_netdevice+0xa51/0xeb0 net/core/dev.c:8711
        [<00000000857bfcfd>] cfg802154_update_iface_num.isra.2+0x13/0x90 [ieee802154]
        [<000000003126e453>] ieee802154_llsec_fill_key_id+0x1d5/0x570 [ieee802154]
        [<00000000e4b3df51>] 0xffffffffc1500e0e
        [<00000000b4319776>] platform_drv_probe+0xc6/0x180 drivers/base/platform.c:614
        [<0000000037669347>] really_probe+0x491/0x7c0 drivers/base/dd.c:509
        [<000000008fed8862>] driver_probe_device+0xdc/0x240 drivers/base/dd.c:671
        [<00000000baf52041>] device_driver_attach+0xf2/0x130 drivers/base/dd.c:945
        [<00000000c7cc8dec>] __driver_attach+0x10e/0x210 drivers/base/dd.c:1022
        [<0000000057a757c2>] bus_for_each_dev+0x154/0x1e0 drivers/base/bus.c:304
        [<000000005f5ae04b>] bus_add_driver+0x427/0x5e0 drivers/base/bus.c:645
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: 1fa5ae85 ("driver core: get rid of struct device's bus_id string array")
      Signed-off-by: NWang Hai <wanghai26@huawei.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b70fc94
  5. 20 3月, 2019 1 次提交
  6. 14 3月, 2019 2 次提交
    • M
      bpf: Add bpf_get_listener_sock(struct bpf_sock *sk) helper · dbafd7dd
      Martin KaFai Lau 提交于
      Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)"
      which returns a bpf_sock in TCP_LISTEN state.  It will trace back to
      the listener sk from a request_sock if possible.  It returns NULL
      for all other cases.
      
      No reference is taken because the helper ensures the sk is
      in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in).
      Hence, bpf_sk_release() is unnecessary and the verifier does not
      allow bpf_sk_release(listen_sk) to be called either.
      
      The following is also allowed because the bpf_prog is run under
      rcu_read_lock():
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	listen_sk = bpf_get_listener_sock(sk);
      	/* if (!listen_sk) { ... } */
      	bpf_sk_release(sk);
      	src_port = listen_sk->src_port; /* Allowed */
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      dbafd7dd
    • M
      bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release · 1b986589
      Martin KaFai Lau 提交于
      Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
      can still be accessed after bpf_sk_release(sk).
      Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
      This patch addresses them together.
      
      A simple reproducer looks like this:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(sk);
      	snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
      
      The problem is the verifier did not scrub the register's states of
      the tcp_sock ptr (tp) after bpf_sk_release(sk).
      
      [ Note that when calling bpf_tcp_sock(sk), the sk is not always
        refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
        fine for this case. ]
      
      Currently, the verifier does not track if a helper's return ptr (in REG_0)
      is "carry"-ing one of its argument's refcount status. To carry this info,
      the reg1->id needs to be stored in reg0.
      
      One approach was tried, like "reg0->id = reg1->id", when calling
      "bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
      for the same reg.  However, overlapping the NULL marking and ref
      tracking purpose in one "id" does not work well:
      
      	ref_sk = bpf_sk_lookup_tcp();
      	fullsock = bpf_sk_fullsock(ref_sk);
      	tp = bpf_tcp_sock(ref_sk);
      	if (!fullsock) {
      	     bpf_sk_release(ref_sk);
      	     return 0;
      	}
      	/* fullsock_reg->id is marked for NOT-NULL.
      	 * Same for tp_reg->id because they have the same id.
      	 */
      
      	/* oops. verifier did not complain about the missing !tp check */
      	snd_cwnd = tp->snd_cwnd;
      
      Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
      With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
      scrub all reg states which has a ref_obj_id match.  It is done with the
      changes in release_reg_references() in this patch.
      
      While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
      bpf_sk_fullsock() to avoid these helpers from returning
      another ptr. It will make bpf_sk_release(tp) possible:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(tp);
      
      A separate helper "bpf_get_listener_sock()" will be added in a later
      patch to do sk_to_full_sk().
      
      Misc change notes:
      - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
        from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
        is removed from bpf.h since no helper is using it.
      
      - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
        because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
        refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
        take ARG_PTR_TO_SOCK_COMMON.
      
      - check_refcount_ok() ensures is_acquire_function() cannot take
        arg_type_may_be_refcounted() as its argument.
      
      - The check_func_arg() can only allow one refcount-ed arg.  It is
        guaranteed by check_refcount_ok() which ensures at most one arg can be
        refcounted.  Hence, it is a verifier internal error if >1 refcount arg
        found in check_func_arg().
      
      - In release_reference(), release_reference_state() is called
        first to ensure a match on "reg->ref_obj_id" can be found before
        scrubbing the reg states with release_reg_references().
      
      - reg_is_refcounted() is no longer needed.
        1. In mark_ptr_or_null_regs(), its usage is replaced by
           "ref_obj_id && ref_obj_id == id" because,
           when is_null == true, release_reference_state() should only be
           called on the ref_obj_id obtained by a acquire helper (i.e.
           is_acquire_function() == true).  Otherwise, the following
           would happen:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	fullsock = bpf_sk_fullsock(sk);
      	if (!fullsock) {
      		/*
      		 * release_reference_state(fullsock_reg->ref_obj_id)
      		 * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
      		 *
      		 * Hence, the following bpf_sk_release(sk) will fail
      		 * because the ref state has already been released in the
      		 * earlier release_reference_state(fullsock_reg->ref_obj_id).
      		 */
      		bpf_sk_release(sk);
      	}
      
        2. In release_reg_references(), the current reg_is_refcounted() call
           is unnecessary because the id check is enough.
      
      - The type_is_refcounted() and type_is_refcounted_or_null()
        are no longer needed also because reg_is_refcounted() is removed.
      
      Fixes: 655a51e5 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
      Reported-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1b986589
  7. 11 3月, 2019 1 次提交
    • E
      gro_cells: make sure device is up in gro_cells_receive() · 2a5ff07a
      Eric Dumazet 提交于
      We keep receiving syzbot reports [1] that show that tunnels do not play
      the rcu/IFF_UP rules properly.
      
      At device dismantle phase, gro_cells_destroy() will be called
      only after a full rcu grace period is observed after IFF_UP
      has been cleared.
      
      This means that IFF_UP needs to be tested before queueing packets
      into netif_rx() or gro_cells.
      
      This patch implements the test in gro_cells_receive() because
      too many callers do not seem to bother enough.
      
      [1]
      BUG: unable to handle kernel paging request at fffff4ca0b9ffffe
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 21 Comm: kworker/u4:1 Not tainted 5.0.0+ #97
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: netns cleanup_net
      RIP: 0010:__skb_unlink include/linux/skbuff.h:1929 [inline]
      RIP: 0010:__skb_dequeue include/linux/skbuff.h:1945 [inline]
      RIP: 0010:__skb_queue_purge include/linux/skbuff.h:2656 [inline]
      RIP: 0010:gro_cells_destroy net/core/gro_cells.c:89 [inline]
      RIP: 0010:gro_cells_destroy+0x19d/0x360 net/core/gro_cells.c:78
      Code: 03 42 80 3c 20 00 0f 85 53 01 00 00 48 8d 7a 08 49 8b 47 08 49 c7 07 00 00 00 00 48 89 f9 49 c7 47 08 00 00 00 00 48 c1 e9 03 <42> 80 3c 21 00 0f 85 10 01 00 00 48 89 c1 48 89 42 08 48 c1 e9 03
      RSP: 0018:ffff8880aa3f79a8 EFLAGS: 00010a02
      RAX: 00ffffffffffffe8 RBX: ffffe8ffffc64b70 RCX: 1ffff8ca0b9ffffe
      RDX: ffffc6505cffffe8 RSI: ffffffff858410ca RDI: ffffc6505cfffff0
      RBP: ffff8880aa3f7a08 R08: ffff8880aa3e8580 R09: fffffbfff1263645
      R10: fffffbfff1263644 R11: ffffffff8931b223 R12: dffffc0000000000
      R13: 0000000000000000 R14: ffffe8ffffc64b80 R15: ffffe8ffffc64b75
      kobject: 'loop2' (000000004bd7d84a): kobject_uevent_env
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: fffff4ca0b9ffffe CR3: 0000000094941000 CR4: 00000000001406f0
      Call Trace:
      kobject: 'loop2' (000000004bd7d84a): fill_kobj_path: path = '/devices/virtual/block/loop2'
       ip_tunnel_dev_free+0x19/0x60 net/ipv4/ip_tunnel.c:1010
       netdev_run_todo+0x51c/0x7d0 net/core/dev.c:8970
       rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:116
       ip_tunnel_delete_nets+0x423/0x5f0 net/ipv4/ip_tunnel.c:1124
       vti_exit_batch_net+0x23/0x30 net/ipv4/ip_vti.c:495
       ops_exit_list.isra.0+0x105/0x160 net/core/net_namespace.c:156
       cleanup_net+0x3fb/0x960 net/core/net_namespace.c:551
       process_one_work+0x98e/0x1790 kernel/workqueue.c:2173
       worker_thread+0x98/0xe40 kernel/workqueue.c:2319
       kthread+0x357/0x430 kernel/kthread.c:246
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
      Modules linked in:
      CR2: fffff4ca0b9ffffe
         [ end trace 513fc9c1338d1cb3 ]
      RIP: 0010:__skb_unlink include/linux/skbuff.h:1929 [inline]
      RIP: 0010:__skb_dequeue include/linux/skbuff.h:1945 [inline]
      RIP: 0010:__skb_queue_purge include/linux/skbuff.h:2656 [inline]
      RIP: 0010:gro_cells_destroy net/core/gro_cells.c:89 [inline]
      RIP: 0010:gro_cells_destroy+0x19d/0x360 net/core/gro_cells.c:78
      Code: 03 42 80 3c 20 00 0f 85 53 01 00 00 48 8d 7a 08 49 8b 47 08 49 c7 07 00 00 00 00 48 89 f9 49 c7 47 08 00 00 00 00 48 c1 e9 03 <42> 80 3c 21 00 0f 85 10 01 00 00 48 89 c1 48 89 42 08 48 c1 e9 03
      RSP: 0018:ffff8880aa3f79a8 EFLAGS: 00010a02
      RAX: 00ffffffffffffe8 RBX: ffffe8ffffc64b70 RCX: 1ffff8ca0b9ffffe
      RDX: ffffc6505cffffe8 RSI: ffffffff858410ca RDI: ffffc6505cfffff0
      RBP: ffff8880aa3f7a08 R08: ffff8880aa3e8580 R09: fffffbfff1263645
      R10: fffffbfff1263644 R11: ffffffff8931b223 R12: dffffc0000000000
      kobject: 'loop3' (00000000e4ee57a6): kobject_uevent_env
      R13: 0000000000000000 R14: ffffe8ffffc64b80 R15: ffffe8ffffc64b75
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: fffff4ca0b9ffffe CR3: 0000000094941000 CR4: 00000000001406f0
      
      Fixes: c9e6bc64 ("net: add gro_cells infrastructure")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a5ff07a
  8. 08 3月, 2019 1 次提交
  9. 07 3月, 2019 3 次提交
    • J
      bpf: Stop the psock parser before canceling its work · e8e34377
      Jakub Sitnicki 提交于
      We might have never enabled (started) the psock's parser, in which case it
      will not get stopped when destroying the psock. This leads to a warning
      when trying to cancel parser's work from psock's deferred destructor:
      
      [  405.325769] WARNING: CPU: 1 PID: 3216 at net/strparser/strparser.c:526 strp_done+0x3c/0x40
      [  405.326712] Modules linked in: [last unloaded: test_bpf]
      [  405.327359] CPU: 1 PID: 3216 Comm: kworker/1:164 Tainted: G        W         5.0.0 #42
      [  405.328294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
      [  405.329712] Workqueue: events sk_psock_destroy_deferred
      [  405.330254] RIP: 0010:strp_done+0x3c/0x40
      [  405.330706] Code: 28 e8 b8 d5 6b ff 48 8d bb 80 00 00 00 e8 9c d5 6b ff 48 8b 7b 18 48 85 ff 74 0d e8 1e a5 e8 ff 48 c7 43 18 00 00 00 00 5b c3 <0f> 0b eb cf 66 66 66 66 90 55 89 f5 53 48 89 fb 48 83 c7 28 e8 0b
      [  405.332862] RSP: 0018:ffffc900026bbe50 EFLAGS: 00010246
      [  405.333482] RAX: ffffffff819323e0 RBX: ffff88812cb83640 RCX: ffff88812cb829e8
      [  405.334228] RDX: 0000000000000001 RSI: ffff88812cb837e8 RDI: ffff88812cb83640
      [  405.335366] RBP: ffff88813fd22680 R08: 0000000000000000 R09: 000073746e657665
      [  405.336472] R10: 8080808080808080 R11: 0000000000000001 R12: ffff88812cb83600
      [  405.337760] R13: 0000000000000000 R14: ffff88811f401780 R15: ffff88812cb837e8
      [  405.338777] FS:  0000000000000000(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
      [  405.339903] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  405.340821] CR2: 00007fb11489a6b8 CR3: 000000012d4d6000 CR4: 00000000000406e0
      [  405.341981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  405.343131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  405.344415] Call Trace:
      [  405.344821]  sk_psock_destroy_deferred+0x23/0x1b0
      [  405.345585]  process_one_work+0x1ae/0x3e0
      [  405.346110]  worker_thread+0x3c/0x3b0
      [  405.346576]  ? pwq_unbound_release_workfn+0xd0/0xd0
      [  405.347187]  kthread+0x11d/0x140
      [  405.347601]  ? __kthread_parkme+0x80/0x80
      [  405.348108]  ret_from_fork+0x35/0x40
      [  405.348566] ---[ end trace a4a3af4026a327d4 ]---
      
      Stop psock's parser just before canceling its work.
      
      Fixes: 1d79895a ("sk_msg: Always cancel strp work before freeing the psock")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e8e34377
    • P
      net: fix GSO in bpf_lwt_push_ip_encap · ea0371f7
      Peter Oskolkov 提交于
      GSO needs inner headers and inner protocol set properly to work.
      
      skb->inner_mac_header: skb_reset_inner_headers() assigns the current
      mac header value to inner_mac_header; but it is not set at the point,
      so we need to call skb_reset_inner_mac_header, otherwise gre_gso_segment
      fails: it does
      
          int tnl_hlen = skb_inner_mac_header(skb) - skb_transport_header(skb);
          ...
          if (unlikely(!pskb_may_pull(skb, tnl_hlen)))
          ...
      
      skb->inner_protocol should also be correctly set.
      
      Fixes: ca78801a ("bpf: handle GSO in bpf_lwt_push_encap")
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ea0371f7
    • W
      bpf: only test gso type on gso packets · 4c3024de
      Willem de Bruijn 提交于
      BPF can adjust gso only for tcp bytestreams. Fail on other gso types.
      
      But only on gso packets. It does not touch this field if !gso_size.
      
      Fixes: b90efd22 ("bpf: only adjust gso_size on bytestream protocols")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4c3024de
  10. 06 3月, 2019 1 次提交
    • A
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual 提交于
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
  11. 05 3月, 2019 3 次提交
  12. 04 3月, 2019 1 次提交
    • Y
      net-sysfs: Fix mem leak in netdev_register_kobject · 895a5e96
      YueHaibing 提交于
      syzkaller report this:
      BUG: memory leak
      unreferenced object 0xffff88837a71a500 (size 256):
        comm "syz-executor.2", pid 9770, jiffies 4297825125 (age 17.843s)
        hex dump (first 32 bytes):
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
          ff ff ff ff ff ff ff ff 20 c0 ef 86 ff ff ff ff  ........ .......
        backtrace:
          [<00000000db12624b>] netdev_register_kobject+0x124/0x2e0 net/core/net-sysfs.c:1751
          [<00000000dc49a994>] register_netdevice+0xcc1/0x1270 net/core/dev.c:8516
          [<00000000e5f3fea0>] tun_set_iff drivers/net/tun.c:2649 [inline]
          [<00000000e5f3fea0>] __tun_chr_ioctl+0x2218/0x3d20 drivers/net/tun.c:2883
          [<000000001b8ac127>] vfs_ioctl fs/ioctl.c:46 [inline]
          [<000000001b8ac127>] do_vfs_ioctl+0x1a5/0x10e0 fs/ioctl.c:690
          [<0000000079b269f8>] ksys_ioctl+0x89/0xa0 fs/ioctl.c:705
          [<00000000de649beb>] __do_sys_ioctl fs/ioctl.c:712 [inline]
          [<00000000de649beb>] __se_sys_ioctl fs/ioctl.c:710 [inline]
          [<00000000de649beb>] __x64_sys_ioctl+0x74/0xb0 fs/ioctl.c:710
          [<000000007ebded1e>] do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
          [<00000000db315d36>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<00000000115be9bb>] 0xffffffffffffffff
      
      It should call kset_unregister to free 'dev->queues_kset'
      in error path of register_queue_kobjects, otherwise will cause a mem leak.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: 1d24eb48 ("xps: Transmit Packet Steering")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      895a5e96
  13. 03 3月, 2019 2 次提交
    • E
      net: sched: put back q.qlen into a single location · 46b1c18f
      Eric Dumazet 提交于
      In the series fc8b81a5 ("Merge branch 'lockless-qdisc-series'")
      John made the assumption that the data path had no need to read
      the qdisc qlen (number of packets in the qdisc).
      
      It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO
      children.
      
      But pfifo_fast can be used as leaf in class full qdiscs, and existing
      logic needs to access the child qlen in an efficient way.
      
      HTB breaks badly, since it uses cl->leaf.q->q.qlen in :
        htb_activate() -> WARN_ON()
        htb_dequeue_tree() to decide if a class can be htb_deactivated
        when it has no more packets.
      
      HFSC, DRR, CBQ, QFQ have similar issues, and some calls to
      qdisc_tree_reduce_backlog() also read q.qlen directly.
      
      Using qdisc_qlen_sum() (which iterates over all possible cpus)
      in the data path is a non starter.
      
      It seems we have to put back qlen in a central location,
      at least for stable kernels.
      
      For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock,
      so the existing q.qlen{++|--} are correct.
      
      For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}()
      because the spinlock might be not held (for example from
      pfifo_fast_enqueue() and pfifo_fast_dequeue())
      
      This patch adds atomic_qlen (in the same location than qlen)
      and renames the following helpers, since we want to express
      they can be used without qdisc lock, and that qlen is no longer percpu.
      
      - qdisc_qstats_cpu_qlen_dec -> qdisc_qstats_atomic_qlen_dec()
      - qdisc_qstats_cpu_qlen_inc -> qdisc_qstats_atomic_qlen_inc()
      
      Later (net-next) we might revert this patch by tracking all these
      qlen uses and replace them by a more efficient method (not having
      to access a precise qlen, but an empty/non_empty status that might
      be less expensive to maintain/track).
      
      Another possibility is to have a legacy pfifo_fast version that would
      be used when used a a child qdisc, since the parent qdisc needs
      a spinlock anyway. But then, future lockless qdiscs would also
      have the same problem.
      
      Fixes: 7e66016f ("net: sched: helpers to sum qlen and qlen for per cpu logic")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46b1c18f
    • B
      bpf: add bpf helper bpf_skb_ecn_set_ce · f7c917ba
      brakmo 提交于
      This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
      "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
      BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
      be attached to the ingress and egress path. The helper is needed
      because his type of bpf_prog cannot modify the skb directly.
      
      This helper is used to set the ECN field of ECN capable IP packets to ce
      (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
      used by a bpf_prog to manage egress or ingress network bandwdith limit
      per cgroupv2 by inducing an ECN response in the TCP sender.
      This works best when using DCTCP.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f7c917ba
  14. 02 3月, 2019 3 次提交
    • E
      net: support 64bit rates for getsockopt(SO_MAX_PACING_RATE) · 677f136c
      Eric Dumazet 提交于
      For legacy applications using 32bit variable, SO_MAX_PACING_RATE
      has to cap the returned value to 0xFFFFFFFF, meaning that
      rates above 34.35 Gbit are capped.
      
      This patch allows applications to read socket pacing rate
      at full resolution, if they provide a 64bit variable to store it,
      and the kernel is 64bit.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      677f136c
    • E
      net: support 64bit values for setsockopt(SO_MAX_PACING_RATE) · 6bdef102
      Eric Dumazet 提交于
      64bit kernels now support 64bit pacing rates.
      
      This commit changes setsockopt() to accept 64bit
      values provided by applications.
      
      Old applications providing 32bit value are still supported,
      but limited to the old 34Gbit limitation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bdef102
    • J
      devlink: fix kdoc · eeaadd82
      Jakub Kicinski 提交于
      devlink suffers from a few kdoc warnings:
      
      net/core/devlink.c:5292: warning: Function parameter or member 'dev' not described in 'devlink_register'
      net/core/devlink.c:5351: warning: Function parameter or member 'port_index' not described in 'devlink_port_register'
      net/core/devlink.c:5753: warning: Function parameter or member 'parent_resource_id' not described in 'devlink_resource_register'
      net/core/devlink.c:5753: warning: Function parameter or member 'size_params' not described in 'devlink_resource_register'
      net/core/devlink.c:5753: warning: Excess function parameter 'top_hierarchy' description in 'devlink_resource_register'
      net/core/devlink.c:5753: warning: Excess function parameter 'reload_required' description in 'devlink_resource_register'
      net/core/devlink.c:5753: warning: Excess function parameter 'parent_reosurce_id' description in 'devlink_resource_register'
      net/core/devlink.c:6451: warning: Function parameter or member 'region' not described in 'devlink_region_snapshot_create'
      net/core/devlink.c:6451: warning: Excess function parameter 'devlink_region' description in 'devlink_region_snapshot_create'
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeaadd82
  15. 28 2月, 2019 1 次提交
  16. 27 2月, 2019 4 次提交
  17. 25 2月, 2019 5 次提交
  18. 22 2月, 2019 2 次提交
  19. 20 2月, 2019 1 次提交
    • J
      bpf: add skb->queue_mapping write access from tc clsact · 74e31ca8
      Jesper Dangaard Brouer 提交于
      The skb->queue_mapping already have read access, via __sk_buff->queue_mapping.
      
      This patch allow BPF tc qdisc clsact write access to the queue_mapping via
      tc_cls_act_is_valid_access.  Also handle that the value NO_QUEUE_MAPPING
      is not allowed.
      
      It is already possible to change this via TC filter action skbedit
      tc-skbedit(8).  Due to the lack of TC examples, lets show one:
      
        # tc qdisc  add  dev ixgbe1 clsact
        # tc filter add  dev ixgbe1 ingress matchall action skbedit queue_mapping 5
        # tc filter list dev ixgbe1 ingress
      
      The most common mistake is that XPS (Transmit Packet Steering) takes
      precedence over setting skb->queue_mapping. XPS is configured per DEVICE
      via /sys/class/net/DEVICE/queues/tx-*/xps_cpus via a CPU hex mask. To
      disable set mask=00.
      
      The purpose of changing skb->queue_mapping is to influence the selection of
      the net_device "txq" (struct netdev_queue), which influence selection of
      the qdisc "root_lock" (via txq->qdisc->q.lock) and txq->_xmit_lock. When
      using the MQ qdisc the txq->qdisc points to different qdiscs and associated
      locks, and HARD_TX_LOCK (txq->_xmit_lock), allowing for CPU scalability.
      
      Due to lack of TC examples, lets show howto attach clsact BPF programs:
      
       # tc qdisc  add  dev ixgbe2 clsact
       # tc filter add  dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu
       # tc filter list dev ixgbe2 egress
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      74e31ca8
  20. 18 2月, 2019 5 次提交