1. 08 4月, 2021 1 次提交
  2. 07 4月, 2021 1 次提交
    • E
      virtio_net: Do not pull payload in skb->head · 0f6925b3
      Eric Dumazet 提交于
      Xuan Zhuo reported that commit 3226b158 ("net: avoid 32 x truesize
      under-estimation for tiny skbs") brought  a ~10% performance drop.
      
      The reason for the performance drop was that GRO was forced
      to chain sk_buff (using skb_shinfo(skb)->frag_list), which
      uses more memory but also cause packet consumers to go over
      a lot of overhead handling all the tiny skbs.
      
      It turns out that virtio_net page_to_skb() has a wrong strategy :
      It allocates skbs with GOOD_COPY_LEN (128) bytes in skb->head, then
      copies 128 bytes from the page, before feeding the packet to GRO stack.
      
      This was suboptimal before commit 3226b158 ("net: avoid 32 x truesize
      under-estimation for tiny skbs") because GRO was using 2 frags per MSS,
      meaning we were not packing MSS with 100% efficiency.
      
      Fix is to pull only the ethernet header in page_to_skb()
      
      Then, we change virtio_net_hdr_to_skb() to pull the missing
      headers, instead of assuming they were already pulled by callers.
      
      This fixes the performance regression, but could also allow virtio_net
      to accept packets with more than 128bytes of headers.
      
      Many thanks to Xuan Zhuo for his report, and his tests/help.
      
      Fixes: 3226b158 ("net: avoid 32 x truesize under-estimation for tiny skbs")
      Reported-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Link: https://www.spinics.net/lists/netdev/msg731397.htmlCo-Developed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: virtualization@lists.linux-foundation.org
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f6925b3
  3. 01 4月, 2021 1 次提交
  4. 31 3月, 2021 2 次提交
    • E
      net: ensure mac header is set in virtio_net_hdr_to_skb() · 61431a59
      Eric Dumazet 提交于
      Commit 924a9bc3 ("net: check if protocol extracted by virtio_net_hdr_set_proto is correct")
      added a call to dev_parse_header_protocol() but mac_header is not yet set.
      
      This means that eth_hdr() reads complete garbage, and syzbot complained about it [1]
      
      This patch resets mac_header earlier, to get more coverage about this change.
      
      Audit of virtio_net_hdr_to_skb() callers shows that this change should be safe.
      
      [1]
      
      BUG: KASAN: use-after-free in eth_header_parse_protocol+0xdc/0xe0 net/ethernet/eth.c:282
      Read of size 2 at addr ffff888017a6200b by task syz-executor313/8409
      
      CPU: 1 PID: 8409 Comm: syz-executor313 Not tainted 5.12.0-rc2-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x141/0x1d7 lib/dump_stack.c:120
       print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
       __kasan_report mm/kasan/report.c:399 [inline]
       kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
       eth_header_parse_protocol+0xdc/0xe0 net/ethernet/eth.c:282
       dev_parse_header_protocol include/linux/netdevice.h:3177 [inline]
       virtio_net_hdr_to_skb.constprop.0+0x99d/0xcd0 include/linux/virtio_net.h:83
       packet_snd net/packet/af_packet.c:2994 [inline]
       packet_sendmsg+0x2325/0x52b0 net/packet/af_packet.c:3031
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:674
       sock_no_sendpage+0xf3/0x130 net/core/sock.c:2860
       kernel_sendpage.part.0+0x1ab/0x350 net/socket.c:3631
       kernel_sendpage net/socket.c:3628 [inline]
       sock_sendpage+0xe5/0x140 net/socket.c:947
       pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
       splice_from_pipe_feed fs/splice.c:418 [inline]
       __splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
       splice_from_pipe fs/splice.c:597 [inline]
       generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
       do_splice_from fs/splice.c:767 [inline]
       do_splice+0xb7e/0x1940 fs/splice.c:1079
       __do_splice+0x134/0x250 fs/splice.c:1144
       __do_sys_splice fs/splice.c:1350 [inline]
       __se_sys_splice fs/splice.c:1332 [inline]
       __x64_sys_splice+0x198/0x250 fs/splice.c:1332
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
      
      Fixes: 924a9bc3 ("net: check if protocol extracted by virtio_net_hdr_set_proto is correct")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Balazs Nemeth <bnemeth@redhat.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61431a59
    • P
      net: let skb_orphan_partial wake-up waiters. · 9adc89af
      Paolo Abeni 提交于
      Currently the mentioned helper can end-up freeing the socket wmem
      without waking-up any processes waiting for more write memory.
      
      If the partially orphaned skb is attached to an UDP (or raw) socket,
      the lack of wake-up can hang the user-space.
      
      Even for TCP sockets not calling the sk destructor could have bad
      effects on TSQ.
      
      Address the issue using skb_orphan to release the sk wmem before
      setting the new sock_efree destructor. Additionally bundle the
      whole ownership update in a new helper, so that later other
      potential users could avoid duplicate code.
      
      v1 -> v2:
       - use skb_orphan() instead of sort of open coding it (Eric)
       - provide an helper for the ownership change (Eric)
      
      Fixes: f6ba8d33 ("netem: fix skb_orphan_partial()")
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9adc89af
  5. 29 3月, 2021 1 次提交
    • M
      can: uapi: can.h: mark union inside struct can_frame packed · f5076c6b
      Marc Kleine-Budde 提交于
      In commit ea780056 ("can: add optional DLC element to Classical
      CAN frame structure") the struct can_frame::can_dlc was put into an
      anonymous union with another u8 variable.
      
      For various reasons some members in struct can_frame and canfd_frame
      including the first 8 byes of data are expected to have the same
      memory layout. This is enforced by a BUILD_BUG_ON check in af_can.c.
      
      Since the above mentioned commit this check fails on ARM kernels
      compiled with the ARM OABI (which means CONFIG_AEABI not set). In this
      case -mabi=apcs-gnu is passed to the compiler, which leads to a
      structure size boundary of 32, instead of 8 compared to CONFIG_AEABI
      enabled. This means the the union in struct can_frame takes 4 bytes
      instead of the expected 1.
      
      Rong Chen illustrates the problem with pahole in the ARM OABI case:
      
      | struct can_frame {
      |          canid_t                    can_id;               /* 0     4 */
      |          union {
      |                  __u8               len;                  /* 4     1 */
      |                  __u8               can_dlc;              /* 4     1 */
      |          };                                               /* 4     4 */
      |          __u8                       __pad;                /* 8     1 */
      |          __u8                       __res0;               /* 9     1 */
      |          __u8                       len8_dlc;             /* 10    1 */
      |
      |          /* XXX 5 bytes hole, try to pack */
      |
      |          __u8                       data[8]
      | __attribute__((__aligned__(8))); /*    16     8 */
      |
      |          /* size: 24, cachelines: 1, members: 6 */
      |          /* sum members: 19, holes: 1, sum holes: 5 */
      |          /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
      |          /* last cacheline: 24 bytes */
      | } __attribute__((__aligned__(8)));
      
      Marking the anonymous union as __attribute__((packed)) fixes the
      BUILD_BUG_ON problem on these compilers.
      
      Fixes: ea780056 ("can: add optional DLC element to Classical CAN frame structure")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Suggested-by: NRong Chen <rong.a.chen@intel.com>
      Link: https://lore.kernel.org/linux-can/2c82ec23-3551-61b5-1bd8-178c3407ee83@hartkopp.net/
      Link: https://lore.kernel.org/r/20210325125850.1620-3-socketcan@hartkopp.netSigned-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      f5076c6b
  6. 27 3月, 2021 1 次提交
  7. 26 3月, 2021 6 次提交
    • E
      sch_red: fix off-by-one checks in red_check_params() · 3a87571f
      Eric Dumazet 提交于
      This fixes following syzbot report:
      
      UBSAN: shift-out-of-bounds in ./include/net/red.h:237:23
      shift exponent 32 is too large for 32-bit type 'unsigned int'
      CPU: 1 PID: 8418 Comm: syz-executor170 Not tainted 5.12.0-rc4-next-20210324-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x141/0x1d7 lib/dump_stack.c:120
       ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
       __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327
       red_set_parms include/net/red.h:237 [inline]
       choke_change.cold+0x3c/0xc8 net/sched/sch_choke.c:414
       qdisc_create+0x475/0x12f0 net/sched/sch_api.c:1247
       tc_modify_qdisc+0x4c8/0x1a50 net/sched/sch_api.c:1663
       rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5553
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
       netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
       netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:674
       ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2404
       __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x43f039
      Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffdfa725168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000400488 RCX: 000000000043f039
      RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000004
      RBP: 0000000000403020 R08: 0000000000400488 R09: 0000000000400488
      R10: 0000000000400488 R11: 0000000000000246 R12: 00000000004030b0
      R13: 0000000000000000 R14: 00000000004ac018 R15: 0000000000400488
      
      Fixes: 8afa10cb ("net_sched: red: Avoid illegal values")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a87571f
    • N
      virtchnl: Fix layout of RSS structures · 22f8b5df
      Norbert Ciosek 提交于
      Remove padding from RSS structures. Previous layout
      could lead to unwanted compiler optimizations
      in loops when iterating over key and lut arrays.
      
      Fixes: 65ece6de ("virtchnl: Add missing explicit padding to structures")
      Signed-off-by: NNorbert Ciosek <norbertx.ciosek@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      22f8b5df
    • M
      mm: memblock: fix section mismatch warning again · a024b7c2
      Mike Rapoport 提交于
      Commit 34dc2efb ("memblock: fix section mismatch warning") marked
      memblock_bottom_up() and memblock_set_bottom_up() as __init, but they
      could be referenced from non-init functions like
      memblock_find_in_range_node() on architectures that enable
      CONFIG_ARCH_KEEP_MEMBLOCK.
      
      For such builds kernel test robot reports:
      
         WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up()
         The function memblock_find_in_range_node() references the function __init memblock_bottom_up().
         This is often because memblock_find_in_range_node lacks a __init  annotation or the annotation of memblock_bottom_up is wrong.
      
      Replace __init annotations with __init_memblock annotations so that the
      appropriate section will be selected depending on
      CONFIG_ARCH_KEEP_MEMBLOCK.
      
      Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com
      Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org
      Fixes: 34dc2efb ("memblock: fix section mismatch warning")
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a024b7c2
    • S
      mm/mmu_notifiers: ensure range_end() is paired with range_start() · c2655835
      Sean Christopherson 提交于
      If one or more notifiers fails .invalidate_range_start(), invoke
      .invalidate_range_end() for "all" notifiers.  If there are multiple
      notifiers, those that did not fail are expecting _start() and _end() to
      be paired, e.g.  KVM's mmu_notifier_count would become imbalanced.
      Disallow notifiers that can fail _start() from implementing _end() so
      that it's unnecessary to either track which notifiers rejected _start(),
      or had already succeeded prior to a failed _start().
      
      Note, the existing behavior of calling _start() on all notifiers even
      after a previous notifier failed _start() was an unintented "feature".
      Make it canon now that the behavior is depended on for correctness.
      
      As of today, the bug is likely benign:
      
        1. The only caller of the non-blocking notifier is OOM kill.
        2. The only notifiers that can fail _start() are the i915 and Nouveau
           drivers.
        3. The only notifiers that utilize _end() are the SGI UV GRU driver
           and KVM.
        4. The GRU driver will never coincide with the i195/Nouveau drivers.
        5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the
           _guest_, and the guest is already doomed due to being an OOM victim.
      
      Fix the bug now to play nice with future usage, e.g.  KVM has a
      potential use case for blocking memslot updates in KVM while an
      invalidation is in-progress, and failure to unblock would result in said
      updates being blocked indefinitely and hanging.
      
      Found by inspection.  Verified by adding a second notifier in KVM that
      periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
      and observing that KVM exits with an elevated notifier count.
      
      Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com
      Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2655835
    • A
      kasan: fix per-page tags for non-page_alloc pages · cf10bd4c
      Andrey Konovalov 提交于
      To allow performing tag checks on page_alloc addresses obtained via
      page_address(), tag-based KASAN modes store tags for page_alloc
      allocations in page->flags.
      
      Currently, the default tag value stored in page->flags is 0x00.
      Therefore, page_address() returns a 0x00ffff...  address for pages that
      were not allocated via page_alloc.
      
      This might cause problems.  A particular case we encountered is a
      conflict with KFENCE.  If a KFENCE-allocated slab object is being freed
      via kfree(page_address(page) + offset), the address passed to kfree()
      will get tagged with 0x00 (as slab pages keep the default per-page
      tags).  This leads to is_kfence_address() check failing, and a KFENCE
      object ending up in normal slab freelist, which causes memory
      corruptions.
      
      This patch changes the way KASAN stores tag in page-flags: they are now
      stored xor'ed with 0xff.  This way, KASAN doesn't need to initialize
      per-page flags for every created page, which might be slow.
      
      With this change, page_address() returns natively-tagged (with 0xff)
      pointers for pages that didn't have tags set explicitly.
      
      This patch fixes the encountered conflict with KFENCE and prevents more
      similar issues that can occur in the future.
      
      Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
      Fixes: 2813b9c0 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf10bd4c
    • M
      hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings · d85aecf2
      Miaohe Lin 提交于
      The current implementation of hugetlb_cgroup for shared mappings could
      have different behavior.  Consider the following two scenarios:
      
       1.Assume initial css reference count of hugetlb_cgroup is 1:
        1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
            count is 2 associated with 1 file_region.
        1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
            count is 3 associated with 2 file_region.
        1.3 coalesce_file_region will coalesce these two file_regions into
            one. So css reference count is 3 associated with 1 file_region
            now.
      
       2.Assume initial css reference count of hugetlb_cgroup is 1 again:
        2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
            count is 2 associated with 1 file_region.
      
      Therefore, we might have one file_region while holding one or more css
      reference counts. This inconsistency could lead to imbalanced css_get()
      and css_put() pair. If we do css_put one by one (i.g. hole punch case),
      scenario 2 would put one more css reference. If we do css_put all
      together (i.g. truncate case), scenario 1 will leak one css reference.
      
      The imbalanced css_get() and css_put() pair would result in a non-zero
      reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
      directory is removed __but__ associated resource is not freed. This
      might result in OOM or can not create a new hugetlb cgroup in a busy
      workload ultimately.
      
      In order to fix this, we have to make sure that one file_region must
      hold exactly one css reference. So in coalesce_file_region case, we
      should release one css reference before coalescence. Also only put css
      reference when the entire file_region is removed.
      
      The last thing to note is that the caller of region_add() will only hold
      one reference to h_cg->css for the whole contiguous reservation region.
      But this area might be scattered when there are already some
      file_regions reside in it. As a result, many file_regions may share only
      one h_cg->css reference. In order to ensure that one file_region must
      hold exactly one css reference, we should do css_get() for each
      file_region and release the reference held by caller when they are done.
      
      [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
        Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
      Fixes: 075a61d0 ("hugetlb_cgroup: add accounting for shared mappings")
      Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR)
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d85aecf2
  8. 25 3月, 2021 1 次提交
  9. 24 3月, 2021 3 次提交
  10. 22 3月, 2021 2 次提交
    • A
      net: xfrm: Use sequence counter with associated spinlock · bc8e0adf
      Ahmed S. Darwish 提交于
      A sequence counter write section must be serialized or its internal
      state can get corrupted. A plain seqcount_t does not contain the
      information of which lock must be held to guaranteee write side
      serialization.
      
      For xfrm_state_hash_generation, use seqcount_spinlock_t instead of plain
      seqcount_t.  This allows to associate the spinlock used for write
      serialization with the sequence counter. It thus enables lockdep to
      verify that the write serialization lock is indeed held before entering
      the sequence counter write section.
      
      If lockdep is disabled, this lock association is compiled out and has
      neither storage size nor runtime overhead.
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      bc8e0adf
    • A
      net: xfrm: Localize sequence counter per network namespace · e88add19
      Ahmed S. Darwish 提交于
      A sequence counter write section must be serialized or its internal
      state can get corrupted. The "xfrm_state_hash_generation" seqcount is
      global, but its write serialization lock (net->xfrm.xfrm_state_lock) is
      instantiated per network namespace. The write protection is thus
      insufficient.
      
      To provide full protection, localize the sequence counter per network
      namespace instead. This should be safe as both the seqcount read and
      write sections access data exclusively within the network namespace. It
      also lays the foundation for transforming "xfrm_state_hash_generation"
      data type from seqcount_t to seqcount_LOCKNAME_t in further commits.
      
      Fixes: b65e3d7b ("xfrm: state: add sequence count to detect hash resizes")
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      e88add19
  11. 20 3月, 2021 2 次提交
    • Z
      bpf: Fix umd memory leak in copy_process() · f60a85ca
      Zqiang 提交于
      The syzbot reported a memleak as follows:
      
      BUG: memory leak
      unreferenced object 0xffff888101b41d00 (size 120):
        comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s)
        backtrace:
          [<ffffffff8125dc56>] alloc_pid+0x66/0x560
          [<ffffffff81226405>] copy_process+0x1465/0x25e0
          [<ffffffff81227943>] kernel_clone+0xf3/0x670
          [<ffffffff812281a1>] kernel_thread+0x61/0x80
          [<ffffffff81253464>] call_usermodehelper_exec_work
          [<ffffffff81253464>] call_usermodehelper_exec_work+0xc4/0x120
          [<ffffffff812591c9>] process_one_work+0x2c9/0x600
          [<ffffffff81259ab9>] worker_thread+0x59/0x5d0
          [<ffffffff812611c8>] kthread+0x178/0x1b0
          [<ffffffff8100227f>] ret_from_fork+0x1f/0x30
      
      unreferenced object 0xffff888110ef5c00 (size 232):
        comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s)
        backtrace:
          [<ffffffff8154a0cf>] kmem_cache_zalloc
          [<ffffffff8154a0cf>] __alloc_file+0x1f/0xf0
          [<ffffffff8154a809>] alloc_empty_file+0x69/0x120
          [<ffffffff8154a8f3>] alloc_file+0x33/0x1b0
          [<ffffffff8154ab22>] alloc_file_pseudo+0xb2/0x140
          [<ffffffff81559218>] create_pipe_files+0x138/0x2e0
          [<ffffffff8126c793>] umd_setup+0x33/0x220
          [<ffffffff81253574>] call_usermodehelper_exec_async+0xb4/0x1b0
          [<ffffffff8100227f>] ret_from_fork+0x1f/0x30
      
      After the UMD process exits, the pipe_to_umh/pipe_from_umh and
      tgid need to be released.
      
      Fixes: d71fa5c9 ("bpf: Add kernel module with user mode driver that populates bpffs.")
      Reported-by: syzbot+44908bb56d2bfe56b28e@syzkaller.appspotmail.com
      Signed-off-by: NZqiang <qiang.zhang@windriver.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210317030915.2865-1-qiang.zhang@windriver.com
      f60a85ca
    • B
      sch_red: Fix a typo · 8a2dc6af
      Bhaskar Chowdhury 提交于
      s/recalcultion/recalculation/
      Signed-off-by: NBhaskar Chowdhury <unixbhaskar@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a2dc6af
  12. 19 3月, 2021 2 次提交
    • A
      efi: use 32-bit alignment for efi_guid_t literals · fb98cc0b
      Ard Biesheuvel 提交于
      Commit 494c704f ("efi: Use 32-bit alignment for efi_guid_t") updated
      the type definition of efi_guid_t to ensure that it always appears
      sufficiently aligned (the UEFI spec is ambiguous about this, but given
      the fact that its EFI_GUID type is defined in terms of a struct carrying
      a uint32_t, the natural alignment is definitely >= 32 bits).
      
      However, we missed the EFI_GUID() macro which is used to instantiate
      efi_guid_t literals: that macro is still based on the guid_t type,
      which does not have a minimum alignment at all. This results in warnings
      such as
      
        In file included from drivers/firmware/efi/mokvar-table.c:35:
        include/linux/efi.h:1093:34: warning: passing 1-byte aligned argument to
            4-byte aligned parameter 2 of 'get_var' may result in an unaligned pointer
            access [-Walign-mismatch]
                status = get_var(L"SecureBoot", &EFI_GLOBAL_VARIABLE_GUID, NULL, &size,
                                                ^
        include/linux/efi.h:1101:24: warning: passing 1-byte aligned argument to
            4-byte aligned parameter 2 of 'get_var' may result in an unaligned pointer
            access [-Walign-mismatch]
                get_var(L"SetupMode", &EFI_GLOBAL_VARIABLE_GUID, NULL, &size, &setupmode);
      
      The distinction only matters on CPUs that do not support misaligned loads
      fully, but 32-bit ARM's load-multiple instructions fall into that category,
      and these are likely to be emitted by the compiler that built the firmware
      for loading word-aligned 128-bit GUIDs from memory
      
      So re-implement the initializer in terms of our own efi_guid_t type, so that
      the alignment becomes a property of the literal's type.
      
      Fixes: 494c704f ("efi: Use 32-bit alignment for efi_guid_t")
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: NNathan Chancellor <nathan@kernel.org>
      Tested-by: NNathan Chancellor <nathan@kernel.org>
      Link: https://github.com/ClangBuiltLinux/linux/issues/1327Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      fb98cc0b
    • S
      workqueue/tracing: Copy workqueue name to buffer in trace event · 83b62687
      Steven Rostedt (VMware) 提交于
      The trace event "workqueue_queue_work" references an unsafe string in
      dereferencing the name of the workqueue. As the name is allocated, it
      could later be freed, and the pointer to that string could stay on the
      tracing buffer. If the trace buffer is read after the string is freed, it
      will reference an unsafe pointer.
      
      I added a new verifier to make sure that all strings referenced in the
      output of the trace buffer is safe to read and this triggered on the
      workqueue_queue_work trace event:
      
      workqueue_queue_work: work struct=00000000b2b235c7 function=gc_worker workqueue=(0xffff888100051160:events_power_efficient)[UNSAFE-MEMORY] req_cpu=256 cpu=1
      workqueue_queue_work: work struct=00000000c344caec function=flush_to_ldisc workqueue=(0xffff888100054d60:events_unbound)[UNSAFE-MEMORY] req_cpu=256 cpu=4294967295
      workqueue_queue_work: work struct=00000000b2b235c7 function=gc_worker workqueue=(0xffff888100051160:events_power_efficient)[UNSAFE-MEMORY] req_cpu=256 cpu=1
      workqueue_queue_work: work struct=000000000b238b3f function=vmstat_update workqueue=(0xffff8881000c3760:mm_percpu_wq)[UNSAFE-MEMORY] req_cpu=1 cpu=1
      
      Also, if this event is read via a user space application like perf or
      trace-cmd, the name would only be an address and useless information:
      
      workqueue_queue_work: work struct=0xffff953f80b4b918 function=disk_events_workfn workqueue=ffff953f8005d378 req_cpu=8192 cpu=5
      
      Cc: Zqiang <qiang.zhang@windriver.com>
      Cc: Tejun Heo <tj@kernel.org>
      Fixes: 7bf9c4a8 ("workqueue: tracing the name of the workqueue instead of it's address")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      83b62687
  13. 18 3月, 2021 6 次提交
  14. 17 3月, 2021 5 次提交
  15. 16 3月, 2021 5 次提交
    • A
      fuse: 32-bit user space ioctl compat for fuse device · f8425c93
      Alessio Balsini 提交于
      With a 64-bit kernel build the FUSE device cannot handle ioctl requests
      coming from 32-bit user space.  This is due to the ioctl command
      translation that generates different command identifiers that thus cannot
      be used for direct comparisons without proper manipulation.
      
      Explicitly extract type and number from the ioctl command to enable 32-bit
      user space compatibility on 64-bit kernel builds.
      Signed-off-by: NAlessio Balsini <balsini@android.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      f8425c93
    • M
      can: dev: Move device back to init netns on owning netns delete · 3a5ca857
      Martin Willi 提交于
      When a non-initial netns is destroyed, the usual policy is to delete
      all virtual network interfaces contained, but move physical interfaces
      back to the initial netns. This keeps the physical interface visible
      on the system.
      
      CAN devices are somewhat special, as they define rtnl_link_ops even
      if they are physical devices. If a CAN interface is moved into a
      non-initial netns, destroying that netns lets the interface vanish
      instead of moving it back to the initial netns. default_device_exit()
      skips CAN interfaces due to having rtnl_link_ops set. Reproducer:
      
        ip netns add foo
        ip link set can0 netns foo
        ip netns delete foo
      
      WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60
      CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1
      Workqueue: netns cleanup_net
      [<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14)
      [<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8)
      [<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114)
      [<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac)
      [<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60)
      [<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380)
      [<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438)
      [<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8)
      [<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c)
      [<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c)
      
      To properly restore physical CAN devices to the initial netns on owning
      netns exit, introduce a flag on rtnl_link_ops that can be set by drivers.
      For CAN devices setting this flag, default_device_exit() considers them
      non-virtual, applying the usual namespace move.
      
      The issue was introduced in the commit mentioned below, as at that time
      CAN devices did not have a dellink() operation.
      
      Fixes: e008b5fc ("net: Simplfy default_device_exit and improve batching.")
      Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.orgSigned-off-by: NMartin Willi <martin@strongswan.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      3a5ca857
    • A
      tcp: relookup sock for RST+ACK packets handled by obsolete req sock · 7233da86
      Alexander Ovechkin 提交于
      Currently tcp_check_req can be called with obsolete req socket for which big
      socket have been already created (because of CPU race or early demux
      assigning req socket to multiple packets in gro batch).
      
      Commit e0f9759f ("tcp: try to keep packet if SYN_RCV race
      is lost") added retry in case when tcp_check_req is called for PSH|ACK packet.
      But if client sends RST+ACK immediatly after connection being
      established (it is performing healthcheck, for example) retry does not
      occur. In that case tcp_check_req tries to close req socket,
      leaving big socket active.
      
      Fixes: e0f9759f ("tcp: try to keep packet if SYN_RCV race is lost")
      Signed-off-by: NAlexander Ovechkin <ovov@yandex-team.ru>
      Reported-by: NOleg Senin <olegsenin@yandex-team.ru>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7233da86
    • M
      netfilter: x_tables: Use correct memory barriers. · 175e476b
      Mark Tomlinson 提交于
      When a new table value was assigned, it was followed by a write memory
      barrier. This ensured that all writes before this point would complete
      before any writes after this point. However, to determine whether the
      rules are unused, the sequence counter is read. To ensure that all
      writes have been done before these reads, a full memory barrier is
      needed, not just a write memory barrier. The same argument applies when
      incrementing the counter, before the rules are read.
      
      Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic
      reported in cc00bcaa (which is still present), while still
      maintaining the same speed of replacing tables.
      
      The smb_mb() barriers potentially slow the packet path, however testing
      has shown no measurable change in performance on a 4-core MIPS64
      platform.
      
      Fixes: 7f5c6d4f ("netfilter: get rid of atomic ops in fast path")
      Signed-off-by: NMark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      175e476b
    • M
      Revert "netfilter: x_tables: Switch synchronization to RCU" · d3d40f23
      Mark Tomlinson 提交于
      This reverts commit cc00bcaa.
      
      This (and the preceding) patch basically re-implemented the RCU
      mechanisms of patch 78454473. That patch was replaced because of the
      performance problems that it created when replacing tables. Now, we have
      the same issue: the call to synchronize_rcu() makes replacing tables
      slower by as much as an order of magnitude.
      
      Prior to using RCU a script calling "iptables" approx. 200 times was
      taking 1.16s. With RCU this increased to 11.59s.
      
      Revert these patches and fix the issue in a different way.
      Signed-off-by: NMark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d3d40f23
  16. 15 3月, 2021 1 次提交