1. 07 7月, 2023 21 次提交
  2. 06 7月, 2023 5 次提交
  3. 05 7月, 2023 10 次提交
    • Z
      jbd2: fix checkpoint cleanup performance regression · cdbe929a
      Zhang Yi 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7IO1D
      CVE: NA
      
      --------------------------------
      
      journal_clean_one_cp_list() has been merged into
      journal_shrink_one_cp_list(), but do chekpoint buffer cleanup from the
      committing process is just a best effort, it should stop scan once it
      meet a busy buffer, or else it will cause a lot of invalid buffer scan
      and checks. We catch a performance regression when doing fs_mark tests
      below.
      
      Test cmd:
       ./fs_mark  -d  scratch  -s  1024  -n  10000  -t  1  -D  100  -N  100
      
      Before merging checkpoint buffer cleanup:
       FSUse%        Count         Size    Files/sec     App Overhead
           95        10000         1024       8304.9            49033
      
      After merging checkpoint buffer cleanup:
       FSUse%        Count         Size    Files/sec     App Overhead
           95        10000         1024       7649.0            50012
       FSUse%        Count         Size    Files/sec     App Overhead
           95        10000         1024       2107.1            50871
      
      After merging checkpoint buffer cleanup, the total loop count in
      journal_shrink_one_cp_list() could be up to 6,261,600+ (50,000+ ~
      100,000+ in general), most of them are invalid. This patch fix it
      through passing 'shrink_type' into journal_shrink_one_cp_list() and add
      a new 'SHRINK_BUSY_STOP' to indicate it should stop once meet a busy
      buffer. After fix, the loop count descending back to 10,000+.
      
      After this fix:
       FSUse%        Count         Size    Files/sec     App Overhead
           95        10000         1024       8558.4            49109
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      cdbe929a
    • Z
      jbd2: remove __journal_try_to_free_buffer() · 90185691
      Zhang Yi 提交于
      maillist inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I70WHL
      CVE: NA
      
      Reference: https://lore.kernel.org/linux-ext4/20230606135928.434610-1-yi.zhang@huaweicloud.com/T/#t
      
      --------------------------------
      
      __journal_try_to_free_buffer() has only one caller and it's logic is
      much simple now, so just remove it and open code in
      jbd2_journal_try_to_free_buffers().
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Conflicts:
      	fs/jbd2/transaction.c
      	[ 46417064("jbd2: Make state lock a spinlock") is not
      	  applied. ]
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      90185691
    • Z
      jbd2: fix a race when checking checkpoint buffer busy · aa5d953c
      Zhang Yi 提交于
      maillist inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I70WHL
      CVE: NA
      
      Reference: https://lore.kernel.org/linux-ext4/20230606135928.434610-1-yi.zhang@huaweicloud.com/T/#t
      
      --------------------------------
      
      Before removing checkpoint buffer from the t_checkpoint_list, we have to
      check both BH_Dirty and BH_Lock bits together to distinguish buffers
      have not been or were being written back. But __cp_buffer_busy() checks
      them separately, it first check lock state and then check dirty, the
      window between these two checks could be raced by writing back
      procedure, which locks buffer and clears buffer dirty before I/O
      completes. So it cannot guarantee checkpointing buffers been written
      back to disk if some error happens later. Finally, it may clean
      checkpoint transactions and lead to inconsistent filesystem.
      
      jbd2_journal_forget() and __journal_try_to_free_buffer() also have the
      same problem (journal_unmap_buffer() escape from this issue since it's
      running under the buffer lock), so fix them through introducing a new
      helper to try holding the buffer lock and remove really clean buffer.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
      Cc: stable@vger.kernel.org
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      aa5d953c
    • Z
      jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint · 87abd734
      Zhihao Cheng 提交于
      maillist inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I70WHL
      CVE: NA
      
      Reference: https://lore.kernel.org/linux-ext4/20230606135928.434610-1-yi.zhang@huaweicloud.com/T/#t
      
      --------------------------------
      
      Following process,
      
      jbd2_journal_commit_transaction
      // there are several dirty buffer heads in transaction->t_checkpoint_list
                P1                   wb_workfn
      jbd2_log_do_checkpoint
       if (buffer_locked(bh)) // false
                                  __block_write_full_page
                                   trylock_buffer(bh)
                                   test_clear_buffer_dirty(bh)
       if (!buffer_dirty(bh))
        __jbd2_journal_remove_checkpoint(jh)
         if (buffer_write_io_error(bh)) // false
                                   >> bh IO error occurs <<
       jbd2_cleanup_journal_tail
        __jbd2_update_log_tail
         jbd2_write_superblock
         // The bh won't be replayed in next mount.
      , which could corrupt the ext4 image, fetch a reproducer in [Link].
      
      Since writeback process clears buffer dirty after locking buffer head,
      we can fix it by try locking buffer and check dirtiness while buffer is
      locked, the buffer head can be removed if it is neither dirty nor locked.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
      Fixes: 470decc6 ("[PATCH] jbd2: initial copy of files from jbd")
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      87abd734
    • Z
      jbd2: remove journal_clean_one_cp_list() · 3d00750a
      Zhang Yi 提交于
      maillist inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I70WHL
      CVE: NA
      
      Reference: https://lore.kernel.org/linux-ext4/20230606135928.434610-1-yi.zhang@huaweicloud.com/T/#t
      
      --------------------------------
      
      journal_clean_one_cp_list() and journal_shrink_one_cp_list() are almost
      the same, so merge them into journal_shrink_one_cp_list(), remove the
      nr_to_scan parameter, always scan and try to free the whole checkpoint
      list.
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Conflicts:
      	fs/jbd2/checkpoint.c
      	[ inclusion patch
      	  9fb671cb("jbd2: fix kabi broken in struct journal_s")
      	  is applied. ]
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      3d00750a
    • Y
      nbd: fix null-ptr-dereference while accessing 'nbd->config' · d55e3cf0
      Yu Kuai 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7EENU
      CVE: NA
      
      ----------------------------------------
      
      nbd->config = config and refcount_set(&nbd->config_refs, 1) in
      nbd_genl_connect may be out of order, causing config_refs to be set to 1
      first, and then nbd_open accessing nbd->config reports a null pointer
      reference.
         T1                      T2
      vfs_open
        do_dentry_open
          blkdev_open
            blkdev_get
              __blkdev_get
                nbd_open
                 nbd_get_config_unlocked
                              genl_rcv_msg
                                genl_family_rcv_msg
                                  genl_family_rcv_msg_doit
                                    nbd_genl_connect
                                      nbd_alloc_and_init_config
                                        // out of order execution
                                        refcount_set(&nbd->config_refs, 1); // 2
                   nbd->config
                   // null point
                                        nbd->config = config; // 1
      
      Fix it by adding a cpu memory barrier to guarantee sequential execution.
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
      Reviewed-by: NYu Kuai <yukuai3@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      d55e3cf0
    • Y
      nbd: factor out a helper to get nbd_config without holding 'config_lock' · e752a398
      Yu Kuai 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7EENU
      CVE: NA
      
      ----------------------------------------
      
      There are no functional changes, just to make code cleaner and prepare
      to fix null-ptr-dereference while accessing 'nbd->config'.
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      conflict:
      	drivers/block/nbd.c
      Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
      Reviewed-by: NYu Kuai <yukuai3@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      e752a398
    • Y
      nbd: fold nbd config initialization into nbd_alloc_config() · 9d0422bf
      Yu Kuai 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7EENU
      CVE: NA
      
      ----------------------------------------
      
      There are no functional changes, make the code cleaner and prepare to
      fix null-ptr-dereference while accessing 'nbd->config'.
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      conflict:
      	drivers/block/nbd.c
      Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
      Reviewed-by: NYu Kuai <yukuai3@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      9d0422bf
    • Z
      ext4: Stop trying writing pages if no free blocks generated · 6b84e9d9
      Zhihao Cheng 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7CBCS
      CVE: NA
      
      --------------------------------
      
      Folllowing steps could make ext4_wripages trap into a dead loop:
      
      1. Consume free_clusters until free_clusters > 2 * sbi->s_resv_clusters,
         and free_clusters > EXT4_FREECLUSTERS_WATERMARK.
         // eg. free_clusters = 1422, sbi->s_resv_clusters = 512
         // nr_cpus = 4, EXT4_FREECLUSTERS_WATERMARK = 512
      2. umount && mount.  // dirty_clusters = 0
      3. Run free_clusters tasks concurrently to write different files, many
         tasks write(appendant) 4K data by da_write method. And each inode will
         consume one data block and one extent block in map_block.
         // There are (free_clusters - EXT4_FREECLUSTERS_WATERMARK = 910)
         // tasks choosing da_write method, left 512 tasks choose write_begin
         // method. If tasks which chooses da_write path run first.
         // dirty_clusters = 910, free_clusters = 1422
         // Tasks which choose write_begin path will get ENOSPC:
         //  free_clusters < (nclusters + dirty_clusters + resv_clusters)
         //  1422 < (1 + 910 + 512)
      4. After certain number of map_block iterations in ext4_writepages.
         // free_clusters = 0,
         // dirty_clusters = 910 - (1422 / 2) = 199
      5. Delete one 4K file.  // free_clusters = 1
      6. ext4_writepages traps into dead loop:
          mpage_map_and_submit_extent
           mpage_map_one_extent // ret = ENOSPC
             ext4_map_blocks -> ext4_ext_map_blocks -> ext4_mb_new_blocks ->
             ext4_claim_free_clusters:
               if (free_clusters >= (nclusters + dirty_clusters)) // false
           if (err == -ENOSPC && ext4_count_free_clusters(sb)) // true
             return err
           *give_up_on_write = true // won't be executed
      
      Fix it by terminating ext4_writepages if no free blocks generated.
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      6b84e9d9
    • T
      ipvlan:Fix out-of-bounds caused by unclear skb->cb · 92e41d84
      t.feng 提交于
      stable inclusion
      from stable-v4.19.284
      commit b36dcf3ed547c103acef6f52bed000a0ac6c074f
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I7GVI1
      CVE: CVE-2023-3090
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b36dcf3ed547c103acef6f52bed000a0ac6c074f
      
      --------------------------------
      
      [ Upstream commit 90cbed52 ]
      
      If skb enqueue the qdisc, fq_skb_cb(skb)->time_to_send is changed which
      is actually skb->cb, and IPCB(skb_in)->opt will be used in
      __ip_options_echo. It is possible that memcpy is out of bounds and lead
      to stack overflow.
      We should clear skb->cb before ip_local_out or ip6_local_out.
      
      v2:
      1. clean the stack info
      2. use IPCB/IP6CB instead of skb->cb
      
      crash on stable-5.10(reproduce in kasan kernel).
      Stack info:
      [ 2203.651571] BUG: KASAN: stack-out-of-bounds in
      __ip_options_echo+0x589/0x800
      [ 2203.653327] Write of size 4 at addr ffff88811a388f27 by task
      swapper/3/0
      [ 2203.655460] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted
      5.10.0-60.18.0.50.h856.kasan.eulerosv2r11.x86_64 #1
      [ 2203.655466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014
      [ 2203.655475] Call Trace:
      [ 2203.655481]  <IRQ>
      [ 2203.655501]  dump_stack+0x9c/0xd3
      [ 2203.655514]  print_address_description.constprop.0+0x19/0x170
      [ 2203.655530]  __kasan_report.cold+0x6c/0x84
      [ 2203.655586]  kasan_report+0x3a/0x50
      [ 2203.655594]  check_memory_region+0xfd/0x1f0
      [ 2203.655601]  memcpy+0x39/0x60
      [ 2203.655608]  __ip_options_echo+0x589/0x800
      [ 2203.655654]  __icmp_send+0x59a/0x960
      [ 2203.655755]  nf_send_unreach+0x129/0x3d0 [nf_reject_ipv4]
      [ 2203.655763]  reject_tg+0x77/0x1bf [ipt_REJECT]
      [ 2203.655772]  ipt_do_table+0x691/0xa40 [ip_tables]
      [ 2203.655821]  nf_hook_slow+0x69/0x100
      [ 2203.655828]  __ip_local_out+0x21e/0x2b0
      [ 2203.655857]  ip_local_out+0x28/0x90
      [ 2203.655868]  ipvlan_process_v4_outbound+0x21e/0x260 [ipvlan]
      [ 2203.655931]  ipvlan_xmit_mode_l3+0x3bd/0x400 [ipvlan]
      [ 2203.655967]  ipvlan_queue_xmit+0xb3/0x190 [ipvlan]
      [ 2203.655977]  ipvlan_start_xmit+0x2e/0xb0 [ipvlan]
      [ 2203.655984]  xmit_one.constprop.0+0xe1/0x280
      [ 2203.655992]  dev_hard_start_xmit+0x62/0x100
      [ 2203.656000]  sch_direct_xmit+0x215/0x640
      [ 2203.656028]  __qdisc_run+0x153/0x1f0
      [ 2203.656069]  __dev_queue_xmit+0x77f/0x1030
      [ 2203.656173]  ip_finish_output2+0x59b/0xc20
      [ 2203.656244]  __ip_finish_output.part.0+0x318/0x3d0
      [ 2203.656312]  ip_finish_output+0x168/0x190
      [ 2203.656320]  ip_output+0x12d/0x220
      [ 2203.656357]  __ip_queue_xmit+0x392/0x880
      [ 2203.656380]  __tcp_transmit_skb+0x1088/0x11c0
      [ 2203.656436]  __tcp_retransmit_skb+0x475/0xa30
      [ 2203.656505]  tcp_retransmit_skb+0x2d/0x190
      [ 2203.656512]  tcp_retransmit_timer+0x3af/0x9a0
      [ 2203.656519]  tcp_write_timer_handler+0x3ba/0x510
      [ 2203.656529]  tcp_write_timer+0x55/0x180
      [ 2203.656542]  call_timer_fn+0x3f/0x1d0
      [ 2203.656555]  expire_timers+0x160/0x200
      [ 2203.656562]  run_timer_softirq+0x1f4/0x480
      [ 2203.656606]  __do_softirq+0xfd/0x402
      [ 2203.656613]  asm_call_irq_on_stack+0x12/0x20
      [ 2203.656617]  </IRQ>
      [ 2203.656623]  do_softirq_own_stack+0x37/0x50
      [ 2203.656631]  irq_exit_rcu+0x134/0x1a0
      [ 2203.656639]  sysvec_apic_timer_interrupt+0x36/0x80
      [ 2203.656646]  asm_sysvec_apic_timer_interrupt+0x12/0x20
      [ 2203.656654] RIP: 0010:default_idle+0x13/0x20
      [ 2203.656663] Code: 89 f0 5d 41 5c 41 5d 41 5e c3 cc cc cc cc cc cc cc
      cc cc cc cc cc cc 0f 1f 44 00 00 0f 1f 44 00 00 0f 00 2d 9f 32 57 00 fb
      f4 <c3> cc cc cc cc 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 be 08
      [ 2203.656668] RSP: 0018:ffff88810036fe78 EFLAGS: 00000256
      [ 2203.656676] RAX: ffffffffaf2a87f0 RBX: ffff888100360000 RCX:
      ffffffffaf290191
      [ 2203.656681] RDX: 0000000000098b5e RSI: 0000000000000004 RDI:
      ffff88811a3c4f60
      [ 2203.656686] RBP: 0000000000000000 R08: 0000000000000001 R09:
      ffff88811a3c4f63
      [ 2203.656690] R10: ffffed10234789ec R11: 0000000000000001 R12:
      0000000000000003
      [ 2203.656695] R13: ffff888100360000 R14: 0000000000000000 R15:
      0000000000000000
      [ 2203.656729]  default_idle_call+0x5a/0x150
      [ 2203.656735]  cpuidle_idle_call+0x1c6/0x220
      [ 2203.656780]  do_idle+0xab/0x100
      [ 2203.656786]  cpu_startup_entry+0x19/0x20
      [ 2203.656793]  secondary_startup_64_no_verify+0xc2/0xcb
      
      [ 2203.657409] The buggy address belongs to the page:
      [ 2203.658648] page:0000000027a9842f refcount:1 mapcount:0
      mapping:0000000000000000 index:0x0 pfn:0x11a388
      [ 2203.658665] flags:
      0x17ffffc0001000(reserved|node=0|zone=2|lastcpupid=0x1fffff)
      [ 2203.658675] raw: 0017ffffc0001000 ffffea000468e208 ffffea000468e208
      0000000000000000
      [ 2203.658682] raw: 0000000000000000 0000000000000000 00000001ffffffff
      0000000000000000
      [ 2203.658686] page dumped because: kasan: bad access detected
      
      To reproduce(ipvlan with IPVLAN_MODE_L3):
      Env setting:
      =======================================================
      modprobe ipvlan ipvlan_default_mode=1
      sysctl net.ipv4.conf.eth0.forwarding=1
      iptables -t nat -A POSTROUTING -s 20.0.0.0/255.255.255.0 -o eth0 -j
      MASQUERADE
      ip link add gw link eth0 type ipvlan
      ip -4 addr add 20.0.0.254/24 dev gw
      ip netns add net1
      ip link add ipv1 link eth0 type ipvlan
      ip link set ipv1 netns net1
      ip netns exec net1 ip link set ipv1 up
      ip netns exec net1 ip -4 addr add 20.0.0.4/24 dev ipv1
      ip netns exec net1 route add default gw 20.0.0.254
      ip netns exec net1 tc qdisc add dev ipv1 root netem loss 10%
      ifconfig gw up
      iptables -t filter -A OUTPUT -p tcp --dport 8888 -j REJECT --reject-with
      icmp-port-unreachable
      =======================================================
      And then excute the shell(curl any address of eth0 can reach):
      
      for((i=1;i<=100000;i++))
      do
              ip netns exec net1 curl x.x.x.x:8888
      done
      =======================================================
      
      Fixes: 2ad7bf36 ("ipvlan: Initial check-in of the IPVLAN driver.")
      Signed-off-by: N"t.feng" <fengtao40@huawei.com>
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Reviewed-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      92e41d84
  4. 30 6月, 2023 1 次提交
    • H
      sched: Fix null pointer derefrence for sd->span · 70dc4628
      Hui Tang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7HFZV
      CVE: NA
      
      ----------------------------------------
      
      There may be NULL pointer derefrence when hotplug running and
      creating taskgroup concurrently.
      
      sched_autogroup_create_attach
        -> sched_create_group
          -> alloc_fair_sched_group
            -> init_auto_affinity
              -> init_affinity_domains
                 -> cpumask_copy(xx, sched_domain_span(tmp))
                    { tmp may be free due rcu lock missing }
      
      { hotplug will rebuild sched domain }
      sched_cpu_activate
        -> build_sched_domains
          -> cpuset_cpu_active
            -> partition_sched_domains
              -> build_sched_domains
                -> cpu_attach_domain
                  -> destroy_sched_domains
                    -> call_rcu(&sd->rcu, destroy_sched_domains_rcu)
      
      So sd should be protect with rcu lock in entire critical zone.
      
      [  599.811593] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [  600.112821] pc : init_affinity_domains+0xf4/0x200
      [  600.125918] lr : init_affinity_domains+0xd4/0x200
      [  600.331355] Call trace:
      [  600.338734]  init_affinity_domains+0xf4/0x200
      [  600.347955]  init_auto_affinity+0x78/0xc0
      [  600.356622]  alloc_fair_sched_group+0xd8/0x210
      [  600.365594]  sched_create_group+0x48/0xc0
      [  600.373970]  sched_autogroup_create_attach+0x54/0x190
      [  600.383311]  ksys_setsid+0x110/0x130
      [  600.391014]  __arm64_sys_setsid+0x18/0x24
      [  600.399156]  el0_svc_common+0x118/0x170
      [  600.406818]  el0_svc_handler+0x3c/0x80
      [  600.414188]  el0_svc+0x8/0x640
      [  600.420719] Code: b40002c0 9104e002 f9402061 a9401444 (a9001424)
      [  600.430504] SMP: stopping secondary CPUs
      [  600.441751] Starting crashdump kernel...
      
      Fixes: 713cfd26 ("sched: Introduce smart grid scheduling strategy for cfs")
      Signed-off-by: NHui Tang <tanghui20@huawei.com>
      Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
      Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
      70dc4628
  5. 29 6月, 2023 3 次提交