1. 27 12月, 2019 40 次提交
    • P
      userfaultfd: clear flag if remap event not enabled · 6d9c81dc
      Peter Xu 提交于
      [ Upstream commit 3cfd22be ]
      
      When the process being tracked does mremap() without
      UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle,
      we should not generate the remap event, and at the same time we should
      clear all the uffd flags on the new VMA.  Without this patch, we can still
      have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault
      handling process does not even know the existance of the VMA.
      
      Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Pravin Shedge <pravin.shedge4linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6d9c81dc
    • J
      ocfs2: fix panic due to unrecovered local alloc · b637f94d
      Junxiao Bi 提交于
      [ Upstream commit 532e1e54 ]
      
      mount.ocfs2 ignore the inconsistent error that journal is clean but
      local alloc is unrecovered.  After mount, local alloc not empty, then
      reserver cluster didn't alloc a new local alloc window, reserveration
      map is empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the
      following panic.
      
      This issue was reported at
      
        https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html
      
      and was advised to fixed during mount.  But this is a very unusual
      inconsistent state, usually journal dirty flag should be cleared at the
      last stage of umount until every other things go right.  We may need do
      further debug to check that.  Any way to avoid possible futher
      corruption, mount should be abort and fsck should be run.
      
        (mount.ocfs2,1765,1):ocfs2_load_local_alloc:353 ERROR: Local alloc hasn't been recovered!
        found = 6518, set = 6518, taken = 8192, off = 15912372
        ocfs2: Mounting device (202,64) on (node 0, slot 3) with ordered data mode.
        o2dlm: Joining domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 8 ) 8 nodes
        ocfs2: Mounting device (202,80) on (node 0, slot 3) with ordered data mode.
        o2hb: Region 89CEAC63CC4F4D03AC185B44E0EE0F3F (xvdf) is now a quorum device
        o2net: Accepted connection from node yvwsoa17p (num 7) at 172.22.77.88:7777
        o2dlm: Node 7 joins domain 64FE421C8C984E6D96ED12C55FEE2435 ( 0 1 2 3 4 5 6 7 8 ) 9 nodes
        o2dlm: Node 7 joins domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 7 8 ) 9 nodes
        ------------[ cut here ]------------
        kernel BUG at fs/ocfs2/reservations.c:507!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ocfs2 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd grace ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ovmapi ppdev parport_pc parport xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea acpi_cpufreq pcspkr i2c_piix4 i2c_core sg ext4 jbd2 mbcache2 sr_mod cdrom xen_blkfront pata_acpi ata_generic ata_piix floppy dm_mirror dm_region_hash dm_log dm_mod
        CPU: 0 PID: 4349 Comm: startWebLogic.s Not tainted 4.1.12-124.19.2.el6uek.x86_64 #2
        Hardware name: Xen HVM domU, BIOS 4.4.4OVM 09/06/2018
        task: ffff8803fb04e200 ti: ffff8800ea4d8000 task.ti: ffff8800ea4d8000
        RIP: 0010:[<ffffffffa05e96a8>]  [<ffffffffa05e96a8>] __ocfs2_resv_find_window+0x498/0x760 [ocfs2]
        Call Trace:
          ocfs2_resmap_resv_bits+0x10d/0x400 [ocfs2]
          ocfs2_claim_local_alloc_bits+0xd0/0x640 [ocfs2]
          __ocfs2_claim_clusters+0x178/0x360 [ocfs2]
          ocfs2_claim_clusters+0x1f/0x30 [ocfs2]
          ocfs2_convert_inline_data_to_extents+0x634/0xa60 [ocfs2]
          ocfs2_write_begin_nolock+0x1c6/0x1da0 [ocfs2]
          ocfs2_write_begin+0x13e/0x230 [ocfs2]
          generic_perform_write+0xbf/0x1c0
          __generic_file_write_iter+0x19c/0x1d0
          ocfs2_file_write_iter+0x589/0x1360 [ocfs2]
          __vfs_write+0xb8/0x110
          vfs_write+0xa9/0x1b0
          SyS_write+0x46/0xb0
          system_call_fastpath+0x18/0xd7
        Code: ff ff 8b 75 b8 39 75 b0 8b 45 c8 89 45 98 0f 84 e5 fe ff ff 45 8b 74 24 18 41 8b 54 24 1c e9 56 fc ff ff 85 c0 0f 85 48 ff ff ff <0f> 0b 48 8b 05 cf c3 de ff 48 ba 00 00 00 00 00 00 00 10 48 85
        RIP   __ocfs2_resv_find_window+0x498/0x760 [ocfs2]
         RSP <ffff8800ea4db668>
        ---[ end trace 566f07529f2edf3c ]---
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: disabled
      
      Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Acked-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b637f94d
    • E
      iomap: don't search past page end in iomap_is_partially_uptodate · 0047db18
      Eric Sandeen 提交于
      [ Upstream commit 3cc31fa6 ]
      
      iomap_is_partially_uptodate() is intended to check wither blocks within
      the selected range of a not-uptodate page are uptodate; if the range we
      care about is up to date, it's an optimization.
      
      However, the iomap implementation continues to check all blocks up to
      from+count, which is beyond the page, and can even be well beyond the
      iop->uptodate bitmap.
      
      I think the worst that will happen is that we may eventually find a zero
      bit and return "not partially uptodate" when it would have otherwise
      returned true, and skip the optimization.  Still, it's clearly an invalid
      memory access that must be fixed.
      
      So: fix this by limiting the search to within the page as is done in the
      non-iomap variant, block_is_partially_uptodate().
      
      Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k
      page system:
      
       BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0
       Read of size 8 at addr ffff800120c3a318 by task fsstress/22337
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0047db18
    • J
      quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. · 50c87aef
      Javier Barrio 提交于
      [ Upstream commit 41c4f85c ]
      
      Commit 1fa5efe3 (ext4: Use generic helpers for quotaon
      and quotaoff) made possible to call quotactl(Q_XQUOTAON/OFF) on ext4 filesystems
      with sysfile quota support. This leads to calling dquot_enable/disable without s_umount
      held in excl. mode, because quotactl_cmd_onoff checks only for Q_QUOTAON/OFF.
      
      The following WARN_ON_ONCE triggers (in this case for dquot_enable, ext4, latest Linus' tree):
      
      [  117.807056] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: quota,prjquota
      
      [...]
      
      [  155.036847] WARNING: CPU: 0 PID: 2343 at fs/quota/dquot.c:2469 dquot_enable+0x34/0xb9
      [  155.036851] Modules linked in: quota_v2 quota_tree ipv6 af_packet joydev mousedev psmouse serio_raw pcspkr i2c_piix4 intel_agp intel_gtt e1000 ttm drm_kms_helper drm agpgart fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_core input_leds kvm_intel kvm irqbypass qemu_fw_cfg floppy evdev parport_pc parport button crc32c_generic dm_mod ata_generic pata_acpi ata_piix libata loop ext4 crc16 mbcache jbd2 usb_storage usbcore sd_mod scsi_mod
      [  155.036901] CPU: 0 PID: 2343 Comm: qctl Not tainted 4.20.0-rc6-00025-gf5d58277 #9
      [  155.036903] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [  155.036911] RIP: 0010:dquot_enable+0x34/0xb9
      [  155.036915] Code: 41 56 41 55 41 54 55 53 4c 8b 6f 28 74 02 0f 0b 4d 8d 7d 70 49 89 fc 89 cb 41 89 d6 89 f5 4c 89 ff e8 23 09 ea ff 85 c0 74 0a <0f> 0b 4c 89 ff e8 8b 09 ea ff 85 db 74 6a 41 8b b5 f8 00 00 00 0f
      [  155.036918] RSP: 0018:ffffb09b00493e08 EFLAGS: 00010202
      [  155.036922] RAX: 0000000000000001 RBX: 0000000000000008 RCX: 0000000000000008
      [  155.036924] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9781b67cd870
      [  155.036926] RBP: 0000000000000002 R08: 0000000000000000 R09: 61c8864680b583eb
      [  155.036929] R10: ffffb09b00493e48 R11: ffffffffff7ce7d4 R12: ffff9781b7ee8d78
      [  155.036932] R13: ffff9781b67cd800 R14: 0000000000000004 R15: ffff9781b67cd870
      [  155.036936] FS:  00007fd813250b88(0000) GS:ffff9781ba000000(0000) knlGS:0000000000000000
      [  155.036939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  155.036942] CR2: 00007fd812ff61d6 CR3: 000000007c882000 CR4: 00000000000006b0
      [  155.036951] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  155.036953] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  155.036955] Call Trace:
      [  155.037004]  dquot_quota_enable+0x8b/0xd0
      [  155.037011]  kernel_quotactl+0x628/0x74e
      [  155.037027]  ? do_mprotect_pkey+0x2a6/0x2cd
      [  155.037034]  __x64_sys_quotactl+0x1a/0x1d
      [  155.037041]  do_syscall_64+0x55/0xe4
      [  155.037078]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  155.037105] RIP: 0033:0x7fd812fe1198
      [  155.037109] Code: 02 77 0d 48 89 c1 48 c1 e9 3f 75 04 48 8b 04 24 48 83 c4 50 5b c3 48 83 ec 08 49 89 ca 48 63 d2 48 63 ff b8 b3 00 00 00 0f 05 <48> 89 c7 e8 c1 eb ff ff 5a c3 48 63 ff b8 bb 00 00 00 0f 05 48 89
      [  155.037112] RSP: 002b:00007ffe8cd7b050 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3
      [  155.037116] RAX: ffffffffffffffda RBX: 00007ffe8cd7b148 RCX: 00007fd812fe1198
      [  155.037119] RDX: 0000000000000000 RSI: 00007ffe8cd7cea9 RDI: 0000000000580102
      [  155.037121] RBP: 00007ffe8cd7b0f0 R08: 000055fc8eba8a9d R09: 0000000000000000
      [  155.037124] R10: 00007ffe8cd7b074 R11: 0000000000000206 R12: 00007ffe8cd7b168
      [  155.037126] R13: 000055fc8eba8897 R14: 0000000000000000 R15: 0000000000000000
      [  155.037131] ---[ end trace 210f864257175c51 ]---
      
      and then the syscall proceeds without s_umount locking.
      
      This patch locks the superblock ->s_umount sem. in exclusive mode for all Q_XQUOTAON/OFF
      quotactls too in addition to Q_QUOTAON/OFF.
      
      AFAICT, other than ext4, only xfs and ocfs2 are affected by this change.
      The VFS will now call in xfs_quota_* functions with s_umount held, which wasn't the case
      before. This looks good to me but I can not say for sure. Ext4 and ocfs2 where already
      beeing called with s_umount exclusive via quota_quotaon/off which is basically the same.
      Signed-off-by: NJavier Barrio <javier.barrio.mart@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      50c87aef
    • J
      btrfs: improve error handling of btrfs_add_link · d59679a9
      Johannes Thumshirn 提交于
      [ Upstream commit 1690dd41 ]
      
      In the error handling block, err holds the return value of either
      btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
      since it's introduction with commit fe66a05a (Btrfs: improve error
      handling for btrfs_insert_dir_item callers) in 2012.
      
      If the error handling in the error handling fails, there's not much left
      to do and the abort either happened earlier in the callees or is
      necessary here.
      
      So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
      the transaction, but still return the original code of the failure
      stored in 'ret' as this will be reported to the user.
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d59679a9
    • A
      btrfs: fix use-after-free due to race between replace start and cancel · 55cf3ae3
      Anand Jain 提交于
      [ Upstream commit d189dd70 ]
      
      The device replace cancel thread can race with the replace start thread
      and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
      fail to stop the scrub thread.
      
      The scrub thread continues with the scrub for replace which then will
      try to write to the target device and which is already freed by the
      cancel thread.
      
      scrub_setup_ctx() warns as tgtdev is NULL.
      
        struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
        {
        ...
      	  if (is_dev_replace) {
      		  WARN_ON(!fs_info->dev_replace.tgtdev);  <===
      		  sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
      		  sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
      		  sctx->flush_all_writes = false;
      	  }
      
        [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
        [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
        [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.432970] Call Trace:
        [ 6852.433202]  btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
        [ 6852.433471]  btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
        [ 6852.433800]  btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
        [ 6852.434097]  btrfs_ioctl+0x2476/0x2d20 [btrfs]
        [ 6852.434365]  ? do_sigaction+0x7d/0x1e0
        [ 6852.434623]  do_vfs_ioctl+0xa9/0x6c0
        [ 6852.434865]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435124]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435387]  ksys_ioctl+0x60/0x90
        [ 6852.435663]  __x64_sys_ioctl+0x16/0x20
        [ 6852.435907]  do_syscall_64+0x50/0x180
        [ 6852.436150]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Further, as the replace thread enters scrub_write_page_to_dev_replace()
      without the target device it panics:
      
        static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
      				      struct scrub_page *spage)
        {
        ...
      	bio_set_dev(bio, sbio->dev->bdev); <======
      
        [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
        ..
        [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
        [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
        [btrfs]
        ..
        [ 6929.721430] Call Trace:
        [ 6929.721663]  scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
        [ 6929.721975]  scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
        [ 6929.722277]  normal_work_helper+0xf0/0x4c0 [btrfs]
        [ 6929.722552]  process_one_work+0x1f4/0x520
        [ 6929.722805]  ? process_one_work+0x16e/0x520
        [ 6929.723063]  worker_thread+0x46/0x3d0
        [ 6929.723313]  kthread+0xf8/0x130
        [ 6929.723544]  ? process_one_work+0x520/0x520
        [ 6929.723800]  ? kthread_delayed_work_timer_fn+0x80/0x80
        [ 6929.724081]  ret_from_fork+0x3a/0x50
      
      Fix this by letting the btrfs_dev_replace_finishing() to do the job of
      cleaning after the cancel, including freeing of the target device.
      btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
      along with the scrub return status.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      55cf3ae3
    • H
      btrfs: alloc_chunk: fix more DUP stripe size handling · 256ba7ae
      Hans van Kranenburg 提交于
      [ Upstream commit baf92114 ]
      
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      256ba7ae
    • Q
      btrfs: volumes: Make sure there is no overlap of dev extents at mount time · f07f28a6
      Qu Wenruo 提交于
      [ Upstream commit 5eb19381 ]
      
      Enhance btrfs_verify_dev_extents() to remember previous checked dev
      extents, so it can verify no dev extents can overlap.
      
      Analysis from Hans:
      
      "Imagine allocating a DATA|DUP chunk.
      
       In the chunk allocator, we first set...
         max_stripe_size = SZ_1G;
         max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
       ... which is 10GiB.
      
       Then...
         /* we don't want a chunk larger than 10% of writeable space */
         max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
             		 max_chunk_size);
      
       Imagine we only have one 7880MiB block device in this filesystem. Now
       max_chunk_size is down to 788MiB.
      
       The next step in the code is to search for max_stripe_size * dev_stripes
       amount of free space on the device, which is in our example 1GiB * 2 =
       2GiB. Imagine the device has exactly 1578MiB free in one contiguous
       piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
      
       Next we recalculate the stripe_size (which is actually the device extent
       length), based on the actual maximum amount of available raw disk space:
         stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
      
       stripe_size is now 789MiB
      
       Next we do...
         data_stripes = num_stripes / ncopies
       ...where data_stripes ends up as 1, because num_stripes is 2 (the amount
       of device extents we're going to have), and DUP has ncopies 2.
      
       Next there's a check...
         if (stripe_size * data_stripes > max_chunk_size)
       ...which matches because 789MiB * 1 > 788MiB.
      
       We go into the if code, and next is...
         stripe_size = div_u64(max_chunk_size, data_stripes);
       ...which resets stripe_size to max_chunk_size: 788MiB
      
       Next is a fun one...
         /* bump the answer up to a 16MB boundary */
         stripe_size = round_up(stripe_size, SZ_16M);
       ...which changes stripe_size from 788MiB to 800MiB.
      
       We're not done changing stripe_size yet...
         /* But don't go higher than the limits we found while searching
          * for free extents
          */
         stripe_size = min(devices_info[ndevs - 1].max_avail,
             	      stripe_size);
      
       This is bad. max_avail is twice the stripe_size (we need to fit 2 device
       extents on the same device for DUP).
      
       The result here is that 800MiB < 1578MiB, so it's unchanged. However,
       the resulting DUP chunk will need 1600MiB disk space, which isn't there,
       and the second dev_extent might extend into the next thing (next
       dev_extent? end of device?) for 22MiB.
      
       The last shown line of code relies on a situation where there's twice
       the value of stripe_size present as value for the variable stripe_size
       when it's DUP. This was actually the case before commit 92e222df
       "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
         "[...] in the meantime there's a check to see if the stripe_size does
       not exceed max_chunk_size. Since during this check stripe_size is twice
       the amount as intended, the check will reduce the stripe_size to
       max_chunk_size if the actual correct to be used stripe_size is more than
       half the amount of max_chunk_size."
      
       In the previous version of the code, the 16MiB alignment (why is this
       done, by the way?) would result in a 50% chance that it would actually
       do an 8MiB alignment for the individual dev_extents, since it was
       operating on double the size. Does this matter?
      
       Does it matter that stripe_size can be set to anything which is not
       16MiB aligned because of the amount of remaining available disk space
       which is just taken?
      
       What is the main purpose of this round_up?
      
       The most straightforward thing to do seems something like...
         stripe_size = min(
             div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
             stripe_size
         )
       ..just putting half of the max_avail into stripe_size."
      
      Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/Reported-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ add analysis from report ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f07f28a6
    • J
      pstore/ram: Do not treat empty buffers as valid · ec8dd15a
      Joel Fernandes (Google) 提交于
      [ Upstream commit 30696378 ]
      
      The ramoops backend currently calls persistent_ram_save_old() even
      if a buffer is empty. While this appears to work, it is does not seem
      like the right thing to do and could lead to future bugs so lets avoid
      that. It also prevents misleading prints in the logs which claim the
      buffer is valid.
      
      I got something like:
      
      	found existing buffer, size 0, start 0
      
      When I was expecting:
      
      	no valid data in buffer (sig = ...)
      
      This bails out early (and reports with pr_debug()), since it's an
      acceptable state.
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Co-developed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ec8dd15a
    • D
      jffs2: Fix use of uninitialized delayed_work, lockdep breakage · 19baa1b6
      Daniel Santos 提交于
      [ Upstream commit a788c527 ]
      
      jffs2_sync_fs makes the assumption that if CONFIG_JFFS2_FS_WRITEBUFFER
      is defined then a write buffer is available and has been initialized.
      However, this does is not the case when the mtd device has no
      out-of-band buffer:
      
      int jffs2_nand_flash_setup(struct jffs2_sb_info *c)
      {
              if (!c->mtd->oobsize)
                      return 0;
      ...
      
      The resulting call to cancel_delayed_work_sync passing a uninitialized
      (but zeroed) delayed_work struct forces lockdep to become disabled.
      
      [   90.050639] overlayfs: upper fs does not support tmpfile.
      [   90.652264] INFO: trying to register non-static key.
      [   90.662171] the code is fine but needs lockdep annotation.
      [   90.673090] turning off the locking correctness validator.
      [   90.684021] CPU: 0 PID: 1762 Comm: mount_root Not tainted 4.14.63 #0
      [   90.696672] Stack : 00000000 00000000 80d8f6a2 00000038 805f0000 80444600 8fe364f4 805dfbe7
      [   90.713349]         80563a30 000006e2 8068370c 00000001 00000000 00000001 8e2fdc48 ffffffff
      [   90.730020]         00000000 00000000 80d90000 00000000 00000106 00000000 6465746e 312e3420
      [   90.746690]         6b636f6c 03bf0000 f8000000 20676e69 00000000 80000000 00000000 8e2c2a90
      [   90.763362]         80d90000 00000001 00000000 8e2c2a90 00000003 80260dc0 08052098 80680000
      [   90.780033]         ...
      [   90.784902] Call Trace:
      [   90.789793] [<8000f0d8>] show_stack+0xb8/0x148
      [   90.798659] [<8005a000>] register_lock_class+0x270/0x55c
      [   90.809247] [<8005cb64>] __lock_acquire+0x13c/0xf7c
      [   90.818964] [<8005e314>] lock_acquire+0x194/0x1dc
      [   90.828345] [<8003f27c>] flush_work+0x200/0x24c
      [   90.837374] [<80041dfc>] __cancel_work_timer+0x158/0x210
      [   90.847958] [<801a8770>] jffs2_sync_fs+0x20/0x54
      [   90.857173] [<80125cf4>] iterate_supers+0xf4/0x120
      [   90.866729] [<80158fc4>] sys_sync+0x44/0x9c
      [   90.875067] [<80014424>] syscall_common+0x34/0x58
      Signed-off-by: NDaniel Santos <daniel.santos@pobox.com>
      Reviewed-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NBoris Brezillon <boris.brezillon@bootlin.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      19baa1b6
    • J
      blockdev: Fix livelocks on loop device · d09d64b8
      Jan Kara 提交于
      commit 04906b2f upstream.
      
      bd_set_size() updates also block device's block size. This is somewhat
      unexpected from its name and at this point, only blkdev_open() uses this
      functionality. Furthermore, this can result in changing block size under
      a filesystem mounted on a loop device which leads to livelocks inside
      __getblk_gfp() like:
      
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
      01/01/2011
      RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
      ...
      Call Trace:
       init_page_buffers+0x3e2/0x530 fs/buffer.c:904
       grow_dev_page fs/buffer.c:947 [inline]
       grow_buffers fs/buffer.c:1009 [inline]
       __getblk_slow fs/buffer.c:1036 [inline]
       __getblk_gfp+0x906/0xb10 fs/buffer.c:1313
       __bread_gfp+0x2d/0x310 fs/buffer.c:1347
       sb_bread include/linux/buffer_head.h:307 [inline]
       fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
       fat_ent_read_block fs/fat/fatent.c:441 [inline]
       fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
       fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
       __fat_get_block fs/fat/inode.c:148 [inline]
      ...
      
      Trivial reproducer for the problem looks like:
      
      truncate -s 1G /tmp/image
      losetup /dev/loop0 /tmp/image
      mkfs.ext4 -b 1024 /dev/loop0
      mount -t ext4 /dev/loop0 /mnt
      losetup -c /dev/loop0
      l /mnt
      
      Fix the problem by moving initialization of a block device block size
      into a separate function and call it when needed.
      
      Thanks to Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> for help with
      debugging the problem.
      
      Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d09d64b8
    • K
      pstore/ram: Avoid allocation and leak of platform data · 57d5e806
      Kees Cook 提交于
      commit 5631e857 upstream.
      
      Yue Hu noticed that when parsing device tree the allocated platform data
      was never freed. Since it's not used beyond the function scope, this
      switches to using a stack variable instead.
      Reported-by: NYue Hu <huyue2@yulong.com>
      Fixes: 35da6094 ("pstore/ram: add Device Tree bindings")
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      57d5e806
    • J
      btrfs: wait on ordered extents on abort cleanup · 8da162ce
      Josef Bacik 提交于
      commit 74d5d229 upstream.
      
      If we flip read-only before we initiate writeback on all dirty pages for
      ordered extents we've created then we'll have ordered extents left over
      on umount, which results in all sorts of bad things happening.  Fix this
      by making sure we wait on ordered extents if we have to do the aborted
      transaction cleanup stuff.
      
      generic/475 can produce this warning:
      
       [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
       [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
       [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
       [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
       [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
       [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
       [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
       [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
       [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
       [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
       [ 8531.207516] Call Trace:
       [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
       [ 8531.210209]  ? wait_for_completion+0x5b/0x190
       [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
       [ 8531.212412]  generic_shutdown_super+0x64/0x100
       [ 8531.213485]  kill_anon_super+0x14/0x30
       [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
       [ 8531.215539]  deactivate_locked_super+0x29/0x60
       [ 8531.216633]  cleanup_mnt+0x3b/0x70
       [ 8531.217497]  task_work_run+0x98/0xc0
       [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
       [ 8531.219324]  do_syscall_64+0x15b/0x180
       [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
       [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
       [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
       [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
       [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
       [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
       [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000
      
      Leaving a tree in the rb-tree:
      
      3853 void btrfs_free_fs_root(struct btrfs_root *root)
      3854 {
      3855         iput(root->ino_cache_inode);
      3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      
      CC: stable@vger.kernel.org
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8da162ce
    • D
      Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io" · 58afa74e
      David Sterba 提交于
      commit 77b7aad1 upstream.
      
      This reverts commit e73e81b6.
      
      This patch causes a few problems:
      
      - adds latency to btrfs_finish_ordered_io
      - as btrfs_finish_ordered_io is used for free space cache, generating
        more work from btrfs_btree_balance_dirty_nodelay could end up in the
        same workque, effectively deadlocking
      
      12260 kworker/u96:16+btrfs-freespace-write D
      [<0>] balance_dirty_pages+0x6e6/0x7ad
      [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
      [<0>] btrfs_finish_ordered_io+0x3da/0x770
      [<0>] normal_work_helper+0x1c5/0x5a0
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x46/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Transaction commit will wait on the freespace cache:
      
      838 btrfs-transacti D
      [<0>] btrfs_start_ordered_extent+0x154/0x1e0
      [<0>] btrfs_wait_ordered_range+0xbd/0x110
      [<0>] __btrfs_wait_cache_io+0x49/0x1a0
      [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
      [<0>] commit_cowonly_roots+0x215/0x2b0
      [<0>] btrfs_commit_transaction+0x37e/0x910
      [<0>] transaction_kthread+0x14d/0x180
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      And then writepages ends up waiting on transaction commit:
      
      9520 kworker/u96:13+flush-btrfs-1 D
      [<0>] wait_current_trans+0xac/0xe0
      [<0>] start_transaction+0x21b/0x4b0
      [<0>] cow_file_range_inline+0x10b/0x6b0
      [<0>] cow_file_range.isra.69+0x329/0x4a0
      [<0>] run_delalloc_range+0x105/0x3c0
      [<0>] writepage_delalloc+0x119/0x180
      [<0>] __extent_writepage+0x10c/0x390
      [<0>] extent_write_cache_pages+0x26f/0x3d0
      [<0>] extent_writepages+0x4f/0x80
      [<0>] do_writepages+0x17/0x60
      [<0>] __writeback_single_inode+0x59/0x690
      [<0>] writeback_sb_inodes+0x291/0x4e0
      [<0>] __writeback_inodes_wb+0x87/0xb0
      [<0>] wb_writeback+0x3bb/0x500
      [<0>] wb_workfn+0x40d/0x610
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x1e0/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Eventually, we have every process in the system waiting on
      balance_dirty_pages(), and nobody is able to make progress on page
      writeback.
      
      The original patch tried to fix an OOM condition, that happened on 4.4 but no
      success reproducing that on later kernels (4.19 and 4.20). This is more likely
      a problem in OOM itself.
      
      Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/Reported-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org # 4.18+
      CC: ethanlien <ethanlien@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      58afa74e
    • F
      Btrfs: use nofs context when initializing security xattrs to avoid deadlock · 153206a5
      Filipe Manana 提交于
      commit 827aa18e upstream.
      
      When initializing the security xattrs, we are holding a transaction handle
      therefore we need to use a GFP_NOFS context in order to avoid a deadlock
      with reclaim in case it's triggered.
      
      Fixes: 39a27ec1 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      153206a5
    • F
      Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation · be84e5ef
      Filipe Manana 提交于
      commit 9a6f209e upstream.
      
      If the quota enable and snapshot creation ioctls are called concurrently
      we can get into a deadlock where the task enabling quotas will deadlock
      on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
      twice, or the task creating a snapshot tries to commit the transaction
      while the task enabling quota waits for the former task to commit the
      transaction while holding the mutex. The following time diagrams show how
      both cases happen.
      
      First scenario:
      
                 CPU 0                                    CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
          btrfs_start_transaction()
      
                                                   btrfs_ioctl()
                                                    btrfs_ioctl_snap_create_v2
                                                     create_snapshot()
                                                      --> adds snapshot to the
                                                          list pending_snapshots
                                                          of the current
                                                          transaction
      
          btrfs_commit_transaction()
           create_pending_snapshots()
             create_pending_snapshot()
              qgroup_account_snapshot()
               btrfs_qgroup_inherit()
      	   mutex_lock(fs_info->qgroup_ioctl_lock)
      	    --> deadlock, mutex already locked
      	        by this task at
      		btrfs_quota_enable()
      
      Second scenario:
      
                 CPU 0                                    CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
          btrfs_start_transaction()
      
                                                   btrfs_ioctl()
                                                    btrfs_ioctl_snap_create_v2
                                                     create_snapshot()
                                                      --> adds snapshot to the
                                                          list pending_snapshots
                                                          of the current
                                                          transaction
      
                                                      btrfs_commit_transaction()
                                                       --> waits for task at
                                                           CPU 0 to release
                                                           its transaction
                                                           handle
      
          btrfs_commit_transaction()
           --> sees another task started
               the transaction commit first
           --> releases its transaction
               handle
           --> waits for the transaction
               commit to be completed by
               the task at CPU 1
      
                                                       create_pending_snapshot()
                                                        qgroup_account_snapshot()
                                                         btrfs_qgroup_inherit()
                                                          mutex_lock(fs_info->qgroup_ioctl_lock)
                                                           --> deadlock, task at CPU 0
                                                               has the mutex locked but
                                                               it is waiting for us to
                                                               finish the transaction
                                                               commit
      
      So fix this by setting the quota enabled flag in fs_info after committing
      the transaction at btrfs_quota_enable(). This ends up serializing quota
      enable and snapshot creation as if the snapshot creation happened just
      before the quota enable request. The quota rescan task, scheduled after
      committing the transaction in btrfs_quote_enable(), will do the accounting.
      
      Fixes: 6426c7ad ("btrfs: qgroup: Fix qgroup accounting when creating snapshot")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      be84e5ef
    • F
      Btrfs: fix access to available allocation bits when starting balance · 3ce2bd6b
      Filipe Manana 提交于
      commit 5a8067c0 upstream.
      
      The available allocation bits members from struct btrfs_fs_info are
      protected by a sequence lock, and when starting balance we access them
      incorrectly in two different ways:
      
      1) In the read sequence lock loop at btrfs_balance() we use the values we
         read from fs_info->avail_*_alloc_bits and we can immediately do actions
         that have side effects and can not be undone (printing a message and
         jumping to a label). This is wrong because a retry might be needed, so
         our actions must not have side effects and must be repeatable as long
         as read_seqretry() returns a non-zero value. In other words, we were
         essentially ignoring the sequence lock;
      
      2) Right below the read sequence lock loop, we were reading the values
         from avail_metadata_alloc_bits and avail_data_alloc_bits without any
         protection from concurrent writers, that is, reading them outside of
         the read sequence lock critical section.
      
      So fix this by making sure we only read the available allocation bits
      while in a read sequence lock critical section and that what we do in the
      critical section is repeatable (has nothing that can not be undone) so
      that any eventual retry that is needed is handled properly.
      
      Fixes: de98ced9 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
      Fixes: 14506127 ("btrfs: fix a bogus warning when converting only data or metadata")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3ce2bd6b
    • T
      ext4: fix special inode number checks in __ext4_iget() · c4ee480a
      Theodore Ts'o 提交于
      commit 191ce178 upstream.
      
      The check for special (reserved) inode number checks in __ext4_iget()
      was broken by commit 8a363970: ("ext4: avoid declaring fs
      inconsistent due to invalid file handles").  This was caused by a
      botched reversal of the sense of the flag now known as
      EXT4_IGET_SPECIAL (when it was previously named EXT4_IGET_NORMAL).
      Fix the logic appropriately.
      
      Fixes: 8a363970 ("ext4: avoid declaring fs inconsistent...")
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c4ee480a
    • T
      ext4: track writeback errors using the generic tracking infrastructure · 5c64459e
      Theodore Ts'o 提交于
      commit 95cb6713 upstream.
      
      We already using mapping_set_error() in fs/ext4/page_io.c, so all we
      need to do is to use file_check_and_advance_wb_err() when handling
      fsync() requests in ext4_sync_file().
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5c64459e
    • T
      ext4: use ext4_write_inode() when fsyncing w/o a journal · e5f41b40
      Theodore Ts'o 提交于
      commit ad211f3e upstream.
      
      In no-journal mode, we previously used __generic_file_fsync() in
      no-journal mode.  This triggers a lockdep warning, and in addition,
      it's not safe to depend on the inode writeback mechanism in the case
      ext4.  We can solve both problems by calling ext4_write_inode()
      directly.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e5f41b40
    • T
      ext4: avoid kernel warning when writing the superblock to a dead device · 69e335a1
      Theodore Ts'o 提交于
      commit e8680786 upstream.
      
      The xfstests generic/475 test switches the underlying device with
      dm-error while running a stress test.  This results in a large number
      of file system errors, and since we can't lock the buffer head when
      marking the superblock dirty in the ext4_grp_locked_error() case, it's
      possible the superblock to be !buffer_uptodate() without
      buffer_write_io_error() being true.
      
      We need to set buffer_uptodate() before we call mark_buffer_dirty() or
      this will trigger a WARN_ON.  It's safe to do this since the
      superblock must have been properly read into memory or the mount would
      have been successful.  So if buffer_uptodate() is not set, we can
      safely assume that this happened due to a failed attempt to write the
      superblock.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      69e335a1
    • T
      ext4: fix a potential fiemap/page fault deadlock w/ inline_data · 2be75a5e
      Theodore Ts'o 提交于
      commit 2b08b1f1 upstream.
      
      The ext4_inline_data_fiemap() function calls fiemap_fill_next_extent()
      while still holding the xattr semaphore.  This is not necessary and it
      triggers a circular lockdep warning.  This is because
      fiemap_fill_next_extent() could trigger a page fault when it writes
      into page which triggers a page fault.  If that page is mmaped from
      the inline file in question, this could very well result in a
      deadlock.
      
      This problem can be reproduced using generic/519 with a file system
      configuration which has the inline_data feature enabled.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2be75a5e
    • T
      ext4: make sure enough credits are reserved for dioread_nolock writes · 79fba104
      Theodore Ts'o 提交于
      commit 812c0cab upstream.
      
      There are enough credits reserved for most dioread_nolock writes;
      however, if the extent tree is sufficiently deep, and/or quota is
      enabled, the code was not allowing for all eventualities when
      reserving journal credits for the unwritten extent conversion.
      
      This problem can be seen using xfstests ext4/034:
      
         WARNING: CPU: 1 PID: 257 at fs/ext4/ext4_jbd2.c:271 __ext4_handle_dirty_metadata+0x10c/0x180
         Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
         RIP: 0010:__ext4_handle_dirty_metadata+0x10c/0x180
         	...
         EXT4-fs: ext4_free_blocks:4938: aborting transaction: error 28 in __ext4_handle_dirty_metadata
         EXT4: jbd2_journal_dirty_metadata failed: handle type 11 started at line 4921, credits 4/0, errcode -28
         EXT4-fs error (device dm-1) in ext4_free_blocks:4950: error 28
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      79fba104
    • R
      cifs: Fix potential OOB access of lock element array · 954a3722
      Ross Lagerwall 提交于
      commit b9a74cde upstream.
      
      If maxBuf is small but non-zero, it could result in a zero sized lock
      element array which we would then try and access OOB.
      Signed-off-by: NRoss Lagerwall <ross.lagerwall@citrix.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      954a3722
    • P
      CIFS: Fix credit computation for compounded requests · 1c2fc93f
      Pavel Shilovsky 提交于
      commit 8544f4aa upstream.
      
      In SMB3 protocol every part of the compound chain consumes credits
      individually, so we need to call wait_for_free_credits() for each
      of the PDUs in the chain. If an operation is interrupted, we must
      ensure we return all credits taken from the server structure back.
      
      Without this patch server can sometimes disconnect the session
      due to credit mismatches, especially when first operation(s)
      are large writes.
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1c2fc93f
    • P
      CIFS: Do not hide EINTR after sending network packets · b6420c40
      Pavel Shilovsky 提交于
      commit ee13919c upstream.
      
      Currently we hide EINTR code returned from sock_sendmsg()
      and return 0 instead. This makes a caller think that we
      successfully completed the network operation which is not
      true. Fix this by properly returning EINTR to callers.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b6420c40
    • P
      CIFS: Do not set credits to 1 if the server didn't grant anything · 684fbf95
      Pavel Shilovsky 提交于
      commit 33fa5c8b upstream.
      
      Currently we reset the number of total credits granted by the server
      to 1 if the server didn't grant us anything int the response. This
      violates the SMB3 protocol - we need to trust the server and use
      the credit values from the response. Fix this by removing the
      corresponding code.
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      684fbf95
    • P
      CIFS: Fix adjustment of credits for MTU requests · ec4a20fc
      Pavel Shilovsky 提交于
      commit b983f7e9 upstream.
      
      Currently for MTU requests we allocate maximum possible credits
      in advance and then adjust them according to the request size.
      While we were adjusting the number of credits belonging to the
      server, we were skipping adjustment of credits belonging to the
      request. This patch fixes it by setting request credits to
      CreditCharge field value of SMB2 packet header.
      
      Also ask 1 credit more for async read and write operations to
      increase parallelism and match the behavior of other operations.
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ec4a20fc
    • F
      Btrfs: fix deadlock when using free space tree due to block group creation · 173b71f6
      Filipe Manana 提交于
      commit a6d8654d upstream.
      
      When modifying the free space tree we can end up COWing one of its extent
      buffers which in turn might result in allocating a new chunk, which in
      turn can result in flushing (finish creation) of pending block groups. If
      that happens we can deadlock because creating a pending block group needs
      to update the free space tree, and if any of the updates tries to modify
      the same extent buffer that we are COWing, we end up in a deadlock since
      we try to write lock twice the same extent buffer.
      
      So fix this by skipping pending block group creation if we are COWing an
      extent buffer from the free space tree. This is a case missed by commit
      5ce55557 ("Btrfs: fix deadlock when writing out free space caches").
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173
      Fixes: 5ce55557 ("Btrfs: fix deadlock when writing out free space caches")
      CC: stable@vger.kernel.org # 4.18+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      173b71f6
    • D
      fs: initialize resize_wait wait queue of init task · bb9fc8a8
      Daeho Jeong 提交于
      mainline inclusion
      from next
      commit: https://patchwork.kernel.org/patch/9785023/
      category: bugfix
      bugzilla: 5503
      CVE: NA
      
      -----------------------------------------
      
      We don't initialize resize_wait of init task now and all the kernel
      threads share this uninitialized resize_wait wait queue because they
      are sharing the file table of init task. Therefore, when expanding
      this file table shared by the kernel threads, we encounter kernel panic
      by accessing the NULL resize_wait wait queue.
      Signed-off-by: NDaeho Jeong <daeho.jeong@samsung.com>
      Tested-by: NYoungjin Gil <youngjin.gil@samsung.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Fixes: 8a81252b ("fs/file.c: don't acquire files->file_lock in
      fd_install()")
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NWei Fang <fangwei1@huawei.com>
      bb9fc8a8
    • Y
      ceph: don't update importing cap's mseq when handing cap export · 9f443389
      Yan, Zheng 提交于
      commit 3c1392d4c49962a31874af14ae9ff289cb2b3851 upstream.
      
      Updating mseq makes client think importer mds has accepted all prior
      cap messages and importer mds knows what caps client wants. Actually
      some cap messages may have been dropped because of mseq mismatch.
      
      If mseq is left untouched, importing cap's mds_wanted later will get
      reset by cap import message.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9f443389
    • J
      nfsd4: zero-length WRITE should succeed · 0e1f65c6
      J. Bruce Fields 提交于
      commit fdec6114 upstream.
      
      Zero-length writes are legal; from 5661 section 18.32.3: "If the count
      is zero, the WRITE will succeed and return a count of zero subject to
      permissions checking".
      
      This check is unnecessary and is causing zero-length reads to return
      EINVAL.
      
      Cc: stable@vger.kernel.org
      Fixes: 3fd9557a "NFSD: Refactor the generic write vector fill helper"
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0e1f65c6
    • B
      lockd: Show pid of lockd for remote locks · 08e3c659
      Benjamin Coddington 提交于
      commit b8eee0e9 upstream.
      
      Commit 9d5b86ac ("fs/locks: Remove fl_nspid and use fs-specific l_pid
      for remote locks") specified that the l_pid returned for F_GETLK on a local
      file that has a remote lock should be the pid of the lock manager process.
      That commit, while updating other filesystems, failed to update lockd, such
      that locks created by lockd had their fl_pid set to that of the remote
      process holding the lock.  Fix that here to be the pid of lockd.
      
      Also, fix the client case so that the returned lock pid is negative, which
      indicates a remote lock on a remote file.
      
      Fixes: 9d5b86ac ("fs/locks: Remove fl_nspid and use fs-specific...")
      Cc: stable@vger.kernel.org
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      08e3c659
    • A
      gfs2: Fix loop in gfs2_rbm_find · 6c80a5f6
      Andreas Gruenbacher 提交于
      commit 2d29f6b9 upstream.
      
      Fix the resource group wrap-around logic in gfs2_rbm_find that commit
      e579ed4f broke.  The bug can lead to unnecessary repeated scanning of the
      same bitmaps; there is a risk that future changes will turn this into an
      endless loop.
      
      Fixes: e579ed4f ("GFS2: Introduce rbm field bii")
      Cc: stable@vger.kernel.org # v3.13+
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6c80a5f6
    • A
      gfs2: Get rid of potential double-freeing in gfs2_create_inode · b43fd451
      Andreas Gruenbacher 提交于
      commit 6ff9b09e upstream.
      
      In gfs2_create_inode, after setting and releasing the acl / default_acl, the
      acl / default_acl pointers are not set to NULL as they should be.  In that
      state, when the function reaches label fail_free_acls, gfs2_create_inode will
      try to release the same acls again.
      
      Fix that by setting the pointers to NULL after releasing the acls.  Slightly
      simplify the logic.  Also, posix_acl_release checks for NULL already, so
      there is no need to duplicate those checks here.
      
      Fixes: e01580bf ("gfs2: use generic posix ACL infrastructure")
      Reported-by: NPan Bian <bianpan2016@163.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b43fd451
    • V
      dlm: memory leaks on error path in dlm_user_request() · 3a16e271
      Vasily Averin 提交于
      commit d47b41ac upstream.
      
      According to comment in dlm_user_request() ua should be freed
      in dlm_free_lkb() after successful attach to lkb.
      
      However ua is attached to lkb not in set_lock_args() but later,
      inside request_lock().
      
      Fixes 597d0cae ("[DLM] dlm: user locks")
      Cc: stable@kernel.org # 2.6.19
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3a16e271
    • V
      dlm: lost put_lkb on error path in receive_convert() and receive_unlock() · 21fa9eb2
      Vasily Averin 提交于
      commit c0174726 upstream.
      
      Fixes 6d40c4a7 ("dlm: improve error and debug messages")
      Cc: stable@kernel.org # 3.5
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      21fa9eb2
    • V
      dlm: possible memory leak on error path in create_lkb() · 127fb2fd
      Vasily Averin 提交于
      commit 23851e97 upstream.
      
      Fixes 3d6aa675 ("dlm: keep lkbs in idr")
      Cc: stable@kernel.org # 3.1
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      127fb2fd
    • V
      dlm: fixed memory leaks after failed ls_remove_names allocation · 50d6d0f7
      Vasily Averin 提交于
      commit b982896c upstream.
      
      If allocation fails on last elements of array need to free already
      allocated elements.
      
      v2: just move existing out_rsbtbl label to right place
      
      Fixes 789924ba635f ("dlm: fix race between remove and lookup")
      Cc: stable@kernel.org # 3.6
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      50d6d0f7
    • P
      smb3: fix large reads on encrypted connections · 47dc569a
      Paul Aurich 提交于
      commit 6d2f84eee098540ae857998fe32f29b9e2cd9613 upstream.
      
      When passing a large read to receive_encrypted_read(), ensure that the
      demultiplex_thread knows that a MID was processed.  Without this, those
      operations never complete.
      
      This is a similar issue/fix to lease break handling:
      commit 7af929d6
      ("smb3: fix lease break problem introduced by compounding")
      
      CC: Stable <stable@vger.kernel.org> # 4.19+
      Fixes: b24df3e3 ("cifs: update receive_encrypted_standard to handle compounded responses")
      Signed-off-by: NPaul Aurich <paul@darkrain42.org>
      Tested-by: NYves-Alexis Perez <corsac@corsac.net>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      47dc569a