1. 31 7月, 2021 3 次提交
    • L
      pipe: make pipe writes always wake up readers · 3a34b13a
      Linus Torvalds 提交于
      Since commit 1b6b26ae ("pipe: fix and clarify pipe write wakeup
      logic") we have sanitized the pipe write logic, and would only try to
      wake up readers if they needed it.
      
      In particular, if the pipe already had data in it before the write,
      there was no point in trying to wake up a reader, since any existing
      readers must have been aware of the pre-existing data already.  Doing
      extraneous wakeups will only cause potential thundering herd problems.
      
      However, it turns out that some Android libraries have misused the EPOLL
      interface, and expected "edge triggered" be to "any new write will
      trigger it".  Even if there was no edge in sight.
      
      Quoting Sandeep Patil:
       "The commit 1b6b26ae ('pipe: fix and clarify pipe write wakeup
        logic') changed pipe write logic to wakeup readers only if the pipe
        was empty at the time of write. However, there are libraries that
        relied upon the older behavior for notification scheme similar to
        what's described in [1]
      
        One such library 'realm-core'[2] is used by numerous Android
        applications. The library uses a similar notification mechanism as GNU
        Make but it never drains the pipe until it is full. When Android moved
        to v5.10 kernel, all applications using this library stopped working.
      
        The library has since been fixed[3] but it will be a while before all
        applications incorporate the updated library"
      
      Our regression rule for the kernel is that if applications break from
      new behavior, it's a regression, even if it was because the application
      did something patently wrong.  Also note the original report [4] by
      Michal Kerrisk about a test for this epoll behavior - but at that point
      we didn't know of any actual broken use case.
      
      So add the extraneous wakeup, to approximate the old behavior.
      
      [ I say "approximate", because the exact old behavior was to do a wakeup
        not for each write(), but for each pipe buffer chunk that was filled
        in. The behavior introduced by this change is not that - this is just
        "every write will cause a wakeup, whether necessary or not", which
        seems to be sufficient for the broken library use. ]
      
      It's worth noting that this adds the extraneous wakeup only for the
      write side, while the read side still considers the "edge" to be purely
      about reading enough from the pipe to allow further writes.
      
      See commit f467a6a6 ("pipe: fix and clarify pipe read wakeup logic")
      for the pipe read case, which remains that "only wake up if the pipe was
      full, and we read something from it".
      
      Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
      Link: https://github.com/realm/realm-core [2]
      Link: https://github.com/realm/realm-core/issues/4666 [3]
      Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
      Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/Reported-by: NSandeep Patil <sspatil@android.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a34b13a
    • J
      ocfs2: issue zeroout to EOF blocks · 9449ad33
      Junxiao Bi 提交于
      For punch holes in EOF blocks, fallocate used buffer write to zero the
      EOF blocks in last cluster.  But since ->writepage will ignore EOF
      pages, those zeros will not be flushed.
      
      This "looks" ok as commit 6bba4471 ("ocfs2: fix data corruption by
      fallocate") will zero the EOF blocks when extend the file size, but it
      isn't.  The problem happened on those EOF pages, before writeback, those
      pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
      set, when writeback run by write_cache_pages(), DIRTY flag on the page
      was cleared, but DIRTY flag on the buffer_head not.
      
      When next write happened to those EOF pages, since buffer_head already
      had DIRTY flag set, it would not mark page DIRTY again.  That made
      writeback ignore them forever.  That will cause data corruption.  Even
      directio write can't work because it will fail when trying to drop pages
      caches before direct io, as it found the buffer_head for those pages
      still had DIRTY flag set, then it will fall back to buffer io mode.
      
      To make a summary of the issue, as writeback ingores EOF pages, once any
      EOF page is generated, any write to it will only go to the page cache,
      it will never be flushed to disk even file size extends and that page is
      not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
      write.
      
      The following code snippet from qemu-img could trigger the corruption.
      
        656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
        ...
        660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
        660   fallocate(11, 0, 2275868672, 327680) = 0
        658   pwrite64(11, "
      
      Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9449ad33
    • J
      ocfs2: fix zero out valid data · f267aeb6
      Junxiao Bi 提交于
      If append-dio feature is enabled, direct-io write and fallocate could
      run in parallel to extend file size, fallocate used "orig_isize" to
      record i_size before taking "ip_alloc_sem", when
      ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
      extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
      out.
      
      Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
      Fixes: 6bba4471 ("ocfs2: fix data corruption by fallocate")
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f267aeb6
  2. 29 7月, 2021 5 次提交
    • D
      btrfs: calculate number of eb pages properly in csum_tree_block · 7280305e
      David Sterba 提交于
      Building with -Warray-bounds on systems with 64K pages there's a
      warning:
      
        fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
        fs/btrfs/disk-io.c:226:34: warning: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Warray-bounds]
          226 |   kaddr = page_address(buf->pages[i]);
              |                        ~~~~~~~~~~^~~
        ./include/linux/mm.h:1630:48: note: in definition of macro ‘page_address’
         1630 | #define page_address(page) lowmem_page_address(page)
              |                                                ^~~~
        In file included from fs/btrfs/ctree.h:32,
                         from fs/btrfs/disk-io.c:23:
        fs/btrfs/extent_io.h:98:15: note: while referencing ‘pages’
           98 |  struct page *pages[1];
              |               ^~~~~
      
      The compiler has no way to know that in that case the nodesize is exactly
      PAGE_SIZE, so the resulting number of pages will be correct (1).
      
      Let's use num_extent_pages that makes the case nodesize == PAGE_SIZE
      explicitly 1.
      Reported-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7280305e
    • R
      cifs: add missing parsing of backupuid · b946dbcf
      Ronnie Sahlberg 提交于
      We lost parsing of backupuid in the switch to new mount API.
      Add it back.
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NShyam Prasad N <sprasad@microsoft.com>
      Cc: <stable@vger.kernel.org> # v5.11+
      Reported-by: NXiaoli Feng <xifeng@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      b946dbcf
    • D
      btrfs: fix rw device counting in __btrfs_free_extra_devids · b2a61667
      Desmond Cheong Zhi Xi 提交于
      When removing a writeable device in __btrfs_free_extra_devids, the rw
      device count should be decremented.
      
      This error was caught by Syzbot which reported a warning in
      close_fs_devices:
      
        WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
        Modules linked in:
        CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
        RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
        RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
        RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
        R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
        R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
        FS:  00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
         open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
         btrfs_fill_super fs/btrfs/super.c:1382 [inline]
         btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x86/0x270 fs/super.c:1498
         fc_mount fs/namespace.c:993 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
         btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x86/0x270 fs/super.c:1498
         do_new_mount fs/namespace.c:2905 [inline]
         path_mount+0x196f/0x2be0 fs/namespace.c:3235
         do_mount fs/namespace.c:3248 [inline]
         __do_sys_mount fs/namespace.c:3456 [inline]
         __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
         do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Because fs_devices->rw_devices was not 0 after
      closing all devices. Here is the call trace that was observed:
      
        btrfs_mount_root():
          btrfs_scan_one_device():
            device_list_add();   <---------------- device added
          btrfs_open_devices():
            open_fs_devices():
              btrfs_open_one_device();   <-------- writable device opened,
      	                                     rw device count ++
          btrfs_fill_super():
            open_ctree():
              btrfs_free_extra_devids():
      	  __btrfs_free_extra_devids();  <--- writable device removed,
      	                              rw device count not decremented
      	  fail_tree_roots:
      	    btrfs_close_devices():
      	      close_fs_devices();   <------- rw device count off by 1
      
      As a note, prior to commit cf89af14 ("btrfs: dev-replace: fail
      mount if we don't have replace item with target device"), rw_devices
      was decremented on removing a writable device in
      __btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
      was not set for the device. However, this check does not need to be
      reinstated as it is now redundant and incorrect.
      
      In __btrfs_free_extra_devids, we skip removing the device if it is the
      target for replacement. This is done by checking whether device->devid
      == BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
      only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
      should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
      and so it's redundant to test for that bit.
      
      Additionally, following commit 82372bc8 ("Btrfs: make
      the logic of source device removing more clear"), rw_devices is
      incremented whenever a writeable device is added to the alloc
      list (including the target device in btrfs_dev_replace_finishing), so
      all removals of writable devices from the alloc list should also be
      accompanied by a decrement to rw_devices.
      
      Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
      Fixes: cf89af14 ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
      CC: stable@vger.kernel.org # 5.10+
      Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b2a61667
    • F
      btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction · ecc64fab
      Filipe Manana 提交于
      When checking if we need to log the new name of a renamed inode, we are
      checking if the inode and its parent inode have been logged before, and if
      not we don't log the new name. The check however is buggy, as it directly
      compares the logged_trans field of the inodes versus the ID of the current
      transaction. The problem is that logged_trans is a transient field, only
      stored in memory and never persisted in the inode item, so if an inode
      was logged before, evicted and reloaded, its logged_trans field is set to
      a value of 0, meaning the check will return false and the new name of the
      renamed inode is not logged. If the old parent directory was previously
      fsynced and we deleted the logged directory entries corresponding to the
      old name, we end up with a log that when replayed will delete the renamed
      inode.
      
      The following example triggers the problem:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ echo -n "hello world" > /mnt/A/foo
      
        $ sync
      
        # Add some new file to A and fsync directory A.
        $ touch /mnt/A/bar
        $ xfs_io -c "fsync" /mnt/A
      
        # Now trigger inode eviction. We are only interested in triggering
        # eviction for the inode of directory A.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Move foo from directory A to directory B.
        # This deletes the directory entries for foo in A from the log, and
        # does not add the new name for foo in directory B to the log, because
        # logged_trans of A is 0, which is less than the current transaction ID.
        $ mv /mnt/A/foo /mnt/B/foo
      
        # Now make an fsync to anything except A, B or any file inside them,
        # like for example create a file at the root directory and fsync this
        # new file. This syncs the log that contains all the changes done by
        # previous rename operation.
        $ touch /mnt/baz
        $ xfs_io -c "fsync" /mnt/baz
      
        <power fail>
      
        # Mount the filesystem and replay the log.
        $ mount /dev/sdc /mnt
      
        # Check the filesystem content.
        $ ls -1R /mnt
        /mnt/:
        A
        B
        baz
      
        /mnt/A:
        bar
      
        /mnt/B:
        $
      
        # File foo is gone, it's neither in A/ nor in B/.
      
      Fix this by using the inode_logged() helper at btrfs_log_new_name(), which
      safely checks if an inode was logged before in the current transaction.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ecc64fab
    • G
      btrfs: mark compressed range uptodate only if all bio succeed · 240246f6
      Goldwyn Rodrigues 提交于
      In compression write endio sequence, the range which the compressed_bio
      writes is marked as uptodate if the last bio of the compressed (sub)bios
      is completed successfully. There could be previous bio which may
      have failed which is recorded in cb->errors.
      
      Set the writeback range as uptodate only if cb->errors is zero, as opposed
      to checking only the last bio's status.
      
      Backporting notes: in all versions up to 4.4 the last argument is always
      replaced by "!cb->errors".
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      240246f6
  3. 28 7月, 2021 4 次提交
  4. 27 7月, 2021 3 次提交
  5. 26 7月, 2021 2 次提交
  6. 24 7月, 2021 5 次提交
    • M
      hugetlbfs: fix mount mode command line processing · e0f7e2b2
      Mike Kravetz 提交于
      In commit 32021982 ("hugetlbfs: Convert to fs_context") processing
      of the mount mode string was changed from match_octal() to fsparam_u32.
      
      This changed existing behavior as match_octal does not require octal
      values to have a '0' prefix, but fsparam_u32 does.
      
      Use fsparam_u32oct which provides the same behavior as match_octal.
      
      Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com
      Fixes: 32021982 ("hugetlbfs: Convert to fs_context")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDennis Camera <bugs+kernel.org@dtnr.ch>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0f7e2b2
    • R
      writeback, cgroup: do not reparent dax inodes · 593311e8
      Roman Gushchin 提交于
      The inode switching code is not suited for dax inodes.  An attempt to
      switch a dax inode to a parent writeback structure (as a part of a
      writeback cleanup procedure) results in a panic like this:
      
        run fstests generic/270 at 2021-07-15 05:54:02
        XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use.  Use at your own risk!
        XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
        XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
        XFS (pmem0p2): Mounting V5 Filesystem
        XFS (pmem0p2): Ending clean mount
        XFS (pmem0p2): Quotacheck needed: Please wait.
        XFS (pmem0p2): Quotacheck: Done.
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        BUG: unable to handle page fault for address: 0000000005b0f669
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-8096acd7+ #8
        Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
        Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
        RIP: 0010:inode_do_switch_wbs+0xaf/0x470
        Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
        RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
        RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
        RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
        RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
        R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
        R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
        FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
        Call Trace:
         inode_switch_wbs_work_fn+0xb6/0x2a0
         process_one_work+0x1e6/0x380
         worker_thread+0x53/0x3d0
         kthread+0x10f/0x130
         ret_from_fork+0x22/0x30
        Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
        CR2: 0000000005b0f669
        ---[ end trace ed2105faff8384f3 ]---
        RIP: 0010:inode_do_switch_wbs+0xaf/0x470
        Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
        RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
        RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
        RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
        RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
        R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
        R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
        FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
        ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      The crash happens on an attempt to iterate over attached pagecache pages
      and check the dirty flag: a dax inode's xarray contains pfn's instead of
      generic struct page pointers.
      
      This happens for DAX and not for other kinds of non-page entries in the
      inodes because it's a tagged iteration, and shadow/swap entries are
      never tagged; only DAX entries get tagged.
      
      Fix the problem by bailing out (with the false return value) of
      inode_prepare_sbs_switch() if a dax inode is passed.
      
      [willy@infradead.org: changelog addition]
      
      Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com
      Fixes: c22d70a1 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Reported-by: NDarrick J. Wong <djwong@kernel.org>
      Tested-by: NDarrick J. Wong <djwong@kernel.org>
      Tested-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Acked-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      593311e8
    • P
      userfaultfd: do not untag user pointers · e71e2ace
      Peter Collingbourne 提交于
      Patch series "userfaultfd: do not untag user pointers", v5.
      
      If a user program uses userfaultfd on ranges of heap memory, it may end
      up passing a tagged pointer to the kernel in the range.start field of
      the UFFDIO_REGISTER ioctl.  This can happen when using an MTE-capable
      allocator, or on Android if using the Tagged Pointers feature for MTE
      readiness [1].
      
      When a fault subsequently occurs, the tag is stripped from the fault
      address returned to the application in the fault.address field of struct
      uffd_msg.  However, from the application's perspective, the tagged
      address *is* the memory address, so if the application is unaware of
      memory tags, it may get confused by receiving an address that is, from
      its point of view, outside of the bounds of the allocation.  We observed
      this behavior in the kselftest for userfaultfd [2] but other
      applications could have the same problem.
      
      Address this by not untagging pointers passed to the userfaultfd ioctls.
      Instead, let the system call fail.  Also change the kselftest to use
      mmap so that it doesn't encounter this problem.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      This patch (of 2):
      
      Do not untag pointers passed to the userfaultfd ioctls.  Instead, let
      the system call fail.  This will provide an early indication of problems
      with tag-unaware userspace code instead of letting the code get confused
      later, and is consistent with how we decided to handle brk/mmap/mremap
      in commit dcde2373 ("mm: Avoid creating virtual address aliases in
      brk()/mmap()/mremap()"), as well as being consistent with the existing
      tagged address ABI documentation relating to how ioctl arguments are
      handled.
      
      The code change is a revert of commit 7d032574 ("userfaultfd: untag
      user pointers") plus some fixups to some additional calls to
      validate_range that have appeared since then.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
      Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
      Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
      Fixes: 63f0c603 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
      Signed-off-by: NPeter Collingbourne <pcc@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e71e2ace
    • J
      io_uring: explicitly catch any illegal async queue attempt · 991468dc
      Jens Axboe 提交于
      Catch an illegal case to queue async from an unrelated task that got
      the ring fd passed to it. This should not be possible to hit, but
      better be proactive and catch it explicitly. io-wq is extended to
      check for early IO_WQ_WORK_CANCEL being set on a work item as well,
      so it can run the request through the normal cancelation path.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      991468dc
    • J
      io_uring: never attempt iopoll reissue from release path · 3c30ef0f
      Jens Axboe 提交于
      There are two reasons why this shouldn't be done:
      
      1) Ring is exiting, and we're canceling requests anyway. Any request
         should be canceled anyway. In theory, this could iterate for a
         number of times if someone else is also driving the target block
         queue into request starvation, however the likelihood of this
         happening is miniscule.
      
      2) If the original task decided to pass the ring to another task, then
         we don't want to be reissuing from this context as it may be an
         unrelated task or context. No assumptions should be made about
         the context in which ->release() is run. This can only happen for pure
         read/write, and we'll get -EFAULT on them anyway.
      
      Link: https://lore.kernel.org/io-uring/YPr4OaHv0iv0KTOc@zeniv-ca.linux.org.uk/Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3c30ef0f
  7. 23 7月, 2021 6 次提交
  8. 22 7月, 2021 5 次提交
    • C
      btrfs: store a block_device in struct btrfs_ordered_extent · c7c3a6dc
      Christoph Hellwig 提交于
      Store the block device instead of the gendisk in the btrfs_ordered_extent
      structure instead of acquiring a reference to it later.
      
      Note: this is from series removing bdgrab/bdput, btrfs is one of the
      last users.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7c3a6dc
    • F
      btrfs: fix lock inversion problem when doing qgroup extent tracing · 8949b9a1
      Filipe Manana 提交于
      At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
      NULL value as the transaction handle argument, which makes that function
      take the commit_root_sem semaphore, which is necessary when we don't hold
      a transaction handle or any other mechanism to prevent a transaction
      commit from wiping out commit roots.
      
      However btrfs_qgroup_trace_extent_post() can be called in a context where
      we are holding a write lock on an extent buffer from a subvolume tree,
      namely from btrfs_truncate_inode_items(), called either during truncate
      or unlink operations. In this case we end up with a lock inversion problem
      because the commit_root_sem is a higher level lock, always supposed to be
      acquired before locking any extent buffer.
      
      Lockdep detects this lock inversion problem since we switched the extent
      buffer locks from custom locks to semaphores, and when running btrfs/158
      from fstests, it reported the following trace:
      
      [ 9057.626435] ======================================================
      [ 9057.627541] WARNING: possible circular locking dependency detected
      [ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
      [ 9057.628961] ------------------------------------------------------
      [ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
      [ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.632542]
                     but task is already holding lock:
      [ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.635255]
                     which lock already depends on the new lock.
      
      [ 9057.636292]
                     the existing dependency chain (in reverse order) is:
      [ 9057.637240]
                     -> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
      [ 9057.638138]        down_read+0x46/0x140
      [ 9057.638648]        btrfs_find_all_roots+0x41/0x80 [btrfs]
      [ 9057.639398]        btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
      [ 9057.640283]        btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
      [ 9057.641114]        btrfs_free_extent+0x35/0xb0 [btrfs]
      [ 9057.641819]        btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
      [ 9057.642643]        btrfs_evict_inode+0x454/0x4f0 [btrfs]
      [ 9057.643418]        evict+0xcf/0x1d0
      [ 9057.643895]        do_unlinkat+0x1e9/0x300
      [ 9057.644525]        do_syscall_64+0x3b/0xc0
      [ 9057.645110]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 9057.645835]
                     -> #0 (btrfs-tree-00){++++}-{3:3}:
      [ 9057.646600]        __lock_acquire+0x130e/0x2210
      [ 9057.647248]        lock_acquire+0xd7/0x310
      [ 9057.647773]        down_read_nested+0x4b/0x140
      [ 9057.648350]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.649175]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.650010]        btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.650849]        scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.651733]        iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.652501]        scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.653264]        scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.654295]        scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.655111]        btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.655831]        process_one_work+0x247/0x5a0
      [ 9057.656425]        worker_thread+0x55/0x3c0
      [ 9057.656993]        kthread+0x155/0x180
      [ 9057.657494]        ret_from_fork+0x22/0x30
      [ 9057.658030]
                     other info that might help us debug this:
      
      [ 9057.659064]  Possible unsafe locking scenario:
      
      [ 9057.659824]        CPU0                    CPU1
      [ 9057.660402]        ----                    ----
      [ 9057.660988]   lock(&fs_info->commit_root_sem);
      [ 9057.661581]                                lock(btrfs-tree-00);
      [ 9057.662348]                                lock(&fs_info->commit_root_sem);
      [ 9057.663254]   lock(btrfs-tree-00);
      [ 9057.663690]
                      *** DEADLOCK ***
      
      [ 9057.664437] 4 locks held by kworker/u16:4/30781:
      [ 9057.665023]  #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.666260]  #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.667639]  #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
      [ 9057.669017]  #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.670408]
                     stack backtrace:
      [ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
      [ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
      [ 9057.674258] Call Trace:
      [ 9057.674588]  dump_stack_lvl+0x57/0x72
      [ 9057.675083]  check_noncircular+0xf3/0x110
      [ 9057.675611]  __lock_acquire+0x130e/0x2210
      [ 9057.676132]  lock_acquire+0xd7/0x310
      [ 9057.676605]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.677313]  ? lock_is_held_type+0xe8/0x140
      [ 9057.677849]  down_read_nested+0x4b/0x140
      [ 9057.678349]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679068]  __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679760]  btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.680458]  btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.681083]  ? _raw_spin_unlock+0x29/0x40
      [ 9057.681594]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.682336]  scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.683058]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.683834]  ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
      [ 9057.684632]  iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.685316]  scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.685977]  ? ___ratelimit+0xa4/0x110
      [ 9057.686460]  scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.687316]  scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.688021]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.688649]  ? lock_is_held_type+0xe8/0x140
      [ 9057.689180]  process_one_work+0x247/0x5a0
      [ 9057.689696]  worker_thread+0x55/0x3c0
      [ 9057.690175]  ? process_one_work+0x5a0/0x5a0
      [ 9057.690731]  kthread+0x155/0x180
      [ 9057.691158]  ? set_kthread_struct+0x40/0x40
      [ 9057.691697]  ret_from_fork+0x22/0x30
      
      Fix this by making btrfs_find_all_roots() never attempt to lock the
      commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().
      
      We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
      from btrfs_qgroup_trace_extent_post(), because that would make backref
      lookup not use commit roots and acquire read locks on extent buffers, and
      therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
      from the btrfs_truncate_inode_items() code path which has acquired a write
      lock on an extent buffer of the subvolume btree.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8949b9a1
    • A
      btrfs: check for missing device in btrfs_trim_fs · 16a200f6
      Anand Jain 提交于
      A fstrim on a degraded raid1 can trigger the following null pointer
      dereference:
      
        BTRFS info (device loop0): allowing degraded mounts
        BTRFS info (device loop0): disk space caching is enabled
        BTRFS info (device loop0): has skinny extents
        BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
        BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
        BTRFS info (device loop0): enabling ssd optimizations
        BUG: kernel NULL pointer dereference, address: 0000000000000620
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 4574 Comm: fstrim Not tainted 5.13.0-rc7+ #31
        Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
        RIP: 0010:btrfs_trim_fs+0x199/0x4a0 [btrfs]
        RSP: 0018:ffff959541797d28 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: ffff946f84eca508 RCX: a7a67937adff8608
        RDX: ffff946e8122d000 RSI: 0000000000000000 RDI: ffffffffc02fdbf0
        RBP: ffff946ea4615000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: ffff946e8122d960 R12: 0000000000000000
        R13: ffff959541797db8 R14: ffff946e8122d000 R15: ffff959541797db8
        FS:  00007f55917a5080(0000) GS:ffff946f9bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000620 CR3: 000000002d2c8001 CR4: 00000000000706f0
        Call Trace:
        btrfs_ioctl_fitrim+0x167/0x260 [btrfs]
        btrfs_ioctl+0x1c00/0x2fe0 [btrfs]
        ? selinux_file_ioctl+0x140/0x240
        ? syscall_trace_enter.constprop.0+0x188/0x240
        ? __x64_sys_ioctl+0x83/0xb0
        __x64_sys_ioctl+0x83/0xb0
      
      Reproducer:
      
        $ mkfs.btrfs -fq -d raid1 -m raid1 /dev/loop0 /dev/loop1
        $ mount /dev/loop0 /btrfs
        $ umount /btrfs
        $ btrfs dev scan --forget
        $ mount -o degraded /dev/loop0 /btrfs
      
        $ fstrim /btrfs
      
      The reason is we call btrfs_trim_free_extents() for the missing device,
      which uses device->bdev (NULL for missing device) to find if the device
      supports discard.
      
      Fix is to check if the device is missing before calling
      btrfs_trim_free_extents().
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16a200f6
    • F
      btrfs: fix unpersisted i_size on fsync after expanding truncate · 9acc8103
      Filipe Manana 提交于
      If we have an inode that does not have the full sync flag set, was changed
      in the current transaction, then it is logged while logging some other
      inode (like its parent directory for example), its i_size is increased by
      a truncate operation, the log is synced through an fsync of some other
      inode and then finally we explicitly call fsync on our inode, the new
      i_size is not persisted.
      
      The following example shows how to trigger it, with comments explaining
      how and why the issue happens:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ touch /mnt/foo
        $ xfs_io -f -c "pwrite -S 0xab 0 1M" /mnt/bar
      
        $ sync
      
        # Fsync bar, this will be a noop since the file has not yet been
        # modified in the current transaction. The goal here is to clear
        # BTRFS_INODE_NEEDS_FULL_SYNC from the inode's runtime flags.
        $ xfs_io -c "fsync" /mnt/bar
      
        # Now rename both files, without changing their parent directory.
        $ mv /mnt/bar /mnt/bar2
        $ mv /mnt/foo /mnt/foo2
      
        # Increase the size of bar2 with a truncate operation.
        $ xfs_io -c "truncate 2M" /mnt/bar2
      
        # Now fsync foo2, this results in logging its parent inode (the root
        # directory), and logging the parent results in logging the inode of
        # file bar2 (its inode item and the new name). The inode of file bar2
        # is logged with an i_size of 0 bytes since it's logged in
        # LOG_INODE_EXISTS mode, meaning we are only logging its names (and
        # xattrs if it had any) and the i_size of the inode will not be changed
        # when the log is replayed.
        $ xfs_io -c "fsync" /mnt/foo2
      
        # Now explicitly fsync bar2. This resulted in doing nothing, not
        # logging the inode with the new i_size of 2M and the hole from file
        # offset 1M to 2M. Because the inode did not have the flag
        # BTRFS_INODE_NEEDS_FULL_SYNC set, when it was logged through the
        # fsync of file foo2, its last_log_commit field was updated,
        # resulting in this explicit of file bar2 not doing anything.
        $ xfs_io -c "fsync" /mnt/bar2
      
        # File bar2 content and size before a power failure.
        $ od -A d -t x1 /mnt/bar2
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        1048576 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        2097152
      
        <power failure>
      
        # Mount the filesystem to replay the log.
        $ mount /dev/sdc /mnt
      
        # Read the file again, should have the same content and size as before
        # the power failure happened, but it doesn't, i_size is still at 1M.
        $ od -A d -t x1 /mnt/bar2
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        1048576
      
      This started to happen after commit 209ecbb8 ("btrfs: remove stale
      comment and logic from btrfs_inode_in_log()"), since btrfs_inode_in_log()
      no longer checks if the inode's list of modified extents is not empty.
      However, checking that list is not the right way to address this case
      and the check was added long time ago in commit 125c4cf9
      ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync")
      for a different purpose, to address consecutive ranged fsyncs.
      
      The reason that checking for the list emptiness makes this test pass is
      because during an expanding truncate we create an extent map to represent
      a hole from the old i_size to the new i_size, and add that extent map to
      the list of modified extents in the inode. However if we are low on
      available memory and we can not allocate a new extent map, then we don't
      treat it as an error and just set the full sync flag on the inode, so that
      the next fsync does not rely on the list of modified extents - so checking
      for the emptiness of the list to decide if the inode needs to be logged is
      not reliable, and results in not logging the inode if it was not possible
      to allocate the extent map for the hole.
      
      Fix this by ensuring that if we are only logging that an inode exists
      (inode item, names/references and xattrs), we don't update the inode's
      last_log_commit even if it does not have the full sync runtime flag set.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 5.13+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9acc8103
    • P
      cgroup1: fix leaked context root causing sporadic NULL deref in LTP · 1e7107c5
      Paul Gortmaker 提交于
      Richard reported sporadic (roughly one in 10 or so) null dereferences and
      other strange behaviour for a set of automated LTP tests.  Things like:
      
         BUG: kernel NULL pointer dereference, address: 0000000000000008
         #PF: supervisor read access in kernel mode
         #PF: error_code(0x0000) - not-present page
         PGD 0 P4D 0
         Oops: 0000 [#1] PREEMPT SMP PTI
         CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
         Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
         RIP: 0010:kernfs_sop_show_path+0x1b/0x60
      
      ...or these others:
      
         RIP: 0010:do_mkdirat+0x6a/0xf0
         RIP: 0010:d_alloc_parallel+0x98/0x510
         RIP: 0010:do_readlinkat+0x86/0x120
      
      There were other less common instances of some kind of a general scribble
      but the common theme was mount and cgroup and a dubious dentry triggering
      the NULL dereference.  I was only able to reproduce it under qemu by
      replicating Richard's setup as closely as possible - I never did get it
      to happen on bare metal, even while keeping everything else the same.
      
      In commit 71d883c3 ("cgroup_do_mount(): massage calling conventions")
      we see this as a part of the overall change:
      
         --------------
                 struct cgroup_subsys *ss;
         -       struct dentry *dentry;
      
         [...]
      
         -       dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
         -                                CGROUP_SUPER_MAGIC, ns);
      
         [...]
      
         -       if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
         -               struct super_block *sb = dentry->d_sb;
         -               dput(dentry);
         +       ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
         +       if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
         +               struct super_block *sb = fc->root->d_sb;
         +               dput(fc->root);
                         deactivate_locked_super(sb);
                         msleep(10);
                         return restart_syscall();
                 }
         --------------
      
      In changing from the local "*dentry" variable to using fc->root, we now
      export/leave that dentry pointer in the file context after doing the dput()
      in the unlikely "is_dying" case.   With LTP doing a crazy amount of back to
      back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
      becomes slightly likely and then bad things happen.
      
      A fix would be to not leave the stale reference in fc->root as follows:
      
         --------------
                        dput(fc->root);
        +               fc->root = NULL;
                        deactivate_locked_super(sb);
         --------------
      
      ...but then we are just open-coding a duplicate of fc_drop_locked() so we
      simply use that instead.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: stable@vger.kernel.org      # v5.1+
      Reported-by: NRichard Purdie <richard.purdie@linuxfoundation.org>
      Fixes: 71d883c3 ("cgroup_do_mount(): massage calling conventions")
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1e7107c5
  9. 21 7月, 2021 4 次提交
  10. 20 7月, 2021 3 次提交
    • L
      ceph: don't WARN if we're still opening a session to an MDS · cdb330f4
      Luis Henriques 提交于
      If MDSs aren't available while mounting a filesystem, the session state
      will transition from SESSION_OPENING to SESSION_CLOSING.  And in that
      scenario check_session_state() will be called from delayed_work() and
      trigger this WARN.
      
      Avoid this by only WARNing after a session has already been established
      (i.e., the s_ttl will be different from 0).
      
      Fixes: 62575e27 ("ceph: check session state after bumping session->s_seq")
      Signed-off-by: NLuis Henriques <lhenriques@suse.de>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      cdb330f4
    • Y
      io_uring: fix memleak in io_init_wq_offload() · 362a9e65
      Yang Yingliang 提交于
      I got memory leak report when doing fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888107310a80 (size 96):
      comm "syz-executor.6", pid 4610, jiffies 4295140240 (age 20.135s)
      hex dump (first 32 bytes):
      01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
      backtrace:
      [<000000001974933b>] kmalloc include/linux/slab.h:591 [inline]
      [<000000001974933b>] kzalloc include/linux/slab.h:721 [inline]
      [<000000001974933b>] io_init_wq_offload fs/io_uring.c:7920 [inline]
      [<000000001974933b>] io_uring_alloc_task_context+0x466/0x640 fs/io_uring.c:7955
      [<0000000039d0800d>] __io_uring_add_tctx_node+0x256/0x360 fs/io_uring.c:9016
      [<000000008482e78c>] io_uring_add_tctx_node fs/io_uring.c:9052 [inline]
      [<000000008482e78c>] __do_sys_io_uring_enter fs/io_uring.c:9354 [inline]
      [<000000008482e78c>] __se_sys_io_uring_enter fs/io_uring.c:9301 [inline]
      [<000000008482e78c>] __x64_sys_io_uring_enter+0xabc/0xc20 fs/io_uring.c:9301
      [<00000000b875f18f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      [<00000000b875f18f>] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
      [<000000006b0a8484>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      CPU0                          CPU1
      io_uring_enter                io_uring_enter
      io_uring_add_tctx_node        io_uring_add_tctx_node
      __io_uring_add_tctx_node      __io_uring_add_tctx_node
      io_uring_alloc_task_context   io_uring_alloc_task_context
      io_init_wq_offload            io_init_wq_offload
      hash = kzalloc                hash = kzalloc
      ctx->hash_map = hash          ctx->hash_map = hash <- one of the hash is leaked
      
      When calling io_uring_enter() in parallel, the 'hash_map' will be leaked,
      add uring_lock to protect 'hash_map'.
      
      Fixes: e941894e ("io-wq: make buffered file write hashed work map per-ctx")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/20210720083805.3030730-1-yangyingliang@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      362a9e65
    • P
      io_uring: remove double poll entry on arm failure · 46fee9ab
      Pavel Begunkov 提交于
      __io_queue_proc() can enqueue both poll entries and still fail
      afterwards, so the callers trying to cancel it should also try to remove
      the second poll entry (if any).
      
      For example, it may leave the request alive referencing a io_uring
      context but not accessible for cancellation:
      
      [  282.599913][ T1620] task:iou-sqp-23145   state:D stack:28720 pid:23155 ppid:  8844 flags:0x00004004
      [  282.609927][ T1620] Call Trace:
      [  282.613711][ T1620]  __schedule+0x93a/0x26f0
      [  282.634647][ T1620]  schedule+0xd3/0x270
      [  282.638874][ T1620]  io_uring_cancel_generic+0x54d/0x890
      [  282.660346][ T1620]  io_sq_thread+0xaac/0x1250
      [  282.696394][ T1620]  ret_from_fork+0x1f/0x30
      
      Cc: stable@vger.kernel.org
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Reported-and-tested-by: syzbot+ac957324022b7132accf@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0ec1228fc5eda4cb524eeda857da8efdc43c331c.1626774457.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      46fee9ab