1. 28 10月, 2019 2 次提交
  2. 26 10月, 2019 2 次提交
  3. 25 10月, 2019 6 次提交
    • P
      io_uring: Fix race for sqes with userspace · 935d1e45
      Pavel Begunkov 提交于
      io_ring_submit() finalises with
      1. io_commit_sqring(), which releases sqes to the userspace
      2. Then calls to io_queue_link_head(), accessing released head's sqe
      
      Reorder them.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      935d1e45
    • P
      io_uring: Fix broken links with offloading · fb5ccc98
      Pavel Begunkov 提交于
      io_sq_thread() processes sqes by 8 without considering links. As a
      result, links will be randomely subdivided.
      
      The easiest way to fix it is to call io_get_sqring() inside
      io_submit_sqes() as do io_ring_submit().
      
      Downsides:
      1. This removes optimisation of not grabbing mm_struct for fixed files
      2. It submitting all sqes in one go, without finer-grained sheduling
      with cq processing.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fb5ccc98
    • P
      io_uring: Fix corrupted user_data · 84d55dc5
      Pavel Begunkov 提交于
      There is a bug, where failed linked requests are returned not with
      specified @user_data, but with garbage from a kernel stack.
      
      The reason is that io_fail_links() uses req->user_data, which is
      uninitialised when called from io_queue_sqe() on fail path.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      84d55dc5
    • D
      cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs · d46b0da7
      Dave Wysochanski 提交于
      There's a deadlock that is possible and can easily be seen with
      a test where multiple readers open/read/close of the same file
      and a disruption occurs causing reconnect.  The deadlock is due
      a reader thread inside cifs_strict_readv calling down_read and
      obtaining lock_sem, and then after reconnect inside
      cifs_reopen_file calling down_read a second time.  If in
      between the two down_read calls, a down_write comes from
      another process, deadlock occurs.
      
              CPU0                    CPU1
              ----                    ----
      cifs_strict_readv()
       down_read(&cifsi->lock_sem);
                                     _cifsFileInfo_put
                                        OR
                                     cifs_new_fileinfo
                                      down_write(&cifsi->lock_sem);
      cifs_reopen_file()
       down_read(&cifsi->lock_sem);
      
      Fix the above by changing all down_write(lock_sem) calls to
      down_write_trylock(lock_sem)/msleep() loop, which in turn
      makes the second down_read call benign since it will never
      block behind the writer while holding lock_sem.
      Signed-off-by: NDave Wysochanski <dwysocha@redhat.com>
      Suggested-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed--by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      d46b0da7
    • P
      CIFS: Fix use after free of file info structures · 1a67c415
      Pavel Shilovsky 提交于
      Currently the code assumes that if a file info entry belongs
      to lists of open file handles of an inode and a tcon then
      it has non-zero reference. The recent changes broke that
      assumption when putting the last reference of the file info.
      There may be a situation when a file is being deleted but
      nothing prevents another thread to reference it again
      and start using it. This happens because we do not hold
      the inode list lock while checking the number of references
      of the file info structure. Fix this by doing the proper
      locking when doing the check.
      
      Fixes: 487317c9 ("cifs: add spinlock for the openFileList to cifsInodeInfo")
      Fixes: cb248819 ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic")
      Cc: Stable <stable@vger.kernel.org>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      1a67c415
    • P
      CIFS: Fix retry mid list corruption on reconnects · abe57073
      Pavel Shilovsky 提交于
      When the client hits reconnect it iterates over the mid
      pending queue marking entries for retry and moving them
      to a temporary list to issue callbacks later without holding
      GlobalMid_Lock. In the same time there is no guarantee that
      mids can't be removed from the temporary list or even
      freed completely by another thread. It may cause a temporary
      list corruption:
      
      [  430.454897] list_del corruption. prev->next should be ffff98d3a8f316c0, but was 2e885cb266355469
      [  430.464668] ------------[ cut here ]------------
      [  430.466569] kernel BUG at lib/list_debug.c:51!
      [  430.468476] invalid opcode: 0000 [#1] SMP PTI
      [  430.470286] CPU: 0 PID: 13267 Comm: cifsd Kdump: loaded Not tainted 5.4.0-rc3+ #19
      [  430.473472] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  430.475872] RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
      ...
      [  430.510426] Call Trace:
      [  430.511500]  cifs_reconnect+0x25e/0x610 [cifs]
      [  430.513350]  cifs_readv_from_socket+0x220/0x250 [cifs]
      [  430.515464]  cifs_read_from_socket+0x4a/0x70 [cifs]
      [  430.517452]  ? try_to_wake_up+0x212/0x650
      [  430.519122]  ? cifs_small_buf_get+0x16/0x30 [cifs]
      [  430.521086]  ? allocate_buffers+0x66/0x120 [cifs]
      [  430.523019]  cifs_demultiplex_thread+0xdc/0xc30 [cifs]
      [  430.525116]  kthread+0xfb/0x130
      [  430.526421]  ? cifs_handle_standard+0x190/0x190 [cifs]
      [  430.528514]  ? kthread_park+0x90/0x90
      [  430.530019]  ret_from_fork+0x35/0x40
      
      Fix this by obtaining extra references for mids being retried
      and marking them as MID_DELETED which indicates that such a mid
      has been dequeued from the pending list.
      
      Also move mid cleanup logic from DeleteMidQEntry to
      _cifs_mid_q_entry_release which is called when the last reference
      to a particular mid is put. This allows to avoid any use-after-free
      of response buffers.
      
      The patch needs to be backported to stable kernels. A stable tag
      is not mentioned below because the patch doesn't apply cleanly
      to any actively maintained stable kernel.
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-and-tested-by: NDavid Wysochanski <dwysocha@redhat.com>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      abe57073
  4. 24 10月, 2019 4 次提交
  5. 23 10月, 2019 1 次提交
  6. 21 10月, 2019 4 次提交
  7. 19 10月, 2019 5 次提交
    • K
      proc/meminfo: fix output alignment · 2be5fbf9
      Kirill A. Shutemov 提交于
      Patch series "Fixes for THP in page cache", v2.
      
      This patch (of 5):
      
      Add extra space for FileHugePages and FilePmdMapped, so the output is
      aligned with other rows.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com
      Fixes: 60fbf0ab ("mm,thp: stats for file backed THP")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Tested-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2be5fbf9
    • Y
      ocfs2: fix panic due to ocfs2_wq is null · b918c430
      Yi Li 提交于
      mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters
      an error.  ocfs2_initialize_super() returns before allocating ocfs2_wq.
      ocfs2_dismount_volume() triggers the following panic.
      
        Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted.
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30
        ------------[ cut here ]------------
        Oops: 0002 [#1] SMP NOPTI
        CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G  E
              4.14.148-200.ckv.x86_64 #1
        Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018
        task: ffff967af0520000 task.stack: ffffa5f05484000
        RIP: 0010:mutex_lock+0x19/0x20
        Call Trace:
          flush_workqueue+0x81/0x460
          ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2]
          ocfs2_dismount_volume+0x84/0x400 [ocfs2]
          ocfs2_fill_super+0xa4/0x1270 [ocfs2]
          ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2]
          mount_bdev+0x17f/0x1c0
          mount_fs+0x3a/0x160
      
      Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.comSigned-off-by: NYi Li <yilikernel@gmail.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b918c430
    • C
      ocfs2: fix error handling in ocfs2_setattr() · ce750f43
      Chengguang Xu 提交于
      Should set transfer_to[USRQUOTA/GRPQUOTA] to NULL on error case before
      jumping to do dqput().
      
      Link: http://lkml.kernel.org/r/20191010082349.1134-1-cgxu519@mykernel.netSigned-off-by: NChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce750f43
    • D
      fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c · aad5f69b
      David Hildenbrand 提交于
      There are three places where we access uninitialized memmaps, namely:
      - /proc/kpagecount
      - /proc/kpageflags
      - /proc/kpagecgroup
      
      We have initialized memmaps either when the section is online or when the
      page was initialized to the ZONE_DEVICE.  Uninitialized memmaps contain
      garbage and in the worst case trigger kernel BUGs, especially with
      CONFIG_PAGE_POISONING.
      
      For example, not onlining a DIMM during boot and calling /proc/kpagecount
      with CONFIG_PAGE_POISONING:
      
        :/# cat /proc/kpagecount > tmp.test
        BUG: unable to handle page fault for address: fffffffffffffffe
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
        RIP: 0010:kpagecount_read+0xce/0x1e0
        Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
        RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
        RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
        R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
        FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
        Call Trace:
         proc_reg_read+0x3c/0x60
         vfs_read+0xc5/0x180
         ksys_read+0x68/0xe0
         do_syscall_64+0x5c/0xa0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      For now, let's drop support for ZONE_DEVICE from the three pseudo files
      in order to fix this.  To distinguish offline memory (with garbage
      memmap) from ZONE_DEVICE memory with properly initialized memmaps, we
      would have to check get_dev_pagemap() and pfn_zone_device_reserved()
      right now.  The usage of both (especially, special casing devmem) is
      frowned upon and needs to be reworked.
      
      The fundamental issue we have is:
      
      	if (pfn_to_online_page(pfn)) {
      		/* memmap initialized */
      	} else if (pfn_valid(pfn)) {
      		/*
      		 * ???
      		 * a) offline memory. memmap garbage.
      		 * b) devmem: memmap initialized to ZONE_DEVICE.
      		 * c) devmem: reserved for driver. memmap garbage.
      		 * (d) devmem: memmap currently initializing - garbage)
      		 */
      	}
      
      We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
      in place as that function is also used from memory failure.  We now no
      longer dump information about pages that are not in use anymore -
      offline.
      
      Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aad5f69b
    • L
      filldir[64]: remove WARN_ON_ONCE() for bad directory entries · b9959c7a
      Linus Torvalds 提交于
      This was always meant to be a temporary thing, just for testing and to
      see if it actually ever triggered.
      
      The only thing that reported it was syzbot doing disk image fuzzing, and
      then that warning is expected.  So let's just remove it before -rc4,
      because the extra sanity testing should probably go to -stable, but we
      don't want the warning to do so.
      
      Reported-by: syzbot+3031f712c7ad5dd4d926@syzkaller.appspotmail.com
      Fixes: 8a23eb80 ("Make filldir[64]() verify the directory entry filename is valid")
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9959c7a
  8. 18 10月, 2019 4 次提交
    • Y
      io_uring: fix logic error in io_timeout · 8b07a65a
      yangerkun 提交于
      If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
      tmp_nxt.
      
      Fixes: 5da0fb1a ("io_uring: consider the overflow of sequence for timeout req")
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8b07a65a
    • J
      io_uring: fix up O_NONBLOCK handling for sockets · 491381ce
      Jens Axboe 提交于
      We've got two issues with the non-regular file handling for non-blocking
      IO:
      
      1) We don't want to re-do a short read in full for a non-regular file,
         as we can't just read the data again.
      2) For non-regular files that don't support non-blocking IO attempts,
         we need to punt to async context even if the file is opened as
         non-blocking. Otherwise the caller always gets -EAGAIN.
      
      Add two new request flags to handle these cases. One is just a cache
      of the inode S_ISREG() status, the other tells io_uring that we always
      need to punt this request to async context, even if REQ_F_NOWAIT is set.
      
      Cc: stable@vger.kernel.org
      Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      491381ce
    • F
      Btrfs: check for the full sync flag while holding the inode lock during fsync · ba0b084a
      Filipe Manana 提交于
      We were checking for the full fsync flag in the inode before locking the
      inode, which is racy, since at that that time it might not be set but
      after we acquire the inode lock some other task set it. One case where
      this can happen is on a system low on memory and some concurrent task
      failed to allocate an extent map and therefore set the full sync flag on
      the inode, to force the next fsync to work in full mode.
      
      A consequence of missing the full fsync flag set is hitting the problems
      fixed by commit 0c713cba ("Btrfs: fix race between ranged fsync and
      writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
      tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
      of weird inconsistencies after replaying a log due to file extents items
      representing ranges that overlap.
      
      So just move the check such that it's done after locking the inode and
      before starting writeback again.
      
      Fixes: 0c713cba ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba0b084a
    • F
      Btrfs: fix qgroup double free after failure to reserve metadata for delalloc · c7967fc1
      Filipe Manana 提交于
      If we fail to reserve metadata for delalloc operations we end up releasing
      the previously reserved qgroup amount twice, once explicitly under the
      'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
      again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
      value of 'true' for its 'qgroup_free' argument, which results in
      btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
      a double free.
      
      Also if we fail to reserve the necessary qgroup amount, we jump to the
      label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
      calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
      reserve any qgroup amount. So we freed some amount we never reserved.
      
      So fix this by removing the call to btrfs_inode_rsv_release() in the
      failure path, since it's not necessary at all as we haven't changed the
      inode's block reserve in any way at this point.
      
      Fixes: c8eaeac7 ("btrfs: reserve delalloc metadata differently")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7967fc1
  9. 17 10月, 2019 1 次提交
  10. 16 10月, 2019 1 次提交
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo 提交于
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8702ba93
  11. 15 10月, 2019 7 次提交
  12. 13 10月, 2019 2 次提交
    • S
      tracing: Do not create tracefs files if tracefs lockdown is in effect · bf8e6021
      Steven Rostedt (VMware) 提交于
      If on boot up, lockdown is activated for tracefs, don't even bother creating
      the files. This can also prevent instances from being created if lockdown is
      in effect.
      
      Link: http://lkml.kernel.org/r/CAHk-=whC6Ji=fWnjh2+eS4b15TnbsS4VPVtvBOwCy1jjEG_JHQ@mail.gmail.comSuggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bf8e6021
    • S
      tracefs: Revert ccbd54ff ("tracefs: Restrict tracefs when the kernel is locked down") · 3ed270b1
      Steven Rostedt (VMware) 提交于
      Running the latest kernel through my "make instances" stress tests, I
      triggered the following bug (with KASAN and kmemleak enabled):
      
      mkdir invoked oom-killer:
      gfp_mask=0x40cd0(GFP_KERNEL|__GFP_COMP|__GFP_RECLAIMABLE), order=0,
      oom_score_adj=0
      CPU: 1 PID: 2229 Comm: mkdir Not tainted 5.4.0-rc2-test #325
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      Call Trace:
       dump_stack+0x64/0x8c
       dump_header+0x43/0x3b7
       ? trace_hardirqs_on+0x48/0x4a
       oom_kill_process+0x68/0x2d5
       out_of_memory+0x2aa/0x2d0
       __alloc_pages_nodemask+0x96d/0xb67
       __alloc_pages_node+0x19/0x1e
       alloc_slab_page+0x17/0x45
       new_slab+0xd0/0x234
       ___slab_alloc.constprop.86+0x18f/0x336
       ? alloc_inode+0x2c/0x74
       ? irq_trace+0x12/0x1e
       ? tracer_hardirqs_off+0x1d/0xd7
       ? __slab_alloc.constprop.85+0x21/0x53
       __slab_alloc.constprop.85+0x31/0x53
       ? __slab_alloc.constprop.85+0x31/0x53
       ? alloc_inode+0x2c/0x74
       kmem_cache_alloc+0x50/0x179
       ? alloc_inode+0x2c/0x74
       alloc_inode+0x2c/0x74
       new_inode_pseudo+0xf/0x48
       new_inode+0x15/0x25
       tracefs_get_inode+0x23/0x7c
       ? lookup_one_len+0x54/0x6c
       tracefs_create_file+0x53/0x11d
       trace_create_file+0x15/0x33
       event_create_dir+0x2a3/0x34b
       __trace_add_new_event+0x1c/0x26
       event_trace_add_tracer+0x56/0x86
       trace_array_create+0x13e/0x1e1
       instance_mkdir+0x8/0x17
       tracefs_syscall_mkdir+0x39/0x50
       ? get_dname+0x31/0x31
       vfs_mkdir+0x78/0xa3
       do_mkdirat+0x71/0xb0
       sys_mkdir+0x19/0x1b
       do_fast_syscall_32+0xb0/0xed
      
      I bisected this down to the addition of the proxy_ops into tracefs for
      lockdown. It appears that the allocation of the proxy_ops and then freeing
      it in the destroy_inode callback, is causing havoc with the memory system.
      Reading the documentation about destroy_inode and talking with Linus about
      this, this is buggy and wrong. When defining the destroy_inode() method, it
      is expected that the destroy_inode() will also free the inode, and not just
      the extra allocations done in the creation of the inode. The faulty commit
      causes a memory leak of the inode data structure when they are deleted.
      
      Instead of allocating the proxy_ops (and then having to free it) the checks
      should be done by the open functions themselves, and not hack into the
      tracefs directory. First revert the tracefs updates for locked_down and then
      later we can add the locked_down checks in the kernel/trace files.
      
      Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home
      
      Fixes: ccbd54ff ("tracefs: Restrict tracefs when the kernel is locked down")
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      3ed270b1
  13. 12 10月, 2019 1 次提交