1. 10 9月, 2020 2 次提交
    • V
      virtiofs: provide a helper function for virtqueue initialization · b43b7e81
      Vivek Goyal 提交于
      This reduces code duplication and make it little easier to read code.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      b43b7e81
    • V
      dax: Create a range version of dax_layout_busy_page() · 6bbdd563
      Vivek Goyal 提交于
      virtiofs device has a range of memory which is mapped into file inodes
      using dax. This memory is mapped in qemu on host and maps different
      sections of real file on host. Size of this memory is limited
      (determined by administrator) and depending on filesystem size, we will
      soon reach a situation where all the memory is in use and we need to
      reclaim some.
      
      As part of reclaim process, we will need to make sure that there are
      no active references to pages (taken by get_user_pages()) on the memory
      range we are trying to reclaim. I am planning to use
      dax_layout_busy_page() for this. But in current form this is per inode
      and scans through all the pages of the inode.
      
      We want to reclaim only a portion of memory (say 2MB page). So we want
      to make sure that only that 2MB range of pages do not have any
      references  (and don't want to unmap all the pages of inode).
      
      Hence, create a range version of this function named
      dax_layout_busy_page_range() which can be used to pass a range which
      needs to be unmapped.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Cc: "Weiny, Ira" <ira.weiny@intel.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6bbdd563
  2. 04 9月, 2020 1 次提交
  3. 29 8月, 2020 1 次提交
  4. 28 8月, 2020 3 次提交
  5. 27 8月, 2020 2 次提交
  6. 26 8月, 2020 3 次提交
  7. 25 8月, 2020 2 次提交
  8. 24 8月, 2020 6 次提交
    • J
      ceph: fix inode number handling on arches with 32-bit ino_t · ebce3eb2
      Jeff Layton 提交于
      Tuan and Ulrich mentioned that they were hitting a problem on s390x,
      which has a 32-bit ino_t value, even though it's a 64-bit arch (for
      historical reasons).
      
      I think the current handling of inode numbers in the ceph driver is
      wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
      actually not a problem. 32-bit arches can deal with 64-bit inode numbers
      just fine when userland code is compiled with LFS support (the common
      case these days).
      
      What we really want to do is just use 64-bit numbers everywhere, unless
      someone has mounted with the ino32 mount option. In that case, we want
      to ensure that we hash the inode number down to something that will fit
      in 32 bits before presenting the value to userland.
      
      Add new helper functions that do this, and only do the conversion before
      presenting these values to userland in getattr and readdir.
      
      The inode table hashvalue is changed to just cast the inode number to
      unsigned long, as low-order bits are the most likely to vary anyway.
      
      While it's not strictly required, we do want to put something in
      inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
      the size of the ino_t type.
      
      NOTE: This is a user-visible change on 32-bit arches:
      
      1/ inode numbers will be seen to have changed between kernel versions.
         32-bit arches will see large inode numbers now instead of the hashed
         ones they saw before.
      
      2/ any really old software not built with LFS support may start failing
         stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
         can do about these, but hopefully the intersection of people running
         such code on ceph will be very small.
      
      The workaround for both problems is to mount with "-o ino32".
      
      [ idryomov: changelog tweak ]
      
      URL: https://tracker.ceph.com/issues/46828Reported-by: NUlrich Weigand <Ulrich.Weigand@de.ibm.com>
      Reported-and-Tested-by: NTuan Hoang1 <Tuan.Hoang1@ibm.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      ebce3eb2
    • B
      gfs2: add some much needed cleanup for log flushes that fail · 462582b9
      Bob Peterson 提交于
      When a log flush fails due to io errors, it signals the failure but does
      not clean up after itself very well. This is because buffers are added to
      the transaction tr_buf and tr_databuf queue, but the io error causes
      gfs2_log_flush to bypass the "after_commit" functions responsible for
      dequeueing the bd elements. If the bd elements are added to the ail list
      before the error, function ail_drain takes care of dequeueing them.
      But if they haven't gotten that far, the elements are forgotten and
      make the transactions unable to be freed.
      
      This patch introduces new function trans_drain which drains the bd
      elements from the transaction so they can be freed properly.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      462582b9
    • M
      binfmt_flat: revert "binfmt_flat: don't offset the data start" · 2217b982
      Max Filippov 提交于
      binfmt_flat loader uses the gap between text and data to store data
      segment pointers for the libraries. Even in the absence of shared
      libraries it stores at least one pointer to the executable's own data
      segment. Text and data can go back to back in the flat binary image and
      without offsetting data segment last few instructions in the text
      segment may get corrupted by the data segment pointer.
      
      Fix it by reverting commit a2357223 ("binfmt_flat: don't offset the
      data start").
      
      Cc: stable@vger.kernel.org
      Fixes: a2357223 ("binfmt_flat: don't offset the data start")
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NGreg Ungerer <gerg@linux-m68k.org>
      2217b982
    • G
      treewide: Use fallthrough pseudo-keyword · df561f66
      Gustavo A. R. Silva 提交于
      Replace the existing /* fall through */ comments and its variants with
      the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
      fall-through markings when it is the case.
      
      [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      df561f66
    • P
      io-wq: fix hang after cancelling pending hashed work · 204361a7
      Pavel Begunkov 提交于
      Don't forget to update wqe->hash_tail after cancelling a pending work
      item, if it was hashed.
      
      Cc: stable@vger.kernel.org # 5.7+
      Reported-by: NDmitry Shulyak <yashulyak@gmail.com>
      Fixes: 86f3cd1b ("io-wq: handle hashed writes in chains")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      204361a7
    • J
      io_uring: don't recurse on tsk->sighand->siglock with signalfd · fd7d6de2
      Jens Axboe 提交于
      If an application is doing reads on signalfd, and we arm the poll handler
      because there's no data available, then the wakeup can recurse on the
      tasks sighand->siglock as the signal delivery from task_work_add() will
      use TWA_SIGNAL and that attempts to lock it again.
      
      We can detect the signalfd case pretty easily by comparing the poll->head
      wait_queue_head_t with the target task signalfd wait queue. Just use
      normal task wakeup for this case.
      
      Cc: stable@vger.kernel.org # v5.7+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd7d6de2
  9. 23 8月, 2020 2 次提交
  10. 22 8月, 2020 3 次提交
    • D
      afs: Fix NULL deref in afs_dynroot_depopulate() · 5e0b17b0
      David Howells 提交于
      If an error occurs during the construction of an afs superblock, it's
      possible that an error occurs after a superblock is created, but before
      we've created the root dentry.  If the superblock has a dynamic root
      (ie.  what's normally mounted on /afs), the afs_kill_super() will call
      afs_dynroot_depopulate() to unpin any created dentries - but this will
      oops if the root hasn't been created yet.
      
      Fix this by skipping that bit of code if there is no root dentry.
      
      This leads to an oops looking like:
      
      	general protection fault, ...
      	KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f]
      	...
      	RIP: 0010:afs_dynroot_depopulate+0x25f/0x529 fs/afs/dynroot.c:385
      	...
      	Call Trace:
      	 afs_kill_super+0x13b/0x180 fs/afs/super.c:535
      	 deactivate_locked_super+0x94/0x160 fs/super.c:335
      	 afs_get_tree+0x1124/0x1460 fs/afs/super.c:598
      	 vfs_get_tree+0x89/0x2f0 fs/super.c:1547
      	 do_new_mount fs/namespace.c:2875 [inline]
      	 path_mount+0x1387/0x2070 fs/namespace.c:3192
      	 do_mount fs/namespace.c:3205 [inline]
      	 __do_sys_mount fs/namespace.c:3413 [inline]
      	 __se_sys_mount fs/namespace.c:3390 [inline]
      	 __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
      	 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      which is oopsing on this line:
      
      	inode_lock(root->d_inode);
      
      presumably because sb->s_root was NULL.
      
      Fixes: 0da0b7fd ("afs: Display manually added cells in dynamic root mount")
      Reported-by: syzbot+c1eff8205244ae7e11a6@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e0b17b0
    • P
      squashfs: avoid bio_alloc() failure with 1Mbyte blocks · f26044c8
      Phillip Lougher 提交于
      This is a regression introduced by the patch "migrate from ll_rw_block
      usage to BIO".
      
      Bio_alloc() is limited to 256 pages (1 Mbyte).  This can cause a failure
      when reading 1 Mbyte block filesystems.  The problem is a datablock can be
      fully (or almost uncompressed), requiring 256 pages, but, because blocks
      are not aligned to page boundaries, it may require 257 pages to read.
      
      Bio_kmalloc() can handle 1024 pages, and so use this for the edge
      condition.
      
      Fixes: 93e72b3c ("squashfs: migrate from ll_rw_block usage to BIO")
      Reported-by: NNicolas Prochazka <nicolas.prochazka@gmail.com>
      Reported-by: NTomoatsu Shimada <shimada@walbrix.com>
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NGuenter Roeck <groeck@chromium.org>
      Cc: Philippe Liard <pliard@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Adrien Schildknecht <adrien+dev@schischi.me>
      Cc: Daniel Rosenberg <drosen@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200815035637.15319-1-phillip@squashfs.org.ukSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f26044c8
    • J
      romfs: fix uninitialized memory leak in romfs_dev_read() · bcf85fce
      Jann Horn 提交于
      romfs has a superblock field that limits the size of the filesystem; data
      beyond that limit is never accessed.
      
      romfs_dev_read() fetches a caller-supplied number of bytes from the
      backing device.  It returns 0 on success or an error code on failure;
      therefore, its API can't represent short reads, it's all-or-nothing.
      
      However, when romfs_dev_read() detects that the requested operation would
      cross the filesystem size limit, it currently silently truncates the
      requested number of bytes.  This e.g.  means that when the content of a
      file with size 0x1000 starts one byte before the filesystem size limit,
      ->readpage() will only fill a single byte of the supplied page while
      leaving the rest uninitialized, leaking that uninitialized memory to
      userspace.
      
      Fix it by returning an error code instead of truncating the read when the
      requested read operation would go beyond the end of the filesystem.
      
      Fixes: da4458bd ("NOMMU: Make it possible for RomFS to use MTD devices directly")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200818013202.2246365-1-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf85fce
  11. 21 8月, 2020 3 次提交
    • B
      btrfs: detect nocow for swap after snapshot delete · a84d5d42
      Boris Burkov 提交于
      can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
      detecting a must cow condition which is not exactly accurate, but saves
      unnecessary tree traversal. The incorrect assumption is that if the
      extent was created in a generation smaller than the last snapshot
      generation, it must be referenced by that snapshot. That is true, except
      the snapshot could have since been deleted, without affecting the last
      snapshot generation.
      
      The original patch claimed a performance win from this check, but it
      also leads to a bug where you are unable to use a swapfile if you ever
      snapshotted the subvolume it's in. Make the check slower and more strict
      for the swapon case, without modifying the general cow checks as a
      compromise. Turning swap on does not seem to be a particularly
      performance sensitive operation, so incurring a possibly unnecessary
      btrfs_search_slot seems worthwhile for the added usability.
      
      Note: Until the snapshot is competely cleaned after deletion,
      check_committed_refs will still cause the logic to think that cow is
      necessary, so the user must until 'btrfs subvolu sync' finished before
      activating the swapfile swapon.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a84d5d42
    • J
      btrfs: check the right error variable in btrfs_del_dir_entries_in_log · fb2fecba
      Josef Bacik 提交于
      With my new locking code dbench is so much faster that I tripped over a
      transaction abort from ENOSPC.  This turned out to be because
      btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
      function sets err on error, and returns err.  So instead of properly
      marking the inode as needing a full commit, we were returning -ENOSPC
      and aborting in __btrfs_unlink_inode.  Fix this by checking the proper
      variable so that we return the correct thing in the case of ENOSPC.
      
      The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
      can return -ENOENT if the dir item isn't in the tree log (which would
      happen if we hadn't fsync'ed this guy).  We actually handle that case in
      __btrfs_unlink_inode, so it's an expected error to get back.
      
      Fixes: 4a500fd1 ("Btrfs: Metadata ENOSPC handling for tree log")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note and comment about ENOENT ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb2fecba
    • D
      afs: Fix key ref leak in afs_put_operation() · ba8e4207
      David Howells 提交于
      The afs_put_operation() function needs to put the reference to the key
      that's authenticating the operation.
      
      Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
      Reported-by: NDave Botsch <botsch@cnf.cornell.edu>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba8e4207
  12. 20 8月, 2020 12 次提交
    • P
      io_uring: kill extra iovec=NULL in import_iovec() · 867a23ea
      Pavel Begunkov 提交于
      If io_import_iovec() returns an error, return iovec is undefined and
      must not be used, so don't set it to NULL when failing.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      867a23ea
    • P
      io_uring: comment on kfree(iovec) checks · f261c168
      Pavel Begunkov 提交于
      kfree() handles NULL pointers well, but io_{read,write}() checks it
      because of performance reasons. Leave a comment there for those who are
      tempted to patch it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f261c168
    • P
      io_uring: fix racy req->flags modification · bb175342
      Pavel Begunkov 提交于
      Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and
      io_cqring_overflow_flush() are racy, because they might be called
      asynchronously.
      
      REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can
      be guaranteed that requests _currently_ marked inflight can't be
      overflown, the problem will be solved with removing the flag
      altogether.
      
      That's how the patch works, it removes inflight status of a request
      in io_cqring_fill_event() whenever it should be thrown into CQ-overflow
      list. That's Ok to do, because no opcode specific handling can be done
      after io_cqring_fill_event(), the same assumption as with "struct
      io_completion" patches.
      And it already have a good place for such cleanups, which is
      io_clean_op(). A nice side effect of this is removing this inflight
      check from the hot path.
      
      note on synchronisation: now __io_cqring_fill_event() may be taking two
      spinlocks simultaneously, completion_lock and inflight_lock. It's fine,
      because we never do that in reverse order, and CQ-overflow of inflight
      requests shouldn't happen often.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bb175342
    • J
      io_uring: use system_unbound_wq for ring exit work · fc666777
      Jens Axboe 提交于
      We currently use system_wq, which is unbounded in terms of number of
      workers. This means that if we're exiting tons of rings at the same
      time, then we'll briefly spawn tons of event kworkers just for a very
      short blocking time as the rings exit.
      
      Use system_unbound_wq instead, which has a sane cap on the concurrency
      level.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fc666777
    • F
      btrfs: fix space cache memory leak after transaction abort · bbc37d6e
      Filipe Manana 提交于
      If a transaction aborts it can cause a memory leak of the pages array of
      a block group's io_ctl structure. The following steps explain how that can
      happen:
      
      1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
         and it's about to start writing out dirty extent buffers;
      
      2) Transaction N + 1 already started and another task, task A, just called
         btrfs_commit_transaction() on it;
      
      3) Block group B was dirtied (extents allocated from it) by transaction
         N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
         very beginning of the transaction commit, it starts writeback for the
         block group's space cache by calling btrfs_write_out_cache(), which
         allocates the pages array for the block group's io_ctl with a call to
         io_ctl_init(). Block group A is added to the io_list of transaction
         N + 1 by btrfs_start_dirty_block_groups();
      
      4) While transaction N's commit is writing out the extent buffers, it gets
         an IO error and aborts transaction N, also setting the file system to
         RO mode;
      
      5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
         btrfs_commit_transaction() and has set transaction N + 1 state to
         TRANS_STATE_COMMIT_START. Immediately after that it checks that the
         filesystem was turned to RO mode, due to transaction N's abort, and
         jumps to the "cleanup_transaction" label. After that we end up at
         btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
         That helper finds block group B in the transaction's io_list but it
         never releases the pages array of the block group's io_ctl, resulting in
         a memory leak.
      
      In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
      array points to pages that were already released by us at
      __btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
      up freeing the pages array only after waiting for the ordered extent to
      complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
      that. But in the transaction abort case we don't wait for the space cache's
      ordered extent to complete through a call to btrfs_wait_cache_io(), so
      that's why we end up with a memory leak - we wait for the ordered extent
      to complete indirectly by shutting down the work queues and waiting for
      any jobs in them to complete before returning from close_ctree().
      
      We can solve the leak simply by freeing the pages array right after
      releasing the pages (with the call to io_ctl_drop_pages()) at
      __btrfs_write_out_cache(), since we will never use it anymore after that
      and the pages array points to already released pages at that point, which
      is currently not a problem since no one will use it after that, but not a
      good practice anyway since it can easily lead to use-after-free issues.
      
      So fix this by freeing the pages array right after releasing the pages at
      __btrfs_write_out_cache().
      
      This issue can often be reproduced with test case generic/475 from fstests
      and kmemleak can detect it and reports it with the following trace:
      
      unreferenced object 0xffff9bbf009fa600 (size 512):
        comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
        hex dump (first 32 bytes):
          00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
          80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
        backtrace:
          [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
          [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
          [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
          [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
          [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
          [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
          [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
          [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
          [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
          [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
          [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
          [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
          [<0000000043ace2c9>] do_syscall_64+0x33/0x80
          [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbc37d6e
    • D
      btrfs: use the correct const function attribute for btrfs_get_num_csums · 604997b4
      David Sterba 提交于
      The build robot reports
      
      compiler: h8300-linux-gcc (GCC) 9.3.0
         In file included from fs/btrfs/tests/extent-map-tests.c:8:
      >> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
          2166 | size_t __const btrfs_get_num_csums(void);
               |        ^~~~~~~
      
      The function attribute for const does not follow the expected scheme and
      in this case is confused with a const type qualifier.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      604997b4
    • M
      btrfs: reset compression level for lzo on remount · 282dd7d7
      Marcos Paulo de Souza 提交于
      Currently a user can set mount "-o compress" which will set the
      compression algorithm to zlib, and use the default compress level for
      zlib (3):
      
        relatime,compress=zlib:3,space_cache
      
      If the user remounts the fs using "-o compress=lzo", then the old
      compress_level is used:
      
        relatime,compress=lzo:3,space_cache
      
      But lzo does not expose any tunable compression level. The same happens
      if we set any compress argument with different level, also with zstd.
      
      Fix this by resetting the compress_level when compress=lzo is
      specified.  With the fix applied, lzo is shown without compress level:
      
        relatime,compress=lzo,space_cache
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      282dd7d7
    • J
      btrfs: handle errors from async submission · c965d640
      Johannes Thumshirn 提交于
      Btrfs' async submit mechanism is able to handle errors in the submission
      path and the meta-data async submit function correctly passes the error
      code to the caller.
      
      In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're
      not handling the errors returned by btrfs_csum_one_bio() correctly though
      and simply call BUG_ON(). This is unnecessary as the caller of these two
      functions - run_one_async_start - correctly checks for the return values
      and sets the status of the async_submit_bio. The actual bio submission
      will be handled later on by run_one_async_done only if
      async_submit_bio::status is 0, so the data won't be written if we
      encountered an error in the checksum process.
      
      Simply return the error from btrfs_csum_one_bio() to the async submitters,
      like it's done in btree_submit_bio_start().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c965d640
    • B
      ext4: limit the length of per-inode prealloc list · 27bc446e
      brookxu 提交于
      In the scenario of writing sparse files, the per-inode prealloc list may
      be very long, resulting in high overhead for ext4_mb_use_preallocated().
      To circumvent this problem, we limit the maximum length of per-inode
      prealloc list to 512 and allow users to modify it.
      
      After patching, we observed that the sys ratio of cpu has dropped, and
      the system throughput has increased significantly. We created a process
      to write the sparse file, and the running time of the process on the
      fixed kernel was significantly reduced, as follows:
      
      Running time on unfixed kernel:
      [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
      real    0m2.051s
      user    0m0.008s
      sys     0m2.026s
      
      Running time on fixed kernel:
      [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
      real    0m0.471s
      user    0m0.004s
      sys     0m0.395s
      Signed-off-by: NChunguang Xu <brookxu@tencent.com>
      Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      27bc446e
    • B
      ext4: reorganize if statement of ext4_mb_release_context() · 66d5e027
      brookxu 提交于
      Reorganize the if statement of ext4_mb_release_context(), make it
      easier to read.
      Signed-off-by: NChunguang Xu <brookxu@tencent.com>
      Link: https://lore.kernel.org/r/5439ac6f-db79-ad68-76c1-a4dda9aa0cc3@gmail.comReviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      66d5e027
    • B
      ext4: add mb_debug logging when there are lost chunks · c55ee7d2
      brookxu 提交于
      Lost chunks are when some other process raced with the current thread
      to grab a particular block allocation.  Add mb_debug log for
      developers who wants to see how often this is happening for a
      particular workload.
      Signed-off-by: NChunguang Xu <brookxu@tencent.com>
      Link: https://lore.kernel.org/r/0a165ac0-1912-aebd-8a0d-b42e7cd1aea1@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      c55ee7d2
    • K
      ext4: Fix comment typo "the the". · 7ca4fcba
      kyoungho koo 提交于
      I have found double typed comments "the the". So i modified it to
      one "the"
      Signed-off-by: Nkyoungho koo <rnrudgh@gmail.com>
      Link: https://lore.kernel.org/r/20200424171620.GA11943@koo-Z370-HD3Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7ca4fcba