1. 26 11月, 2019 18 次提交
    • J
      io_uring: io_allocate_scq_urings() should return a sane state · eb065d30
      Jens Axboe 提交于
      We currently rely on the ring destroy on cleaning things up in case of
      failure, but io_allocate_scq_urings() can leave things half initialized
      if only parts of it fails.
      
      Be nice and return with either everything setup in success, or return an
      error with things nicely cleaned up.
      
      Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb065d30
    • P
      io_uring: Always REQ_F_FREE_SQE for allocated sqe · bbad27b2
      Pavel Begunkov 提交于
      Always mark requests with allocated sqe and deallocate it in
      __io_free_req(). It's easier to follow and doesn't add edge cases.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bbad27b2
    • J
      io_uring: io_fail_links() should only consider first linked timeout · 5d960724
      Jens Axboe 提交于
      We currently clear the linked timeout field if we cancel such a timeout,
      but we should only attempt to cancel if it's the first one we see.
      Others should simply be freed like other requests, as they haven't
      been started yet.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d960724
    • P
      io_uring: Fix leaking linked timeouts · 09fbb0a8
      Pavel Begunkov 提交于
      let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT
      
      1. submission stage: submission references for REQ and LINK_TIMEOUT
      are dropped. So, references respectively (1,1,2)
      
      2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
      linked timeouts will call cancel_timeout() and drop 1 reference.
      So, references after: (0,0,1). That's a leak.
      
      Make it treat only the first linked timeout as such, and pass others
      through __io_double_put_req().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09fbb0a8
    • P
      io_uring: remove redundant check · f70193d6
      Pavel Begunkov 提交于
      Pass any IORING_OP_LINK_TIMEOUT request further, where it will
      eventually fail in io_issue_sqe().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f70193d6
    • P
      io_uring: break links for failed defer · d3b35796
      Pavel Begunkov 提交于
      If io_req_defer() failed, it needs to cancel a dependant link.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d3b35796
    • D
      io-wq: remove extra space characters · b2e9c7d6
      Dan Carpenter 提交于
      These lines are indented an extra space character.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2e9c7d6
    • J
      io-wq: wait for io_wq_create() to setup necessary workers · b60fda60
      Jens Axboe 提交于
      We currently have a race where if setup is really slow, we can be
      calling io_wq_destroy() before we're done setting up. This will cause
      the caller to get stuck waiting for the manager to set things up, but
      the manager already exited.
      
      Fix this by doing a sync setup of the manager. This also fixes the case
      where if we failed creating workers, we'd also get stuck.
      
      In practice this race window was really small, as we already wait for
      the manager to start. Hence someone would have to call io_wq_destroy()
      after the task has started, but before it started the first loop. The
      reported test case forked tons of these, which is why it became an
      issue.
      
      Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com
      Fixes: 771b53d0 ("io-wq: small threadpool implementation for io_uring")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b60fda60
    • J
      io_uring: request cancellations should break links · fba38c27
      Jens Axboe 提交于
      We currently don't explicitly break links if a request is cancelled, but
      we should. Add explicitly link breakage for all types of request
      cancellations that we support.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fba38c27
    • J
      io_uring: correct poll cancel and linked timeout expiration completion · b0dd8a41
      Jens Axboe 提交于
      Currently a poll request fills a completion entry of 0, even if it got
      cancelled. This is odd, and it makes it harder to support with chains.
      Ensure that it returns -ECANCELED in the completions events if it got
      cancelled, and furthermore ensure that the linked timeout that triggered
      it completes with -ETIME if we did indeed trigger the completions
      through a timeout.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b0dd8a41
    • J
      io_uring: remove dead REQ_F_SEQ_PREV flag · e0e328c4
      Jens Axboe 提交于
      With the conversion to io-wq, we no longer use that flag. Kill it.
      
      Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e0e328c4
    • J
      io_uring: fix sequencing issues with linked timeouts · 94ae5e77
      Jens Axboe 提交于
      We have an issue with timeout links that are deeper in the submit chain,
      because we only handle it upfront, not from later submissions. Move the
      prep + issue of the timeout link to the async work prep handler, and do
      it normally for non-async queue. If we validate and prepare the timeout
      links upfront when we first see them, there's nothing stopping us from
      supporting any sort of nesting.
      
      Fixes: 2665abfd ("io_uring: add support for linked SQE timeouts")
      Reported-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      94ae5e77
    • J
      io_uring: make req->timeout be dynamically allocated · ad8a48ac
      Jens Axboe 提交于
      There are a few reasons for this:
      
      - As a prep to improving the linked timeout logic
      - io_timeout is the biggest member in the io_kiocb opcode union
      
      This also enables a few cleanups, like unifying the timer setup between
      IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
      arguments to the link/prep helpers.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ad8a48ac
    • J
      io_uring: make io_double_put_req() use normal completion path · 978db57e
      Jens Axboe 提交于
      If we don't use the normal completion path, we may skip killing links
      that should be errored and freed. Add __io_double_put_req() for use
      within the completion path itself, other calls should just use
      io_double_put_req().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      978db57e
    • J
      io_uring: cleanup return values from the queueing functions · 0e0702da
      Jens Axboe 提交于
      __io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
      but the caller doesn't care since the errors are handled inline. Clean
      these up and just make them void.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0e0702da
    • J
      io_uring: io_async_cancel() should pass in 'nxt' request pointer · 95a5bbae
      Jens Axboe 提交于
      If we have a linked request, this enables us to pass it back directly
      without having to go through async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      95a5bbae
    • L
      vfs: properly and reliably lock f_pos in fdget_pos() · 0be0ee71
      Linus Torvalds 提交于
      fdget_pos() is used by file operations that will read and update f_pos:
      things like "read()", "write()" and "lseek()" (but not, for example,
      "pread()/pwrite" that get their file positions elsewhere).
      
      However, it had two separate escape clauses for this, because not
      everybody wants or needs serialization of the file position.
      
      The first and most obvious case is the "file descriptor doesn't have a
      position at all", ie a stream-like file.  Except we didn't actually use
      FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
      was that FMODE_STREAM didn't exist back in the days, but also that we
      didn't want to mark all the special cases, so we only marked the ones
      that _required_ position atomicity according to POSIX - regular files
      and directories.
      
      The case one was intentionally lazy, but now that we _do_ have
      FMODE_STREAM we could and should just use it.  With the change to use
      FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
      the code to set it is deleted.
      
      Any cases where we don't want the serialization because the driver (or
      subsystem) doesn't use the file position should just be updated to do
      "stream_open()".  We've done that for all the obvious and common
      situations, we may need a few more.  Quoting Kirill Smelkov in the
      original FMODE_STREAM thread (see link below for full email):
      
       "And I appreciate if people could help at least somehow with "getting
        rid of mixed case entirely" (i.e. always lock f_pos_lock on
        !FMODE_STREAM), because this transition starts to diverge from my
        particular use-case too far. To me it makes sense to do that
        transition as follows:
      
         - convert nonseekable_open -> stream_open via stream_open.cocci;
         - audit other nonseekable_open calls and convert left users that
           truly don't depend on position to stream_open;
         - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
           will cover pipes and sockets), or maybe convert pipes and sockets
           to FMODE_STREAM manually;
         - extend stream_open.cocci to analyze file_operations that use
           no_llseek or noop_llseek, but do not use nonseekable_open or
           alloc_file_pseudo. This might find files that have stream semantic
           but are opened differently;
         - extend stream_open.cocci to analyze file_operations whose
           .read/.write do not use ppos at all (independently of how file was
           opened);
         - ...
         - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
           !FMODE_STREAM;
         - gather bug reports for deadlocked read/write and convert missed
           cases to FMODE_STREAM, probably extending stream_open.cocci along
           the road to catch similar cases
      
        i.e. always take f_pos_lock unless a file is explicitly marked as
        being stream, and try to find and cover all files that are streams"
      
      We have not done the "extend stream_open.cocci to analyze
      alloc_file_pseudo" as well, but the previous commit did manually handle
      the case of pipes and sockets.
      
      The other case where we can avoid locking f_pos is the "this file
      descriptor only has a single user and it is us, and thus there is no
      need to lock it".
      
      The second test was correct, although a bit subtle and worth just
      re-iterating here.  There are two kinds of other sources of references
      to the same file descriptor: file descriptors that have been explicitly
      shared across fork() or with dup(), and file tables having elevated
      reference counts due to threading (or explicit file sharing with
      clone()).
      
      The first case would have incremented the file count explicitly, and in
      the second case the previous __fdget() would have incremented it for us
      and set the FDPUT_FPUT flag.
      
      But in both cases the file count would be greater than one, so the
      "file_count(file) > 1" test catches both situations.  Also note that if
      file_count is 1, that also means that no other thread can have access to
      the file table, so there also cannot be races with concurrent calls to
      dup()/fork()/clone() that would increment the file count any other way.
      
      Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0be0ee71
    • L
      vfs: mark pipes and sockets as stream-like file descriptors · d8e464ec
      Linus Torvalds 提交于
      In commit 3975b097 ("convert stream-like files -> stream_open, even
      if they use noop_llseek") Kirill used a coccinelle script to change
      "nonseekable_open()" to "stream_open()", which changed the trivial cases
      of stream-like file descriptors to the new model with FMODE_STREAM.
      
      However, the two big cases - sockets and pipes - don't actually have
      that trivial pattern at all, and were thus never converted to
      FMODE_STREAM even though it makes lots of sense to do so.
      
      That's particularly true when looking forward to the next change:
      getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
      decide whether f_pos updates are needed or not.  And if they are, we'll
      always do them atomically.
      
      This came up because KCSAN (correctly) noted that the non-locked f_pos
      updates are data races: they are clearly benign for the case where we
      don't care, but it would be good to just not have that issue exist at
      all.
      
      Note that the reason we used FMODE_ATOMIC_POS originally is that only
      doing it for the minimal required case is "safer" in that it's possible
      that the f_pos locking can cause unnecessary serialization across the
      whole write() call.  And in the worst case, that kind of serialization
      can cause deadlock issues: think writers that need readers to empty the
      state using the same file descriptor.
      
      [ Note that the locking is per-file descriptor - because it protects
        "f_pos", which is obviously per-file descriptor - so it only affects
        cases where you literally use the same file descriptor to both read
        and write.
      
        So a regular pipe that has separate reading and writing file
        descriptors doesn't really have this situation even though it's the
        obvious case of "reader empties what a bit writer concurrently fills"
      
        But we want to make pipes as being stream-line anyway, because we
        don't want the unnecessary overhead of locking, and because a named
        pipe can be (ab-)used by reading and writing to the same file
        descriptor. ]
      
      There are likely a lot of other cases that might want FMODE_STREAM, and
      looking for ".llseek = no_llseek" users and other cases that don't have
      an lseek file operation at all and making them use "stream_open()" might
      be a good idea.  But pipes and sockets are likely to be the two main
      cases.
      
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8e464ec
  2. 24 11月, 2019 1 次提交
  3. 23 11月, 2019 3 次提交
  4. 20 11月, 2019 1 次提交
    • D
      afs: Fix missing timeout reset · c74386d5
      David Howells 提交于
      In afs_wait_for_call_to_complete(), rather than immediately aborting an
      operation if a signal occurs, the code attempts to wait for it to
      complete, using a schedule timeout of 2*RTT (or min 2 jiffies) and a
      check that we're still receiving relevant packets from the server before
      we consider aborting the call.  We may even ping the server to check on
      the status of the call.
      
      However, there's a missing timeout reset in the event that we do
      actually get a packet to process, such that if we then get a couple of
      short stalls, we then time out when progress is actually being made.
      
      Fix this by resetting the timeout any time we get something to process.
      If it's the failure of the call then the call state will get changed and
      we'll exit the loop shortly thereafter.
      
      A symptom of this is data fetches and stores failing with EINTR when
      they really shouldn't.
      
      Fixes: bc5e3a54 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c74386d5
  5. 19 11月, 2019 17 次提交
    • D
      btrfs: drop bdev argument from submit_extent_page · fa17ed06
      David Sterba 提交于
      After previous patches removing bdev being passed around to set it to
      bio, it has become unused in submit_extent_page. So it now has "only" 13
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa17ed06
    • D
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba 提交于
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a019e9e1
    • D
      btrfs: drop bio_set_dev where not needed · 1a418027
      David Sterba 提交于
      bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
      also changing some state bits if there was a different bdev set before.
      This is one thing that's not needed.
      
      Another thing is that setting a bdev at bio allocation time is too early
      and actually does not work with plain redundancy profiles, where each
      time we submit a bio to a device, the bdev is set correctly.
      
      In many places the bio bdev is set to latest_bdev that seems to serve as
      a stub pointer "just to put something to bio". But we don't have to do
      that.
      
      Where do we know which bdev to set:
      
      * for regular IO: submit_stripe_bio that's called by btrfs_map_bio
      
      * repair IO: repair_io_failure, read or write from specific device
      
      * super block write (using buffer_heads but uses raw bdev) and barriers
      
      * scrub: this does not use all regular IO paths as it needs to reach all
        copies, verify and fixup eventually, and for that all bdev management
        is independent
      
      * raid56: rbio_add_io_page, for the RMW write
      
      * integrity-checker: does it's own low-level block tracking
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a418027
    • D
      btrfs: get bdev directly from fs_devices in submit_extent_page · 429aebc0
      David Sterba 提交于
      This is preparatory patch to remove @bdev parameter from
      submit_extent_page. It can't be removed completely, because the cgroups
      need it for wbc when initializing the bio
      
      wbc_init_bio
        bio_associate_blkg_from_css
          dereference bdev->bi_disk->queue
      
      The bdev pointer is the same as latest_bdev, thus no functional change.
      We can retrieve it from fs_devices that's reachable through several
      dereferences. The local variable shadows the parameter, but that's only
      temporary.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      429aebc0
    • J
      btrfs: record all roots for rename exchange on a subvol · 3e174099
      Josef Bacik 提交于
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e174099
    • F
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana 提交于
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      042528f8
    • Q
      btrfs: scrub: Don't check free space before marking a block group RO · b12de528
      Qu Wenruo 提交于
      [BUG]
      When running btrfs/072 with only one online CPU, it has a pretty high
      chance to fail:
      
        btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
        - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
            --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
            +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
            @@ -1,2 +1,3 @@
             QA output created by 072
             Silence is golden
            +Scrub find errors in "-m dup -d single" test
            ...
      
      And with the following call trace:
      
        BTRFS info (device dm-5): scrub: started on devid 1
        ------------[ cut here ]------------
        BTRFS: Transaction aborted (error -27)
        WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        Call Trace:
         __btrfs_end_transaction+0xdb/0x310 [btrfs]
         btrfs_end_transaction+0x10/0x20 [btrfs]
         btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
         scrub_enumerate_chunks+0x264/0x940 [btrfs]
         btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
         btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
         do_vfs_ioctl+0x636/0xaa0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x43/0x50
         do_syscall_64+0x79/0xe0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        ---[ end trace 166c865cec7688e7 ]---
      
      [CAUSE]
      The error number -27 is -EFBIG, returned from the following call chain:
      btrfs_end_transaction()
      |- __btrfs_end_transaction()
         |- btrfs_create_pending_block_groups()
            |- btrfs_finish_chunk_alloc()
               |- btrfs_add_system_chunk()
      
      This happens because we have used up all space of
      btrfs_super_block::sys_chunk_array.
      
      The root cause is, we have the following bad loop of creating tons of
      system chunks:
      
      1. The only SYSTEM chunk is being scrubbed
         It's very common to have only one SYSTEM chunk.
      2. New SYSTEM bg will be allocated
         As btrfs_inc_block_group_ro() will check if we have enough space
         after marking current bg RO. If not, then allocate a new chunk.
      3. New SYSTEM bg is still empty, will be reclaimed
         During the reclaim, we will mark it RO again.
      4. That newly allocated empty SYSTEM bg get scrubbed
         We go back to step 2, as the bg is already mark RO but still not
         cleaned up yet.
      
      If the cleaner kthread doesn't get executed fast enough (e.g. only one
      CPU), then we will get more and more empty SYSTEM chunks, using up all
      the space of btrfs_super_block::sys_chunk_array.
      
      [FIX]
      Since scrub/dev-replace doesn't always need to allocate new extent,
      especially chunk tree extent, so we don't really need to do chunk
      pre-allocation.
      
      To break above spiral, here we introduce a new parameter to
      btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
      need extra chunk pre-allocation.
      
      For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
      @do_chunk_alloc=false.
      This should keep unnecessary empty chunks from popping up for scrub.
      
      Also, since there are two parameters for btrfs_inc_block_group_ro(),
      add more comment for it.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b12de528
    • J
      btrfs: change btrfs_fs_devices::rotating to bool · 7f0432d0
      Johannes Thumshirn 提交于
      struct btrfs_fs_devices::rotating currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f0432d0
    • J
      btrfs: change btrfs_fs_devices::seeding to bool · 0395d84f
      Johannes Thumshirn 提交于
      struct btrfs_fs_devices::seeding currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0395d84f
    • D
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba 提交于
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32da5386
    • Q
      btrfs: block-group: Reuse the item key from caller of read_one_block_group() · d49a2ddb
      Qu Wenruo 提交于
      For read_one_block_group(), its only caller has already got the item key
      to search next block group item.
      
      So we can use that key directly without doing our own convertion on
      stack.
      
      Also, since that key used in btrfs_read_block_groups() is vital for
      block group item search, add 'const' keyword for that parameter to
      prevent read_one_block_group() to modify it.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d49a2ddb
    • Q
      btrfs: block-group: Refactor btrfs_read_block_groups() · ffb9e0f0
      Qu Wenruo 提交于
      Refactor the work inside the loop of btrfs_read_block_groups() into one
      separate function, read_one_block_group().
      
      This allows read_one_block_group to be reused for later BG_TREE feature.
      
      The refactor does the following extra fix:
      - Use btrfs_fs_incompat() to replace open-coded feature check
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ffb9e0f0
    • D
      btrfs: document extent buffer locking · d4e253bb
      David Sterba 提交于
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4e253bb
    • D
      btrfs: access eb::blocking_writers according to ACCESS_ONCE policies · a4477988
      David Sterba 提交于
      A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
      once policies can be found here
      https://lwn.net/Articles/799218/#Access-Marking%20Policies .
      
      The locked and unlocked access to eb::blocking_writers should be
      annotated accordingly, following this:
      
      Writes:
      
      - locked write must use ONCE, may use plain read
      - unlocked write must use ONCE
      
      Reads:
      
      - unlocked read must use ONCE
      - locked read may use plain read iff not mixed with unlocked read
      - unlocked read then locked must use ONCE
      
      There's one difference on the assembly level, where
      btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
      value and did not reevaluate it after taking the lock. This could have
      missed some opportunities to take the lock in case blocking writers
      changed between the calls, but the window is just a few instructions
      long. As this is in try-lock, the callers handle that.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a4477988
    • D
      btrfs: set blocking_writers directly, no increment or decrement · 40d38f53
      David Sterba 提交于
      The increment and decrement was inherited from previous version that
      used atomics, switched in commit 06297d8c ("btrfs: switch
      extent_buffer blocking_writers from atomic to int"). The only possible
      values are 0 and 1 so we can set them directly.
      
      The generated assembly (gcc 9.x) did the direct value assignment in
      btrfs_set_lock_blocking_write (asm diff after change in 06297d8c):
      
           5d:   test   %eax,%eax
           5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
           61:   retq
      
        -  62:   lock incl 0x44(%rdi)
        -  66:   add    $0x50,%rdi
        -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>
      
        +  62:   movl   $0x1,0x44(%rdi)
        +  69:   add    $0x50,%rdi
        +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>
      
      The part in btrfs_tree_unlock did a decrement because
      BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
      otherwise the output looks safe:
      
        - lock decl 0x44(%rdi)
      
        + sub    $0x1,%eax
        + mov    %eax,0x44(%rdi)
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40d38f53
    • D
      btrfs: merge blocking_writers branches in btrfs_tree_read_lock · f5c2a525
      David Sterba 提交于
      There are two ifs that use eb::blocking_writers. As this is a variable
      modified inside and outside of locks, we could minimize number of
      accesses to avoid problems with getting different results at different
      times.
      
      The access here is locked so this can only race with btrfs_tree_unlock
      that sets blocking_writers to 0 without lock and unsets the lock owner.
      
      The first branch is taken only if the same thread already holds the
      lock, the second if checks for blocking writers. Here we'd either unlock
      and wait, or proceed. Both are valid states of the locking protocol.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5c2a525
    • D
      btrfs: drop incompat bit for raid1c34 after last block group is gone · 9c907446
      David Sterba 提交于
      When there are no raid1c3 or raid1c4 block groups left after balance
      (either convert or with other filters applied), remove the incompat bit.
      This is already done for RAID56, do the same for RAID1C34.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c907446