1. 26 11月, 2019 26 次提交
    • P
      io_uring: pass only !null to io_req_find_next() · 944e58bf
      Pavel Begunkov 提交于
      Make io_req_find_next() and io_req_link_next() to accept only non-null
      nxt, and handle it in callers.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      944e58bf
    • P
      io_uring: remove io_free_req_find_next() · 70cf9f32
      Pavel Begunkov 提交于
      There is only one one-liner user of io_free_req_find_next(). Inline it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      70cf9f32
    • P
      io_uring: add likely/unlikely in io_get_sqring() · 9835d6fa
      Pavel Begunkov 提交于
      The number of SQEs to submit is specified by a user, so io_get_sqring()
      in most of the cases succeeds. Hint compilers about that.
      
      Checking ASM genereted by gcc 9.2.0 for x64, there is one branch
      misprediction.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9835d6fa
    • P
      io_uring: rename __io_submit_sqe() · d732447f
      Pavel Begunkov 提交于
      __io_submit_sqe() is issuing requests, so call it as
      such. Moreover, it ends by calling io_iopoll_req_issued().
      
      Rename it and make terminology clearer.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d732447f
    • J
      io_uring: improve trace_io_uring_defer() trace point · 915967f6
      Jens Axboe 提交于
      We don't have shadow requests anymore, so get rid of the shadow
      argument. Add the user_data argument, as that's often useful to easily
      match up requests, instead of having to look at request pointers.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      915967f6
    • P
      io_uring: drain next sqe instead of shadowing · 1b4a51b6
      Pavel Begunkov 提交于
      There's an issue with the shadow drain logic in that we drop the
      completion lock after deciding to defer a request, then re-grab it later
      and assume that the state is still the same. In the mean time, someone
      else completing a request could have found and issued it. This can cause
      a stall in the queue, by having a shadow request inserted that nobody is
      going to drain.
      
      Additionally, if we fail allocating the shadow request, we simply ignore
      the drain.
      
      Instead of using a shadow request, defer the next request/link instead.
      This also has the following advantages:
      
      - removes semi-duplicated code
      - doesn't allocate memory for shadows
      - works better if only the head marked for drain
      - doesn't need complex synchronisation
      
      On the flip side, it removes the shadow->seq ==
      last_drain_in_in_link->seq optimization. That shouldn't be a common
      case, and can always be added back, if needed.
      
      Fixes: 4fe2c963 ("io_uring: add support for link with drain")
      Cc: Jackie Liu <liuyun01@kylinos.cn>
      Reported-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1b4a51b6
    • J
      io_uring: close lookup gap for dependent next work · b76da70f
      Jens Axboe 提交于
      When we find new work to process within the work handler, we queue the
      linked timeout before we have issued the new work. This can be
      problematic for very short timeouts, as we have a window where the new
      work isn't visible.
      
      Allow the work handler to store a callback function for this in the work
      item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
      is set, then io-wq will call the callback when it has setup the new work
      item.
      Reported-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b76da70f
    • J
      io_uring: allow finding next link independent of req reference count · 4d7dd462
      Jens Axboe 提交于
      We currently try and start the next link when we put the request, and
      only if we were going to free it. This means that the optimization to
      continue executing requests from the same context often fails, as we're
      not putting the final reference.
      
      Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the
      next request more efficiently.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4d7dd462
    • J
      io_uring: io_allocate_scq_urings() should return a sane state · eb065d30
      Jens Axboe 提交于
      We currently rely on the ring destroy on cleaning things up in case of
      failure, but io_allocate_scq_urings() can leave things half initialized
      if only parts of it fails.
      
      Be nice and return with either everything setup in success, or return an
      error with things nicely cleaned up.
      
      Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb065d30
    • P
      io_uring: Always REQ_F_FREE_SQE for allocated sqe · bbad27b2
      Pavel Begunkov 提交于
      Always mark requests with allocated sqe and deallocate it in
      __io_free_req(). It's easier to follow and doesn't add edge cases.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bbad27b2
    • J
      io_uring: io_fail_links() should only consider first linked timeout · 5d960724
      Jens Axboe 提交于
      We currently clear the linked timeout field if we cancel such a timeout,
      but we should only attempt to cancel if it's the first one we see.
      Others should simply be freed like other requests, as they haven't
      been started yet.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d960724
    • P
      io_uring: Fix leaking linked timeouts · 09fbb0a8
      Pavel Begunkov 提交于
      let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT
      
      1. submission stage: submission references for REQ and LINK_TIMEOUT
      are dropped. So, references respectively (1,1,2)
      
      2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
      linked timeouts will call cancel_timeout() and drop 1 reference.
      So, references after: (0,0,1). That's a leak.
      
      Make it treat only the first linked timeout as such, and pass others
      through __io_double_put_req().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09fbb0a8
    • P
      io_uring: remove redundant check · f70193d6
      Pavel Begunkov 提交于
      Pass any IORING_OP_LINK_TIMEOUT request further, where it will
      eventually fail in io_issue_sqe().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f70193d6
    • P
      io_uring: break links for failed defer · d3b35796
      Pavel Begunkov 提交于
      If io_req_defer() failed, it needs to cancel a dependant link.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d3b35796
    • D
      io-wq: remove extra space characters · b2e9c7d6
      Dan Carpenter 提交于
      These lines are indented an extra space character.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2e9c7d6
    • J
      io-wq: wait for io_wq_create() to setup necessary workers · b60fda60
      Jens Axboe 提交于
      We currently have a race where if setup is really slow, we can be
      calling io_wq_destroy() before we're done setting up. This will cause
      the caller to get stuck waiting for the manager to set things up, but
      the manager already exited.
      
      Fix this by doing a sync setup of the manager. This also fixes the case
      where if we failed creating workers, we'd also get stuck.
      
      In practice this race window was really small, as we already wait for
      the manager to start. Hence someone would have to call io_wq_destroy()
      after the task has started, but before it started the first loop. The
      reported test case forked tons of these, which is why it became an
      issue.
      
      Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com
      Fixes: 771b53d0 ("io-wq: small threadpool implementation for io_uring")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b60fda60
    • J
      io_uring: request cancellations should break links · fba38c27
      Jens Axboe 提交于
      We currently don't explicitly break links if a request is cancelled, but
      we should. Add explicitly link breakage for all types of request
      cancellations that we support.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fba38c27
    • J
      io_uring: correct poll cancel and linked timeout expiration completion · b0dd8a41
      Jens Axboe 提交于
      Currently a poll request fills a completion entry of 0, even if it got
      cancelled. This is odd, and it makes it harder to support with chains.
      Ensure that it returns -ECANCELED in the completions events if it got
      cancelled, and furthermore ensure that the linked timeout that triggered
      it completes with -ETIME if we did indeed trigger the completions
      through a timeout.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b0dd8a41
    • J
      io_uring: remove dead REQ_F_SEQ_PREV flag · e0e328c4
      Jens Axboe 提交于
      With the conversion to io-wq, we no longer use that flag. Kill it.
      
      Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e0e328c4
    • J
      io_uring: fix sequencing issues with linked timeouts · 94ae5e77
      Jens Axboe 提交于
      We have an issue with timeout links that are deeper in the submit chain,
      because we only handle it upfront, not from later submissions. Move the
      prep + issue of the timeout link to the async work prep handler, and do
      it normally for non-async queue. If we validate and prepare the timeout
      links upfront when we first see them, there's nothing stopping us from
      supporting any sort of nesting.
      
      Fixes: 2665abfd ("io_uring: add support for linked SQE timeouts")
      Reported-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      94ae5e77
    • J
      io_uring: make req->timeout be dynamically allocated · ad8a48ac
      Jens Axboe 提交于
      There are a few reasons for this:
      
      - As a prep to improving the linked timeout logic
      - io_timeout is the biggest member in the io_kiocb opcode union
      
      This also enables a few cleanups, like unifying the timer setup between
      IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
      arguments to the link/prep helpers.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ad8a48ac
    • J
      io_uring: make io_double_put_req() use normal completion path · 978db57e
      Jens Axboe 提交于
      If we don't use the normal completion path, we may skip killing links
      that should be errored and freed. Add __io_double_put_req() for use
      within the completion path itself, other calls should just use
      io_double_put_req().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      978db57e
    • J
      io_uring: cleanup return values from the queueing functions · 0e0702da
      Jens Axboe 提交于
      __io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
      but the caller doesn't care since the errors are handled inline. Clean
      these up and just make them void.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0e0702da
    • J
      io_uring: io_async_cancel() should pass in 'nxt' request pointer · 95a5bbae
      Jens Axboe 提交于
      If we have a linked request, this enables us to pass it back directly
      without having to go through async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      95a5bbae
    • L
      vfs: properly and reliably lock f_pos in fdget_pos() · 0be0ee71
      Linus Torvalds 提交于
      fdget_pos() is used by file operations that will read and update f_pos:
      things like "read()", "write()" and "lseek()" (but not, for example,
      "pread()/pwrite" that get their file positions elsewhere).
      
      However, it had two separate escape clauses for this, because not
      everybody wants or needs serialization of the file position.
      
      The first and most obvious case is the "file descriptor doesn't have a
      position at all", ie a stream-like file.  Except we didn't actually use
      FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
      was that FMODE_STREAM didn't exist back in the days, but also that we
      didn't want to mark all the special cases, so we only marked the ones
      that _required_ position atomicity according to POSIX - regular files
      and directories.
      
      The case one was intentionally lazy, but now that we _do_ have
      FMODE_STREAM we could and should just use it.  With the change to use
      FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
      the code to set it is deleted.
      
      Any cases where we don't want the serialization because the driver (or
      subsystem) doesn't use the file position should just be updated to do
      "stream_open()".  We've done that for all the obvious and common
      situations, we may need a few more.  Quoting Kirill Smelkov in the
      original FMODE_STREAM thread (see link below for full email):
      
       "And I appreciate if people could help at least somehow with "getting
        rid of mixed case entirely" (i.e. always lock f_pos_lock on
        !FMODE_STREAM), because this transition starts to diverge from my
        particular use-case too far. To me it makes sense to do that
        transition as follows:
      
         - convert nonseekable_open -> stream_open via stream_open.cocci;
         - audit other nonseekable_open calls and convert left users that
           truly don't depend on position to stream_open;
         - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
           will cover pipes and sockets), or maybe convert pipes and sockets
           to FMODE_STREAM manually;
         - extend stream_open.cocci to analyze file_operations that use
           no_llseek or noop_llseek, but do not use nonseekable_open or
           alloc_file_pseudo. This might find files that have stream semantic
           but are opened differently;
         - extend stream_open.cocci to analyze file_operations whose
           .read/.write do not use ppos at all (independently of how file was
           opened);
         - ...
         - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
           !FMODE_STREAM;
         - gather bug reports for deadlocked read/write and convert missed
           cases to FMODE_STREAM, probably extending stream_open.cocci along
           the road to catch similar cases
      
        i.e. always take f_pos_lock unless a file is explicitly marked as
        being stream, and try to find and cover all files that are streams"
      
      We have not done the "extend stream_open.cocci to analyze
      alloc_file_pseudo" as well, but the previous commit did manually handle
      the case of pipes and sockets.
      
      The other case where we can avoid locking f_pos is the "this file
      descriptor only has a single user and it is us, and thus there is no
      need to lock it".
      
      The second test was correct, although a bit subtle and worth just
      re-iterating here.  There are two kinds of other sources of references
      to the same file descriptor: file descriptors that have been explicitly
      shared across fork() or with dup(), and file tables having elevated
      reference counts due to threading (or explicit file sharing with
      clone()).
      
      The first case would have incremented the file count explicitly, and in
      the second case the previous __fdget() would have incremented it for us
      and set the FDPUT_FPUT flag.
      
      But in both cases the file count would be greater than one, so the
      "file_count(file) > 1" test catches both situations.  Also note that if
      file_count is 1, that also means that no other thread can have access to
      the file table, so there also cannot be races with concurrent calls to
      dup()/fork()/clone() that would increment the file count any other way.
      
      Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0be0ee71
    • L
      vfs: mark pipes and sockets as stream-like file descriptors · d8e464ec
      Linus Torvalds 提交于
      In commit 3975b097 ("convert stream-like files -> stream_open, even
      if they use noop_llseek") Kirill used a coccinelle script to change
      "nonseekable_open()" to "stream_open()", which changed the trivial cases
      of stream-like file descriptors to the new model with FMODE_STREAM.
      
      However, the two big cases - sockets and pipes - don't actually have
      that trivial pattern at all, and were thus never converted to
      FMODE_STREAM even though it makes lots of sense to do so.
      
      That's particularly true when looking forward to the next change:
      getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
      decide whether f_pos updates are needed or not.  And if they are, we'll
      always do them atomically.
      
      This came up because KCSAN (correctly) noted that the non-locked f_pos
      updates are data races: they are clearly benign for the case where we
      don't care, but it would be good to just not have that issue exist at
      all.
      
      Note that the reason we used FMODE_ATOMIC_POS originally is that only
      doing it for the minimal required case is "safer" in that it's possible
      that the f_pos locking can cause unnecessary serialization across the
      whole write() call.  And in the worst case, that kind of serialization
      can cause deadlock issues: think writers that need readers to empty the
      state using the same file descriptor.
      
      [ Note that the locking is per-file descriptor - because it protects
        "f_pos", which is obviously per-file descriptor - so it only affects
        cases where you literally use the same file descriptor to both read
        and write.
      
        So a regular pipe that has separate reading and writing file
        descriptors doesn't really have this situation even though it's the
        obvious case of "reader empties what a bit writer concurrently fills"
      
        But we want to make pipes as being stream-line anyway, because we
        don't want the unnecessary overhead of locking, and because a named
        pipe can be (ab-)used by reading and writing to the same file
        descriptor. ]
      
      There are likely a lot of other cases that might want FMODE_STREAM, and
      looking for ".llseek = no_llseek" users and other cases that don't have
      an lseek file operation at all and making them use "stream_open()" might
      be a good idea.  But pipes and sockets are likely to be the two main
      cases.
      
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8e464ec
  2. 24 11月, 2019 1 次提交
  3. 23 11月, 2019 3 次提交
  4. 20 11月, 2019 1 次提交
    • D
      afs: Fix missing timeout reset · c74386d5
      David Howells 提交于
      In afs_wait_for_call_to_complete(), rather than immediately aborting an
      operation if a signal occurs, the code attempts to wait for it to
      complete, using a schedule timeout of 2*RTT (or min 2 jiffies) and a
      check that we're still receiving relevant packets from the server before
      we consider aborting the call.  We may even ping the server to check on
      the status of the call.
      
      However, there's a missing timeout reset in the event that we do
      actually get a packet to process, such that if we then get a couple of
      short stalls, we then time out when progress is actually being made.
      
      Fix this by resetting the timeout any time we get something to process.
      If it's the failure of the call then the call state will get changed and
      we'll exit the loop shortly thereafter.
      
      A symptom of this is data fetches and stores failing with EINTR when
      they really shouldn't.
      
      Fixes: bc5e3a54 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c74386d5
  5. 19 11月, 2019 9 次提交
    • D
      btrfs: drop bdev argument from submit_extent_page · fa17ed06
      David Sterba 提交于
      After previous patches removing bdev being passed around to set it to
      bio, it has become unused in submit_extent_page. So it now has "only" 13
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa17ed06
    • D
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba 提交于
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a019e9e1
    • D
      btrfs: drop bio_set_dev where not needed · 1a418027
      David Sterba 提交于
      bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
      also changing some state bits if there was a different bdev set before.
      This is one thing that's not needed.
      
      Another thing is that setting a bdev at bio allocation time is too early
      and actually does not work with plain redundancy profiles, where each
      time we submit a bio to a device, the bdev is set correctly.
      
      In many places the bio bdev is set to latest_bdev that seems to serve as
      a stub pointer "just to put something to bio". But we don't have to do
      that.
      
      Where do we know which bdev to set:
      
      * for regular IO: submit_stripe_bio that's called by btrfs_map_bio
      
      * repair IO: repair_io_failure, read or write from specific device
      
      * super block write (using buffer_heads but uses raw bdev) and barriers
      
      * scrub: this does not use all regular IO paths as it needs to reach all
        copies, verify and fixup eventually, and for that all bdev management
        is independent
      
      * raid56: rbio_add_io_page, for the RMW write
      
      * integrity-checker: does it's own low-level block tracking
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a418027
    • D
      btrfs: get bdev directly from fs_devices in submit_extent_page · 429aebc0
      David Sterba 提交于
      This is preparatory patch to remove @bdev parameter from
      submit_extent_page. It can't be removed completely, because the cgroups
      need it for wbc when initializing the bio
      
      wbc_init_bio
        bio_associate_blkg_from_css
          dereference bdev->bi_disk->queue
      
      The bdev pointer is the same as latest_bdev, thus no functional change.
      We can retrieve it from fs_devices that's reachable through several
      dereferences. The local variable shadows the parameter, but that's only
      temporary.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      429aebc0
    • J
      btrfs: record all roots for rename exchange on a subvol · 3e174099
      Josef Bacik 提交于
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e174099
    • F
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana 提交于
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      042528f8
    • Q
      btrfs: scrub: Don't check free space before marking a block group RO · b12de528
      Qu Wenruo 提交于
      [BUG]
      When running btrfs/072 with only one online CPU, it has a pretty high
      chance to fail:
      
        btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
        - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
            --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
            +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
            @@ -1,2 +1,3 @@
             QA output created by 072
             Silence is golden
            +Scrub find errors in "-m dup -d single" test
            ...
      
      And with the following call trace:
      
        BTRFS info (device dm-5): scrub: started on devid 1
        ------------[ cut here ]------------
        BTRFS: Transaction aborted (error -27)
        WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        Call Trace:
         __btrfs_end_transaction+0xdb/0x310 [btrfs]
         btrfs_end_transaction+0x10/0x20 [btrfs]
         btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
         scrub_enumerate_chunks+0x264/0x940 [btrfs]
         btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
         btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
         do_vfs_ioctl+0x636/0xaa0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x43/0x50
         do_syscall_64+0x79/0xe0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        ---[ end trace 166c865cec7688e7 ]---
      
      [CAUSE]
      The error number -27 is -EFBIG, returned from the following call chain:
      btrfs_end_transaction()
      |- __btrfs_end_transaction()
         |- btrfs_create_pending_block_groups()
            |- btrfs_finish_chunk_alloc()
               |- btrfs_add_system_chunk()
      
      This happens because we have used up all space of
      btrfs_super_block::sys_chunk_array.
      
      The root cause is, we have the following bad loop of creating tons of
      system chunks:
      
      1. The only SYSTEM chunk is being scrubbed
         It's very common to have only one SYSTEM chunk.
      2. New SYSTEM bg will be allocated
         As btrfs_inc_block_group_ro() will check if we have enough space
         after marking current bg RO. If not, then allocate a new chunk.
      3. New SYSTEM bg is still empty, will be reclaimed
         During the reclaim, we will mark it RO again.
      4. That newly allocated empty SYSTEM bg get scrubbed
         We go back to step 2, as the bg is already mark RO but still not
         cleaned up yet.
      
      If the cleaner kthread doesn't get executed fast enough (e.g. only one
      CPU), then we will get more and more empty SYSTEM chunks, using up all
      the space of btrfs_super_block::sys_chunk_array.
      
      [FIX]
      Since scrub/dev-replace doesn't always need to allocate new extent,
      especially chunk tree extent, so we don't really need to do chunk
      pre-allocation.
      
      To break above spiral, here we introduce a new parameter to
      btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
      need extra chunk pre-allocation.
      
      For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
      @do_chunk_alloc=false.
      This should keep unnecessary empty chunks from popping up for scrub.
      
      Also, since there are two parameters for btrfs_inc_block_group_ro(),
      add more comment for it.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b12de528
    • J
      btrfs: change btrfs_fs_devices::rotating to bool · 7f0432d0
      Johannes Thumshirn 提交于
      struct btrfs_fs_devices::rotating currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f0432d0
    • J
      btrfs: change btrfs_fs_devices::seeding to bool · 0395d84f
      Johannes Thumshirn 提交于
      struct btrfs_fs_devices::seeding currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0395d84f