1. 26 11月, 2019 2 次提交
    • L
      vfs: properly and reliably lock f_pos in fdget_pos() · 0be0ee71
      Linus Torvalds 提交于
      fdget_pos() is used by file operations that will read and update f_pos:
      things like "read()", "write()" and "lseek()" (but not, for example,
      "pread()/pwrite" that get their file positions elsewhere).
      
      However, it had two separate escape clauses for this, because not
      everybody wants or needs serialization of the file position.
      
      The first and most obvious case is the "file descriptor doesn't have a
      position at all", ie a stream-like file.  Except we didn't actually use
      FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
      was that FMODE_STREAM didn't exist back in the days, but also that we
      didn't want to mark all the special cases, so we only marked the ones
      that _required_ position atomicity according to POSIX - regular files
      and directories.
      
      The case one was intentionally lazy, but now that we _do_ have
      FMODE_STREAM we could and should just use it.  With the change to use
      FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
      the code to set it is deleted.
      
      Any cases where we don't want the serialization because the driver (or
      subsystem) doesn't use the file position should just be updated to do
      "stream_open()".  We've done that for all the obvious and common
      situations, we may need a few more.  Quoting Kirill Smelkov in the
      original FMODE_STREAM thread (see link below for full email):
      
       "And I appreciate if people could help at least somehow with "getting
        rid of mixed case entirely" (i.e. always lock f_pos_lock on
        !FMODE_STREAM), because this transition starts to diverge from my
        particular use-case too far. To me it makes sense to do that
        transition as follows:
      
         - convert nonseekable_open -> stream_open via stream_open.cocci;
         - audit other nonseekable_open calls and convert left users that
           truly don't depend on position to stream_open;
         - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
           will cover pipes and sockets), or maybe convert pipes and sockets
           to FMODE_STREAM manually;
         - extend stream_open.cocci to analyze file_operations that use
           no_llseek or noop_llseek, but do not use nonseekable_open or
           alloc_file_pseudo. This might find files that have stream semantic
           but are opened differently;
         - extend stream_open.cocci to analyze file_operations whose
           .read/.write do not use ppos at all (independently of how file was
           opened);
         - ...
         - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
           !FMODE_STREAM;
         - gather bug reports for deadlocked read/write and convert missed
           cases to FMODE_STREAM, probably extending stream_open.cocci along
           the road to catch similar cases
      
        i.e. always take f_pos_lock unless a file is explicitly marked as
        being stream, and try to find and cover all files that are streams"
      
      We have not done the "extend stream_open.cocci to analyze
      alloc_file_pseudo" as well, but the previous commit did manually handle
      the case of pipes and sockets.
      
      The other case where we can avoid locking f_pos is the "this file
      descriptor only has a single user and it is us, and thus there is no
      need to lock it".
      
      The second test was correct, although a bit subtle and worth just
      re-iterating here.  There are two kinds of other sources of references
      to the same file descriptor: file descriptors that have been explicitly
      shared across fork() or with dup(), and file tables having elevated
      reference counts due to threading (or explicit file sharing with
      clone()).
      
      The first case would have incremented the file count explicitly, and in
      the second case the previous __fdget() would have incremented it for us
      and set the FDPUT_FPUT flag.
      
      But in both cases the file count would be greater than one, so the
      "file_count(file) > 1" test catches both situations.  Also note that if
      file_count is 1, that also means that no other thread can have access to
      the file table, so there also cannot be races with concurrent calls to
      dup()/fork()/clone() that would increment the file count any other way.
      
      Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0be0ee71
    • L
      vfs: mark pipes and sockets as stream-like file descriptors · d8e464ec
      Linus Torvalds 提交于
      In commit 3975b097 ("convert stream-like files -> stream_open, even
      if they use noop_llseek") Kirill used a coccinelle script to change
      "nonseekable_open()" to "stream_open()", which changed the trivial cases
      of stream-like file descriptors to the new model with FMODE_STREAM.
      
      However, the two big cases - sockets and pipes - don't actually have
      that trivial pattern at all, and were thus never converted to
      FMODE_STREAM even though it makes lots of sense to do so.
      
      That's particularly true when looking forward to the next change:
      getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
      decide whether f_pos updates are needed or not.  And if they are, we'll
      always do them atomically.
      
      This came up because KCSAN (correctly) noted that the non-locked f_pos
      updates are data races: they are clearly benign for the case where we
      don't care, but it would be good to just not have that issue exist at
      all.
      
      Note that the reason we used FMODE_ATOMIC_POS originally is that only
      doing it for the minimal required case is "safer" in that it's possible
      that the f_pos locking can cause unnecessary serialization across the
      whole write() call.  And in the worst case, that kind of serialization
      can cause deadlock issues: think writers that need readers to empty the
      state using the same file descriptor.
      
      [ Note that the locking is per-file descriptor - because it protects
        "f_pos", which is obviously per-file descriptor - so it only affects
        cases where you literally use the same file descriptor to both read
        and write.
      
        So a regular pipe that has separate reading and writing file
        descriptors doesn't really have this situation even though it's the
        obvious case of "reader empties what a bit writer concurrently fills"
      
        But we want to make pipes as being stream-line anyway, because we
        don't want the unnecessary overhead of locking, and because a named
        pipe can be (ab-)used by reading and writing to the same file
        descriptor. ]
      
      There are likely a lot of other cases that might want FMODE_STREAM, and
      looking for ".llseek = no_llseek" users and other cases that don't have
      an lseek file operation at all and making them use "stream_open()" might
      be a good idea.  But pipes and sockets are likely to be the two main
      cases.
      
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8e464ec
  2. 24 11月, 2019 1 次提交
  3. 23 11月, 2019 3 次提交
  4. 20 11月, 2019 1 次提交
    • D
      afs: Fix missing timeout reset · c74386d5
      David Howells 提交于
      In afs_wait_for_call_to_complete(), rather than immediately aborting an
      operation if a signal occurs, the code attempts to wait for it to
      complete, using a schedule timeout of 2*RTT (or min 2 jiffies) and a
      check that we're still receiving relevant packets from the server before
      we consider aborting the call.  We may even ping the server to check on
      the status of the call.
      
      However, there's a missing timeout reset in the event that we do
      actually get a packet to process, such that if we then get a couple of
      short stalls, we then time out when progress is actually being made.
      
      Fix this by resetting the timeout any time we get something to process.
      If it's the failure of the call then the call state will get changed and
      we'll exit the loop shortly thereafter.
      
      A symptom of this is data fetches and stores failing with EINTR when
      they really shouldn't.
      
      Fixes: bc5e3a54 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c74386d5
  5. 19 11月, 2019 33 次提交
    • D
      btrfs: drop bdev argument from submit_extent_page · fa17ed06
      David Sterba 提交于
      After previous patches removing bdev being passed around to set it to
      bio, it has become unused in submit_extent_page. So it now has "only" 13
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa17ed06
    • D
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba 提交于
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a019e9e1
    • D
      btrfs: drop bio_set_dev where not needed · 1a418027
      David Sterba 提交于
      bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
      also changing some state bits if there was a different bdev set before.
      This is one thing that's not needed.
      
      Another thing is that setting a bdev at bio allocation time is too early
      and actually does not work with plain redundancy profiles, where each
      time we submit a bio to a device, the bdev is set correctly.
      
      In many places the bio bdev is set to latest_bdev that seems to serve as
      a stub pointer "just to put something to bio". But we don't have to do
      that.
      
      Where do we know which bdev to set:
      
      * for regular IO: submit_stripe_bio that's called by btrfs_map_bio
      
      * repair IO: repair_io_failure, read or write from specific device
      
      * super block write (using buffer_heads but uses raw bdev) and barriers
      
      * scrub: this does not use all regular IO paths as it needs to reach all
        copies, verify and fixup eventually, and for that all bdev management
        is independent
      
      * raid56: rbio_add_io_page, for the RMW write
      
      * integrity-checker: does it's own low-level block tracking
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a418027
    • D
      btrfs: get bdev directly from fs_devices in submit_extent_page · 429aebc0
      David Sterba 提交于
      This is preparatory patch to remove @bdev parameter from
      submit_extent_page. It can't be removed completely, because the cgroups
      need it for wbc when initializing the bio
      
      wbc_init_bio
        bio_associate_blkg_from_css
          dereference bdev->bi_disk->queue
      
      The bdev pointer is the same as latest_bdev, thus no functional change.
      We can retrieve it from fs_devices that's reachable through several
      dereferences. The local variable shadows the parameter, but that's only
      temporary.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      429aebc0
    • J
      btrfs: record all roots for rename exchange on a subvol · 3e174099
      Josef Bacik 提交于
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e174099
    • F
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana 提交于
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      042528f8
    • Q
      btrfs: scrub: Don't check free space before marking a block group RO · b12de528
      Qu Wenruo 提交于
      [BUG]
      When running btrfs/072 with only one online CPU, it has a pretty high
      chance to fail:
      
        btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
        - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
            --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
            +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
            @@ -1,2 +1,3 @@
             QA output created by 072
             Silence is golden
            +Scrub find errors in "-m dup -d single" test
            ...
      
      And with the following call trace:
      
        BTRFS info (device dm-5): scrub: started on devid 1
        ------------[ cut here ]------------
        BTRFS: Transaction aborted (error -27)
        WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        Call Trace:
         __btrfs_end_transaction+0xdb/0x310 [btrfs]
         btrfs_end_transaction+0x10/0x20 [btrfs]
         btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
         scrub_enumerate_chunks+0x264/0x940 [btrfs]
         btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
         btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
         do_vfs_ioctl+0x636/0xaa0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x43/0x50
         do_syscall_64+0x79/0xe0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        ---[ end trace 166c865cec7688e7 ]---
      
      [CAUSE]
      The error number -27 is -EFBIG, returned from the following call chain:
      btrfs_end_transaction()
      |- __btrfs_end_transaction()
         |- btrfs_create_pending_block_groups()
            |- btrfs_finish_chunk_alloc()
               |- btrfs_add_system_chunk()
      
      This happens because we have used up all space of
      btrfs_super_block::sys_chunk_array.
      
      The root cause is, we have the following bad loop of creating tons of
      system chunks:
      
      1. The only SYSTEM chunk is being scrubbed
         It's very common to have only one SYSTEM chunk.
      2. New SYSTEM bg will be allocated
         As btrfs_inc_block_group_ro() will check if we have enough space
         after marking current bg RO. If not, then allocate a new chunk.
      3. New SYSTEM bg is still empty, will be reclaimed
         During the reclaim, we will mark it RO again.
      4. That newly allocated empty SYSTEM bg get scrubbed
         We go back to step 2, as the bg is already mark RO but still not
         cleaned up yet.
      
      If the cleaner kthread doesn't get executed fast enough (e.g. only one
      CPU), then we will get more and more empty SYSTEM chunks, using up all
      the space of btrfs_super_block::sys_chunk_array.
      
      [FIX]
      Since scrub/dev-replace doesn't always need to allocate new extent,
      especially chunk tree extent, so we don't really need to do chunk
      pre-allocation.
      
      To break above spiral, here we introduce a new parameter to
      btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
      need extra chunk pre-allocation.
      
      For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
      @do_chunk_alloc=false.
      This should keep unnecessary empty chunks from popping up for scrub.
      
      Also, since there are two parameters for btrfs_inc_block_group_ro(),
      add more comment for it.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b12de528
    • J
      btrfs: change btrfs_fs_devices::rotating to bool · 7f0432d0
      Johannes Thumshirn 提交于
      struct btrfs_fs_devices::rotating currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f0432d0
    • J
      btrfs: change btrfs_fs_devices::seeding to bool · 0395d84f
      Johannes Thumshirn 提交于
      struct btrfs_fs_devices::seeding currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0395d84f
    • D
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba 提交于
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32da5386
    • Q
      btrfs: block-group: Reuse the item key from caller of read_one_block_group() · d49a2ddb
      Qu Wenruo 提交于
      For read_one_block_group(), its only caller has already got the item key
      to search next block group item.
      
      So we can use that key directly without doing our own convertion on
      stack.
      
      Also, since that key used in btrfs_read_block_groups() is vital for
      block group item search, add 'const' keyword for that parameter to
      prevent read_one_block_group() to modify it.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d49a2ddb
    • Q
      btrfs: block-group: Refactor btrfs_read_block_groups() · ffb9e0f0
      Qu Wenruo 提交于
      Refactor the work inside the loop of btrfs_read_block_groups() into one
      separate function, read_one_block_group().
      
      This allows read_one_block_group to be reused for later BG_TREE feature.
      
      The refactor does the following extra fix:
      - Use btrfs_fs_incompat() to replace open-coded feature check
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ffb9e0f0
    • D
      btrfs: document extent buffer locking · d4e253bb
      David Sterba 提交于
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4e253bb
    • D
      btrfs: access eb::blocking_writers according to ACCESS_ONCE policies · a4477988
      David Sterba 提交于
      A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
      once policies can be found here
      https://lwn.net/Articles/799218/#Access-Marking%20Policies .
      
      The locked and unlocked access to eb::blocking_writers should be
      annotated accordingly, following this:
      
      Writes:
      
      - locked write must use ONCE, may use plain read
      - unlocked write must use ONCE
      
      Reads:
      
      - unlocked read must use ONCE
      - locked read may use plain read iff not mixed with unlocked read
      - unlocked read then locked must use ONCE
      
      There's one difference on the assembly level, where
      btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
      value and did not reevaluate it after taking the lock. This could have
      missed some opportunities to take the lock in case blocking writers
      changed between the calls, but the window is just a few instructions
      long. As this is in try-lock, the callers handle that.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a4477988
    • D
      btrfs: set blocking_writers directly, no increment or decrement · 40d38f53
      David Sterba 提交于
      The increment and decrement was inherited from previous version that
      used atomics, switched in commit 06297d8c ("btrfs: switch
      extent_buffer blocking_writers from atomic to int"). The only possible
      values are 0 and 1 so we can set them directly.
      
      The generated assembly (gcc 9.x) did the direct value assignment in
      btrfs_set_lock_blocking_write (asm diff after change in 06297d8c):
      
           5d:   test   %eax,%eax
           5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
           61:   retq
      
        -  62:   lock incl 0x44(%rdi)
        -  66:   add    $0x50,%rdi
        -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>
      
        +  62:   movl   $0x1,0x44(%rdi)
        +  69:   add    $0x50,%rdi
        +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>
      
      The part in btrfs_tree_unlock did a decrement because
      BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
      otherwise the output looks safe:
      
        - lock decl 0x44(%rdi)
      
        + sub    $0x1,%eax
        + mov    %eax,0x44(%rdi)
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40d38f53
    • D
      btrfs: merge blocking_writers branches in btrfs_tree_read_lock · f5c2a525
      David Sterba 提交于
      There are two ifs that use eb::blocking_writers. As this is a variable
      modified inside and outside of locks, we could minimize number of
      accesses to avoid problems with getting different results at different
      times.
      
      The access here is locked so this can only race with btrfs_tree_unlock
      that sets blocking_writers to 0 without lock and unsets the lock owner.
      
      The first branch is taken only if the same thread already holds the
      lock, the second if checks for blocking writers. Here we'd either unlock
      and wait, or proceed. Both are valid states of the locking protocol.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5c2a525
    • D
      btrfs: drop incompat bit for raid1c34 after last block group is gone · 9c907446
      David Sterba 提交于
      When there are no raid1c3 or raid1c4 block groups left after balance
      (either convert or with other filters applied), remove the incompat bit.
      This is already done for RAID56, do the same for RAID1C34.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c907446
    • D
      btrfs: add incompat for raid1 with 3, 4 copies · cfbb825c
      David Sterba 提交于
      The new raid1c3 and raid1c4 profiles are backward incompatible and the
      name shall be 'raid1c34', the status can be found in the global
      supported features in /sys/fs/btrfs/features or in the per-filesystem
      directory.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cfbb825c
    • D
      btrfs: add support for 4-copy replication (raid1c4) · 8d6fac00
      David Sterba 提交于
      Add new block group profile to store 4 copies in a simliar way that
      current RAID1 does.  The profile attributes and constraints are defined
      in the raid table and used by the same code that already handles the 2-
      and 3-copy RAID1.
      
      The minimum number of devices is 4, the maximum number of devices/chunks
      that can be lost/damaged is 3. There is no comparable traditional RAID
      level, the profile is added for future needs to accompany triple-parity
      and beyond.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d6fac00
    • D
      btrfs: add support for 3-copy replication (raid1c3) · 47e6f742
      David Sterba 提交于
      Add new block group profile to store 3 copies in a simliar way that
      current RAID1 does. The profile attributes and constraints are defined
      in the raid table and used by the same code that already handles the
      2-copy RAID1.
      
      The minimum number of devices is 3, the maximum number of devices/chunks
      that can be lost/damaged is 2. Like RAID6 but with 33% space
      utilization.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47e6f742
    • D
      btrfs: sink write flags to cow_file_range_async · fac07d2b
      David Sterba 提交于
      In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
      cow_file_range_async gained wbc as a parameter and this makes passing
      write flags redundant. Set it inside the function and remove the
      parameter.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fac07d2b
    • D
      btrfs: sink write_flags to __extent_writepage_io · 57e5ffeb
      David Sterba 提交于
      __extent_writepage reads write flags from wbc and passes both to
      __extent_writepage_io. This makes write_flags redundant and we can
      remove it.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      57e5ffeb
    • F
      Btrfs: send, skip backreference walking for extents with many references · fd0ddbe2
      Filipe Manana 提交于
      Backreference walking, which is used by send to figure if it can issue
      clone operations instead of write operations, can be very slow and use
      too much memory when extents have many references. This change simply
      skips backreference walking when an extent has more than 64 references,
      in which case we fallback to a write operation instead of a clone
      operation. This limit is conservative and in practice I observed no
      signicant slowdown with up to 100 references and still low memory usage
      up to that limit.
      
      This is a temporary workaround until there are speedups in the backref
      walking code, and as such it does not attempt to add extra interfaces or
      knobs to tweak the threshold.
      Reported-by: NAtemu <atemu.main@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd0ddbe2
    • F
      Btrfs: send, allow clone operations within the same file · 11f2069c
      Filipe Manana 提交于
      For send we currently skip clone operations when the source and
      destination files are the same. This is so because clone didn't support
      this case in its early days, but support for it was added back in May
      2013 by commit a96fbc72 ("Btrfs: allow file data clone within a
      file"). This change adds support for it.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt/sdd
      
        $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
        $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap
      
        $ mkfs.btrfs -f /dev/sde
        $ mount /dev/sde /mnt/sde
      
        $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde
      
      Without this change file foobar at the destination has a single 128Kb
      extent:
      
        $ filefrag -v /mnt/sde/snap/foobar
        Filesystem type is: 9123683e
        File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..      31:          0..        31:     32:             last,unknown_loc,delalloc,eof
        /mnt/sde/snap/foobar: 1 extent found
      
      With this we get a single 64Kb extent that is shared at file offsets 0
      and 64K, just like in the source filesystem:
      
        $ filefrag -v /mnt/sde/snap/foobar
        Filesystem type is: 9123683e
        File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..      15:       3328..      3343:     16:             shared
           1:       16..      31:       3328..      3343:     16:       3344: last,shared,eof
        /mnt/sde/snap/foobar: 2 extents found
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11f2069c
    • Q
      btrfs: Ensure we trim ranges across block group boundary · 6b7faadd
      Qu Wenruo 提交于
      [BUG]
      When deleting large files (which cross block group boundary) with
      discard mount option, we find some btrfs_discard_extent() calls only
      trimmed part of its space, not the whole range:
      
        btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
      
      type:		bbio->map_type, in above case, it's SINGLE DATA.
      start:		Logical address of this trim
      len:		Logical length of this trim
      trimmed:	Physically trimmed bytes
      ratio:		trimmed / len
      
      Thus leaving some unused space not discarded.
      
      [CAUSE]
      When discard mount option is specified, after a transaction is fully
      committed (super block written to disk), we begin to cleanup pinned
      extents in the following call chain:
      
      btrfs_commit_transaction()
      |- btrfs_finish_extent_commit()
         |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
         |- btrfs_discard_extent()
      
      However, pinned extents are recorded in an extent_io_tree, which can
      merge adjacent extent states.
      
      When a large file gets deleted and it has adjacent file extents across
      block group boundary, we will get a large merged range like this:
      
            |<---    BG1    --->|<---      BG2     --->|
            |//////|<--   Range to discard   --->|/////|
      
      To discard that range, we have the following calls:
      
        btrfs_discard_extent()
        |- btrfs_map_block()
        |  Returned bbio will end at BG1's end. As btrfs_map_block()
        |  never returns result across block group boundary.
        |- btrfs_issuse_discard()
           Issue discard for each stripe.
      
      So we will only discard the range in BG1, not the remaining part in BG2.
      
      Furthermore, this bug is not that reliably observed, for above case, if
      there is no other extent in BG2, BG2 will be empty and btrfs will trim
      all space of BG2, covering up the bug.
      
      [FIX]
      - Allow __btrfs_map_block_for_discard() to modify @length parameter
        btrfs_map_block() uses its @length paramter to notify the caller how
        many bytes are mapped in current call.
        With __btrfs_map_block_for_discard() also modifing the @length,
        btrfs_discard_extent() now understands when to do extra trim.
      
      - Call btrfs_map_block() in a loop until we hit the range end Since we
        now know how many bytes are mapped each time, we can iterate through
        each block group boundary and issue correct trim for each range.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7faadd
    • Q
      btrfs: volumes: Use more straightforward way to calculate map length · 2d974619
      Qu Wenruo 提交于
      The old code goes:
      
       	offset = logical - em->start;
      	length = min_t(u64, em->len - offset, length);
      
      Where @length calculation is dependent on offset, it can take reader
      several more seconds to find it's just the same code as:
      
       	offset = logical - em->start;
      	length = min_t(u64, em->start + em->len - logical, length);
      
      Use above code to make the length calculate independent from other
      variable, thus slightly increase the readability.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2d974619
    • Q
      btrfs: tree-checker: Check item size before reading file extent type · 153a6d29
      Qu Wenruo 提交于
      In check_extent_data_item(), we read file extent type without verifying
      if the item size is valid.
      
      Add such check to ensure the file extent type we read is correct.
      
      The check is not as accurate as we need to cover both inline and regular
      extents, so it only checks if the item size is larger or equal to inline
      header.
      So the existing size checks on inline/regular extents are still needed.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      153a6d29
    • D
      btrfs: clean up locking name in scrub_enumerate_chunks() · 3ec17a67
      Dan Carpenter 提交于
      The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
      the same lock but Smatch is not clever enough to figure that out so it
      leads to static checker warnings.  It's better to use it consistently
      anyway.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ec17a67
    • N
      btrfs: Streamline btrfs_fs_info::backup_root_index semantics · 6ef108dd
      Nikolay Borisov 提交于
      The backup_root_index member stores the index at which the backup root
      should be saved upon next transaction commit. However, there is a
      small deviation from this behavior in the form of a check in
      backup_super_roots which checks if current root generation equals to the
      generation of the previous root. This can trigger in the following
      scenario:
      
      slot0: gen-2
      slot1: gen-1
      slot2: gen
      slot3: unused
      
      Now suppose slot3 (which is also the root specified in the super block)
      is corrupted hence init_tree_roots chooses to use the backup root at
      slot2, meaning read_backup_root will read slot2 and assign the
      superblock generation to gen-1. Despite this backup_root_index will
      point at slot3 because its init happens in init_backup_root_slot, long
      before any parsing of the backup roots occur. Then on next transaction
      start, gen-1 will be incremented by 1 making the root's generation
      equal gen. Subsequently, on transaction commit the following check
      triggers:
      
        if (btrfs_backup_tree_root_gen(root_backup) ==
                 btrfs_header_generation(info->tree_root->node))
      
      This causes the 'next_backup', which is the index at which the backup is
      going to be written to, to set to last_backup, which will be slot2.
      
      All of this is a very confusing way of expressing the following
      invariant:
      
       Always write a backup root at the index following the last used backup
       root.
      
      This commit streamlines this logic by setting backup_root_index to the
      next index after the one used for mount.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ef108dd
    • N
      btrfs: Rename find_oldest_super_backup to init_backup_root_slot · 4ac039ad
      Nikolay Borisov 提交于
      The old name name was an awful misnomer because it didn't really find
      the oldest super backup per-se but rather its slot. For example if we
      have:
      
      slot0: gen - 2
      slot1: gen - 1
      slot2: gen
      slot3: empty
      
      init_backup_root_slot will return slot3 and not slot0.
      
      The new name is more appropriate since the function doesn't care whether
      there is a valid backup in the returned slot or not.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4ac039ad
    • N
      btrfs: Remove unused next_root_backup function · 260eb11b
      Nikolay Borisov 提交于
      This function has been superseded by previous commits and is no longer
      used so just remove it.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      260eb11b
    • N
      btrfs: Don't use objectid_mutex during mount · 336a0d8d
      Nikolay Borisov 提交于
      Since the filesystem is not well formed and no trees are loaded it's
      pointless holding the objectid_mutex. Just remove its usage.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      336a0d8d
    • N
      btrfs: Factor out tree roots initialization during mount · b8522a1e
      Nikolay Borisov 提交于
      The code responsible for reading and initializing tree roots is
      scattered in open_ctree among 2 labels, emulating a loop. This is rather
      confusing to reason about. Instead, factor the code to a new function,
      init_tree_roots which implements the same logical flow.
      
      There are a couple of notable differences, namely:
      
      * Instead of using next_backup_root it's using the newly introduced
        read_backup_root.
      
      * If read_backup_root returns an error init_tree_roots propagates the
        error and there is no special handling of that case e.g. the code jumps
        straight to 'fail_tree_roots' label. The old code, however, was
        (erroneously) jumping to 'fail_block_groups' label if next_backup_root
        did fail, this was unnecessary since the tree roots init logic doesn't
        modify the state of block groups.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8522a1e