1. 27 10月, 2021 5 次提交
    • Q
      btrfs: cleanup for extent_write_locked_range() · 2bd0fc93
      Qu Wenruo 提交于
      There are several cleanups for extent_write_locked_range(), most of them
      are pure cleanups, but with some preparation for future subpage support.
      
      - Add a proper comment for which call sites are suitable
        Unlike regular synchronized extent write back, if async COW or zoned
        COW happens, we have all pages in the range still locked.
      
        Thus for those (only) two call sites, we need this function to submit
        page content into bios and submit them.
      
      - Remove @mode parameter
        All the existing two call sites pass WB_SYNC_ALL. No need for @mode
        parameter.
      
      - Better error handling
        Currently if we hit an error during the page iteration loop, we
        overwrite @ret, causing only the last error can be recorded.
      
        Here we add @found_error and @first_error variable to record if we hit
        any error, and the first error we hit.
        So the first error won't get lost.
      
      - Don't reuse @start as the cursor
        We reuse the parameter @start as the cursor to iterate the range, not
        a big problem, but since we're here, introduce a proper @cur as the
        cursor.
      
      - Remove impossible branch
        Since all pages are still locked after the ordered extent is inserted,
        there is no way that pages can get its dirty bit cleared.
        Remove the branch where page is not dirty and replace it with an
        ASSERT().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bd0fc93
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Q
      btrfs: remove btrfs_bio_alloc() helper · cd8e0cca
      Qu Wenruo 提交于
      The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(),
      except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes
      bio->bi_iter.bi_sector.
      
      However the naming itself is not using "btrfs_io_bio" to indicate its
      parameter is "strcut btrfs_io_bio" and can be easily confused with
      "struct btrfs_bio".
      
      Considering assigned bio->bi_iter.bi_sector is such a simple work and
      there are already tons of call sites doing that manually, there is no
      need to do that in a helper.
      
      Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc()
      function to provide a fail-safe value for its @nr_iovecs.
      
      And then replace all btrfs_bio_alloc() callers with
      btrfs_io_bio_alloc().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8e0cca
    • Q
      btrfs: unexport repair_io_failure() · 38d5e541
      Qu Wenruo 提交于
      Function repair_io_failure() is no longer used out of extent_io.c since
      commit 8b9b6f25 ("btrfs: scrub: cleanup the remaining nodatasum
      fixup code"), which removes the last external caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38d5e541
    • N
      btrfs: zoned: finish fully written block group · be1a1d7a
      Naohiro Aota 提交于
      If we have written to the zone capacity, the device automatically
      deactivates the zone. Sync up block group side (the active BG list and
      zone_is_active flag) with it.
      
      We need to do it both on data BGs and metadata BGs. On data side, we add a
      hook to btrfs_finish_ordered_io(). On metadata side, we use
      end_extent_buffer_writeback().
      
      To reduce excess lookup of a block group, we mark the last extent buffer in
      a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
      data (ordered_extent), because the address may change due to
      REQ_OP_ZONE_APPEND.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be1a1d7a
  2. 23 8月, 2021 1 次提交
  3. 21 6月, 2021 4 次提交
    • Q
      btrfs: rename PagePrivate2 to PageOrdered inside btrfs · f57ad937
      Qu Wenruo 提交于
      Inside btrfs we use Private2 page status to indicate we have an ordered
      extent with pending IO for the sector.
      
      But the page status name, Private2, tells us nothing about the bit
      itself, so this patch will rename it to Ordered.
      And with extra comment about the bit added, so reader who is still
      uncertain about the page Ordered status, will find the comment pretty
      easily.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f57ad937
    • Q
      btrfs: refactor submit_extent_page() to make bio and its flag tracing easier · 390ed29b
      Qu Wenruo 提交于
      There is a lot of code inside extent_io.c needs both "struct bio
      **bio_ret" and "unsigned long prev_bio_flags", along with some
      parameters like "unsigned long bio_flags".
      
      Such strange parameters are here for bio assembly.
      
      For example, we have such inode page layout:
      
        0       4K      8K      12K
        |<-- Extent A-->|<- EB->|
      
      Then what we do is:
      
      - Page [0, 4K)
        *bio_ret = NULL
        So we allocate a new bio to bio_ret,
        Add page [0, 4K) to *bio_ret.
      
      - Page [4K, 8K)
        *bio_ret != NULL
        We found this page is continuous to *bio_ret,
        and if we're not at stripe boundary, we
        add page [4K, 8K) to *bio_ret.
      
      - Page [8K, 12K)
        *bio_ret != NULL
        But we found this page is not continuous, so
        we submit *bio_ret, then allocate a new bio,
        and add page [8K, 12K) to the new bio.
      
      This means we need to record both the bio and its bio_flag, but we
      record them manually using those strange parameter list, other than
      encapsulating them into their own structure.
      
      So this patch will introduce a new structure, btrfs_bio_ctrl, to record
      both the bio, and its bio_flags.
      
      Also, in above case, for all pages added to the bio, we need to check if
      the new page crosses stripe boundary.  This check itself can be time
      consuming, and we don't really need to do that for each page.
      
      This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
      When a new bio is allocated, the stripe and ordered extent boundary is
      also calculated, so no matter how large the bio will be, we only
      calculate the boundaries once, to save some CPU time.
      
      The following functions/structures are affected:
      
      - struct extent_page_data
        Replace its bio pointer with structure btrfs_bio_ctrl (embedded
        structure, not pointer)
      
      - end_write_bio()
      - flush_write_bio()
        Just change how bio is fetched
      
      - btrfs_bio_add_page()
        Use pre-calculated boundaries instead of re-calculating them.
        And use @bio_ctrl to replace @bio and @prev_bio_flags.
      
      - calc_bio_boundaries()
        New function
      
      - submit_extent_page() callers
      - btrfs_do_readpage() callers
      - contiguous_readpages() callers
        To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
        bio.
      
      - btrfs_bio_fits_in_ordered_extent()
        Removed, as now the ordered extent size limit is done at bio
        allocation time, no need to check for each page range.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      390ed29b
    • Q
      btrfs: remove io_failure_record::in_validation · 1245835d
      Qu Wenruo 提交于
      The io_failure_record::in_validation was introduced to handle failed bio
      which cross several sectors.  In such case, we still need to verify
      which sectors are corrupted.
      
      But since we've changed the way how we handle corrupted sectors, by only
      submitting repair for each corrupted sector, there is no need for extra
      validation any more.
      
      This patch will cleanup all io_failure_record::in_validation related
      code.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1245835d
    • Q
      btrfs: submit read time repair only for each corrupted sector · 150e4b05
      Qu Wenruo 提交于
      Currently btrfs_submit_read_repair() has some extra check on whether the
      failed bio needs extra validation for repair.  But we can avoid all
      these extra mechanisms if we submit the repair for each sector.
      
      By this, each read repair can be easily handled without the need to
      verify which sector is corrupted.
      
      This will also benefit subpage, as one subpage bvec can contain several
      sectors, making the extra verification more complex.
      
      So this patch will:
      
      - Introduce repair_one_sector()
        The main code submitting repair, which is more or less the same as old
        btrfs_submit_read_repair().
        But this time, it only repairs one sector.
      
      - Make btrfs_submit_read_repair() to handle sectors differently
        There are 3 different cases:
      
        * Good sector
          We need to release the page and extent, set the range uptodate.
      
        * Bad sector and failed to submit repair bio
          We need to release the page and extent, but not set the range
          uptodate.
      
        * Bad sector but repair bio submitted
          The page and extent release will be handled by the submitted repair
          bio. Nothing needs to be done.
      
        Since btrfs_submit_read_repair() will handle the page and extent
        release now, we need to skip to next bvec even we hit some error.
      
      - Change the lifespan of @uptodate in end_bio_extent_readpage()
        Since now btrfs_submit_read_repair() will handle the full bvec
        which contains any corruption, we don't need to bother updating
        @uptodate bit anymore.
        Just let @uptodate to be local variable inside the main loop,
        so that any error from one bvec won't affect later bvec.
      
      - Only export btrfs_repair_one_sector(), unexport
        btrfs_submit_read_repair()
        The only outside caller for read repair is DIO, which already submits
        its repair for just one sector.
        Only export btrfs_repair_one_sector() for DIO.
      
      This patch will focus on the change on the repair path, the extra
      validation code is still kept as is, and will be cleaned up later.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      150e4b05
  4. 19 4月, 2021 1 次提交
  5. 09 2月, 2021 3 次提交
    • N
      btrfs: zoned: redirty released extent buffers · d3575156
      Naohiro Aota 提交于
      Tree manipulating operations like merging nodes often release
      once-allocated tree nodes. Such nodes are cleaned so that pages in the
      node are not uselessly written out. On zoned volumes, however, such
      optimization blocks the following IOs as the cancellation of the write
      out of the freed blocks breaks the sequential write sequence expected by
      the device.
      
      Introduce a list of clean and unwritten extent buffers that have been
      released in a transaction. Redirty the buffers so that
      btree_write_cache_pages() can send proper bios to the devices.
      
      Besides it clears the entire content of the extent buffer not to confuse
      raw block scanners e.g. 'btrfs check'. By clearing the content,
      csum_dirty_buffer() complains about bytenr mismatch, so avoid the
      checking and checksum using newly introduced buffer flag
      EXTENT_BUFFER_NO_CHECK.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d3575156
    • Q
      btrfs: introduce btrfs_subpage for data inodes · 32443de3
      Qu Wenruo 提交于
      To support subpage sector size, data also need extra info to make sure
      which sectors in a page are uptodate/dirty/...
      
      This patch will make pages for data inodes get btrfs_subpage structure
      attached, and detached when the page is freed.
      
      This patch also slightly changes the timing when
      set_page_extent_mapped() is called to make sure:
      
      - We have page->mapping set
        page->mapping->host is used to grab btrfs_fs_info, thus we can only
        call this function after page is mapped to an inode.
      
        One call site attaches pages to inode manually, thus we have to modify
        the timing of set_page_extent_mapped() a bit.
      
      - As soon as possible, before other operations
        Since memory allocation can fail, we have to do extra error handling.
        Calling set_page_extent_mapped() as soon as possible can simply the
        error handling for several call sites.
      
      The idea is pretty much the same as iomap_page, but with more bitmaps
      for btrfs specific cases.
      
      Currently the plan is to switch iomap if iomap can provide sector
      aligned write back (only write back dirty sectors, but not the full
      page, data balance require this feature).
      
      So we will stick to btrfs specific bitmap for now.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32443de3
    • Q
      btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK · 6869b0a8
      Qu Wenruo 提交于
      PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
      __process_pages_contig(), to let the function know to clear page dirty
      bit and then set page writeback.
      
      However page writeback and dirty bits are conflicting (at least for
      sector size == PAGE_SIZE case), this means these two have to be always
      updated together.
      
      This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
      PAGE_START_WRITEBACK.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6869b0a8
  6. 10 12月, 2020 4 次提交
  7. 08 12月, 2020 7 次提交
    • Q
      btrfs: use fixed width int type for extent_state::state · f97e27e9
      Qu Wenruo 提交于
      Currently the type is unsigned int which could change its width
      depending on the architecture. We need up to 32 bits so make it
      explicit.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f97e27e9
    • J
      btrfs: remove extent_buffer::recursed · a55463c9
      Josef Bacik 提交于
      It is unused everywhere now, it can be removed.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a55463c9
    • J
      btrfs: pass the owner_root and level to alloc_extent_buffer · 3fbaf258
      Josef Bacik 提交于
      Now that we've plumbed all of the callers to have the owner root and the
      level, plumb it down into alloc_extent_buffer().
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3fbaf258
    • J
      btrfs: cleanup extent buffer readahead · bfb484d9
      Josef Bacik 提交于
      We're going to pass around more information when we allocate extent
      buffers, in order to make that cleaner how we do readahead.  Most of the
      callers have the parent node that we're getting our blockptr from, with
      the sole exception of relocation which simply has the bytenr it wants to
      read.
      
      Add a helper that takes the current arguments that we need (bytenr and
      gen), and add another helper for simply reading the slot out of a node.
      In followup patches the helper that takes all the extra arguments will
      be expanded, and the simpler helper won't need to have it's arguments
      adjusted.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bfb484d9
    • D
      btrfs: reorder extent buffer members for better packing · dc516164
      David Sterba 提交于
      After the rwsem replaced the tree lock implementation, the extent buffer
      got smaller but leaving some holes behind. By changing log_index type
      and reordering, we can squeeze the size further to 240 bytes, measured on
      release config on x86_64. Log_index spans only 3 values and needs to be
      signed.
      
      Before:
      
      struct extent_buffer {
              u64                        start;                /*     0     8 */
              long unsigned int          len;                  /*     8     8 */
              long unsigned int          bflags;               /*    16     8 */
              struct btrfs_fs_info *     fs_info;              /*    24     8 */
              spinlock_t                 refs_lock;            /*    32     4 */
              atomic_t                   refs;                 /*    36     4 */
              atomic_t                   io_pages;             /*    40     4 */
              int                        read_mirror;          /*    44     4 */
              struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              pid_t                      lock_owner;           /*    64     4 */
              bool                       lock_recursed;        /*    68     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              struct rw_semaphore        lock;                 /*    72    40 */
              short int                  log_index;            /*   112     2 */
      
              /* XXX 6 bytes hole, try to pack */
      
              struct page *              pages[16];            /*   120   128 */
      
              /* size: 248, cachelines: 4, members: 14 */
              /* sum members: 239, holes: 2, sum holes: 9 */
              /* forced alignments: 1 */
              /* last cacheline: 56 bytes */
      } __attribute__((__aligned__(8)));
      
      After:
      
      struct extent_buffer {
              u64                        start;                /*     0     8 */
              long unsigned int          len;                  /*     8     8 */
              long unsigned int          bflags;               /*    16     8 */
              struct btrfs_fs_info *     fs_info;              /*    24     8 */
              spinlock_t                 refs_lock;            /*    32     4 */
              atomic_t                   refs;                 /*    36     4 */
              atomic_t                   io_pages;             /*    40     4 */
              int                        read_mirror;          /*    44     4 */
              struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              pid_t                      lock_owner;           /*    64     4 */
              bool                       lock_recursed;        /*    68     1 */
              s8                         log_index;            /*    69     1 */
      
              /* XXX 2 bytes hole, try to pack */
      
              struct rw_semaphore        lock;                 /*    72    40 */
              struct page *              pages[16];            /*   112   128 */
      
              /* size: 240, cachelines: 4, members: 14 */
              /* sum members: 238, holes: 1, sum holes: 2 */
              /* forced alignments: 1 */
              /* last cacheline: 48 bytes */
      } __attribute__((__aligned__(8)));
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc516164
    • Q
      btrfs: replace fs_info and private_data with inode in btrfs_wq_submit_bio · 8896a08d
      Qu Wenruo 提交于
      All callers of btrfs_wq_submit_bio() pass struct inode as @private_data,
      so there is no need for it to be (void *), replace it with "struct inode
      *inode".
      
      While we can extract fs_info from struct inode, also remove the @fs_info
      parameter.
      
      Since we're here, also replace all the (void *private_data) into (struct
      inode *inode).
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8896a08d
    • J
      btrfs: switch extent buffer tree lock to rw_semaphore · 196d59ab
      Josef Bacik 提交于
      Historically we've implemented our own locking because we wanted to be
      able to selectively spin or sleep based on what we were doing in the
      tree.  For instance, if all of our nodes were in cache then there's
      rarely a reason to need to sleep waiting for node locks, as they'll
      likely become available soon.  At the time this code was written the
      rw_semaphore didn't do adaptive spinning, and thus was orders of
      magnitude slower than our home grown locking.
      
      However now the opposite is the case.  There are a few problems with how
      we implement blocking locks, namely that we use a normal waitqueue and
      simply wake everybody up in reverse sleep order.  This leads to some
      suboptimal performance behavior, and a lot of context switches in highly
      contended cases.  The rw_semaphores actually do this properly, and also
      have adaptive spinning that works relatively well.
      
      The locking code is also a bit of a bear to understand, and we lose the
      benefit of lockdep for the most part because the blocking states of the
      lock are simply ad-hoc and not mapped into lockdep.
      
      So rework the locking code to drop all of this custom locking stuff, and
      simply use a rw_semaphore for everything.  This makes the locking much
      simpler for everything, as we can now drop a lot of cruft and blocking
      transitions.  The performance numbers vary depending on the workload,
      because generally speaking there doesn't tend to be a lot of contention
      on the btree.  However, on my test system which is an 80 core single
      socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
      following results (with all debug options off):
      
        dbench 200 baseline
        Throughput 216.056 MB/sec  200 clients  200 procs  max_latency=1471.197 ms
      
        dbench 200 with patch
        Throughput 737.188 MB/sec  200 clients  200 procs  max_latency=714.346 ms
      
      Previously we also used fs_mark to test this sort of contention, and
      those results are far less impressive, mostly because there's not enough
      tasks to really stress the locking
      
        fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16
      
        baseline
          Average Files/sec:     160166.7
          p50 Files/sec:         165832
          p90 Files/sec:         123886
          p99 Files/sec:         123495
      
          real    3m26.527s
          user    2m19.223s
          sys     48m21.856s
      
        patched
          Average Files/sec:     164135.7
          p50 Files/sec:         171095
          p90 Files/sec:         122889
          p99 Files/sec:         113819
      
          real    3m29.660s
          user    2m19.990s
          sys     44m12.259s
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      196d59ab
  8. 07 10月, 2020 10 次提交
  9. 27 8月, 2020 1 次提交
    • J
      btrfs: fix potential deadlock in the search ioctl · a48b73ec
      Josef Bacik 提交于
      With the conversion of the tree locks to rwsem I got the following
      lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
        ------------------------------------------------------
        compsize/11122 is trying to acquire lock:
        ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
      
        but task is already holding lock:
        ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-fs-00){++++}-{3:3}:
      	 down_write_nested+0x3b/0x70
      	 __btrfs_tree_lock+0x24/0x120
      	 btrfs_search_slot+0x756/0x990
      	 btrfs_lookup_inode+0x3a/0xb4
      	 __btrfs_update_delayed_inode+0x93/0x270
      	 btrfs_async_run_delayed_root+0x168/0x230
      	 btrfs_work_helper+0xd4/0x570
      	 process_one_work+0x2ad/0x5f0
      	 worker_thread+0x3a/0x3d0
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_delayed_update_inode+0x50/0x440
      	 btrfs_update_inode+0x8a/0xf0
      	 btrfs_dirty_inode+0x5b/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 copy_to_sk.isra.32+0x121/0x300
      	 search_ioctl+0x106/0x200
      	 btrfs_ioctl_tree_search_v2+0x7b/0xf0
      	 btrfs_ioctl+0x106f/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-fs-00);
      				 lock(&delayed_node->mutex);
      				 lock(btrfs-fs-00);
          lock(&mm->mmap_lock#2);
      
         *** DEADLOCK ***
      
        1 lock held by compsize/11122:
         #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        stack backtrace:
        CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? __might_fault+0x3e/0x90
         ? find_held_lock+0x72/0x90
         __might_fault+0x68/0x90
         ? __might_fault+0x3e/0x90
         _copy_to_user+0x1e/0x80
         copy_to_sk.isra.32+0x121/0x300
         ? btrfs_search_forward+0x2a6/0x360
         search_ioctl+0x106/0x200
         btrfs_ioctl_tree_search_v2+0x7b/0xf0
         btrfs_ioctl+0x106f/0x30a0
         ? __do_sys_newfstat+0x5a/0x70
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is we're doing a copy_to_user() while holding tree locks,
      which can deadlock if we have to do a page fault for the copy_to_user().
      This exists even without my locking changes, so it needs to be fixed.
      Rework the search ioctl to do the pre-fault and then
      copy_to_user_nofault for the copying.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a48b73ec
  10. 27 7月, 2020 2 次提交
  11. 04 6月, 2020 1 次提交
  12. 03 6月, 2020 1 次提交