1. 25 7月, 2022 3 次提交
  2. 16 5月, 2022 2 次提交
  3. 14 3月, 2022 2 次提交
  4. 07 1月, 2022 1 次提交
    • Y
      btrfs: fix argument list that the kdoc format and script verified · be8d1a2a
      Yang Li 提交于
      The warnings were found by running scripts/kernel-doc, which is
      caused by using 'make W=1'.
      
      fs/btrfs/extent_io.c:3210: warning: Function parameter or member
      'bio_ctrl' not described in 'btrfs_bio_add_page'
      fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
      description in 'btrfs_bio_add_page'
      fs/btrfs/extent_io.c:3210: warning: Excess function parameter
      'prev_bio_flags' description in 'btrfs_bio_add_page'
      fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
      description in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1602: warning: Function parameter or member
      'fs_info' not described in 'btrfs_reserve_metadata_bytes'
      
      Note: this is fixing only the warnings regarding parameter list, the
      first line is not strictly conforming to the kdoc format as the btrfs
      codebase does not stick to that and keeps the first line more free form
      (because it's only for internal use).
      Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: NYang Li <yang.lee@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be8d1a2a
  5. 03 1月, 2022 7 次提交
  6. 27 10月, 2021 1 次提交
    • J
      btrfs: do not infinite loop in data reclaim if we aborted · 0e24f6d8
      Josef Bacik 提交于
      Error injection stressing uncovered a busy loop in our data reclaim
      loop.  There are two cases here, one where we loop creating block groups
      until space_info->full is set, or in the main loop we will skip erroring
      out any tickets if space_info->full == 0.  Unfortunately if we aborted
      the transaction then we will never allocate chunks or reclaim any space
      and thus never get ->full, and you'll see stack traces like this:
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139]
        CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G        W         5.13.0-rc1+ #328
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Workqueue: events_unbound btrfs_async_reclaim_data_space
        RIP: 0010:btrfs_join_transaction+0x12/0x20
        RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246
        RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000
        RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0
        R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff
        R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008
        FS:  0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0
        Call Trace:
         flush_space+0x4a8/0x660
         btrfs_async_reclaim_data_space+0x55/0x130
         process_one_work+0x1e9/0x380
         worker_thread+0x53/0x3e0
         ? process_one_work+0x380/0x380
         kthread+0x118/0x140
         ? __kthread_bind_mask+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      Fix this by checking to see if we have a btrfs fs error in either of the
      reclaim loops, and if so fail the tickets and bail.  In addition to
      this, fix maybe_fail_all_tickets() to not try to grant tickets if we've
      aborted, simply fail everything.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e24f6d8
  7. 18 9月, 2021 1 次提交
  8. 23 8月, 2021 5 次提交
    • J
      btrfs: do not do preemptive flushing if the majority is global rsv · 11462397
      Josef Bacik 提交于
      A common characteristic of the bug report where preemptive flushing was
      going full tilt was the fact that the vast majority of the free metadata
      space was used up by the global reserve.  The hard 90% threshold would
      cover the majority of these cases, but to be even smarter we should take
      into account how much of the outstanding reservations are covered by the
      global block reserve.  If the global block reserve accounts for the vast
      majority of outstanding reservations, skip preemptive flushing, as it
      will likely just cause churn and pain.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11462397
    • J
      btrfs: reduce the preemptive flushing threshold to 90% · 93c60b17
      Josef Bacik 提交于
      The preemptive flushing code was added in order to avoid needing to
      synchronously wait for ENOSPC flushing to recover space.  Once we're
      almost full however we can essentially flush constantly.  We were using
      98% as a threshold to determine if we were simply full, however in
      practice this is a really high bar to hit.  For example reports of
      systems running into this problem had around 94% usage and thus
      continued to flush.  Fix this by lowering the threshold to 90%, which is
      a more sane value, especially for smaller file systems.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
      CC: stable@vger.kernel.org # 5.12+
      Fixes: 576fa348 ("btrfs: improve preemptive background space flushing")
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      93c60b17
    • J
      btrfs: wait on async extents when flushing delalloc · e1646070
      Josef Bacik 提交于
      I've been debugging an early ENOSPC problem in production and finally
      root caused it to this problem.  When we switched to the per-inode in
      38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") I pulled out the async extent handling, because we
      were doing the correct thing by calling filemap_flush() if we had async
      extents set.  This would properly wait on any async extents by locking
      the page in the second flush, thus making sure our ordered extents were
      properly set up.
      
      However when I switched us back to page based flushing, I used
      sync_inode(), which allows us to pass in our own wbc.  The problem here
      is that sync_inode() is smarter than the filemap_* helpers, it tries to
      avoid calling writepages at all.  This means that our second call could
      skip calling do_writepages altogether, and thus not wait on the pagelock
      for the async helpers.  This means we could come back before any ordered
      extents were created and then simply continue on in our flushing
      mechanisms and ENOSPC out when we have plenty of space to use.
      
      Fix this by putting back the async pages logic in shrink_delalloc.  This
      allows us to bulk write out everything that we need to, and then we can
      wait in one place for the async helpers to catch up, and then wait on
      any ordered extents that are created.
      
      Fixes: e076ab2a ("btrfs: shrink delalloc pages instead of full inodes")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1646070
    • J
      btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc
      Josef Bacik 提交于
      We have been hitting some early ENOSPC issues in production with more
      recent kernels, and I tracked it down to us simply not flushing delalloc
      as aggressively as we should be.  With tracing I was seeing us failing
      all tickets with all of the block rsvs at or around 0, with very little
      pinned space, but still around 120MiB of outstanding bytes_may_used.
      Upon further investigation I saw that we were flushing around 14 pages
      per shrink call for delalloc, despite having around 2GiB of delalloc
      outstanding.
      
      Consider the example of a 8 way machine, all CPUs trying to create a
      file in parallel, which at the time of this commit requires 5 items to
      do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
      size waiting on reservations.  Now assume we have 128MiB of delalloc
      outstanding.  With our current math we would set items to 20, and then
      set to_reclaim to 20 * 256k, or 5MiB.
      
      Assuming that we went through this loop all 3 times, for both
      FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
      twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
      leave a fair bit of delalloc reservations still hanging around by the
      time we go to ENOSPC out all the remaining tickets.
      
      Fix this two ways.  First, change the calculations to be a fraction of
      the total delalloc bytes on the system.  Prior to this change we were
      calculating based on dirty inodes so our math made more sense, now it's
      just completely unrelated to what we're actually doing.
      
      Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
      gone through the flush states at least once.  This will empty the system
      of all delalloc so we're sure to be truly out of space when we start
      failing tickets.
      
      I'm tagging stable 5.10 and forward, because this is where we started
      using the page stuff heavily again.  This affects earlier kernel
      versions as well, but would be a pain to backport to them as the
      flushing mechanisms aren't the same.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      03fe78cc
    • J
      btrfs: enable a tracepoint when we fail tickets · fcdef39c
      Josef Bacik 提交于
      When debugging early enospc problems it was useful to have a tracepoint
      where we failed all tickets so I could check the state of the enospc
      counters at failure time to validate my fixes.  This adds the tracpoint
      so you can easily get that information.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcdef39c
  9. 22 6月, 2021 5 次提交
    • J
      btrfs: rip out btrfs_space_info::total_bytes_pinned · 138a12d8
      Josef Bacik 提交于
      We used this in may_commit_transaction() in order to determine if we
      needed to commit the transaction.  However we no longer have that logic
      and thus have no use of this counter anymore, so delete it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      138a12d8
    • J
      btrfs: rip the first_ticket_bytes logic from fail_all_tickets · 3ffad696
      Josef Bacik 提交于
      This was a trick implemented to handle the case where we had a giant
      reservation in front of a bunch of little reservations in the ticket
      queue.  If the giant reservation was too large for the transaction
      commit to make a difference we'd ENOSPC everybody out instead of
      committing the transaction.  This logic was put in to force us to go
      back and re-try the transaction commit logic to see if we could make
      progress.
      
      Instead now we know we've committed the transaction, so any space that
      would have been recovered is now available, and would be caught by the
      btrfs_try_granting_tickets() in this loop, so we no longer need this
      code and can simply delete it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ffad696
    • J
      btrfs: remove FLUSH_DELAYED_REFS from data ENOSPC flushing · 04808553
      Josef Bacik 提交于
      Since we unconditionally commit the transaction now we no longer need to
      run the delayed refs to make sure our total_bytes_pinned value is
      uptodate, we can simply commit the transaction.  Remove this stage from
      the data flushing list.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      04808553
    • J
      btrfs: rip out may_commit_transaction · c416a30c
      Josef Bacik 提交于
      may_commit_transaction was introduced before the ticketing
      infrastructure existed.  There was a problem where we'd legitimately be
      out of space, but every reservation would trigger a transaction commit
      and then fail.  Thus if you had 1000 things trying to make a
      reservation, they'd all do the flushing loop and thus commit the
      transaction 1000 times before they'd get their ENOSPC.
      
      This helper was introduced to short circuit this, if there wasn't space
      that could be reclaimed by committing the transaction then simply ENOSPC
      out.  This made true ENOSPC tests much faster as we didn't waste a bunch
      of time.
      
      However many of our bugs over the years have been from cases where we
      didn't account for some space that would be reclaimed by committing a
      transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
      miscalculations, etc.  And in the meantime the original problem has been
      solved with ticketing.  We no longer will commit the transaction 1000
      times.  Instead we'll get 1000 waiters, we will go through the flushing
      mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
      out.  The ticketing infrastructure gives us a deterministic way to see
      if we're making progress or not, thus we avoid a lot of extra work.
      
      So simplify this step by simply unconditionally committing the
      transaction.  This removes what is arguably our most common source of
      early ENOSPC bugs and will allow us to drastically simplify many of the
      things we track because we simply won't need them with this stuff gone.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c416a30c
    • D
      btrfs: fix typos in comments · 1a9fd417
      David Sterba 提交于
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a9fd417
  10. 21 6月, 2021 7 次提交
  11. 19 4月, 2021 1 次提交
    • J
      btrfs: use percpu_read_positive instead of sum_positive for need_preempt · 2cdb3909
      Josef Bacik 提交于
      Looking at perf data for a fio workload I noticed that we were spending
      a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
      need_preemptive_reclaim.  This is silly, as we only want to know if we
      have more ordered than delalloc to see if we should be counting the
      delayed items in our threshold calculation.  Change this to
      percpu_read_positive() to avoid the overhead.
      
      I ran this through fsperf to validate the changes, obviously the latency
      numbers in dbench and fio are quite jittery, so take them as you wish,
      but overall the improvements on throughput, iops, and bw are all
      positive.  Each test was run two times, the given value is the average
      of both runs for their respective column.
      
        btrfs ssd normal test results
      
        bufferedrandwrite16g results
             metric         baseline   current          diff
        ==========================================================
        write_io_kbytes     16777216   16777216     0.00%
        read_clat_ns_p99           0          0     0.00%
        write_bw_bytes      1.04e+08   1.05e+08     1.12%
        read_iops                  0          0     0.00%
        write_clat_ns_p50      13888      11840   -14.75%
        read_io_kbytes             0          0     0.00%
        read_io_bytes              0          0     0.00%
        write_clat_ns_p99      35008      29312   -16.27%
        read_bw_bytes              0          0     0.00%
        elapsed                  170        167    -1.76%
        write_lat_ns_min     4221.50    3762.50   -10.87%
        sys_cpu                39.65      35.37   -10.79%
        write_lat_ns_max    2.67e+10   2.50e+10    -6.63%
        read_lat_ns_min            0          0     0.00%
        write_iops          25270.10   25553.43     1.12%
        read_lat_ns_max            0          0     0.00%
        read_clat_ns_p50           0          0     0.00%
      
        dbench60 results
          metric     baseline   current         diff
        ==================================================
        qpathinfo       11.12     12.73    14.52%
        throughput     416.09    445.66     7.11%
        flush         3485.63   1887.55   -45.85%
        qfileinfo        0.70      1.92   173.86%
        ntcreatex      992.60    695.76   -29.91%
        qfsinfo          2.43      3.71    52.48%
        close            1.67      3.14    88.09%
        sfileinfo       66.54    105.20    58.10%
        rename         809.23    619.59   -23.43%
        find            16.88     15.46    -8.41%
        unlink         820.54    670.86   -18.24%
        writex        3375.20   2637.91   -21.84%
        deltree        386.33    449.98    16.48%
        readx            3.43      3.41    -0.60%
        mkdir            0.05      0.03   -38.46%
        lockx            0.26      0.26    -0.76%
        unlockx          0.81      0.32   -60.33%
      
        dio4kbs16threads results
             metric          baseline       current           diff
        ================================================================
        write_io_kbytes         5249676       3357150   -36.05%
        read_clat_ns_p99              0             0     0.00%
        write_bw_bytes      89583501.50   57291192.50   -36.05%
        read_iops                     0             0     0.00%
        write_clat_ns_p50        242688        263680     8.65%
        read_io_kbytes                0             0     0.00%
        read_io_bytes                 0             0     0.00%
        write_clat_ns_p99      15826944      36732928   132.09%
        read_bw_bytes                 0             0     0.00%
        elapsed                      61            61     0.00%
        write_lat_ns_min          42704         42095    -1.43%
        sys_cpu                    5.27          3.45   -34.52%
        write_lat_ns_max       7.43e+08      9.27e+08    24.71%
        read_lat_ns_min               0             0     0.00%
        write_iops             21870.97      13987.11   -36.05%
        read_lat_ns_max               0             0     0.00%
        read_clat_ns_p50              0             0     0.00%
      
        randwrite2xram results
             metric          baseline       current           diff
        ================================================================
        write_io_kbytes        24831972      28876262    16.29%
        read_clat_ns_p99              0             0     0.00%
        write_bw_bytes      83745273.50   92182192.50    10.07%
        read_iops                     0             0     0.00%
        write_clat_ns_p50         13952         11648   -16.51%
        read_io_kbytes                0             0     0.00%
        read_io_bytes                 0             0     0.00%
        write_clat_ns_p99         50176         52992     5.61%
        read_bw_bytes                 0             0     0.00%
        elapsed                     314           332     5.73%
        write_lat_ns_min        5920.50          5127   -13.40%
        sys_cpu                    7.82          7.35    -6.07%
        write_lat_ns_max       5.27e+10      3.88e+10   -26.44%
        read_lat_ns_min               0             0     0.00%
        write_iops             20445.62      22505.42    10.07%
        read_lat_ns_max               0             0     0.00%
        read_clat_ns_p50              0             0     0.00%
      
        untarfirefox results
        metric    baseline   current        diff
        ==============================================
        elapsed      47.41     47.40   -0.03%
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2cdb3909
  12. 09 2月, 2021 5 次提交