1. 03 1月, 2022 1 次提交
  2. 27 10月, 2021 1 次提交
    • J
      btrfs: do not infinite loop in data reclaim if we aborted · 0e24f6d8
      Josef Bacik 提交于
      Error injection stressing uncovered a busy loop in our data reclaim
      loop.  There are two cases here, one where we loop creating block groups
      until space_info->full is set, or in the main loop we will skip erroring
      out any tickets if space_info->full == 0.  Unfortunately if we aborted
      the transaction then we will never allocate chunks or reclaim any space
      and thus never get ->full, and you'll see stack traces like this:
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139]
        CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G        W         5.13.0-rc1+ #328
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Workqueue: events_unbound btrfs_async_reclaim_data_space
        RIP: 0010:btrfs_join_transaction+0x12/0x20
        RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246
        RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000
        RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0
        R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff
        R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008
        FS:  0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0
        Call Trace:
         flush_space+0x4a8/0x660
         btrfs_async_reclaim_data_space+0x55/0x130
         process_one_work+0x1e9/0x380
         worker_thread+0x53/0x3e0
         ? process_one_work+0x380/0x380
         kthread+0x118/0x140
         ? __kthread_bind_mask+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      Fix this by checking to see if we have a btrfs fs error in either of the
      reclaim loops, and if so fail the tickets and bail.  In addition to
      this, fix maybe_fail_all_tickets() to not try to grant tickets if we've
      aborted, simply fail everything.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e24f6d8
  3. 18 9月, 2021 1 次提交
  4. 23 8月, 2021 5 次提交
    • J
      btrfs: do not do preemptive flushing if the majority is global rsv · 11462397
      Josef Bacik 提交于
      A common characteristic of the bug report where preemptive flushing was
      going full tilt was the fact that the vast majority of the free metadata
      space was used up by the global reserve.  The hard 90% threshold would
      cover the majority of these cases, but to be even smarter we should take
      into account how much of the outstanding reservations are covered by the
      global block reserve.  If the global block reserve accounts for the vast
      majority of outstanding reservations, skip preemptive flushing, as it
      will likely just cause churn and pain.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11462397
    • J
      btrfs: reduce the preemptive flushing threshold to 90% · 93c60b17
      Josef Bacik 提交于
      The preemptive flushing code was added in order to avoid needing to
      synchronously wait for ENOSPC flushing to recover space.  Once we're
      almost full however we can essentially flush constantly.  We were using
      98% as a threshold to determine if we were simply full, however in
      practice this is a really high bar to hit.  For example reports of
      systems running into this problem had around 94% usage and thus
      continued to flush.  Fix this by lowering the threshold to 90%, which is
      a more sane value, especially for smaller file systems.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
      CC: stable@vger.kernel.org # 5.12+
      Fixes: 576fa348 ("btrfs: improve preemptive background space flushing")
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      93c60b17
    • J
      btrfs: wait on async extents when flushing delalloc · e1646070
      Josef Bacik 提交于
      I've been debugging an early ENOSPC problem in production and finally
      root caused it to this problem.  When we switched to the per-inode in
      38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") I pulled out the async extent handling, because we
      were doing the correct thing by calling filemap_flush() if we had async
      extents set.  This would properly wait on any async extents by locking
      the page in the second flush, thus making sure our ordered extents were
      properly set up.
      
      However when I switched us back to page based flushing, I used
      sync_inode(), which allows us to pass in our own wbc.  The problem here
      is that sync_inode() is smarter than the filemap_* helpers, it tries to
      avoid calling writepages at all.  This means that our second call could
      skip calling do_writepages altogether, and thus not wait on the pagelock
      for the async helpers.  This means we could come back before any ordered
      extents were created and then simply continue on in our flushing
      mechanisms and ENOSPC out when we have plenty of space to use.
      
      Fix this by putting back the async pages logic in shrink_delalloc.  This
      allows us to bulk write out everything that we need to, and then we can
      wait in one place for the async helpers to catch up, and then wait on
      any ordered extents that are created.
      
      Fixes: e076ab2a ("btrfs: shrink delalloc pages instead of full inodes")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1646070
    • J
      btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc
      Josef Bacik 提交于
      We have been hitting some early ENOSPC issues in production with more
      recent kernels, and I tracked it down to us simply not flushing delalloc
      as aggressively as we should be.  With tracing I was seeing us failing
      all tickets with all of the block rsvs at or around 0, with very little
      pinned space, but still around 120MiB of outstanding bytes_may_used.
      Upon further investigation I saw that we were flushing around 14 pages
      per shrink call for delalloc, despite having around 2GiB of delalloc
      outstanding.
      
      Consider the example of a 8 way machine, all CPUs trying to create a
      file in parallel, which at the time of this commit requires 5 items to
      do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
      size waiting on reservations.  Now assume we have 128MiB of delalloc
      outstanding.  With our current math we would set items to 20, and then
      set to_reclaim to 20 * 256k, or 5MiB.
      
      Assuming that we went through this loop all 3 times, for both
      FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
      twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
      leave a fair bit of delalloc reservations still hanging around by the
      time we go to ENOSPC out all the remaining tickets.
      
      Fix this two ways.  First, change the calculations to be a fraction of
      the total delalloc bytes on the system.  Prior to this change we were
      calculating based on dirty inodes so our math made more sense, now it's
      just completely unrelated to what we're actually doing.
      
      Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
      gone through the flush states at least once.  This will empty the system
      of all delalloc so we're sure to be truly out of space when we start
      failing tickets.
      
      I'm tagging stable 5.10 and forward, because this is where we started
      using the page stuff heavily again.  This affects earlier kernel
      versions as well, but would be a pain to backport to them as the
      flushing mechanisms aren't the same.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      03fe78cc
    • J
      btrfs: enable a tracepoint when we fail tickets · fcdef39c
      Josef Bacik 提交于
      When debugging early enospc problems it was useful to have a tracepoint
      where we failed all tickets so I could check the state of the enospc
      counters at failure time to validate my fixes.  This adds the tracpoint
      so you can easily get that information.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcdef39c
  5. 22 6月, 2021 5 次提交
    • J
      btrfs: rip out btrfs_space_info::total_bytes_pinned · 138a12d8
      Josef Bacik 提交于
      We used this in may_commit_transaction() in order to determine if we
      needed to commit the transaction.  However we no longer have that logic
      and thus have no use of this counter anymore, so delete it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      138a12d8
    • J
      btrfs: rip the first_ticket_bytes logic from fail_all_tickets · 3ffad696
      Josef Bacik 提交于
      This was a trick implemented to handle the case where we had a giant
      reservation in front of a bunch of little reservations in the ticket
      queue.  If the giant reservation was too large for the transaction
      commit to make a difference we'd ENOSPC everybody out instead of
      committing the transaction.  This logic was put in to force us to go
      back and re-try the transaction commit logic to see if we could make
      progress.
      
      Instead now we know we've committed the transaction, so any space that
      would have been recovered is now available, and would be caught by the
      btrfs_try_granting_tickets() in this loop, so we no longer need this
      code and can simply delete it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ffad696
    • J
      btrfs: remove FLUSH_DELAYED_REFS from data ENOSPC flushing · 04808553
      Josef Bacik 提交于
      Since we unconditionally commit the transaction now we no longer need to
      run the delayed refs to make sure our total_bytes_pinned value is
      uptodate, we can simply commit the transaction.  Remove this stage from
      the data flushing list.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      04808553
    • J
      btrfs: rip out may_commit_transaction · c416a30c
      Josef Bacik 提交于
      may_commit_transaction was introduced before the ticketing
      infrastructure existed.  There was a problem where we'd legitimately be
      out of space, but every reservation would trigger a transaction commit
      and then fail.  Thus if you had 1000 things trying to make a
      reservation, they'd all do the flushing loop and thus commit the
      transaction 1000 times before they'd get their ENOSPC.
      
      This helper was introduced to short circuit this, if there wasn't space
      that could be reclaimed by committing the transaction then simply ENOSPC
      out.  This made true ENOSPC tests much faster as we didn't waste a bunch
      of time.
      
      However many of our bugs over the years have been from cases where we
      didn't account for some space that would be reclaimed by committing a
      transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
      miscalculations, etc.  And in the meantime the original problem has been
      solved with ticketing.  We no longer will commit the transaction 1000
      times.  Instead we'll get 1000 waiters, we will go through the flushing
      mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
      out.  The ticketing infrastructure gives us a deterministic way to see
      if we're making progress or not, thus we avoid a lot of extra work.
      
      So simplify this step by simply unconditionally committing the
      transaction.  This removes what is arguably our most common source of
      early ENOSPC bugs and will allow us to drastically simplify many of the
      things we track because we simply won't need them with this stuff gone.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c416a30c
    • D
      btrfs: fix typos in comments · 1a9fd417
      David Sterba 提交于
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a9fd417
  6. 21 6月, 2021 7 次提交
  7. 19 4月, 2021 1 次提交
    • J
      btrfs: use percpu_read_positive instead of sum_positive for need_preempt · 2cdb3909
      Josef Bacik 提交于
      Looking at perf data for a fio workload I noticed that we were spending
      a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
      need_preemptive_reclaim.  This is silly, as we only want to know if we
      have more ordered than delalloc to see if we should be counting the
      delayed items in our threshold calculation.  Change this to
      percpu_read_positive() to avoid the overhead.
      
      I ran this through fsperf to validate the changes, obviously the latency
      numbers in dbench and fio are quite jittery, so take them as you wish,
      but overall the improvements on throughput, iops, and bw are all
      positive.  Each test was run two times, the given value is the average
      of both runs for their respective column.
      
        btrfs ssd normal test results
      
        bufferedrandwrite16g results
             metric         baseline   current          diff
        ==========================================================
        write_io_kbytes     16777216   16777216     0.00%
        read_clat_ns_p99           0          0     0.00%
        write_bw_bytes      1.04e+08   1.05e+08     1.12%
        read_iops                  0          0     0.00%
        write_clat_ns_p50      13888      11840   -14.75%
        read_io_kbytes             0          0     0.00%
        read_io_bytes              0          0     0.00%
        write_clat_ns_p99      35008      29312   -16.27%
        read_bw_bytes              0          0     0.00%
        elapsed                  170        167    -1.76%
        write_lat_ns_min     4221.50    3762.50   -10.87%
        sys_cpu                39.65      35.37   -10.79%
        write_lat_ns_max    2.67e+10   2.50e+10    -6.63%
        read_lat_ns_min            0          0     0.00%
        write_iops          25270.10   25553.43     1.12%
        read_lat_ns_max            0          0     0.00%
        read_clat_ns_p50           0          0     0.00%
      
        dbench60 results
          metric     baseline   current         diff
        ==================================================
        qpathinfo       11.12     12.73    14.52%
        throughput     416.09    445.66     7.11%
        flush         3485.63   1887.55   -45.85%
        qfileinfo        0.70      1.92   173.86%
        ntcreatex      992.60    695.76   -29.91%
        qfsinfo          2.43      3.71    52.48%
        close            1.67      3.14    88.09%
        sfileinfo       66.54    105.20    58.10%
        rename         809.23    619.59   -23.43%
        find            16.88     15.46    -8.41%
        unlink         820.54    670.86   -18.24%
        writex        3375.20   2637.91   -21.84%
        deltree        386.33    449.98    16.48%
        readx            3.43      3.41    -0.60%
        mkdir            0.05      0.03   -38.46%
        lockx            0.26      0.26    -0.76%
        unlockx          0.81      0.32   -60.33%
      
        dio4kbs16threads results
             metric          baseline       current           diff
        ================================================================
        write_io_kbytes         5249676       3357150   -36.05%
        read_clat_ns_p99              0             0     0.00%
        write_bw_bytes      89583501.50   57291192.50   -36.05%
        read_iops                     0             0     0.00%
        write_clat_ns_p50        242688        263680     8.65%
        read_io_kbytes                0             0     0.00%
        read_io_bytes                 0             0     0.00%
        write_clat_ns_p99      15826944      36732928   132.09%
        read_bw_bytes                 0             0     0.00%
        elapsed                      61            61     0.00%
        write_lat_ns_min          42704         42095    -1.43%
        sys_cpu                    5.27          3.45   -34.52%
        write_lat_ns_max       7.43e+08      9.27e+08    24.71%
        read_lat_ns_min               0             0     0.00%
        write_iops             21870.97      13987.11   -36.05%
        read_lat_ns_max               0             0     0.00%
        read_clat_ns_p50              0             0     0.00%
      
        randwrite2xram results
             metric          baseline       current           diff
        ================================================================
        write_io_kbytes        24831972      28876262    16.29%
        read_clat_ns_p99              0             0     0.00%
        write_bw_bytes      83745273.50   92182192.50    10.07%
        read_iops                     0             0     0.00%
        write_clat_ns_p50         13952         11648   -16.51%
        read_io_kbytes                0             0     0.00%
        read_io_bytes                 0             0     0.00%
        write_clat_ns_p99         50176         52992     5.61%
        read_bw_bytes                 0             0     0.00%
        elapsed                     314           332     5.73%
        write_lat_ns_min        5920.50          5127   -13.40%
        sys_cpu                    7.82          7.35    -6.07%
        write_lat_ns_max       5.27e+10      3.88e+10   -26.44%
        read_lat_ns_min               0             0     0.00%
        write_iops             20445.62      22505.42    10.07%
        read_lat_ns_max               0             0     0.00%
        read_clat_ns_p50              0             0     0.00%
      
        untarfirefox results
        metric    baseline   current        diff
        ==============================================
        elapsed      47.41     47.40   -0.03%
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2cdb3909
  8. 09 2月, 2021 15 次提交
    • N
      btrfs: zoned: track unusable bytes for zones · 169e0da9
      Naohiro Aota 提交于
      In a zoned filesystem a once written then freed region is not usable
      until the underlying zone has been reset. So we need to distinguish such
      unusable space from usable free space.
      
      Therefore we need to introduce the "zone_unusable" field to the block
      group structure, and "bytes_zone_unusable" to the space_info structure
      to track the unusable space.
      
      Pinned bytes are always reclaimed to the unusable space. But, when an
      allocated region is returned before using e.g., the block group becomes
      read-only between allocation time and reservation time, we can safely
      return the region to the block group. For the situation, this commit
      introduces "btrfs_add_free_space_unused". This behaves the same as
      btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
      rewinds the allocation offset.
      
      Because the read-only bytes tracks free but unusable bytes when the block
      group is read-only, we need to migrate the zone_unusable bytes to
      read-only bytes when a block group is marked read-only.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      169e0da9
    • J
      btrfs: add a trace class for dumping the current ENOSPC state · e5ad49e2
      Josef Bacik 提交于
      Often when I'm debugging ENOSPC related issues I have to resort to
      printing the entire ENOSPC state with trace_printk() in different spots.
      This gets pretty annoying, so add a trace state that does this for us.
      Then add a trace point at the end of preemptive flushing so you can see
      the state of the space_info when we decide to exit preemptive flushing.
      This helped me figure out we weren't kicking in the preemptive flushing
      soon enough.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e5ad49e2
    • J
      btrfs: adjust the flush trace point to include the source · 4b02b00f
      Josef Bacik 提交于
      Since we have normal ticketed flushing and preemptive flushing, adjust
      the tracepoint so that we know the source of the flushing action to make
      it easier to debug problems.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b02b00f
    • J
      btrfs: implement space clamping for preemptive flushing · 88a777a6
      Josef Bacik 提交于
      Starting preemptive flushing at 50% of available free space is a good
      start, but some workloads are particularly abusive and can quickly
      overwhelm the preemptive flushing code and drive us into using tickets.
      
      Handle this by clamping down on our threshold for starting and
      continuing to run preemptive flushing.  This is particularly important
      for our overcommit case, as we can really drive the file system into
      overages and then it's more difficult to pull it back as we start to
      actually fill up the file system.
      
      The clamping is essentially 2^CLAMP, but we start at 1 so whatever we
      calculate for overcommit is the baseline.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88a777a6
    • J
      btrfs: simplify the logic in need_preemptive_flushing · 2e294c60
      Josef Bacik 提交于
      A lot of this was added all in one go with no explanation, and is a bit
      unwieldy and confusing.  Simplify the logic to start preemptive flushing
      if we've reserved more than half of our available free space.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2e294c60
    • J
      btrfs: rework btrfs_calc_reclaim_metadata_size · 9f42d377
      Josef Bacik 提交于
      Currently btrfs_calc_reclaim_metadata_size does two things, it returns
      the space currently required for flushing by the tickets, and if there
      are no tickets it calculates a value for the preemptive flushing.
      
      However for the normal ticketed flushing we really only care about the
      space required for tickets.  We will accidentally come in and flush one
      time, but as soon as we see there are no tickets we bail out of our
      flushing.
      
      Fix this by making btrfs_calc_reclaim_metadata_size really only tell us
      what is required for flushing if we have people waiting on space.  Then
      move the preemptive flushing logic into need_preemptive_reclaim().  We
      ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim()
      because if we are in this path then we made our reservation and there
      are not pending tickets currently, so we do not need to check it, simply
      do the fuzzy logic to check if we're getting low on space.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9f42d377
    • J
      btrfs: check reclaim_size in need_preemptive_reclaim · f205edf7
      Josef Bacik 提交于
      If we're flushing space for tickets then we have
      space_info->reclaim_size set and we do not need to do background
      reclaim.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f205edf7
    • J
      btrfs: rename need_do_async_reclaim · ae7913ba
      Josef Bacik 提交于
      All of our normal flushing is asynchronous reclaim, so this helper is
      poorly named.  This is more checking if we need to preemptively flush
      space, so rename it to need_preemptive_reclaim.
      
      Also switch it to bool and make it plain static as followup patches will
      move more code here.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae7913ba
    • J
      btrfs: improve preemptive background space flushing · 576fa348
      Josef Bacik 提交于
      Currently if we ever have to flush space because we do not have enough
      we allocate a ticket and attach it to the space_info, and then
      systematically flush things in the filesystem that hold space
      reservations until our space is reclaimed.
      
      However this has a latency cost, we must go to sleep and wait for the
      flushing to make progress before we are woken up and allowed to continue
      doing our work.
      
      In order to address that we used to kick off the async worker to flush
      space preemptively, so that we could be reclaiming space hopefully
      before any tasks needed to stop and wait for space to reclaim.
      
      When I introduced the ticketed ENOSPC stuff this broke slightly in the
      fact that we were using tickets to indicate if we were done flushing.
      No tickets, no more flushing.  However this meant that we essentially
      never preemptively flushed.  This caused a write performance regression
      that Nikolay noticed in an unrelated patch that removed the committing
      of the transaction during btrfs_end_transaction.
      
      The behavior that happened pre that patch was btrfs_end_transaction()
      would see that we were low on space, and it would commit the
      transaction.  This was bad because in this particular case you could end
      up with thousands and thousands of transactions being committed during
      the 5 minute reproducer.  With the patch to remove this behavior we got
      much more sane transaction commits, but we ended up slower because we
      would write for a while, flush, write for a while, flush again.
      
      To address this we need to reinstate a preemptive flushing mechanism.
      However it is distinctly different from our ticketing flushing in that
      it doesn't have tickets to base it's decisions on.  Instead of bolting
      this logic into our existing flushing work, add another worker to handle
      this preemptive flushing.  Here we will attempt to be slightly
      intelligent about the things that we flushing, attempting to balance
      between whichever pool is taking up the most space.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      576fa348
    • J
      btrfs: introduce a FORCE_COMMIT_TRANS flush operation · f00c42dd
      Josef Bacik 提交于
      Solely for preemptive flushing, we want to be able to force the
      transaction commit without any of the ambiguity of
      may_commit_transaction().  This is because may_commit_transaction()
      checks tickets and such, and in preemptive flushing we already know
      it'll be helpful, so use this to keep the code nice and clean and
      straightforward.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f00c42dd
    • J
      btrfs: track ordered bytes instead of just dio ordered bytes · 5deb17e1
      Josef Bacik 提交于
      We track dio_bytes because the shrink delalloc code needs to know if we
      have more DIO in flight than we have normal buffered IO.  The reason for
      this is because we can't "flush" DIO, we have to just wait on the
      ordered extents to finish.
      
      However this is true of all ordered extents.  If we have more ordered
      space outstanding than dirty pages we should be waiting on ordered
      extents.  We already are ok on this front technically, because we always
      do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in
      the preemptive flushing code as well, so change this to count all
      ordered bytes instead of just DIO ordered bytes.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5deb17e1
    • J
      btrfs: add a trace point for reserve tickets · ac1ea10e
      Josef Bacik 提交于
      While debugging a ENOSPC related performance problem I needed to see the
      time difference between start and end of a reserve ticket, so add a
      trace point to report when we handle a reserve ticket.
      
      I opted to spit out start_ns itself without calculating the difference
      because there could be a gap between enabling the tracepoint and setting
      start_ns.  Doing it this way allows us to filter on 0 start_ns so we
      don't get bogus entries, and we can easily calculate the time difference
      with bpftrace or something else.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac1ea10e
    • J
      btrfs: make flush_space take a enum btrfs_flush_state instead of int · 91e79a83
      Josef Bacik 提交于
      I got a automated message from somebody who runs clang against our
      kernels and it's because I used the wrong enum type for what I passed
      into flush_space, caught by -Wenum-conversion.  Change the argument to
      be explicitly the enum we're expecting to make everything consistent.
      Maybe eventually gcc will catch errors like this.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      91e79a83
    • N
      btrfs: fix parameter description in space-info.c · d98b188e
      Nikolay Borisov 提交于
      With these fixes space-info.c is clear for W=1 warnings, namely the
      following ones are fixed:
      
      fs/btrfs/space-info.c:575: warning: Function parameter or member 'fs_info' not described in 'may_commit_transaction'
      fs/btrfs/space-info.c:575: warning: Function parameter or member 'space_info' not described in 'may_commit_transaction'
      fs/btrfs/space-info.c:1231: warning: Function parameter or member 'fs_info' not described in 'handle_reserve_ticket'
      fs/btrfs/space-info.c:1231: warning: Function parameter or member 'space_info' not described in 'handle_reserve_ticket'
      fs/btrfs/space-info.c:1231: warning: Function parameter or member 'ticket' not described in 'handle_reserve_ticket'
      fs/btrfs/space-info.c:1231: warning: Function parameter or member 'flush' not described in 'handle_reserve_ticket'
      fs/btrfs/space-info.c:1315: warning: Function parameter or member 'fs_info' not described in '__reserve_bytes'
      fs/btrfs/space-info.c:1315: warning: Function parameter or member 'space_info' not described in '__reserve_bytes'
      fs/btrfs/space-info.c:1315: warning: Function parameter or member 'orig_bytes' not described in '__reserve_bytes'
      fs/btrfs/space-info.c:1315: warning: Function parameter or member 'flush' not described in '__reserve_bytes'
      fs/btrfs/space-info.c:1427: warning: Function parameter or member 'root' not described in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1427: warning: Function parameter or member 'block_rsv' not described in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1427: warning: Function parameter or member 'orig_bytes' not described in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1427: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1462: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_data_bytes'
      fs/btrfs/space-info.c:1462: warning: Function parameter or member 'bytes' not described in 'btrfs_reserve_data_bytes'
      fs/btrfs/space-info.c:1462: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_data_bytes'
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d98b188e
    • N
      btrfs: make btrfs_start_delalloc_root's nr argument a long · 9db4dc24
      Nikolay Borisov 提交于
      It's currently u64 which gets instantly translated either to LONG_MAX
      (if U64_MAX is passed) or cast to an unsigned long (which is in fact,
      wrong because writeback_control::nr_to_write is a signed, long type).
      
      Just convert the function's argument to be long time which obviates the
      need to manually convert u64 value to a long. Adjust all call sites
      which pass U64_MAX to pass LONG_MAX. Finally ensure that in
      shrink_delalloc the u64 is converted to a long without overflowing,
      resulting in a negative number.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9db4dc24
  9. 08 1月, 2021 1 次提交
    • J
      btrfs: shrink delalloc pages instead of full inodes · e076ab2a
      Josef Bacik 提交于
      Commit 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
      some infrastructure we have in place to flush inodes that we use for
      device replace and snapshot.  However this introduced a pretty serious
      performance regression.  To reproduce the user untarred the source
      tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
      see it take anywhere from 5 to 20 times as long to untar in 5.10
      compared to 5.9. This was observed on fast devices (SSD and better) and
      not on HDD.
      
      The root cause is because before we would generally use the normal
      writeback path to reclaim delalloc space, and for this we would provide
      it with the number of pages we wanted to flush.  The referenced commit
      changed this to flush that many inodes, which drastically increased the
      amount of space we were flushing in certain cases, which severely
      affected performance.
      
      We cannot revert this patch unfortunately because of 3d45f221
      ("btrfs: fix deadlock when cloning inline extent and low on free
      metadata space") which requires the ability to skip flushing inodes that
      are being cloned in certain scenarios, which means we need to keep using
      our flushing infrastructure or risk re-introducing the deadlock.
      
      Instead to fix this problem we can go back to providing
      btrfs_start_delalloc_roots with a number of pages to flush, and then set
      up a writeback_control and utilize sync_inode() to handle the flushing
      for us.  This gives us the same behavior we had prior to the fix, while
      still allowing us to avoid the deadlock that was fixed by Filipe.  I
      redid the users original test and got the following results on one of
      our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)
      
        5.9		0m54.258s
        5.10		1m26.212s
        5.10+patch	0m38.800s
      
      5.10+patch is significantly faster than plain 5.9 because of my patch
      series "Change data reservations to use the ticketing infra" which
      contained the patch that introduced the regression, but generally
      improved the overall ENOSPC flushing mechanisms.
      
      Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
      the results:
      
        5.10.5            4m00s
        5.10.5+patch      1m08s
        5.11-rc2	    5m14s
        5.11-rc2+patch    1m30s
      Reported-by: NRené Rebe <rene@exactcode.de>
      Fixes: 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
      CC: stable@vger.kernel.org # 5.10
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add my test results ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e076ab2a
  10. 18 12月, 2020 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extent and low on free metadata space · 3d45f221
      Filipe Manana 提交于
      When cloning an inline extent there are cases where we can not just copy
      the inline extent from the source range to the target range (e.g. when the
      target range starts at an offset greater than zero). In such cases we copy
      the inline extent's data into a page of the destination inode and then
      dirty that page. However, after that we will need to start a transaction
      for each processed extent and, if we are ever low on available metadata
      space, we may need to flush existing delalloc for all dirty inodes in an
      attempt to release metadata space - if that happens we may deadlock:
      
      * the async reclaim task queued a delalloc work to flush delalloc for
        the destination inode of the clone operation;
      
      * the task executing that delalloc work gets blocked waiting for the
        range with the dirty page to be unlocked, which is currently locked
        by the task doing the clone operation;
      
      * the async reclaim task blocks waiting for the delalloc work to complete;
      
      * the cloning task is waiting on the waitqueue of its reservation ticket
        while holding the range with the dirty page locked in the inode's
        io_tree;
      
      * if metadata space is not released by some other task (like delalloc for
        some other inode completing for example), the clone task waits forever
        and as a consequence the delalloc work and async reclaim tasks will hang
        forever as well. Releasing more space on the other hand may require
        starting a transaction, which will hang as well when trying to reserve
        metadata space, resulting in a deadlock between all these tasks.
      
      When this happens, traces like the following show up in dmesg/syslog:
      
        [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
        [87452.323644]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.324852] task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
        [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [87452.326136] Call Trace:
        [87452.326737]  __schedule+0x5d1/0xcf0
        [87452.327390]  schedule+0x45/0xe0
        [87452.328174]  lock_extent_bits+0x1e6/0x2d0 [btrfs]
        [87452.328894]  ? finish_wait+0x90/0x90
        [87452.329474]  btrfs_invalidatepage+0x32c/0x390 [btrfs]
        [87452.330133]  ? __mod_memcg_state+0x8e/0x160
        [87452.330738]  __extent_writepage+0x2d4/0x400 [btrfs]
        [87452.331405]  extent_write_cache_pages+0x2b2/0x500 [btrfs]
        [87452.332007]  ? lock_release+0x20e/0x4c0
        [87452.332557]  ? trace_hardirqs_on+0x1b/0xf0
        [87452.333127]  extent_writepages+0x43/0x90 [btrfs]
        [87452.333653]  ? lock_acquire+0x1a3/0x490
        [87452.334177]  do_writepages+0x43/0xe0
        [87452.334699]  ? __filemap_fdatawrite_range+0xa4/0x100
        [87452.335720]  __filemap_fdatawrite_range+0xc5/0x100
        [87452.336500]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [87452.337216]  btrfs_work_helper+0xf1/0x600 [btrfs]
        [87452.337838]  process_one_work+0x24e/0x5e0
        [87452.338437]  worker_thread+0x50/0x3b0
        [87452.339137]  ? process_one_work+0x5e0/0x5e0
        [87452.339884]  kthread+0x153/0x170
        [87452.340507]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.341153]  ret_from_fork+0x22/0x30
        [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
        [87452.342487]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.344049] task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
        [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        [87452.345655] Call Trace:
        [87452.346305]  __schedule+0x5d1/0xcf0
        [87452.346947]  ? kvm_clock_read+0x14/0x30
        [87452.347676]  ? wait_for_completion+0x81/0x110
        [87452.348389]  schedule+0x45/0xe0
        [87452.349077]  schedule_timeout+0x30c/0x580
        [87452.349718]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [87452.350340]  ? lock_acquire+0x1a3/0x490
        [87452.351006]  ? try_to_wake_up+0x7a/0xa20
        [87452.351541]  ? lock_release+0x20e/0x4c0
        [87452.352040]  ? lock_acquired+0x199/0x490
        [87452.352517]  ? wait_for_completion+0x81/0x110
        [87452.353000]  wait_for_completion+0xab/0x110
        [87452.353490]  start_delalloc_inodes+0x2af/0x390 [btrfs]
        [87452.353973]  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        [87452.354455]  flush_space+0x24f/0x660 [btrfs]
        [87452.355063]  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        [87452.355565]  process_one_work+0x24e/0x5e0
        [87452.356024]  worker_thread+0x20f/0x3b0
        [87452.356487]  ? process_one_work+0x5e0/0x5e0
        [87452.356973]  kthread+0x153/0x170
        [87452.357434]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.357880]  ret_from_fork+0x22/0x30
        (...)
        < stack traces of several tasks waiting for the locks of the inodes of the
          clone operation >
        (...)
        [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
        [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
        [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
        [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
        [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
        [92867.447361] task:fsstress        state:D stack:    0 pid:2508238 ppid:2508153 flags:0x00004000
        [92867.447920] Call Trace:
        [92867.448435]  __schedule+0x5d1/0xcf0
        [92867.448934]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [92867.449423]  schedule+0x45/0xe0
        [92867.449916]  __reserve_bytes+0x4a4/0xb10 [btrfs]
        [92867.450576]  ? finish_wait+0x90/0x90
        [92867.451202]  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        [92867.451815]  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        [92867.452412]  start_transaction+0x2d1/0x760 [btrfs]
        [92867.453216]  clone_copy_inline_extent+0x333/0x490 [btrfs]
        [92867.453848]  ? lock_release+0x20e/0x4c0
        [92867.454539]  ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
        [92867.455218]  btrfs_clone+0x569/0x7e0 [btrfs]
        [92867.455952]  btrfs_clone_files+0xf6/0x150 [btrfs]
        [92867.456588]  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
        [92867.457213]  do_clone_file_range+0xd4/0x1f0
        [92867.457828]  vfs_clone_file_range+0x4d/0x230
        [92867.458355]  ? lock_release+0x20e/0x4c0
        [92867.458890]  ioctl_file_clone+0x8f/0xc0
        [92867.459377]  do_vfs_ioctl+0x342/0x750
        [92867.459913]  __x64_sys_ioctl+0x62/0xb0
        [92867.460377]  do_syscall_64+0x33/0x80
        [92867.460842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        (...)
        < stack traces of more tasks blocked on metadata reservation like the clone
          task above, because the async reclaim task has deadlocked >
        (...)
      
      Another thing to notice is that the worker task that is deadlocked when
      trying to flush the destination inode of the clone operation is at
      btrfs_invalidatepage(). This is simply because the clone operation has a
      destination offset greater than the i_size and we only update the i_size
      of the destination file after cloning an extent (just like we do in the
      buffered write path).
      
      Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
      the flushing of delalloc for all inodes that have delalloc, add a runtime
      flag to an inode to signal it should not be flushed, and for inodes with
      that flag set, start_delalloc_inodes() will simply skip them. When the
      cloning code needs to dirty a page to copy an inline extent, set that flag
      on the inode and then clear it when the clone operation finishes.
      
      This could be sporadically triggered with test case generic/269 from
      fstests, which exercises many fsstress processes running in parallel with
      several dd processes filling up the entire filesystem.
      
      CC: stable@vger.kernel.org # 5.9+
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d45f221
  11. 07 10月, 2020 2 次提交