• J
    btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc
    Josef Bacik 提交于
    We have been hitting some early ENOSPC issues in production with more
    recent kernels, and I tracked it down to us simply not flushing delalloc
    as aggressively as we should be.  With tracing I was seeing us failing
    all tickets with all of the block rsvs at or around 0, with very little
    pinned space, but still around 120MiB of outstanding bytes_may_used.
    Upon further investigation I saw that we were flushing around 14 pages
    per shrink call for delalloc, despite having around 2GiB of delalloc
    outstanding.
    
    Consider the example of a 8 way machine, all CPUs trying to create a
    file in parallel, which at the time of this commit requires 5 items to
    do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
    size waiting on reservations.  Now assume we have 128MiB of delalloc
    outstanding.  With our current math we would set items to 20, and then
    set to_reclaim to 20 * 256k, or 5MiB.
    
    Assuming that we went through this loop all 3 times, for both
    FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
    twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
    leave a fair bit of delalloc reservations still hanging around by the
    time we go to ENOSPC out all the remaining tickets.
    
    Fix this two ways.  First, change the calculations to be a fraction of
    the total delalloc bytes on the system.  Prior to this change we were
    calculating based on dirty inodes so our math made more sense, now it's
    just completely unrelated to what we're actually doing.
    
    Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
    gone through the flush states at least once.  This will empty the system
    of all delalloc so we're sure to be truly out of space when we start
    failing tickets.
    
    I'm tagging stable 5.10 and forward, because this is where we started
    using the page stuff heavily again.  This affects earlier kernel
    versions as well, but would be a pain to backport to them as the
    flushing mechanisms aren't the same.
    
    CC: stable@vger.kernel.org # 5.10+
    Reviewed-by: NNikolay Borisov <nborisov@suse.com>
    Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: NDavid Sterba <dsterba@suse.com>
    03fe78cc
ctree.h 128.4 KB