1. 04 9月, 2021 1 次提交
  2. 27 8月, 2021 3 次提交
  3. 25 8月, 2021 1 次提交
  4. 24 8月, 2021 3 次提交
    • C
      block: add an explicit ->disk backpointer to the request_queue · d152c682
      Christoph Hellwig 提交于
      Replace the magic lookup through the kobject tree with an explicit
      backpointer, given that the device model links are set up and torn
      down at times when I/O is still possible, leading to potential
      NULL or invalid pointer dereferences.
      
      Fixes: edb0872f ("block: move the bdi from the request_queue to the gendisk")
      Reported-by: Nsyzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NSven Schnelle <svens@linux.ibm.com>
      Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      d152c682
    • D
      f2fs: introduce periodic iostat io latency traces · a4b68176
      Daeho Jeong 提交于
      Whenever we notice some sluggish issues on our machines, we are always
      curious about how well all types of I/O in the f2fs filesystem are
      handled. But, it's hard to get this kind of real data. First of all,
      we need to reproduce the issue while turning on the profiling tool like
      blktrace, but the issue doesn't happen again easily. Second, with the
      intervention of any tools, the overall timing of the issue will be
      slightly changed and it sometimes makes us hard to figure it out.
      
      So, I added the feature printing out IO latency statistics tracepoint
      events, which are minimal things to understand filesystem's I/O related
      behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
      node on, we can get this statistics info in a periodic way and it
      would cause the least overhead.
      
      [samples]
       f2fs_ckpt-254:1-507     [003] ....  2842.439683: f2fs_iostat_latency:
      dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
      rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
      wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
      wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
      wr_async_node [0/0/0], wr_async_meta [0/0/0]
      
       f2fs_ckpt-254:1-507     [002] ....  2845.450514: f2fs_iostat_latency:
      dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
      rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
      wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
      wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
      wr_async_node [0/0/0], wr_async_meta [0/0/0]
      Signed-off-by: NDaeho Jeong <daehojeong@google.com>
      Reviewed-by: NChao Yu <chao@kernel.org>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      a4b68176
    • D
      f2fs: separate out iostat feature · 52118743
      Daeho Jeong 提交于
      Added F2FS_IOSTAT config option to support getting IO statistics through
      sysfs and printing out periodic IO statistics tracepoint events and
      moved I/O statistics related codes into separate files for better
      maintenance.
      Signed-off-by: NDaeho Jeong <daehojeong@google.com>
      Reviewed-by: NChao Yu <chao@kernel.org>
      [Jaegeuk Kim: set default=y]
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      52118743
  5. 23 8月, 2021 3 次提交
    • J
      btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc
      Josef Bacik 提交于
      We have been hitting some early ENOSPC issues in production with more
      recent kernels, and I tracked it down to us simply not flushing delalloc
      as aggressively as we should be.  With tracing I was seeing us failing
      all tickets with all of the block rsvs at or around 0, with very little
      pinned space, but still around 120MiB of outstanding bytes_may_used.
      Upon further investigation I saw that we were flushing around 14 pages
      per shrink call for delalloc, despite having around 2GiB of delalloc
      outstanding.
      
      Consider the example of a 8 way machine, all CPUs trying to create a
      file in parallel, which at the time of this commit requires 5 items to
      do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
      size waiting on reservations.  Now assume we have 128MiB of delalloc
      outstanding.  With our current math we would set items to 20, and then
      set to_reclaim to 20 * 256k, or 5MiB.
      
      Assuming that we went through this loop all 3 times, for both
      FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
      twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
      leave a fair bit of delalloc reservations still hanging around by the
      time we go to ENOSPC out all the remaining tickets.
      
      Fix this two ways.  First, change the calculations to be a fraction of
      the total delalloc bytes on the system.  Prior to this change we were
      calculating based on dirty inodes so our math made more sense, now it's
      just completely unrelated to what we're actually doing.
      
      Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
      gone through the flush states at least once.  This will empty the system
      of all delalloc so we're sure to be truly out of space when we start
      failing tickets.
      
      I'm tagging stable 5.10 and forward, because this is where we started
      using the page stuff heavily again.  This affects earlier kernel
      versions as well, but would be a pain to backport to them as the
      flushing mechanisms aren't the same.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      03fe78cc
    • J
      btrfs: enable a tracepoint when we fail tickets · fcdef39c
      Josef Bacik 提交于
      When debugging early enospc problems it was useful to have a tracepoint
      where we failed all tickets so I could check the state of the enospc
      counters at failure time to validate my fixes.  This adds the tracpoint
      so you can easily get that information.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcdef39c
    • J
      btrfs: include delalloc related info in dump space info tracepoint · 8197766d
      Josef Bacik 提交于
      In order to debug delalloc flushing issues I added delalloc_bytes and
      ordered_bytes to this tracepoint to see if they were non-zero when we
      were going ENOSPC. This was valuable for me and showed me cases where we
      weren't waiting on ordered extents properly. In order to add this to the
      tracepoint we need to take away the const modifier for fs_info, as
      percpu_sum_counter_positive() will change the counter when it adds up
      the percpu buckets.  This is needed to make sure we're getting accurate
      information at these tracepoints, as the wrong information could send us
      down the wrong path when debugging problems.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8197766d
  6. 21 8月, 2021 1 次提交
  7. 17 8月, 2021 1 次提交
  8. 11 8月, 2021 2 次提交
  9. 10 8月, 2021 6 次提交
  10. 27 7月, 2021 1 次提交
  11. 21 7月, 2021 1 次提交
  12. 16 7月, 2021 3 次提交
  13. 01 7月, 2021 1 次提交
  14. 30 6月, 2021 3 次提交
  15. 29 6月, 2021 1 次提交
  16. 26 6月, 2021 1 次提交
    • D
      trace: Add osnoise tracer · bce29ac9
      Daniel Bristot de Oliveira 提交于
      In the context of high-performance computing (HPC), the Operating System
      Noise (*osnoise*) refers to the interference experienced by an application
      due to activities inside the operating system. In the context of Linux,
      NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
      system. Moreover, hardware-related jobs can also cause noise, for example,
      via SMIs.
      
      The osnoise tracer leverages the hwlat_detector by running a similar
      loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
      the sources of *osnoise* during its execution. Using the same approach
      of hwlat, osnoise takes note of the entry and exit point of any
      source of interferences, increasing a per-cpu interference counter. The
      osnoise tracer also saves an interference counter for each source of
      interference. The interference counter for NMI, IRQs, SoftIRQs, and
      threads is increased anytime the tool observes these interferences' entry
      events. When a noise happens without any interference from the operating
      system level, the hardware noise counter increases, pointing to a
      hardware-related noise. In this way, osnoise can account for any
      source of interference. At the end of the period, the osnoise tracer
      prints the sum of all noise, the max single noise, the percentage of CPU
      available for the thread, and the counters for the noise sources.
      
      Usage
      
      Write the ASCII text "osnoise" into the current_tracer file of the
      tracing system (generally mounted at /sys/kernel/tracing).
      
      For example::
      
              [root@f32 ~]# cd /sys/kernel/tracing/
              [root@f32 tracing]# echo osnoise > current_tracer
      
      It is possible to follow the trace by reading the trace trace file::
      
              [root@f32 tracing]# cat trace
              # tracer: osnoise
              #
              #                                _-----=> irqs-off
              #                               / _----=> need-resched
              #                              | / _---=> hardirq/softirq
              #                              || / _--=> preempt-depth                            MAX
              #                              || /                                             SINGLE     Interference counters:
              #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
              #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
              #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
                         <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
                         <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
                         <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
                         <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
                         <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
                         <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
                         <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
                         <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19
      
      In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
      tracer prints a message at the end of each period for each CPU that is
      running an osnoise/CPU thread. The osnoise specific fields report:
      
       - The RUNTIME IN USE reports the amount of time in microseconds that
         the osnoise thread kept looping reading the time.
       - The NOISE IN US reports the sum of noise in microseconds observed
         by the osnoise tracer during the associated runtime.
       - The % OF CPU AVAILABLE reports the percentage of CPU available for
         the osnoise thread during the runtime window.
       - The MAX SINGLE NOISE IN US reports the maximum single noise observed
         during the runtime window.
       - The Interference counters display how many each of the respective
         interference happened during the runtime window.
      
      Note that the example above shows a high number of HW noise samples.
      The reason being is that this sample was taken on a virtual machine,
      and the host interference is detected as a hardware interference.
      
      Tracer options
      
      The tracer has a set of options inside the osnoise directory, they are:
      
       - osnoise/cpus: CPUs at which a osnoise thread will execute.
       - osnoise/period_us: the period of the osnoise thread.
       - osnoise/runtime_us: how long an osnoise thread will look for noise.
       - osnoise/stop_tracing_us: stop the system tracing if a single noise
         higher than the configured value happens. Writing 0 disables this
         option.
       - osnoise/stop_tracing_total_us: stop the system tracing if total noise
         higher than the configured value happens. Writing 0 disables this
         option.
       - tracing_threshold: the minimum delta between two time() reads to be
         considered as noise, in us. When set to 0, the default value will
         be used, which is currently 5 us.
      
      Additional Tracing
      
      In addition to the tracer, a set of tracepoints were added to
      facilitate the identification of the osnoise source.
      
       - osnoise:sample_threshold: printed anytime a noise is higher than
         the configurable tolerance_ns.
       - osnoise:nmi_noise: noise from NMI, including the duration.
       - osnoise:irq_noise: noise from an IRQ, including the duration.
       - osnoise:softirq_noise: noise from a SoftIRQ, including the
         duration.
       - osnoise:thread_noise: noise from a thread, including the duration.
      
      Note that all the values are *net values*. For example, if while osnoise
      is running, another thread preempts the osnoise thread, it will start a
      thread_noise duration at the start. Then, an IRQ takes place, preempting
      the thread_noise, starting a irq_noise. When the IRQ ends its execution,
      it will compute its duration, and this duration will be subtracted from
      the thread_noise, in such a way as to avoid the double accounting of the
      IRQ execution. This logic is valid for all sources of noise.
      
      Here is one example of the usage of these tracepoints::
      
             osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
             osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
           migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
             osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2
      
      In this example, a noise sample of 8 microseconds was reported in the last
      line, pointing to two interferences. Looking backward in the trace, the
      two previous entries were about the migration thread running after a
      timer IRQ execution. The first event is not part of the noise because
      it took place one millisecond before.
      
      It is worth noticing that the sum of the duration reported in the
      tracepoints is smaller than eight us reported in the sample_threshold.
      The reason roots in the overhead of the entry and exit code that happens
      before and after any interference execution. This justifies the dual
      approach: measuring thread and tracing.
      
      Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      [
        Made the following functions static:
         trace_irqentry_callback()
         trace_irqexit_callback()
         trace_intel_irqentry_callback()
         trace_intel_irqexit_callback()
      
        Added to include/trace.h:
         osnoise_arch_register()
         osnoise_arch_unregister()
      
        Fixed define logic for LATENCY_FS_NOTIFY
      Reported-by: Nkernel test robot <lkp@intel.com>
      ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bce29ac9
  17. 24 6月, 2021 1 次提交
    • Z
      jbd2,ext4: add a shrinker to release checkpointed buffers · 4ba3fcdd
      Zhang Yi 提交于
      Current metadata buffer release logic in bdev_try_to_free_page() have
      a lot of use-after-free issues when umount filesystem concurrently, and
      it is difficult to fix directly because ext4 is the only user of
      s_op->bdev_try_to_free_page callback and we may have to add more special
      refcount or lock that is only used by ext4 into the common vfs layer,
      which is unacceptable.
      
      One better solution is remove the bdev_try_to_free_page callback, but
      the real problem is we cannot easily release journal_head on the
      checkpointed buffer, so try_to_free_buffers() cannot release buffers and
      page under memory pressure, which is more likely to trigger
      out-of-memory. So we cannot remove the callback directly before we find
      another way to release journal_head.
      
      This patch introduce a shrinker to free journal_head on the checkpointed
      transaction. After the journal_head got freed, try_to_free_buffers()
      could free buffer properly.
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210610112440.3438139-6-yi.zhang@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      4ba3fcdd
  18. 22 6月, 2021 1 次提交
    • J
      btrfs: rip out may_commit_transaction · c416a30c
      Josef Bacik 提交于
      may_commit_transaction was introduced before the ticketing
      infrastructure existed.  There was a problem where we'd legitimately be
      out of space, but every reservation would trigger a transaction commit
      and then fail.  Thus if you had 1000 things trying to make a
      reservation, they'd all do the flushing loop and thus commit the
      transaction 1000 times before they'd get their ENOSPC.
      
      This helper was introduced to short circuit this, if there wasn't space
      that could be reclaimed by committing the transaction then simply ENOSPC
      out.  This made true ENOSPC tests much faster as we didn't waste a bunch
      of time.
      
      However many of our bugs over the years have been from cases where we
      didn't account for some space that would be reclaimed by committing a
      transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
      miscalculations, etc.  And in the meantime the original problem has been
      solved with ticketing.  We no longer will commit the transaction 1000
      times.  Instead we'll get 1000 waiters, we will go through the flushing
      mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
      out.  The ticketing infrastructure gives us a deterministic way to see
      if we're making progress or not, thus we avoid a lot of extra work.
      
      So simplify this step by simply unconditionally committing the
      transaction.  This removes what is arguably our most common source of
      early ENOSPC bugs and will allow us to drastically simplify many of the
      things we track because we simply won't need them with this stuff gone.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c416a30c
  19. 21 6月, 2021 1 次提交
    • Q
      btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered() · 38a39ac7
      Qu Wenruo 提交于
      There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
      end_compressed_bio_write().
      
      It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
      which is only supposed to accept inode pages.
      
      Thankfully the important info here is the inode, so let's pass
      btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
      make @page parameter optional.
      
      By this, end_compressed_bio_write() can happily pass page=NULL while
      still getting everything done properly.
      
      Also, to cooperate with such modification, replace @page parameter for
      trace_btrfs_writepage_end_io_hook() with btrfs_inode.
      Although this removes page_index info, the existing start/len should be
      enough for most usage.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38a39ac7
  20. 19 6月, 2021 1 次提交
  21. 16 6月, 2021 2 次提交
  22. 12 6月, 2021 1 次提交
  23. 10 6月, 2021 1 次提交