1. 15 1月, 2020 19 次提交
  2. 02 1月, 2020 10 次提交
  3. 27 12月, 2019 11 次提交
    • Y
      mm: swap: check if swap backing device is congested or not · 0c22f660
      Yang Shi 提交于
      commit 8fd2e0b505d124bbb046ab15de0ff6f8d4babf56 upstream.
      Change SWP_FS to SWP_FILE.
      
      Swap readahead would read in a few pages regardless if the underlying
      device is busy or not.  It may incur long waiting time if the device is
      congested, and it may also exacerbate the congestion.
      
      Use inode_read_congested() to check if the underlying device is busy or
      not like what file page readahead does.  Get inode from
      swap_info_struct.
      
      Although we can add inode information in swap_address_space
      (address_space->host), it may lead some unexpected side effect, i.e.  it
      may break mapping_cap_account_dirty().  Using inode from
      swap_info_struct seems simple and good enough.
      
      Just does the check in vma_cluster_readahead() since
      swap_vma_readahead() is just used for non-rotational device which much
      less likely has congestion than traditional HDD.
      
      Although swap slots may be consecutive on swap partition, it still may
      be fragmented on swap file.  This check would help to reduce excessive
      stall for such case.
      
      The test with page_fault1 of will-it-scale (sometimes tracing may just
      show runtest.py that is the wrapper script of page_fault1), which
      basically launches NR_CPU threads to generate 128MB anonymous pages for
      each thread, on my virtual machine with congested HDD shows long tail
      latency is reduced significantly.
      
      Without the patch
       page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();
      
      With the patch
            runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
            runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
            runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
            runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
            runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
            runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
            runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
            runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();
      
      [akpm@linux-foundation.org: code cleanup]
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NTim Chen <tim.c.chen@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      0c22f660
    • J
      iocost: check active_list of all the ancestors in iocg_activate() · 9fe84dc5
      Jiufei Xue 提交于
      commit 8b37bc277fb459fa100808880a9d4e0641fff444 upstream.
      
      There is a bug that checking the same active_list over and over again
      in iocg_activate(). The intention of the code was checking whether all
      the ancestors and self have already been activated. So fix it.
      
      Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      9fe84dc5
    • X
      jbd2: fix deadlock while checkpoint thread waits commit thread to finish · 1b483bfe
      Xiaoguang Wang 提交于
      commit 53cf978457325d8fb2cdecd7981b31a8229e446e upstream.
      
      This issue was found when I tried to put checkpoint work in a separate thread,
      the deadlock below happened:
               Thread1                                |   Thread2
      __jbd2_log_wait_for_space                       |
      jbd2_log_do_checkpoint (hold j_checkpoint_mutex)|
        if (jh->b_transaction != NULL)                |
          ...                                         |
          jbd2_log_start_commit(journal, tid);        |jbd2_update_log_tail
                                                      |  will lock j_checkpoint_mutex,
                                                      |  but will be blocked here.
                                                      |
          jbd2_log_wait_commit(journal, tid);         |
          wait_event(journal->j_wait_done_commit,     |
           !tid_gt(tid, journal->j_commit_sequence)); |
           ...                                        |wake_up(j_wait_done_commit)
        }                                             |
      
      then deadlock occurs, Thread1 will never be waken up.
      
      To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()
      when we are going to wait for transaction commit.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      1b483bfe
    • D
      iocost: don't nest spin_lock_irq in ioc_weight_write() · 779d625e
      Dan Carpenter 提交于
      commit 41591a51f00d2dc7bb9dc6e9bedf56c5cf6f2392 upstream.
      
      This code causes a static analysis warning:
      
          block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq'
      
      We disable IRQs in blkg_conf_prep() and re-enable them in
      blkg_conf_finish().  IRQ disable/enable should not be nested because
      that means the IRQs will be enabled at the first unlock instead of the
      second one.
      
      Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      779d625e
    • J
      alinux: iocost: fix a deadlock in ioc_rqos_throttle() · 573ddb46
      Jiufei Xue 提交于
      Function ioc_rqos_throttle() may called inside queue_lock.
      We should unlock the queue_lock before entering sleep.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      573ddb46
    • T
      blkcg: Fix multiple bugs in blkcg_activate_policy() · c6417941
      Tejun Heo 提交于
      commit 9d179b865449b351ad5cb76dbea480c9170d4a27 upstream.
      
      blkcg_activate_policy() has the following bugs.
      
      * cf09a8ee19ad ("blkcg: pass @q and @blkcg into
        blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
        blkcg_activate_policy() ends up using pd's allocated for the root
        blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
        can be passed in pd's which are allocated for the root blkcg.
      
        For blk-iocost, this means that ->pd_init_fn() can write beyond the
        end of the allocated object as it determines the length of the flex
        array at the end based on the blkcg's nesting level.
      
      * Each pd is initialized as they get allocated.  If alloc fails, the
        policy will get freed with pd's initialized on it.
      
      * After the above partial failure, the partial pds are not freed.
      
      This patch fixes all the above issues by
      
      * Restructuring blkcg_activate_policy() so that alloc and init passes
        are separate.  Init takes place only after all allocs succeeded and
        on failure all allocated pds are freed.
      
      * Unifying and fixing the cleanup of the remaining pd_prealloc.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c6417941
    • T
      blkcg: blkcg_activate_policy() should initialize ancestors first · 94e9f8d4
      Tejun Heo 提交于
      commit 71c814077de60b2e7415dac6f5c4e98f59d521fd upstream.
      
      When blkcg_activate_policy() is creating blkg_policy_data for existing
      blkgs, it did in the wrong order - descendants first.  Fix it.  None
      of the existing controllers seem affected by this.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      94e9f8d4
    • T
      iocost: don't let vrate run wild while there's no saturation signal · c08d3e4b
      Tejun Heo 提交于
      When the QoS targets are met and nothing is being throttled, there's
      no way to tell how saturated the underlying device is - it could be
      almost entirely idle, at the cusp of saturation or anywhere inbetween.
      Given that there's no information, it's best to keep vrate as-is in
      this state.  Before 7cd806a9a953 ("iocost: improve nr_lagging
      handling"), this was the case - if the device isn't missing QoS
      targets and nothing is being throttled, busy_level was reset to zero.
      
      While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve
      nr_lagging handling") broke this.  Now, while the device is hitting
      QoS targets and nothing is being throttled, vrate keeps getting
      adjusted according to the existing busy_level.
      
      This led to vrate keeping climing till it hits max when there's an IO
      issuer with limited request concurrency if the vrate started low.
      vrate starts getting adjusted upwards until the issuer can issue IOs
      w/o being throttled.  From then on, QoS targets keeps getting met and
      nothing on the system needs throttling and vrate keeps getting
      increased due to the existing busy_level.
      
      This patch makes the following changes to the busy_level logic.
      
      * Reset busy_level if nr_shortages is zero to avoid the above
        scenario.
      
      * Make non-zero nr_lagging block lowering nr_level but still clear
        positive busy_level if there's clear non-saturation signal - QoS
        targets are met and nr_shortages is non-zero.  nr_lagging's role is
        preventing adjusting vrate upwards while there are long-running
        commands and it shouldn't keep busy_level positive while there's
        clear non-saturation signal.
      
      * Restructure code for clarity and add comments.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAndy Newell <newella@fb.com>
      Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling")
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c08d3e4b
    • T
      iocost: bump up default latency targets for hard disks · 54c73cd5
      Tejun Heo 提交于
      commit 7afcccafa59fb63b58f863a6c5e603a34625955b upstream.
      
      The default hard disk param sets latency targets at 50ms.  As the
      default target percentiles are zero, these don't directly regulate
      vrate; however, they're still used to calculate the period length -
      100ms in this case.
      
      This is excessively low.  A SATA drive with QD32 saturated with random
      IOs can easily reach avg completion latency of several hundred msecs.
      A period duration which is substantially lower than avg completion
      latency can lead to wildly fluctuating vrate.
      
      Let's bump up the default latency targets to 250ms so that the period
      duration is sufficiently long.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      54c73cd5
    • T
      iocost: improve nr_lagging handling · f63e7224
      Tejun Heo 提交于
      commit 7cd806a9a953f234b9865c30028f47fd738ce375 upstream.
      
      Some IOs may span multiple periods.  As latencies are collected on
      completion, the inbetween periods won't register them and may
      incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
      avoid those situations.  Currently, whenever there are IOs which are
      spanning from the previous period, busy_level is reset to 0 if
      negative thus suppressing vrate increase.
      
      This has the following two problems.
      
      * When latency target percentiles aren't set, vrate adjustment should
        only be governed by queue depth depletion; however, the current code
        keeps nr_lagging active which pulls in latency results and can keep
        down vrate unexpectedly.
      
      * When lagging condition is detected, it resets the entire negative
        busy_level.  This turned out to be way too aggressive on some
        devices which sometimes experience extended latencies on a small
        subset of commands.  In addition, a lagging IO will be accounted as
        latency target miss on completion anyway and resetting busy_level
        amplifies its impact unnecessarily.
      
      This patch fixes the above two problems by disabling nr_lagging
      counting when latency target percentiles aren't set and blocking vrate
      increases when there are lagging IOs while leaving busy_level as-is.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f63e7224
    • T
      iocost: better trace vrate changes · c017dc0a
      Tejun Heo 提交于
      commit 25d41e4aadb0788b4fae8a8fca90f437b9ebd727 upstream.
      
      vrate_adj tracepoint traces vrate changes; however, it does so only
      when busy_level is non-zero.  busy_level turning to zero can sometimes
      be as interesting an event.  This patch also enables vrate_adj
      tracepoint on other vrate related events - busy_level changes and
      non-zero nr_lagging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c017dc0a