1. 08 12月, 2019 3 次提交
    • Y
      alios: mm: memcontrol: make distance between wmark_low and wmark_high configurable · 33ef4784
      Yang Shi 提交于
      Introduce a new interface, wmark_scale_factor, which defines the
      distance between wmark_high and wmark_low.  The unit is in fractions of
      10,000. The default value of 50 means the distance between wmark_high
      and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
      is 1000, or 10% of the max limit.
      
      The distance between wmark_low and wmark_high have impact on how hard
      memcg kswapd would reclaim.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      33ef4784
    • Y
      alios: mm: vmscan: make memcg kswapd set memcg state to dirty or writeback · e10c247b
      Yang Shi 提交于
      The global kswapd could set memory node to dirty or writeback if current
      scan find all pages are unqueued dirty or writeback.  Then kswapd would
      write out dirty pages or wait for writeback done.  The memcg kswapd
      behaves like global kswapd, and it should set dirty or writeback state
      to memcg too if the same condition is met.
      
      Since direct reclaim can't write out page caches, the system depends on
      kswapd to write out dirty pages if scan finds too many dirty pages in
      order to avoid pre-mature OOM.  But, if page cache is dirtied too fast,
      writing out pages definitely can't catch up with dirtying pages.  It is
      the responsibility of dirty page balance to throttle dirtying pages.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      e10c247b
    • Y
      alios: mm: memcontrol: support background async page reclaim · 6b2ef082
      Yang Shi 提交于
      Currently when memory usage exceeds memory cgroup limit, memory cgroup
      just can do sync direct reclaim.  This may incur unexpected stall on
      some applications which are sensitive to latency.  Introduce background
      async page reclaim mechanism, like what kswapd does.
      
      Define memcg memory usage water mark by introducing wmark_ratio interface,
      which is from 0 to 100 and represents percentage of max limit.  The
      wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
      (wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
      is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
      which is the default value.
      
      If wmark_ratio is setup, when charging page, if usage is greater than
      wmark_high, which means the available memory of memcg is low, a work
      would be scheduled to do background page reclaim until memory usage is
      reduced to wmark_low if possible.
      
      Define a dedicated unbound workqueue for scheduling water mark reclaim
      works.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      6b2ef082
  2. 05 12月, 2019 1 次提交
  3. 28 11月, 2019 7 次提交
    • X
      alios: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 06a67773
      Xiaoguang Wang 提交于
      Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
      or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
      then obviously mm subsys will submit more pages, even underlying device
      can not handle these io requests, also this will make large amount of pages
      entering writeback prematurely, later if some process writes some of these
      pages, it will wait for long time.
      
      I have done some tests: one process does buffered writes on a 1GB file,
      and make this process's blkcg max bps limit be 10MB/s, I observe this:
      	#cat /proc/meminfo  | grep -i back
      	Writeback:        900024 kB
      	WritebackTmp:          0 kB
      
      I think this Writeback value is just too big, indeed many bios have been
      queued in throtl_grp's throtl_service_queue, if one process try to write
      the last bio's page in this queue, it will call wait_on_page_writeback(page),
      which must wait the previous bios to finish and will take long time, we
      have also see 120s hung task warning in our server.
      
       INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
             Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       kworker/u128:0  D    0 30072      2 0x00000000
       Workqueue: writeback wb_workfn (flush-8:16)
        ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
        ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
        00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
       Call Trace:
        [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
        [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
        [<ffffffff81733726>] schedule+0x36/0x80
        [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
        [<ffffffff81036c69>] ? sched_clock+0x9/0x10
        [<ffffffff81363073>] ? get_request+0x403/0x810
        [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
        [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
        [<ffffffff81733f90>] ? bit_wait+0x60/0x60
        [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
        [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
        [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
        [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
        [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
        [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
        [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
        [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
        [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
        [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
        [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
        [<ffffffff811c139e>] do_writepages+0x1e/0x30
        [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
        [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
        [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
        [<ffffffff8127d884>] wb_workfn+0xb4/0x380
        [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
        [<ffffffff810a5759>] process_one_work+0x189/0x420
        [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
        [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
        [<ffffffff810ac026>] kthread+0xe6/0x100
        [<ffffffff810abf40>] ? kthread_park+0x60/0x60
        [<ffffffff81738499>] ret_from_fork+0x39/0x50
      
      To fix this issue, we can simply limit throtl_service_queue's max queued
      bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
      still exteeds, we just sleep for a while.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      06a67773
    • J
      alios: blk-throttle: fix tg NULL pointer dereference · 4667e926
      Joseph Qi 提交于
      io throtl stats will blkg_get at the beginning of throttle and then
      blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
      freed if end_io is called twice like dm-thin, which will save origin
      end_io first, and call its overwrite end_io and then the saved end_io.
      After that, access blkg is invalid and finally BUG:
      
      [ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
      [ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
      [ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
      [ 4417.239232] Oops: 0000 [#1] SMP
      ......
      [ 4417.274070] Call Trace:
      [ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
      [ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
      [ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
      [ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
      [ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
      [ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
      [ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
      [ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
      [ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
      [ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
      [ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
      [ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
      [ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
      [ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
      [ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
      [ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
      [ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
      [ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
      [ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
      [ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
      [ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
      [ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
      [ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
      [ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
      [ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      [ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
      [ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      ......
      
      Now we introduce a new bio flag BIO_THROTL_STATED to make sure
      blkg_get/put only get called once for the same bio.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4667e926
    • J
      alios: blk-throttle: support io delay stats · 65e6966a
      Joseph Qi 提交于
      Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
      get per-cgroup io delay statistics.
      io_service_time represents the time spent after io throttle to io
      completion, while io_wait_time represents the time spent on throttle
      queue.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      65e6966a
    • X
      alios: block: add counter to track io request's d2c time · 07232d74
      Xiaoguang Wang 提交于
      Indeed tool iostat's await is not good enough, which is somewhat sketchy
      and could not show request's latency on device driver's side.
      
      Here we add a new counter to track io request's d2c time, also with this
      patch, we can extend iostat to show this value easily.
      
      Note:
      I had checked how iostat is implemented, it just reads fields it needs,
      so iostat won't be affected by this change, so does tsar.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      07232d74
    • X
      alios: jbd2: add proc entry to control whether doing buffer copy-out · ac452d09
      Xiaoguang Wang 提交于
      When jbd2 tries to get write access to one buffer, and if this buffer
      is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
      has been written to disk, but sometimes the time taken to wait may be
      much long, especially disk capacity is almost full.
      
      Here add a proc entry "force-copy", if its value is not zero, jbd2 will
      always do meta buffer copy-cout, then we can eliminate the unnecessary
      wating time here, and reduce long tail latency for buffered-write.
      
      I construct such test case below:
      
      $cat offline.fio
      ; fio-rand-RW.job for fiotest
      
      [global]
      name=fio-rand-RW
      filename=fio-rand-RW
      rw=randrw
      rwmixread=60
      rwmixwrite=40
      bs=4K
      direct=0
      numjobs=4
      time_based=1
      runtime=900
      
      [file1]
      size=60G
      ioengine=sync
      iodepth=16
      
      $cat online.fio
      ; fio-seq-write.job for fiotest
      
      [global]
      name=fio-seq-write
      filename=fio-seq-write
      rw=write
      bs=256K
      direct=0
      numjobs=1
      time_based=1
      runtime=60
      
      [file1]
      rate=50m
      size=10G
      ioengine=sync
      iodepth=16
      
      With this patch:
      $cat /proc/fs/jbd2/sda5-8/force_copy
      0
      
      online fio almost always get such long tail latency:
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
           lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
          clat percentiles (usec):
           |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
           | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
           | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
           | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
           | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
           | 99.95th=[    441], 99.99th=[3640656]
      
      $cat /proc/fs/jbd2/sda5-8/force_copy
      1
      
      online fio latency is much better.
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=137, max=545, avg=151.35, stdev=16.22
           lat (usec): min=140, max=548, avg=155.31, stdev=16.65
          clat percentiles (usec):
           |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
      147],
           | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
      149],
           | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
      161],
           | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
      429],
           | 99.99th=[  537]
      
      As to the cost: because we'll always need to copy meta buffer, will
      consume minor cpu time and some memory(at most 32MB for 128MB journal
      size).
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ac452d09
    • Z
      alios: fs,ext4: remove projid limit when create hard link · 28df06b3
      zhangliguang 提交于
      This is a temporary workaround plan to avoid the limitation when
      creating hard link cross two projids.
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      28df06b3
    • J
      alios: jbd2: create jbd2-ckpt thread for journal checkpoint · c31b17e5
      Joseph Qi 提交于
      This is trying to do jbd2 checkpoint in a specific kernel thread, then
      checkpoint won't be under io throttle control.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c31b17e5
  4. 20 11月, 2019 22 次提交
  5. 19 11月, 2019 3 次提交
    • M
      block: fix .bi_size overflow · a74e2556
      Ming Lei 提交于
      commit 79d08f89bb1b5c2c1ff90d9bb95497ab9e8aa7e0 upstream
      
      'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
      bytes.
      
      Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can
      include very limited pages, and usually at most 256, so the fs bio
      size won't be bigger than 1M bytes most of times.
      
      Since we support multi-page bvec, in theory one fs bio really can
      be added > 1M pages, especially in case of hugepage, or big writeback
      with too many dirty pages. Then there is chance in which .bi_size
      is overflowed.
      
      Fixes this issue by using bio_full() to check if the added segment may
      overflow .bi_size.
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: linux-xfs@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 07173c3ec276 ("block: enable multipage bvecs")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a74e2556
    • H
      mm, swap: fix race between swapoff and some swap operations · 73c29467
      Huang Ying 提交于
      commit eb085574a7526c4375965c5fbf7e5b0c19cdd336 upstream.
      Change SWP_VALID to (1 << 12).
      
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      73c29467
    • W
      vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n · e431b612
      Wei Yang 提交于
      commit 8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7 upstream.
      
      Commit fa5e084e ("vmscan: do not unconditionally treat zones that
      fail zone_reclaim() as full") changed the return value of
      node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
      after this commit.
      
      While the return value of node_reclaim() when CONFIG_NUMA is n is not
      changed.  This will leads to call zone_watermark_ok() again.
      
      This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
      Since node_reclaim() is only called in page_alloc.c, move it to
      mm/internal.h.
      
      Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      e431b612
  6. 07 11月, 2019 4 次提交
    • T
      blkcg: implement blk-iocost · abd942b2
      Tejun Heo 提交于
      commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream.
      
      This patchset implements IO cost model based work-conserving
      proportional controller.
      
      While io.latency provides the capability to comprehensively prioritize
      and protect IOs depending on the cgroups, its protection is binary -
      the lowest latency target cgroup which is suffering is protected at
      the cost of all others.  In many use cases including stacking multiple
      workload containers in a single system, it's necessary to distribute
      IO capacity with better granularity.
      
      One challenge of controlling IO resources is the lack of trivially
      observable cost metric.  The most common metrics - bandwidth and iops
      - can be off by orders of magnitude depending on the device type and
      IO pattern.  However, the cost isn't a complete mystery.  Given
      several key attributes, we can make fairly reliable predictions on how
      expensive a given stream of IOs would be, at least compared to other
      IO patterns.
      
      The function which determines the cost of a given IO is the IO cost
      model for the device.  This controller distributes IO capacity based
      on the costs estimated by such model.  The more accurate the cost
      model the better but the controller adapts based on IO completion
      latency and as long as the relative costs across differents IO
      patterns are consistent and sensible, it'll adapt to the actual
      performance of the device.
      
      Currently, the only implemented cost model is a simple linear one with
      a few sets of default parameters for different classes of device.
      This covers most common devices reasonably well.  All the
      infrastructure to tune and add different cost models is already in
      place and a later patch will also allow using bpf progs for cost
      models.
      
      Please see the top comment in blk-iocost.c and documentation for
      more details.
      
      v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
          for a divide-by-zero bug in current_hweight() triggered by zero
          inuse_sum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      [Joseph: fix confilcts with ioc_rqos_throttle()]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      abd942b2
    • T
      cgroup: add cgroup_parse_float() · cb2f6e75
      Tejun Heo 提交于
      commit a5e112e6424adb77d953eac20e6936b952fd6b32 upstream.
      
      cgroup already uses floating point for percent[ile] numbers and there
      are several controllers which want to take them as input.  Add a
      generic parse helper to handle inputs.
      
      Update the interface convention documentation about the use of
      percentage numbers.  While at it, also clarify the default time unit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cb2f6e75
    • T
      blk-mq: add optional request->alloc_time_ns · 4edb756c
      Tejun Heo 提交于
      commit 6f816b4b746c2241540e537682d30d8e9997d674 upstream.
      
      There are currently two start time timestamps - start_time_ns and
      io_start_time_ns.  The former marks the request allocation and and the
      second issue-to-device time.  The planned io.weight controller needs
      to measure the total time bios take to execute after it leaves rq_qos
      including the time spent waiting for request to become available,
      which can easily dominate on saturated devices.
      
      This patch adds request->alloc_time_ns which records when the request
      allocation attempt started.  As it isn't used for the usual stats,
      make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
      QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
      no users and it's active only on queues which need it even when
      compiled in.
      
      v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
          gating as suggested by Jens.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      4edb756c
    • T
      blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() · 5525bb81
      Tejun Heo 提交于
      commit 015d254cb02b6d8eec4b3366274bf4672f9e0b64 upstream.
      
      Separate out blkcg_conf_get_disk() so that it can be used by blkcg
      policy interface file input parsers before the policy is actually
      enabled.  This doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      5525bb81