1. 08 12月, 2019 3 次提交
    • Y
      alios: mm: memcontrol: make distance between wmark_low and wmark_high configurable · 33ef4784
      Yang Shi 提交于
      Introduce a new interface, wmark_scale_factor, which defines the
      distance between wmark_high and wmark_low.  The unit is in fractions of
      10,000. The default value of 50 means the distance between wmark_high
      and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
      is 1000, or 10% of the max limit.
      
      The distance between wmark_low and wmark_high have impact on how hard
      memcg kswapd would reclaim.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      33ef4784
    • Y
      alios: mm: vmscan: make memcg kswapd set memcg state to dirty or writeback · e10c247b
      Yang Shi 提交于
      The global kswapd could set memory node to dirty or writeback if current
      scan find all pages are unqueued dirty or writeback.  Then kswapd would
      write out dirty pages or wait for writeback done.  The memcg kswapd
      behaves like global kswapd, and it should set dirty or writeback state
      to memcg too if the same condition is met.
      
      Since direct reclaim can't write out page caches, the system depends on
      kswapd to write out dirty pages if scan finds too many dirty pages in
      order to avoid pre-mature OOM.  But, if page cache is dirtied too fast,
      writing out pages definitely can't catch up with dirtying pages.  It is
      the responsibility of dirty page balance to throttle dirtying pages.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      e10c247b
    • Y
      alios: mm: memcontrol: support background async page reclaim · 6b2ef082
      Yang Shi 提交于
      Currently when memory usage exceeds memory cgroup limit, memory cgroup
      just can do sync direct reclaim.  This may incur unexpected stall on
      some applications which are sensitive to latency.  Introduce background
      async page reclaim mechanism, like what kswapd does.
      
      Define memcg memory usage water mark by introducing wmark_ratio interface,
      which is from 0 to 100 and represents percentage of max limit.  The
      wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
      (wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
      is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
      which is the default value.
      
      If wmark_ratio is setup, when charging page, if usage is greater than
      wmark_high, which means the available memory of memcg is low, a work
      would be scheduled to do background page reclaim until memory usage is
      reduced to wmark_low if possible.
      
      Define a dedicated unbound workqueue for scheduling water mark reclaim
      works.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      6b2ef082
  2. 05 12月, 2019 1 次提交
  3. 28 11月, 2019 7 次提交
    • X
      alios: blk-throttle: limit bios to fix amount of pages entering writeback prematurely · 06a67773
      Xiaoguang Wang 提交于
      Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps
      or iops limit, this bio will be queued throtl_grp's throtl_service_queue,
      then obviously mm subsys will submit more pages, even underlying device
      can not handle these io requests, also this will make large amount of pages
      entering writeback prematurely, later if some process writes some of these
      pages, it will wait for long time.
      
      I have done some tests: one process does buffered writes on a 1GB file,
      and make this process's blkcg max bps limit be 10MB/s, I observe this:
      	#cat /proc/meminfo  | grep -i back
      	Writeback:        900024 kB
      	WritebackTmp:          0 kB
      
      I think this Writeback value is just too big, indeed many bios have been
      queued in throtl_grp's throtl_service_queue, if one process try to write
      the last bio's page in this queue, it will call wait_on_page_writeback(page),
      which must wait the previous bios to finish and will take long time, we
      have also see 120s hung task warning in our server.
      
       INFO: task kworker/u128:0:30072 blocked for more than 120 seconds.
             Tainted: G            E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       kworker/u128:0  D    0 30072      2 0x00000000
       Workqueue: writeback wb_workfn (flush-8:16)
        ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80
        ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780
        00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400
       Call Trace:
        [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0
        [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20
        [<ffffffff81733726>] schedule+0x36/0x80
        [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0
        [<ffffffff81036c69>] ? sched_clock+0x9/0x10
        [<ffffffff81363073>] ? get_request+0x403/0x810
        [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0
        [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170
        [<ffffffff81733f90>] ? bit_wait+0x60/0x60
        [<ffffffff81733fab>] bit_wait_io+0x1b/0x60
        [<ffffffff81733b28>] __wait_on_bit+0x58/0x90
        [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0
        [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0
        [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60
        [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4]
        [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0
        [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4]
        [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4]
        [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200
        [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80
        [<ffffffff811c139e>] do_writepages+0x1e/0x30
        [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320
        [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600
        [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300
        [<ffffffff8127d884>] wb_workfn+0xb4/0x380
        [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0
        [<ffffffff810a5759>] process_one_work+0x189/0x420
        [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0
        [<ffffffff810a59f0>] ? process_one_work+0x420/0x420
        [<ffffffff810ac026>] kthread+0xe6/0x100
        [<ffffffff810abf40>] ? kthread_park+0x60/0x60
        [<ffffffff81738499>] ret_from_fork+0x39/0x50
      
      To fix this issue, we can simply limit throtl_service_queue's max queued
      bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it
      still exteeds, we just sleep for a while.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      06a67773
    • J
      alios: blk-throttle: fix tg NULL pointer dereference · 4667e926
      Joseph Qi 提交于
      io throtl stats will blkg_get at the beginning of throttle and then
      blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
      freed if end_io is called twice like dm-thin, which will save origin
      end_io first, and call its overwrite end_io and then the saved end_io.
      After that, access blkg is invalid and finally BUG:
      
      [ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
      [ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
      [ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
      [ 4417.239232] Oops: 0000 [#1] SMP
      ......
      [ 4417.274070] Call Trace:
      [ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
      [ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
      [ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
      [ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
      [ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
      [ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
      [ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
      [ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
      [ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
      [ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
      [ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
      [ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
      [ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
      [ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
      [ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
      [ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
      [ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
      [ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
      [ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
      [ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
      [ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
      [ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
      [ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
      [ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
      [ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      [ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
      [ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      ......
      
      Now we introduce a new bio flag BIO_THROTL_STATED to make sure
      blkg_get/put only get called once for the same bio.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4667e926
    • J
      alios: blk-throttle: support io delay stats · 65e6966a
      Joseph Qi 提交于
      Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
      get per-cgroup io delay statistics.
      io_service_time represents the time spent after io throttle to io
      completion, while io_wait_time represents the time spent on throttle
      queue.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      65e6966a
    • X
      alios: block: add counter to track io request's d2c time · 07232d74
      Xiaoguang Wang 提交于
      Indeed tool iostat's await is not good enough, which is somewhat sketchy
      and could not show request's latency on device driver's side.
      
      Here we add a new counter to track io request's d2c time, also with this
      patch, we can extend iostat to show this value easily.
      
      Note:
      I had checked how iostat is implemented, it just reads fields it needs,
      so iostat won't be affected by this change, so does tsar.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      07232d74
    • X
      alios: jbd2: add proc entry to control whether doing buffer copy-out · ac452d09
      Xiaoguang Wang 提交于
      When jbd2 tries to get write access to one buffer, and if this buffer
      is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
      has been written to disk, but sometimes the time taken to wait may be
      much long, especially disk capacity is almost full.
      
      Here add a proc entry "force-copy", if its value is not zero, jbd2 will
      always do meta buffer copy-cout, then we can eliminate the unnecessary
      wating time here, and reduce long tail latency for buffered-write.
      
      I construct such test case below:
      
      $cat offline.fio
      ; fio-rand-RW.job for fiotest
      
      [global]
      name=fio-rand-RW
      filename=fio-rand-RW
      rw=randrw
      rwmixread=60
      rwmixwrite=40
      bs=4K
      direct=0
      numjobs=4
      time_based=1
      runtime=900
      
      [file1]
      size=60G
      ioengine=sync
      iodepth=16
      
      $cat online.fio
      ; fio-seq-write.job for fiotest
      
      [global]
      name=fio-seq-write
      filename=fio-seq-write
      rw=write
      bs=256K
      direct=0
      numjobs=1
      time_based=1
      runtime=60
      
      [file1]
      rate=50m
      size=10G
      ioengine=sync
      iodepth=16
      
      With this patch:
      $cat /proc/fs/jbd2/sda5-8/force_copy
      0
      
      online fio almost always get such long tail latency:
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
           lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
          clat percentiles (usec):
           |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
           | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
           | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
           | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
           | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
           | 99.95th=[    441], 99.99th=[3640656]
      
      $cat /proc/fs/jbd2/sda5-8/force_copy
      1
      
      online fio latency is much better.
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=137, max=545, avg=151.35, stdev=16.22
           lat (usec): min=140, max=548, avg=155.31, stdev=16.65
          clat percentiles (usec):
           |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
      147],
           | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
      149],
           | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
      161],
           | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
      429],
           | 99.99th=[  537]
      
      As to the cost: because we'll always need to copy meta buffer, will
      consume minor cpu time and some memory(at most 32MB for 128MB journal
      size).
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ac452d09
    • Z
      alios: fs,ext4: remove projid limit when create hard link · 28df06b3
      zhangliguang 提交于
      This is a temporary workaround plan to avoid the limitation when
      creating hard link cross two projids.
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      28df06b3
    • J
      alios: jbd2: create jbd2-ckpt thread for journal checkpoint · c31b17e5
      Joseph Qi 提交于
      This is trying to do jbd2 checkpoint in a specific kernel thread, then
      checkpoint won't be under io throttle control.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c31b17e5
  4. 20 11月, 2019 28 次提交
  5. 19 11月, 2019 1 次提交
    • M
      block: fix .bi_size overflow · a74e2556
      Ming Lei 提交于
      commit 79d08f89bb1b5c2c1ff90d9bb95497ab9e8aa7e0 upstream
      
      'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
      bytes.
      
      Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can
      include very limited pages, and usually at most 256, so the fs bio
      size won't be bigger than 1M bytes most of times.
      
      Since we support multi-page bvec, in theory one fs bio really can
      be added > 1M pages, especially in case of hugepage, or big writeback
      with too many dirty pages. Then there is chance in which .bi_size
      is overflowed.
      
      Fixes this issue by using bio_full() to check if the added segment may
      overflow .bi_size.
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: linux-xfs@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 07173c3ec276 ("block: enable multipage bvecs")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a74e2556