1. 17 1月, 2020 19 次提交
  2. 15 1月, 2020 8 次提交
    • J
      alinux: ovl: implement async IO routines · 3e5dd02b
      Jiufei Xue 提交于
      A performance regression is observed since linux v4.19 when we do aio
      test using fio with iodepth 128 on overlayfs. And we found that queue
      depth of the device is always 1 which is unexpected.
      
      After investigation, it is found that commit 16914e6f
      ("ovl: add ovl_read_iter()") and commit 2a92e07e
      ("ovl: add ovl_write_iter()") use do_iter_readv_writev() to submit
      requests to real filesystem. Async IOs are converted to sync IOs here
      and cause performance regression.
      
      So implement async IO for stacked reading and writing.
      
      Changes since v1:
        - add a cleanup helper for completion/error handling
        - handle the case when aio_req allocation failed
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      3e5dd02b
    • J
      alinux: vfs: add vfs_iocb_iter_[read|write] helper functions · 6011bef7
      Jiufei Xue 提交于
      This isn't cause any behavior changes and will be used by overlay
      async IO implementation.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6011bef7
    • M
      alinux: fuse: add sysfs api to flush processing queue requests · fc0a9b55
      Ma Jie Yue 提交于
      The failover of fuse userspace daemon will reuse the existing fuse conn,
      without unmounting it, during daemon crashing and recovery procedure.
      But some requests might be in process in the daemon before sending out reply,
      when the crash happens. This will stuck the application since it will
      never get the reply after the failover.
      
      We add the sysfs api to flush these requests, after the daemon crash, before
      recovery. It is easy to reproduce the issue in the fuse userspace daemon,
      just exit after receiving the request and before sending the reply back.
      The application will hang up in some read/write operation, before
      echo 1 > /sys/fs/fuse/connection/xxx/flush. The flush operation will make
      the io fail and return the error to the application.
      Signed-off-by: NMa Jie Yue <majieyue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fc0a9b55
    • X
      alinux: jbd2: add proc entry to control whether doing buffer copy-out · 1ced8a5c
      Xiaoguang Wang 提交于
      When jbd2 tries to get write access to one buffer, and if this buffer
      is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
      has been written to disk, but sometimes the time taken to wait may be
      much long, especially disk capacity is almost full.
      
      Here add a proc entry "force-copy", if its value is not zero, jbd2 will
      always do meta buffer copy-cout, then we can eliminate the unnecessary
      wating time here, and reduce long tail latency for buffered-write.
      
      I construct such test case below:
      
      $cat offline.fio
      ; fio-rand-RW.job for fiotest
      
      [global]
      name=fio-rand-RW
      filename=fio-rand-RW
      rw=randrw
      rwmixread=60
      rwmixwrite=40
      bs=4K
      direct=0
      numjobs=4
      time_based=1
      runtime=900
      
      [file1]
      size=60G
      ioengine=sync
      iodepth=16
      
      $cat online.fio
      ; fio-seq-write.job for fiotest
      
      [global]
      name=fio-seq-write
      filename=fio-seq-write
      rw=write
      bs=256K
      direct=0
      numjobs=1
      time_based=1
      runtime=60
      
      [file1]
      rate=50m
      size=10G
      ioengine=sync
      iodepth=16
      
      With this patch:
      $cat /proc/fs/jbd2/sda5-8/force_copy
      0
      
      online fio almost always get such long tail latency:
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
           lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
          clat percentiles (usec):
           |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
           | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
           | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
           | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
           | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
           | 99.95th=[    441], 99.99th=[3640656]
      
      $cat /proc/fs/jbd2/sda5-8/force_copy
      1
      
      online fio latency is much better.
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=137, max=545, avg=151.35, stdev=16.22
           lat (usec): min=140, max=548, avg=155.31, stdev=16.65
          clat percentiles (usec):
           |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
      147],
           | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
      149],
           | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
      161],
           | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
      429],
           | 99.99th=[  537]
      
      As to the cost: because we'll always need to copy meta buffer, will
      consume minor cpu time and some memory(at most 32MB for 128MB journal
      size).
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      1ced8a5c
    • X
      alinux: ext4: don't submit unwritten extent while holding active jbd2 handle · c7c8cb0e
      Xiaoguang Wang 提交于
      In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
      will try to find 2048 pages to map and normally one bio can contain 256
      pages at most. If we really found 2048 pages to map, there will be 4 bios
      and 4 ext4_io_submit() calls which are called both in ext4_writepages()
      and mpage_map_and_submit_extent().
      
      But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
      when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
      will wait this handle to finish, so wait the unwritten extent is written to
      disk, this will introduce unnecessary stall time, especially longer when
      the writeback operation is io throttled, need to fix this issue.
      
      Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and
      only submit these bios after dropping the jbd2 handle.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c7c8cb0e
    • Z
      alinux: fs,ext4: remove projid limit when create hard link · 08e6d768
      zhangliguang 提交于
      This is a temporary workaround plan to avoid the limitation when
      creating hard link cross two projids.
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      08e6d768
    • X
      alinux: jbd2: add new "stats" proc file · 7e2e7b9a
      Xiaoguang Wang 提交于
      /proc/fs/jbd2/${device}/info only shows whole average statistical
      info about jbd2's life cycle, but it can not show jbd2 info in
      specified time interval and sometimes this capability is very useful
      for trouble shooting. For example, we can not see how rs_locked and
      rs_flushing grows in specified time interval, but these two indexes
      can explain some reasons for app's behaviours.
      
      Here we add a new "stats" proc file like /proc/diskstats, then we can
      implement a simple tool jbd2_stats which'll display detailed jbd2 info
      in specified time interval. Like below(time interval 5s):
      
      [lege@localhost ~]$ cat /proc/fs/jbd2/vdb1-8/stats
      51 30 8192 0 1 241616 0 0 22 0 47158 891 942 1000 1000
      
      [lege@localhost ~]$ gcc -o jbd2_stat jbd2_stat.c ; ./jbd2_stat
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             1861       158       359     13.00      0.00
      2.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             1974       113       389     26.00      0.00
      5.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             2188       214       308     10.00      0.00
      7.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             2344       156       332     19.00      0.00
      4.00
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7e2e7b9a
    • J
      alinux: jbd2: create jbd2-ckpt thread for journal checkpoint · 3999cdd9
      Joseph Qi 提交于
      This is trying to do jbd2 checkpoint in a specific kernel thread, then
      checkpoint won't be under io throttle control.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3999cdd9
  3. 02 1月, 2020 4 次提交
  4. 27 12月, 2019 9 次提交