1. 17 1月, 2020 1 次提交
  2. 15 1月, 2020 8 次提交
    • J
      alinux: ovl: implement async IO routines · 3e5dd02b
      Jiufei Xue 提交于
      A performance regression is observed since linux v4.19 when we do aio
      test using fio with iodepth 128 on overlayfs. And we found that queue
      depth of the device is always 1 which is unexpected.
      
      After investigation, it is found that commit 16914e6f
      ("ovl: add ovl_read_iter()") and commit 2a92e07e
      ("ovl: add ovl_write_iter()") use do_iter_readv_writev() to submit
      requests to real filesystem. Async IOs are converted to sync IOs here
      and cause performance regression.
      
      So implement async IO for stacked reading and writing.
      
      Changes since v1:
        - add a cleanup helper for completion/error handling
        - handle the case when aio_req allocation failed
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      3e5dd02b
    • J
      alinux: vfs: add vfs_iocb_iter_[read|write] helper functions · 6011bef7
      Jiufei Xue 提交于
      This isn't cause any behavior changes and will be used by overlay
      async IO implementation.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6011bef7
    • M
      alinux: fuse: add sysfs api to flush processing queue requests · fc0a9b55
      Ma Jie Yue 提交于
      The failover of fuse userspace daemon will reuse the existing fuse conn,
      without unmounting it, during daemon crashing and recovery procedure.
      But some requests might be in process in the daemon before sending out reply,
      when the crash happens. This will stuck the application since it will
      never get the reply after the failover.
      
      We add the sysfs api to flush these requests, after the daemon crash, before
      recovery. It is easy to reproduce the issue in the fuse userspace daemon,
      just exit after receiving the request and before sending the reply back.
      The application will hang up in some read/write operation, before
      echo 1 > /sys/fs/fuse/connection/xxx/flush. The flush operation will make
      the io fail and return the error to the application.
      Signed-off-by: NMa Jie Yue <majieyue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fc0a9b55
    • X
      alinux: jbd2: add proc entry to control whether doing buffer copy-out · 1ced8a5c
      Xiaoguang Wang 提交于
      When jbd2 tries to get write access to one buffer, and if this buffer
      is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
      has been written to disk, but sometimes the time taken to wait may be
      much long, especially disk capacity is almost full.
      
      Here add a proc entry "force-copy", if its value is not zero, jbd2 will
      always do meta buffer copy-cout, then we can eliminate the unnecessary
      wating time here, and reduce long tail latency for buffered-write.
      
      I construct such test case below:
      
      $cat offline.fio
      ; fio-rand-RW.job for fiotest
      
      [global]
      name=fio-rand-RW
      filename=fio-rand-RW
      rw=randrw
      rwmixread=60
      rwmixwrite=40
      bs=4K
      direct=0
      numjobs=4
      time_based=1
      runtime=900
      
      [file1]
      size=60G
      ioengine=sync
      iodepth=16
      
      $cat online.fio
      ; fio-seq-write.job for fiotest
      
      [global]
      name=fio-seq-write
      filename=fio-seq-write
      rw=write
      bs=256K
      direct=0
      numjobs=1
      time_based=1
      runtime=60
      
      [file1]
      rate=50m
      size=10G
      ioengine=sync
      iodepth=16
      
      With this patch:
      $cat /proc/fs/jbd2/sda5-8/force_copy
      0
      
      online fio almost always get such long tail latency:
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
           lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
          clat percentiles (usec):
           |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
           | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
           | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
           | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
           | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
           | 99.95th=[    441], 99.99th=[3640656]
      
      $cat /proc/fs/jbd2/sda5-8/force_copy
      1
      
      online fio latency is much better.
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=137, max=545, avg=151.35, stdev=16.22
           lat (usec): min=140, max=548, avg=155.31, stdev=16.65
          clat percentiles (usec):
           |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
      147],
           | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
      149],
           | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
      161],
           | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
      429],
           | 99.99th=[  537]
      
      As to the cost: because we'll always need to copy meta buffer, will
      consume minor cpu time and some memory(at most 32MB for 128MB journal
      size).
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      1ced8a5c
    • X
      alinux: ext4: don't submit unwritten extent while holding active jbd2 handle · c7c8cb0e
      Xiaoguang Wang 提交于
      In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
      will try to find 2048 pages to map and normally one bio can contain 256
      pages at most. If we really found 2048 pages to map, there will be 4 bios
      and 4 ext4_io_submit() calls which are called both in ext4_writepages()
      and mpage_map_and_submit_extent().
      
      But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
      when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
      will wait this handle to finish, so wait the unwritten extent is written to
      disk, this will introduce unnecessary stall time, especially longer when
      the writeback operation is io throttled, need to fix this issue.
      
      Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and
      only submit these bios after dropping the jbd2 handle.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c7c8cb0e
    • Z
      alinux: fs,ext4: remove projid limit when create hard link · 08e6d768
      zhangliguang 提交于
      This is a temporary workaround plan to avoid the limitation when
      creating hard link cross two projids.
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      08e6d768
    • X
      alinux: jbd2: add new "stats" proc file · 7e2e7b9a
      Xiaoguang Wang 提交于
      /proc/fs/jbd2/${device}/info only shows whole average statistical
      info about jbd2's life cycle, but it can not show jbd2 info in
      specified time interval and sometimes this capability is very useful
      for trouble shooting. For example, we can not see how rs_locked and
      rs_flushing grows in specified time interval, but these two indexes
      can explain some reasons for app's behaviours.
      
      Here we add a new "stats" proc file like /proc/diskstats, then we can
      implement a simple tool jbd2_stats which'll display detailed jbd2 info
      in specified time interval. Like below(time interval 5s):
      
      [lege@localhost ~]$ cat /proc/fs/jbd2/vdb1-8/stats
      51 30 8192 0 1 241616 0 0 22 0 47158 891 942 1000 1000
      
      [lege@localhost ~]$ gcc -o jbd2_stat jbd2_stat.c ; ./jbd2_stat
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             1861       158       359     13.00      0.00
      2.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             1974       113       389     26.00      0.00
      5.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             2188       214       308     10.00      0.00
      7.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             2344       156       332     19.00      0.00
      4.00
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7e2e7b9a
    • J
      alinux: jbd2: create jbd2-ckpt thread for journal checkpoint · 3999cdd9
      Joseph Qi 提交于
      This is trying to do jbd2 checkpoint in a specific kernel thread, then
      checkpoint won't be under io throttle control.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3999cdd9
  3. 02 1月, 2020 4 次提交
  4. 27 12月, 2019 14 次提交
    • X
      jbd2: fix deadlock while checkpoint thread waits commit thread to finish · 1b483bfe
      Xiaoguang Wang 提交于
      commit 53cf978457325d8fb2cdecd7981b31a8229e446e upstream.
      
      This issue was found when I tried to put checkpoint work in a separate thread,
      the deadlock below happened:
               Thread1                                |   Thread2
      __jbd2_log_wait_for_space                       |
      jbd2_log_do_checkpoint (hold j_checkpoint_mutex)|
        if (jh->b_transaction != NULL)                |
          ...                                         |
          jbd2_log_start_commit(journal, tid);        |jbd2_update_log_tail
                                                      |  will lock j_checkpoint_mutex,
                                                      |  but will be blocked here.
                                                      |
          jbd2_log_wait_commit(journal, tid);         |
          wait_event(journal->j_wait_done_commit,     |
           !tid_gt(tid, journal->j_commit_sequence)); |
           ...                                        |wake_up(j_wait_done_commit)
        }                                             |
      
      then deadlock occurs, Thread1 will never be waken up.
      
      To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()
      when we are going to wait for transaction commit.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      1b483bfe
    • E
      ext4: fix bigalloc cluster freeing when hole punching under load · 9e24ae5c
      Eric Whitney 提交于
      commit 7bd75230b43727b258a4f7a59d62114cffe1b6c8 upstream.
      
      Ext4 may not free clusters correctly when punching holes in bigalloc
      file systems under high load conditions.  If it's not possible to
      extend and restart the journal in ext4_ext_rm_leaf() when preparing to
      remove blocks from a punched region, a retry of the entire punch
      operation is triggered in ext4_ext_remove_space().  This causes a
      partial cluster to be set to the first cluster in the extent found to
      the right of the punched region.  However, if the punch operation
      prior to the retry had made enough progress to delete one or more
      extents and a partial cluster candidate for freeing had already been
      recorded, the retry would overwrite the partial cluster.  The loss of
      this information makes it impossible to correctly free the original
      partial cluster in all cases.
      
      This bug can cause generic/476 to fail when run as part of
      xfstests-bld's bigalloc and bigalloc_1k test cases.  The failure is
      reported when e2fsck detects bad iblocks counts greater than expected
      in units of whole clusters and also detects a number of negative block
      bitmap differences equal to the iblocks discrepancy in cluster units.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      9e24ae5c
    • X
      ext4: unlock unused_pages timely when doing writeback · 507547e8
      Xiaoguang Wang 提交于
      commit a297b2fcee461e40df763e179cbbfba5a9e572d2 upstream.
      
      In mpage_add_bh_to_extent(), when accumulated extents length is greater
      than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we
      will not continue to search unmapped area for this page, but note this
      page is locked, and will only be unlocked in mpage_release_unused_pages()
      after ext4_io_submit, if io also is throttled by blk-throttle or similar
      io qos, we will hold this page locked for a while, it's unnecessary.
      
      I think the best fix is to refactor mpage_add_bh_to_extent() to let it
      return some hints whether to unlock this page, but given that we will
      improve dioread_nolock later, we can let it done later, so currently
      the simple fix would just call mpage_release_unused_pages() before
      ext4_io_submit().
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      507547e8
    • J
      fs: kernfs: add poll file operation · 12528a20
      Johannes Weiner 提交于
      commit 147e1a97c4a0bdd43f55a582a9416bb9092563a9 upstream.
      
      Patch series "psi: pressure stall monitors", v3.
      
      Android is adopting psi to detect and remedy memory pressure that
      results in stuttering and decreased responsiveness on mobile devices.
      
      Psi gives us the stall information, but because we're dealing with
      latencies in the millisecond range, periodically reading the pressure
      files to detect stalls in a timely fashion is not feasible.  Psi also
      doesn't aggregate its averages at a high enough frequency right now.
      
      This patch series extends the psi interface such that users can
      configure sensitive latency thresholds and use poll() and friends to be
      notified when these are breached.
      
      As high-frequency aggregation is costly, it implements an aggregation
      method that is optimized for fast, short-interval averaging, and makes
      the aggregation frequency adaptive, such that high-frequency updates
      only happen while monitored stall events are actively occurring.
      
      With these patches applied, Android can monitor for, and ward off,
      mounting memory shortages before they cause problems for the user.  For
      example, using memory stall monitors in userspace low memory killer
      daemon (lmkd) we can detect mounting pressure and kill less important
      processes before device becomes visibly sluggish.
      
      In our memory stress testing psi memory monitors produce roughly 10x
      less false positives compared to vmpressure signals.  Having ability to
      specify multiple triggers for the same psi metric allows other parts of
      Android framework to monitor memory state of the device and act
      accordingly.
      
      The new interface is straightforward.  The user opens one of the
      pressure files for writing and writes a trigger description into the
      file descriptor that defines the stall state - some or full, and the
      maximum stall time over a given window of time.  E.g.:
      
              /* Signal when stall time exceeds 100ms of a 1s window */
              char trigger[] = "full 100000 1000000";
              fd = open("/proc/pressure/memory");
              write(fd, trigger, sizeof(trigger));
              while (poll() >= 0) {
                      ...
              }
              close(fd);
      
      When the monitored stall state is entered, psi adapts its aggregation
      frequency according to what the configured time window requires in order
      to emit event signals in a timely fashion.  Once the stalling subsides,
      aggregation reverts back to normal.
      
      The trigger is associated with the open file descriptor.  To stop
      monitoring, the user only needs to close the file descriptor and the
      trigger is discarded.
      
      Patches 1-4 prepare the psi code for polling support.  Patch 5
      implements the adaptive polling logic, the pressure growth detection
      optimized for short intervals, and hooks up write() and poll() on the
      pressure files.
      
      The patches were developed in collaboration with Johannes Weiner.
      
      This patch (of 5):
      
      Kernfs has a standardized poll/notification mechanism for waking all
      pollers on all fds when a filesystem node changes.  To allow polling for
      custom events, add a .poll callback that can override the default.
      
      This is in preparation for pollable cgroup pressure files which have
      per-fd trigger configurations.
      
      Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.comSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      12528a20
    • J
      sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD · 4ca637b4
      Johannes Weiner 提交于
      commit 8508cf3ffad4defa202b303e5b6379efc4cd9054 upstream.
      
      There are several definitions of those functions/macros in places that
      mess with fixed-point load averages.  Provide an official version.
      
      [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
      Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: use stat.mean instead of stat->rqs.mean to solve conflict]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      
      Conflicts:
          block/blk-iolatency.c
      4ca637b4
    • G
      alinux: net/tcp: Support tunable tcp timeout value in TIME-WAIT state · 0e7b6b7f
      George Zhang 提交于
      By default the tcp_tw_timeout value is 60 seconds. The minimum is
      1 second and the maximum is 600. This setting is useful on system under
      heavy tcp load.
      
      NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
      restriction, and make your system into the risk of causing some old data
      to be accepted as new or new data rejected as old duplicated by some
      receivers.
      
      Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0e7b6b7f
    • L
      alinux: fs/writeback: Attach inode's wb to root if needed · 67dc261c
      luanshi 提交于
      There might have tons of files queued in the writeback, awaiting for
      writing back. Unfortunately, the writeback's cgroup has been dead. In
      this case, we reassociate the inode with another writeback, but we
      possibly can't because the writeback associated with the dead cgroup is
      the only valid one. In this case, the new writeback is allocated,
      initialized and associated with the inode in the non-stopping fashion
      until all data resident in the inode's page cache are flushed to disk.
      It causes unnecessary high system load.
      
      This fixes the issue by enforce moving the inode to root cgroup when the
      previous binding cgroup becomes dead. With it, no more unnecessary
      writebacks are created, populated and the system load decreased by about
      6x in the test case we carried out:
          Without the patch: 30% system load
          With the patch:    5%  system load
      Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      67dc261c
    • J
    • E
      ext4: fix reserved cluster accounting at page invalidation time · c94af26e
      Eric Whitney 提交于
      commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream.
      
      Add new code to count canceled pending cluster reservations on bigalloc
      file systems and to reduce the cluster reservation count on all file
      systems using delayed allocation.  This replaces old code in
      ext4_da_page_release_reservations that was incorrect.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      c94af26e
    • E
      ext4: adjust reserved cluster count when removing extents · 60a5bf34
      Eric Whitney 提交于
      commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.
      
      Modify ext4_ext_remove_space() and the code it calls to correct the
      reserved cluster count for pending reservations (delayed allocated
      clusters shared with allocated blocks) when a block range is removed
      from the extent tree.  Pending reservations may be found for the clusters
      at the ends of written or unwritten extents when a block range is removed.
      If a physical cluster at the end of an extent is freed, it's necessary
      to increment the reserved cluster count to maintain correct accounting
      if the corresponding logical cluster is shared with at least one
      delayed and unwritten extent as found in the extents status tree.
      
      Add a new function, ext4_rereserve_cluster(), to reapply a reservation
      on a delayed allocated cluster sharing blocks with a freed allocated
      cluster.  To avoid ENOSPC on reservation, a flag is applied to
      ext4_free_blocks() to briefly defer updating the freeclusters counter
      when an allocated cluster is freed.  This prevents another thread
      from allocating the freed block before the reservation can be reapplied.
      
      Redefine the partial cluster object as a struct to carry more state
      information and to clarify the code using it.
      
      Adjust the conditional code structure in ext4_ext_remove_space to
      reduce the indentation level in the main body of the code to improve
      readability.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      60a5bf34
    • E
      ext4: reduce reserved cluster count by number of allocated clusters · d49d8b35
      Eric Whitney 提交于
      commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream.
      
      Ext4 does not always reduce the reserved cluster count by the number
      of clusters allocated when mapping a delayed extent.  It sometimes
      adds back one or more clusters after allocation if delalloc blocks
      adjacent to the range allocated by ext4_ext_map_blocks() share the
      clusters newly allocated for that range.  However, this overcounts
      the number of clusters needed to satisfy future mapping requests
      (holding one or more reservations for clusters that have already been
      allocated) and premature ENOSPC and quota failures, etc., result.
      
      Ext4 also does not reduce the reserved cluster count when allocating
      clusters for non-delayed allocated writes that have previously been
      reserved for delayed writes.  This also results in overcounts.
      
      To make it possible to handle reserved cluster accounting for
      fallocated regions in the same manner as used for other non-delayed
      writes, do the reserved cluster accounting for them at the time of
      allocation.  In the current code, this is only done later when a
      delayed extent sharing the fallocated region is finally mapped.
      
      Address comment correcting handling of unsigned long long constant
      from Jan Kara's review of RFC version of this patch.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      d49d8b35
    • E
      ext4: fix reserved cluster accounting at delayed write time · f683c7e6
      Eric Whitney 提交于
      commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream.
      
      The code in ext4_da_map_blocks sometimes reserves space for more
      delayed allocated clusters than it should, resulting in premature
      ENOSPC, exceeded quota, and inaccurate free space reporting.
      
      Fix this by checking for written and unwritten blocks shared in the
      same cluster with the newly delayed allocated block.  A cluster
      reservation should not be made for a cluster for which physical space
      has already been allocated.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      f683c7e6
    • E
      ext4: add new pending reservation mechanism · a993adbe
      Eric Whitney 提交于
      commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream.
      
      Add new pending reservation mechanism to help manage reserved cluster
      accounting.  Its primary function is to avoid the need to read extents
      from the disk when invalidating pages as a result of a truncate, punch
      hole, or collapse range operation.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      a993adbe
    • E
      ext4: generalize extents status tree search functions · fccb6f6e
      Eric Whitney 提交于
      commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream.
      
      Ext4 contains a few functions that are used to search for delayed
      extents or blocks in the extents status tree.  Rather than duplicate
      code to add new functions to search for extents with different status
      values, such as written or a combination of delayed and unwritten,
      generalize the existing code to search for caller-specified extents
      status values.  Also, move this code into extents_status.c where it
      is better associated with the data structures it operates upon, and
      where it can be more readily used to implement new extents status tree
      functions that might want a broader scope for i_es_lock.
      
      Three missing static specifiers in RFC version of patch reported and
      fixed by Fengguang Wu <fengguang.wu@intel.com>.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      fccb6f6e
  5. 21 12月, 2019 6 次提交
  6. 18 12月, 2019 7 次提交
    • P
      cifs: Fix potential softlockups while refreshing DFS cache · 14cb20ad
      Paulo Alcantara (SUSE) 提交于
      [ Upstream commit 84a1f5b1cc6fd7f6cd99fc5630c36f631b19fa60 ]
      
      We used to skip reconnects on all SMB2_IOCTL commands due to SMB3+
      FSCTL_VALIDATE_NEGOTIATE_INFO - which made sense since we're still
      establishing a SMB session.
      
      However, when refresh_cache_worker() calls smb2_get_dfs_refer() and
      we're under reconnect, SMB2_ioctl() will not be able to get a proper
      status error (e.g. -EHOSTDOWN in case we failed to reconnect) but an
      -EAGAIN from cifs_send_recv() thus looping forever in
      refresh_cache_worker().
      
      Fixes: e99c63e4d86d ("SMB3: Fix deadlock in validate negotiate hits reconnect")
      Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Suggested-by: NAurelien Aptel <aaptel@suse.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      14cb20ad
    • B
      gfs2: fix glock reference problem in gfs2_trans_remove_revoke · 0809e108
      Bob Peterson 提交于
      [ Upstream commit fe5e7ba11fcf1d75af8173836309e8562aefedef ]
      
      Commit 9287c6452d2b fixed a situation in which gfs2 could use a glock
      after it had been freed. To do that, it temporarily added a new glock
      reference by calling gfs2_glock_hold in function gfs2_add_revoke.
      However, if the bd element was removed by gfs2_trans_remove_revoke, it
      failed to drop the additional reference.
      
      This patch adds logic to gfs2_trans_remove_revoke to properly drop the
      additional glock reference.
      
      Fixes: 9287c6452d2b ("gfs2: Fix occasional glock use-after-free")
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0809e108
    • M
      mm, thp, proc: report THP eligibility for each vma · c76adee3
      Michal Hocko 提交于
      [ Upstream commit 7635d9cbe8327e131a1d3d8517dc186c2796ce2e ]
      
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c76adee3
    • Y
      ext4: fix a bug in ext4_wait_for_tail_page_commit · b1ec93dd
      yangerkun 提交于
      commit 565333a1554d704789e74205989305c811fd9c7a upstream.
      
      No need to wait for any commit once the page is fully truncated.
      Besides, it may confuse e.g. concurrent ext4_writepage() with the page
      still be dirty (will be cleared by truncate_pagecache() in
      ext4_setattr()) but buffers has been freed; and then trigger a bug
      show as below:
      
      [   26.057508] ------------[ cut here ]------------
      [   26.058531] kernel BUG at fs/ext4/inode.c:2134!
      ...
      [   26.088130] Call trace:
      [   26.088695]  ext4_writepage+0x914/0xb28
      [   26.089541]  writeout.isra.4+0x1b4/0x2b8
      [   26.090409]  move_to_new_page+0x3b0/0x568
      [   26.091338]  __unmap_and_move+0x648/0x988
      [   26.092241]  unmap_and_move+0x48c/0xbb8
      [   26.093096]  migrate_pages+0x220/0xb28
      [   26.093945]  kernel_mbind+0x828/0xa18
      [   26.094791]  __arm64_sys_mbind+0xc8/0x138
      [   26.095716]  el0_svc_common+0x190/0x490
      [   26.096571]  el0_svc_handler+0x60/0xd0
      [   26.097423]  el0_svc+0x8/0xc
      
      Run the procedure (generate by syzkaller) parallel with ext3.
      
      void main()
      {
      	int fd, fd1, ret;
      	void *addr;
      	size_t length = 4096;
      	int flags;
      	off_t offset = 0;
      	char *str = "12345";
      
      	fd = open("a", O_RDWR | O_CREAT);
      	assert(fd >= 0);
      
      	/* Truncate to 4k */
      	ret = ftruncate(fd, length);
      	assert(ret == 0);
      
      	/* Journal data mode */
      	flags = 0xc00f;
      	ret = ioctl(fd, _IOW('f', 2, long), &flags);
      	assert(ret == 0);
      
      	/* Truncate to 0 */
      	fd1 = open("a", O_TRUNC | O_NOATIME);
      	assert(fd1 >= 0);
      
      	addr = mmap(NULL, length, PROT_WRITE | PROT_READ,
      					MAP_SHARED, fd, offset);
      	assert(addr != (void *)-1);
      
      	memcpy(addr, str, 5);
      	mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE);
      }
      
      And the bug will be triggered once we seen the below order.
      
      reproduce1                         reproduce2
      
      ...                            |   ...
      truncate to 4k                 |
      change to journal data mode    |
                                     |   memcpy(set page dirty)
      truncate to 0:                 |
      ext4_setattr:                  |
      ...                            |
      ext4_wait_for_tail_page_commit |
                                     |   mbind(trigger bug)
      truncate_pagecache(clean dirty)|   ...
      ...                            |
      
      mbind will call ext4_writepage() since the page still be dirty, and then
      report the bug since the buffers has been free. Fix it by return
      directly once offset equals to 0 which means the page has been fully
      truncated.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.comReviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b1ec93dd
    • D
      splice: only read in as much information as there is pipe buffer space · 326ba910
      Darrick J. Wong 提交于
      commit 3253d9d093376d62b4a56e609f15d2ec5085ac73 upstream.
      
      Andreas Grünbacher reports that on the two filesystems that support
      iomap directio, it's possible for splice() to return -EAGAIN (instead of
      a short splice) if the pipe being written to has less space available in
      its pipe buffers than the length supplied by the calling process.
      
      Months ago we fixed splice_direct_to_actor to clamp the length of the
      read request to the size of the splice pipe.  Do the same to do_splice.
      
      Fixes: 17614445576b6 ("splice: don't read more than available pipe space")
      Reported-by: syzbot+3c01db6025f26530cf8d@syzkaller.appspotmail.com
      Reported-by: NAndreas Grünbacher <andreas.gruenbacher@gmail.com>
      Reviewed-by: NAndreas Grünbacher <andreas.gruenbacher@gmail.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      326ba910
    • T
      ext4: work around deleting a file with i_nlink == 0 safely · 8e7a8653
      Theodore Ts'o 提交于
      commit c7df4a1ecb8579838ec8c56b2bb6a6716e974f37 upstream.
      
      If the file system is corrupted such that a file's i_links_count is
      too small, then it's possible that when unlinking that file, i_nlink
      will already be zero.  Previously we were working around this kind of
      corruption by forcing i_nlink to one; but we were doing this before
      trying to delete the directory entry --- and if the file system is
      corrupted enough that ext4_delete_entry() fails, then we exit with
      i_nlink elevated, and this causes the orphan inode list handling to be
      FUBAR'ed, such that when we unmount the file system, the orphan inode
      list can get corrupted.
      
      A better way to fix this is to simply skip trying to call drop_nlink()
      if i_nlink is already zero, thus moving the check to the place where
      it makes the most sense.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=205433
      
      Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.eduSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e7a8653
    • J
      reiserfs: fix extended attributes on the root directory · 7b3ea9bb
      Jeff Mahoney 提交于
      commit 60e4cf67a582d64f07713eda5fcc8ccdaf7833e6 upstream.
      
      Since commit d0a5b995 (vfs: Add IOP_XATTR inode operations flag)
      extended attributes haven't worked on the root directory in reiserfs.
      
      This is due to reiserfs conditionally setting the sb->s_xattrs handler
      array depending on whether it located or create the internal privroot
      directory.  It necessarily does this after the root inode is already
      read in.  The IOP_XATTR flag is set during inode initialization, so
      it never gets set on the root directory.
      
      This commit unconditionally assigns sb->s_xattrs and clears IOP_XATTR on
      internal inodes.  The old return values due to the conditional assignment
      are handled via open_xa_root, which now returns EOPNOTSUPP as the VFS
      would have done.
      
      Link: https://lore.kernel.org/r/20191024143127.17509-1-jeffm@suse.com
      CC: stable@vger.kernel.org
      Fixes: d0a5b995 ("vfs: Add IOP_XATTR inode operations flag")
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b3ea9bb