1. 06 8月, 2019 4 次提交
    • E
      ext4: fix bigalloc cluster freeing when hole punching under load · 46de2e5e
      Eric Whitney 提交于
      commit 7bd75230b43727b258a4f7a59d62114cffe1b6c8 upstream.
      
      Ext4 may not free clusters correctly when punching holes in bigalloc
      file systems under high load conditions.  If it's not possible to
      extend and restart the journal in ext4_ext_rm_leaf() when preparing to
      remove blocks from a punched region, a retry of the entire punch
      operation is triggered in ext4_ext_remove_space().  This causes a
      partial cluster to be set to the first cluster in the extent found to
      the right of the punched region.  However, if the punch operation
      prior to the retry had made enough progress to delete one or more
      extents and a partial cluster candidate for freeing had already been
      recorded, the retry would overwrite the partial cluster.  The loss of
      this information makes it impossible to correctly free the original
      partial cluster in all cases.
      
      This bug can cause generic/476 to fail when run as part of
      xfstests-bld's bigalloc and bigalloc_1k test cases.  The failure is
      reported when e2fsck detects bad iblocks counts greater than expected
      in units of whole clusters and also detects a number of negative block
      bitmap differences equal to the iblocks discrepancy in cluster units.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      46de2e5e
    • G
      ext4: fix build error when DX_DEBUG is defined · f2d52aff
      Gabriel Krisman Bertazi 提交于
      commit 799578ab16e86b074c184ec5abbda0bc698c7b0b upstream.
      
      Enabling DX_DEBUG triggers the build error below.  info is an attribute
      of  the dxroot structure.
      
      linux/fs/ext4/namei.c:2264:12: error: ‘info’
      undeclared (first use in this function); did you mean ‘insl’?
      	   	  info->indirect_levels));
      
      Fixes: e08ac99f ("ext4: add largedir feature")
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f2d52aff
    • O
      NFSv4.x: fix lock recovery during delegation recall · d50ef120
      Olga Kornievskaia 提交于
      commit 44f411c353bf6d98d5a34f8f1b8605d43b2e50b8 upstream.
      
      Running "./nfstest_delegation --runtest recall26" uncovers that
      client doesn't recover the lock when we have an appending open,
      where the initial open got a write delegation.
      
      Instead of checking for the passed in open context against
      the file lock's open context. Check that the state is the same.
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d50ef120
    • D
      xfs: fix use-after-free race in xfs_buf_rele · ac911480
      Dave Chinner 提交于
      commit 37fd1678245f7a5898c1b05128bc481fb403c290 upstream.
      
      When looking at a 4.18 based KASAN use after free report, I noticed
      that racing xfs_buf_rele() may race on dropping the last reference
      to the buffer and taking the buffer lock. This was the symptom
      displayed by the KASAN report, but the actual issue that was
      reported had already been fixed in 4.19-rc1 by commit e339dd8d
      ("xfs: use sync buffer I/O for sync delwri queue submission").
      
      Despite this, I think there is still an issue with xfs_buf_rele()
      in this code:
      
              release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
              spin_lock(&bp->b_lock);
              if (!release) {
      .....
      
      If two threads race on the b_lock after both dropping a reference
      and one getting dropping the last reference so release = true, we
      end up with:
      
      CPU 0				CPU 1
      atomic_dec_and_lock()
      				atomic_dec_and_lock()
      				spin_lock(&bp->b_lock)
      spin_lock(&bp->b_lock)
      <spins>
      				<release = true bp->b_lru_ref = 0>
      				<remove from lists>
      				freebuf = true
      				spin_unlock(&bp->b_lock)
      				xfs_buf_free(bp)
      <gets lock, reading and writing freed memory>
      <accesses freed memory>
      spin_unlock(&bp->b_lock) <reads/writes freed memory>
      
      IOWs, we can't safely take bp->b_lock after dropping the hold
      reference because the buffer may go away at any time after we
      drop that reference. However, this can be fixed simply by taking the
      bp->b_lock before we drop the reference.
      
      It is safe to nest the pag_buf_lock inside bp->b_lock as the
      pag_buf_lock is only used to serialise against lookup in
      xfs_buf_find() and no other locks are held over or under the
      pag_buf_lock there. Make this clear by documenting the buffer lock
      orders at the top of the file.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ac911480
  2. 20 7月, 2019 1 次提交
  3. 18 7月, 2019 1 次提交
    • X
      ext4: unlock unused_pages timely when doing writeback · 404ed43a
      Xiaoguang Wang 提交于
      commit a297b2fcee461e40df763e179cbbfba5a9e572d2 upstream.
      
      In mpage_add_bh_to_extent(), when accumulated extents length is greater
      than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we
      will not continue to search unmapped area for this page, but note this
      page is locked, and will only be unlocked in mpage_release_unused_pages()
      after ext4_io_submit, if io also is throttled by blk-throttle or similar
      io qos, we will hold this page locked for a while, it's unnecessary.
      
      I think the best fix is to refactor mpage_add_bh_to_extent() to let it
      return some hints whether to unlock this page, but given that we will
      improve dioread_nolock later, we can let it done later, so currently
      the simple fix would just call mpage_release_unused_pages() before
      ext4_io_submit().
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      404ed43a
  4. 05 7月, 2019 14 次提交
    • B
      xfs: serialize unaligned dio writes against all other dio writes · 6fb16296
      Brian Foster 提交于
      commit 2032a8a27b5cc0f578d37fa16fa2494b80a0d00a upstream.
      
      XFS applies more strict serialization constraints to unaligned
      direct writes to accommodate things like direct I/O layer zeroing,
      unwritten extent conversion, etc. Unaligned submissions acquire the
      exclusive iolock and wait for in-flight dio to complete to ensure
      multiple submissions do not race on the same block and cause data
      corruption.
      
      This generally works in the case of an aligned dio followed by an
      unaligned dio, but the serialization is lost if I/Os occur in the
      opposite order. If an unaligned write is submitted first and
      immediately followed by an overlapping, aligned write, the latter
      submits without the typical unaligned serialization barriers because
      there is no indication of an unaligned dio still in-flight. This can
      lead to unpredictable results.
      
      To provide proper unaligned dio serialization, require that such
      direct writes are always the only dio allowed in-flight at one time
      for a particular inode. We already acquire the exclusive iolock and
      drain pending dio before submitting the unaligned dio. Wait once
      more after the dio submission to hold the iolock across the I/O and
      prevent further submissions until the unaligned I/O completes. This
      is heavy handed, but consistent with the current pre-submission
      serialization for unaligned direct writes.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NAlvin Zheng <Alvin@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6fb16296
    • J
      fs: kernfs: add poll file operation · ba79ed2d
      Johannes Weiner 提交于
      commit 147e1a97c4a0bdd43f55a582a9416bb9092563a9 upstream.
      
      Patch series "psi: pressure stall monitors", v3.
      
      Android is adopting psi to detect and remedy memory pressure that
      results in stuttering and decreased responsiveness on mobile devices.
      
      Psi gives us the stall information, but because we're dealing with
      latencies in the millisecond range, periodically reading the pressure
      files to detect stalls in a timely fashion is not feasible.  Psi also
      doesn't aggregate its averages at a high enough frequency right now.
      
      This patch series extends the psi interface such that users can
      configure sensitive latency thresholds and use poll() and friends to be
      notified when these are breached.
      
      As high-frequency aggregation is costly, it implements an aggregation
      method that is optimized for fast, short-interval averaging, and makes
      the aggregation frequency adaptive, such that high-frequency updates
      only happen while monitored stall events are actively occurring.
      
      With these patches applied, Android can monitor for, and ward off,
      mounting memory shortages before they cause problems for the user.  For
      example, using memory stall monitors in userspace low memory killer
      daemon (lmkd) we can detect mounting pressure and kill less important
      processes before device becomes visibly sluggish.
      
      In our memory stress testing psi memory monitors produce roughly 10x
      less false positives compared to vmpressure signals.  Having ability to
      specify multiple triggers for the same psi metric allows other parts of
      Android framework to monitor memory state of the device and act
      accordingly.
      
      The new interface is straightforward.  The user opens one of the
      pressure files for writing and writes a trigger description into the
      file descriptor that defines the stall state - some or full, and the
      maximum stall time over a given window of time.  E.g.:
      
              /* Signal when stall time exceeds 100ms of a 1s window */
              char trigger[] = "full 100000 1000000";
              fd = open("/proc/pressure/memory");
              write(fd, trigger, sizeof(trigger));
              while (poll() >= 0) {
                      ...
              }
              close(fd);
      
      When the monitored stall state is entered, psi adapts its aggregation
      frequency according to what the configured time window requires in order
      to emit event signals in a timely fashion.  Once the stalling subsides,
      aggregation reverts back to normal.
      
      The trigger is associated with the open file descriptor.  To stop
      monitoring, the user only needs to close the file descriptor and the
      trigger is discarded.
      
      Patches 1-4 prepare the psi code for polling support.  Patch 5
      implements the adaptive polling logic, the pressure growth detection
      optimized for short intervals, and hooks up write() and poll() on the
      pressure files.
      
      The patches were developed in collaboration with Johannes Weiner.
      
      This patch (of 5):
      
      Kernfs has a standardized poll/notification mechanism for waking all
      pollers on all fds when a filesystem node changes.  To allow polling for
      custom events, add a .poll callback that can override the default.
      
      This is in preparation for pollable cgroup pressure files which have
      per-fd trigger configurations.
      
      Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.comSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ba79ed2d
    • J
      sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD · 63c14678
      Johannes Weiner 提交于
      commit 8508cf3ffad4defa202b303e5b6379efc4cd9054 upstream.
      
      There are several definitions of those functions/macros in places that
      mess with fixed-point load averages.  Provide an official version.
      
      [Joseph] use stat.mean instead of stat->rqs.mean to fix conflicts in
      iolatency_check_latencies();
      
      [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
      Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      63c14678
    • D
      splice: don't read more than available pipe space · 34dd6d05
      Darrick J. Wong 提交于
      commit 17614445576b6af24e9cf36607c6448164719c96 upstream.
      
      In commit 4721a601099, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.
      
      The brokenness is compounded by splice_direct_to_actor immediately
      bailing on do_splice_to returning <= 0 without ever calling ->actor
      (which empties out the pipe), so if userspace calls back we'll EFAULT
      again on the full pipe, and nothing ever gets copied.
      
      Therefore, teach splice_direct_to_actor to clamp its requests to the
      amount of free space in the pipe and remove the simulated short read
      from the iomap directio code.
      
      Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      34dd6d05
    • D
      xfs: don't overflow xattr listent buffer · 9802c6bd
      Darrick J. Wong 提交于
      commit 3b50086f0c0d78c144d9483fa292c1509c931b70 upstream.
      
      For VFS listxattr calls, xfs_xattr_put_listent calls
      __xfs_xattr_put_listent twice if it sees an attribute
      "trusted.SGI_ACL_FILE": once for that name, and again for
      "system.posix_acl_access".  Unfortunately, if we happen to run out of
      buffer space while emitting the first name, we set count to -1 (so that
      we can feed ERANGE to the caller).  The second invocation doesn't check that
      the context parameters make sense and overwrites the byte before the
      buffer, triggering a KASAN report:
      
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in strncpy+0xb3/0xd0
      Write of size 1 at addr ffff88807fbd317f by task syz/1113
      
      CPU: 3 PID: 1113 Comm: syz Not tainted 5.0.0-rc6-xfsx #rc6
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       dump_stack+0xcc/0x180
       print_address_description+0x6c/0x23c
       kasan_report.cold.3+0x1c/0x35
       strncpy+0xb3/0xd0
       __xfs_xattr_put_listent+0x1a9/0x2c0 [xfs]
       xfs_attr_list_int_ilocked+0x11af/0x1800 [xfs]
       xfs_attr_list_int+0x20c/0x2e0 [xfs]
       xfs_vn_listxattr+0x225/0x320 [xfs]
       listxattr+0x11f/0x1b0
       path_listxattr+0xbd/0x130
       do_syscall_64+0x139/0x560
      
      While we're at it we add an assert to the other put_listent to avoid
      this sort of thing ever happening to the attrlist_by_handle code.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      9802c6bd
    • G
      net/tcp: Support tunable tcp timeout value in TIME-WAIT state · cf8fdcc2
      George Zhang 提交于
      By default the tcp_tw_timeout value is 60 seconds. The minimum is
      1 second and the maximum is 600. This setting is useful on system under
      heavy tcp load.
      
      NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
      restriction, and make your system into the risk of causing some old data
      to be accepted as new or new data rejected as old duplicated by some
      receivers.
      
      Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cf8fdcc2
    • L
      fs/writeback: Attach inode's wb to root if needed · 9fa0f25c
      luanshi 提交于
      There might have tons of files queued in the writeback, awaiting for
      writing back. Unfortunately, the writeback's cgroup has been dead. In
      this case, we reassociate the inode with another writeback, but we
      possibly can't because the writeback associated with the dead cgroup is
      the only valid one. In this case, the new writeback is allocated,
      initialized and associated with the inode in the non-stopping fashion
      until all data resident in the inode's page cache are flushed to disk.
      It causes unnecessary high system load.
      
      This fixes the issue by enforce moving the inode to root cgroup when the
      previous binding cgroup becomes dead. With it, no more unnecessary
      writebacks are created, populated and the system load decreased by about
      6x in the test case we carried out:
          Without the patch: 30% system load
          With the patch:    5%  system load
      Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      9fa0f25c
    • J
    • E
      ext4: fix reserved cluster accounting at page invalidation time · ec1f21f8
      Eric Whitney 提交于
      commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream.
      
      Add new code to count canceled pending cluster reservations on bigalloc
      file systems and to reduce the cluster reservation count on all file
      systems using delayed allocation.  This replaces old code in
      ext4_da_page_release_reservations that was incorrect.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      ec1f21f8
    • E
      ext4: adjust reserved cluster count when removing extents · 52737423
      Eric Whitney 提交于
      commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.
      
      Modify ext4_ext_remove_space() and the code it calls to correct the
      reserved cluster count for pending reservations (delayed allocated
      clusters shared with allocated blocks) when a block range is removed
      from the extent tree.  Pending reservations may be found for the clusters
      at the ends of written or unwritten extents when a block range is removed.
      If a physical cluster at the end of an extent is freed, it's necessary
      to increment the reserved cluster count to maintain correct accounting
      if the corresponding logical cluster is shared with at least one
      delayed and unwritten extent as found in the extents status tree.
      
      Add a new function, ext4_rereserve_cluster(), to reapply a reservation
      on a delayed allocated cluster sharing blocks with a freed allocated
      cluster.  To avoid ENOSPC on reservation, a flag is applied to
      ext4_free_blocks() to briefly defer updating the freeclusters counter
      when an allocated cluster is freed.  This prevents another thread
      from allocating the freed block before the reservation can be reapplied.
      
      Redefine the partial cluster object as a struct to carry more state
      information and to clarify the code using it.
      
      Adjust the conditional code structure in ext4_ext_remove_space to
      reduce the indentation level in the main body of the code to improve
      readability.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      52737423
    • E
      ext4: reduce reserved cluster count by number of allocated clusters · fffa8584
      Eric Whitney 提交于
      commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream.
      
      Ext4 does not always reduce the reserved cluster count by the number
      of clusters allocated when mapping a delayed extent.  It sometimes
      adds back one or more clusters after allocation if delalloc blocks
      adjacent to the range allocated by ext4_ext_map_blocks() share the
      clusters newly allocated for that range.  However, this overcounts
      the number of clusters needed to satisfy future mapping requests
      (holding one or more reservations for clusters that have already been
      allocated) and premature ENOSPC and quota failures, etc., result.
      
      Ext4 also does not reduce the reserved cluster count when allocating
      clusters for non-delayed allocated writes that have previously been
      reserved for delayed writes.  This also results in overcounts.
      
      To make it possible to handle reserved cluster accounting for
      fallocated regions in the same manner as used for other non-delayed
      writes, do the reserved cluster accounting for them at the time of
      allocation.  In the current code, this is only done later when a
      delayed extent sharing the fallocated region is finally mapped.
      
      Address comment correcting handling of unsigned long long constant
      from Jan Kara's review of RFC version of this patch.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      fffa8584
    • E
      ext4: fix reserved cluster accounting at delayed write time · fcd1ffa4
      Eric Whitney 提交于
      commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream.
      
      The code in ext4_da_map_blocks sometimes reserves space for more
      delayed allocated clusters than it should, resulting in premature
      ENOSPC, exceeded quota, and inaccurate free space reporting.
      
      Fix this by checking for written and unwritten blocks shared in the
      same cluster with the newly delayed allocated block.  A cluster
      reservation should not be made for a cluster for which physical space
      has already been allocated.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      fcd1ffa4
    • E
      ext4: add new pending reservation mechanism · 21cf5841
      Eric Whitney 提交于
      commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream.
      
      Add new pending reservation mechanism to help manage reserved cluster
      accounting.  Its primary function is to avoid the need to read extents
      from the disk when invalidating pages as a result of a truncate, punch
      hole, or collapse range operation.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      21cf5841
    • E
      ext4: generalize extents status tree search functions · 8e39d1ef
      Eric Whitney 提交于
      commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream.
      
      Ext4 contains a few functions that are used to search for delayed
      extents or blocks in the extents status tree.  Rather than duplicate
      code to add new functions to search for extents with different status
      values, such as written or a combination of delayed and unwritten,
      generalize the existing code to search for caller-specified extents
      status values.  Also, move this code into extents_status.c where it
      is better associated with the data structures it operates upon, and
      where it can be more readily used to implement new extents status tree
      functions that might want a broader scope for i_es_lock.
      
      Three missing static specifiers in RFC version of patch reported and
      fixed by Fengguang Wu <fengguang.wu@intel.com>.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      8e39d1ef
  5. 03 7月, 2019 4 次提交
  6. 25 6月, 2019 8 次提交
    • S
      SMB3: retry on STATUS_INSUFFICIENT_RESOURCES instead of failing write · 5293c79c
      Steve French 提交于
      commit 8d526d62db907e786fd88948c75d1833d82bd80e upstream.
      
      Some servers such as Windows 10 will return STATUS_INSUFFICIENT_RESOURCES
      as the number of simultaneous SMB3 requests grows (even though the client
      has sufficient credits).  Return EAGAIN on STATUS_INSUFFICIENT_RESOURCES
      so that we can retry writes which fail with this status code.
      
      This (for example) fixes large file copies to Windows 10 on fast networks.
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5293c79c
    • N
      btrfs: start readahead also in seed devices · c592b1c3
      Naohiro Aota 提交于
      commit c4e0540d0ad49c8ceab06cceed1de27c4fe29f6e upstream.
      
      Currently, btrfs does not consult seed devices to start readahead. As a
      result, if readahead zone is added to the seed devices, btrfs_reada_wait()
      indefinitely wait for the reada_ctl to finish.
      
      You can reproduce the hung by modifying btrfs/163 to have larger initial
      file size (e.g. xfs_io pwrite 4M instead of current 256K).
      
      Fixes: 7414a03f ("btrfs: initial readahead code and prototypes")
      Cc: stable@vger.kernel.org # 3.2+: ce7791ff: Btrfs: fix race between readahead and device replace/removal
      Cc: stable@vger.kernel.org # 3.2+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c592b1c3
    • A
      ovl: fix bogus -Wmaybe-unitialized warning · 0319ef1d
      Arnd Bergmann 提交于
      [ Upstream commit 1dac6f5b0ed2601be21bb4e27a44b0c3e667b7f4 ]
      
      gcc gets a bit confused by the logic in ovl_setup_trap() and
      can't figure out whether the local 'trap' variable in the caller
      was initialized or not:
      
      fs/overlayfs/super.c: In function 'ovl_fill_super':
      fs/overlayfs/super.c:1333:4: error: 'trap' may be used uninitialized in this function [-Werror=maybe-uninitialized]
          iput(trap);
          ^~~~~~~~~~
      fs/overlayfs/super.c:1312:17: note: 'trap' was declared here
      
      Reword slightly to make it easier for the compiler to understand.
      
      Fixes: 146d62e5a586 ("ovl: detect overlapping layers")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0319ef1d
    • M
      ovl: don't fail with disconnected lower NFS · 639e8c2f
      Miklos Szeredi 提交于
      [ Upstream commit 9179c21dc6ed1c993caa5fe4da876a6765c26af7 ]
      
      NFS mounts can be disconnected from fs root.  Don't fail the overlapping
      layer check because of this.
      
      The check is not authoritative anyway, since topology can change during or
      after the check.
      Reported-by: NAntti Antinoja <antti@fennosys.fi>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 146d62e5a586 ("ovl: detect overlapping layers")
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      639e8c2f
    • A
      ovl: detect overlapping layers · f1c5aa5e
      Amir Goldstein 提交于
      [ Upstream commit 146d62e5a5867fbf84490d82455718bfb10fe824 ]
      
      Overlapping overlay layers are not supported and can cause unexpected
      behavior, but overlayfs does not currently check or warn about these
      configurations.
      
      User is not supposed to specify the same directory for upper and
      lower dirs or for different lower layers and user is not supposed to
      specify directories that are descendants of each other for overlay
      layers, but that is exactly what this zysbot repro did:
      
          https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000
      
      Moving layer root directories into other layers while overlayfs
      is mounted could also result in unexpected behavior.
      
      This commit places "traps" in the overlay inode hash table.
      Those traps are dummy overlay inodes that are hashed by the layers
      root inodes.
      
      On mount, the hash table trap entries are used to verify that overlay
      layers are not overlapping.  While at it, we also verify that overlay
      layers are not overlapping with directories "in-use" by other overlay
      instances as upperdir/workdir.
      
      On lookup, the trap entries are used to verify that overlay layers
      root inodes have not been moved into other layers after mount.
      
      Some examples:
      
      $ ./run --ov --samefs -s
      ...
      ( mkdir -p base/upper/0/u base/upper/0/w base/lower lower upper mnt
        mount -o bind base/lower lower
        mount -o bind base/upper upper
        mount -t overlay none mnt ...
              -o lowerdir=lower,upperdir=upper/0/u,workdir=upper/0/w)
      
      $ umount mnt
      $ mount -t overlay none mnt ...
              -o lowerdir=base,upperdir=upper/0/u,workdir=upper/0/w
      
        [   94.434900] overlayfs: overlapping upperdir path
        mount: mount overlay on mnt failed: Too many levels of symbolic links
      
      $ mount -t overlay none mnt ...
              -o lowerdir=upper/0/u,upperdir=upper/0/u,workdir=upper/0/w
      
        [  151.350132] overlayfs: conflicting lowerdir path
        mount: none is already mounted or mnt busy
      
      $ mount -t overlay none mnt ...
              -o lowerdir=lower:lower/a,upperdir=upper/0/u,workdir=upper/0/w
      
        [  201.205045] overlayfs: overlapping lowerdir path
        mount: mount overlay on mnt failed: Too many levels of symbolic links
      
      $ mount -t overlay none mnt ...
              -o lowerdir=lower,upperdir=upper/0/u,workdir=upper/0/w
      $ mv base/upper/0/ base/lower/
      $ find mnt/0
        mnt/0
        mnt/0/w
        find: 'mnt/0/w/work': Too many levels of symbolic links
        find: 'mnt/0/u': Too many levels of symbolic links
      
      Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f1c5aa5e
    • A
      ovl: make i_ino consistent with st_ino in more cases · a00f405e
      Amir Goldstein 提交于
      [ Upstream commit 6dde1e42f497b2d4e22466f23019016775607947 ]
      
      Relax the condition that overlayfs supports nfs export, to require
      that i_ino is consistent with st_ino/d_ino.
      
      It is enough to require that st_ino and d_ino are consistent.
      
      This fixes the failure of xfstest generic/504, due to mismatch of
      st_ino to inode number in the output of /proc/locks.
      
      Fixes: 12574a9f ("ovl: consistent i_ino for non-samefs with xino")
      Cc: <stable@vger.kernel.org> # v4.19
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a00f405e
    • A
      ovl: fix wrong flags check in FS_IOC_FS[SG]ETXATTR ioctls · d6623379
      Amir Goldstein 提交于
      [ Upstream commit 941d935ac7636911a3fd8fa80e758e52b0b11e20 ]
      
      The ioctl argument was parsed as the wrong type.
      
      Fixes: b21d9c435f93 ("ovl: support the FS_IOC_FS[SG]ETXATTR ioctls")
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d6623379
    • A
      ovl: support the FS_IOC_FS[SG]ETXATTR ioctls · 3cb5d7fa
      Amir Goldstein 提交于
      [ Upstream commit b21d9c435f935014d3e3fa6914f2e4fbabb0e94d ]
      
      They are the extended version of FS_IOC_FS[SG]ETFLAGS ioctls.
      xfs_io -c "chattr <flags>" uses the new ioctls for setting flags.
      
      This used to work in kernel pre v4.19, before stacked file ops
      introduced the ovl_ioctl whitelist.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Fixes: d1d04ef8 ("ovl: stack file ops")
      Cc: <stable@vger.kernel.org> # v4.19
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3cb5d7fa
  7. 22 6月, 2019 3 次提交
  8. 19 6月, 2019 2 次提交
    • R
      f2fs: fix to avoid accessing xattr across the boundary · ae3787d4
      Randall Huang 提交于
      [ Upstream commit 2777e654371dd4207a3a7f4fb5fa39550053a080 ]
      
      When we traverse xattr entries via __find_xattr(),
      if the raw filesystem content is faked or any hardware failure occurs,
      out-of-bound error can be detected by KASAN.
      Fix the issue by introducing boundary check.
      
      [   38.402878] c7   1827 BUG: KASAN: slab-out-of-bounds in f2fs_getxattr+0x518/0x68c
      [   38.402891] c7   1827 Read of size 4 at addr ffffffc0b6fb35dc by task
      [   38.402935] c7   1827 Call trace:
      [   38.402952] c7   1827 [<ffffff900809003c>] dump_backtrace+0x0/0x6bc
      [   38.402966] c7   1827 [<ffffff9008090030>] show_stack+0x20/0x2c
      [   38.402981] c7   1827 [<ffffff900871ab10>] dump_stack+0xfc/0x140
      [   38.402995] c7   1827 [<ffffff9008325c40>] print_address_description+0x80/0x2d8
      [   38.403009] c7   1827 [<ffffff900832629c>] kasan_report_error+0x198/0x1fc
      [   38.403022] c7   1827 [<ffffff9008326104>] kasan_report_error+0x0/0x1fc
      [   38.403037] c7   1827 [<ffffff9008325000>] __asan_load4+0x1b0/0x1b8
      [   38.403051] c7   1827 [<ffffff90085fcc44>] f2fs_getxattr+0x518/0x68c
      [   38.403066] c7   1827 [<ffffff90085fc508>] f2fs_xattr_generic_get+0xb0/0xd0
      [   38.403080] c7   1827 [<ffffff9008395708>] __vfs_getxattr+0x1f4/0x1fc
      [   38.403096] c7   1827 [<ffffff9008621bd0>] inode_doinit_with_dentry+0x360/0x938
      [   38.403109] c7   1827 [<ffffff900862d6cc>] selinux_d_instantiate+0x2c/0x38
      [   38.403123] c7   1827 [<ffffff900861b018>] security_d_instantiate+0x68/0x98
      [   38.403136] c7   1827 [<ffffff9008377db8>] d_splice_alias+0x58/0x348
      [   38.403149] c7   1827 [<ffffff900858d16c>] f2fs_lookup+0x608/0x774
      [   38.403163] c7   1827 [<ffffff900835eacc>] lookup_slow+0x1e0/0x2cc
      [   38.403177] c7   1827 [<ffffff9008367fe0>] walk_component+0x160/0x520
      [   38.403190] c7   1827 [<ffffff9008369ef4>] path_lookupat+0x110/0x2b4
      [   38.403203] c7   1827 [<ffffff900835dd38>] filename_lookup+0x1d8/0x3a8
      [   38.403216] c7   1827 [<ffffff900835eeb0>] user_path_at_empty+0x54/0x68
      [   38.403229] c7   1827 [<ffffff9008395f44>] SyS_getxattr+0xb4/0x18c
      [   38.403241] c7   1827 [<ffffff9008084200>] el0_svc_naked+0x34/0x38
      Signed-off-by: NRandall Huang <huangrandall@google.com>
      [Jaegeuk Kim: Fix wrong ending boundary]
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ae3787d4
    • W
      fs/ocfs2: fix race in ocfs2_dentry_attach_lock() · 6b9aa7ac
      Wengang Wang 提交于
      commit be99ca2716972a712cde46092c54dee5e6192bf8 upstream.
      
      ocfs2_dentry_attach_lock() can be executed in parallel threads against the
      same dentry.  Make that race safe.  The race is like this:
      
                  thread A                               thread B
      
      (A1) enter ocfs2_dentry_attach_lock,
      seeing dentry->d_fsdata is NULL,
      and no alias found by
      ocfs2_find_local_alias, so kmalloc
      a new ocfs2_dentry_lock structure
      to local variable "dl", dl1
      
                     .....
      
                                          (B1) enter ocfs2_dentry_attach_lock,
                                          seeing dentry->d_fsdata is NULL,
                                          and no alias found by
                                          ocfs2_find_local_alias so kmalloc
                                          a new ocfs2_dentry_lock structure
                                          to local variable "dl", dl2.
      
                                                         ......
      
      (A2) set dentry->d_fsdata with dl1,
      call ocfs2_dentry_lock() and increase
      dl1->dl_lockres.l_ro_holders to 1 on
      success.
                    ......
      
                                          (B2) set dentry->d_fsdata with dl2
                                          call ocfs2_dentry_lock() and increase
      				    dl2->dl_lockres.l_ro_holders to 1 on
      				    success.
      
                                                        ......
      
      (A3) call ocfs2_dentry_unlock()
      and decrease
      dl2->dl_lockres.l_ro_holders to 0
      on success.
                   ....
      
                                          (B3) call ocfs2_dentry_unlock(),
                                          decreasing
      				    dl2->dl_lockres.l_ro_holders, but
      				    see it's zero now, panic
      
      Link: http://lkml.kernel.org/r/20190529174636.22364-1-wen.gang.wang@oracle.comSigned-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Reported-by: NDaniel Sobe <daniel.sobe@nxp.com>
      Tested-by: NDaniel Sobe <daniel.sobe@nxp.com>
      Reviewed-by: NChangwei Ge <gechangwei@live.cn>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b9aa7ac
  9. 15 6月, 2019 3 次提交
    • A
      ovl: support stacked SEEK_HOLE/SEEK_DATA · afec7068
      Amir Goldstein 提交于
      commit 9e46b840c7053b5f7a245e98cd239b60d189a96c upstream.
      
      Overlay file f_pos is the master copy that is preserved
      through copy up and modified on read/write, but only real
      fs knows how to SEEK_HOLE/SEEK_DATA and real fs may impose
      limitations that are more strict than ->s_maxbytes for specific
      files, so we use the real file to perform seeks.
      
      We do not call real fs for SEEK_CUR:0 query and for SEEK_SET:0
      requests.
      
      Fixes: d1d04ef8 ("ovl: stack file ops")
      Reported-by: NEddie Horng <eddiehorng.tw@gmail.com>
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      afec7068
    • J
      ovl: check the capability before cred overridden · 22dac6cc
      Jiufei Xue 提交于
      commit 98487de318a6f33312471ae1e2afa16fbf8361fe upstream.
      
      We found that it return success when we set IMMUTABLE_FL flag to a file in
      docker even though the docker didn't have the capability
      CAP_LINUX_IMMUTABLE.
      
      The commit d1d04ef8 ("ovl: stack file ops") and dab5ca8f ("ovl: add
      lsattr/chattr support") implemented chattr operations on a regular overlay
      file. ovl_real_ioctl() overridden the current process's subjective
      credentials with ofs->creator_cred which have the capability
      CAP_LINUX_IMMUTABLE so that it will return success in
      vfs_ioctl()->cap_capable().
      
      Fix this by checking the capability before cred overridden. And here we
      only care about APPEND_FL and IMMUTABLE_FL, so get these information from
      inode.
      
      [SzM: move check and call to underlying fs inside inode locked region to
      prevent two such calls from racing with each other]
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22dac6cc
    • A
      nfsd: avoid uninitialized variable warning · 806e8395
      Arnd Bergmann 提交于
      [ Upstream commit 0ab88ca4bcf18ba21058d8f19220f60afe0d34d8 ]
      
      clang warns that 'contextlen' may be accessed without an initialization:
      
      fs/nfsd/nfs4xdr.c:2911:9: error: variable 'contextlen' is uninitialized when used here [-Werror,-Wuninitialized]
                                                                      contextlen);
                                                                      ^~~~~~~~~~
      fs/nfsd/nfs4xdr.c:2424:16: note: initialize the variable 'contextlen' to silence this warning
              int contextlen;
                            ^
                             = 0
      
      Presumably this cannot happen, as FATTR4_WORD2_SECURITY_LABEL is
      set if CONFIG_NFSD_V4_SECURITY_LABEL is enabled.
      Adding another #ifdef like the other two in this function
      avoids the warning.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      806e8395