1. 12 1月, 2016 2 次提交
    • D
      xfs: handle dquot buffer readahead in log recovery correctly · 7d6a13f0
      Dave Chinner 提交于
      When we do dquot readahead in log recovery, we do not use a verifier
      as the underlying buffer may not have dquots in it. e.g. the
      allocation operation hasn't yet been replayed. Hence we do not want
      to fail recovery because we detect an operation to be replayed has
      not been run yet. This problem was addressed for inodes in commit
      d8914002 ("xfs: inode buffers may not be valid during recovery
      readahead") but the problem was not recognised to exist for dquots
      and their buffers as the dquot readahead did not have a verifier.
      
      The result of not using a verifier is that when the buffer is then
      next read to replay a dquot modification, the dquot buffer verifier
      will only be attached to the buffer if *readahead is not complete*.
      Hence we can read the buffer, replay the dquot changes and then add
      it to the delwri submission list without it having a verifier
      attached to it. This then generates warnings in xfs_buf_ioapply(),
      which catches and warns about this case.
      
      Fix this and make it handle the same readahead verifier error cases
      as for inode buffers by adding a new readahead verifier that has a
      write operation as well as a read operation that marks the buffer as
      not done if any corruption is detected.  Also make sure we don't run
      readahead if the dquot buffer has been marked as cancelled by
      recovery.
      
      This will result in readahead either succeeding and the buffer
      having a valid write verifier, or readahead failing and the buffer
      state requiring the subsequent read to resubmit the IO with the new
      verifier.  In either case, this will result in the buffer always
      ending up with a valid write verifier on it.
      
      Note: we also need to fix the inode buffer readahead error handling
      to mark the buffer with EIO. Brian noticed the code I copied from
      there wrong during review, so fix it at the same time. Add comments
      linking the two functions that handle readahead verifier errors
      together so we don't forget this behavioural link in future.
      
      cc: <stable@vger.kernel.org> # 3.12 - current
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7d6a13f0
    • D
      xfs: inode recovery readahead can race with inode buffer creation · b79f4a1c
      Dave Chinner 提交于
      When we do inode readahead in log recovery, we do can do the
      readahead before we've replayed the icreate transaction that stamps
      the buffer with inode cores. The inode readahead verifier catches
      this and marks the buffer as !done to indicate that it doesn't yet
      contain valid inodes.
      
      In adding buffer error notification  (i.e. setting b_error = -EIO at
      the same time as as we clear the done flag) to such a readahead
      verifier failure, we can then get subsequent inode recovery failing
      with this error:
      
      XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32
      
      This occurs when readahead completion races with icreate item replay
      such as:
      
      	inode readahead
      		find buffer
      		lock buffer
      		submit RA io
      	....
      	icreate recovery
      	    xfs_trans_get_buffer
      		find buffer
      		lock buffer
      		<blocks on RA completion>
      	.....
      	<ra completion>
      		fails verifier
      		clear XBF_DONE
      		set bp->b_error = -EIO
      		release and unlock buffer
      	<icreate gains lock>
      	icreate initialises buffer
      	marks buffer as done
      	adds buffer to delayed write queue
      	releases buffer
      
      At this point, we have an initialised inode buffer that is up to
      date but has an -EIO state registered against it. When we finally
      get to recovering an inode in that buffer:
      
      	inode item recovery
      	    xfs_trans_read_buffer
      		find buffer
      		lock buffer
      		sees XBF_DONE is set, returns buffer
      	    sees bp->b_error is set
      		fail log recovery!
      
      Essentially, we need xfs_trans_get_buf_map() to clear the error status of
      the buffer when doing a lookup. This function returns uninitialised
      buffers, so the buffer returned can not be in an error state and
      none of the code that uses this function expects b_error to be set
      on return. Indeed, there is an ASSERT(!bp->b_error); in the
      transaction case in xfs_trans_get_buf_map() that would have caught
      this if log recovery used transactions....
      
      This patch firstly changes the inode readahead failure to set -EIO
      on the buffer, and secondly changes xfs_buf_get_map() to never
      return a buffer with an error state set so this first change doesn't
      cause unexpected log recovery failures.
      
      cc: <stable@vger.kernel.org> # 3.12 - current
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b79f4a1c
  2. 11 1月, 2016 1 次提交
    • E
      xfs: eliminate committed arg from xfs_bmap_finish · f6106efa
      Eric Sandeen 提交于
      Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
      associated comments were replicated several times across
      the attribute code, all dealing with what to do if the
      transaction was or wasn't committed.
      
      And in that replicated code, an ASSERT() test of an
      uninitialized variable occurs in several locations:
      
      	error = xfs_attr_thing(&args);
      	if (!error) {
      		error = xfs_bmap_finish(&args.trans, args.flist,
      					&committed);
      	}
      	if (error) {
      		ASSERT(committed);
      
      If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
      never set "committed", and then test it in the ASSERT.
      
      Fix this up by moving the committed state internal to xfs_bmap_finish,
      and add a new inode argument.  If an inode is passed in, it is passed
      through to __xfs_trans_roll() and joined to the transaction there if
      the transaction was committed.
      
      xfs_qm_dqalloc() was a little unique in that it called bjoin rather
      than ijoin, but as Dave points out we can detect the committed state
      but checking whether (*tpp != tp).
      
      Addresses-Coverity-Id: 102360
      Addresses-Coverity-Id: 102361
      Addresses-Coverity-Id: 102363
      Addresses-Coverity-Id: 102364
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f6106efa
  3. 08 1月, 2016 2 次提交
    • D
      xfs: bmapbt checking on debug kernels too expensive · e3543819
      Dave Chinner 提交于
      For large sparse or fragmented files, checking every single entry in
      the bmapbt on every operation is prohibitively expensive. Especially
      as such checks rarely discover problems during normal operations on
      high extent coutn files. Our regression tests don't tend to exercise
      files with hundreds of thousands to millions of extents, so mostly
      this isn't noticed.
      
      However, trying to run things like xfs_mdrestore of large filesystem
      dumps on a debug kernel quickly becomes impossible as the CPU is
      completely burnt up repeatedly walking the sparse file bmapbt that
      is generated for every allocation that is made.
      
      Hence, if the file has more than 10,000 extents, just don't bother
      with walking the tree to check it exhaustively. The btree code has
      checks that ensure that the newly inserted/removed/modified record
      is correctly ordered, so the entrie tree walk in thses cases has
      limited additional value.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e3543819
    • D
      xfs: add tracepoints to readpage calls · 121e213e
      Dave Chinner 提交于
      This allows us to see page cache driven readahead in action as it
      passes through XFS. This helps to understand buffered read
      throughput problems such as readahead IO IO sizes being too small
      for the underlying device to reach max throughput.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      121e213e
  4. 05 1月, 2016 2 次提交
    • B
      xfs: debug mode log record crc error injection · 609adfc2
      Brian Foster 提交于
      XFS now uses CRC verification over a limited section of the log to
      detect torn writes prior to a crash. This is difficult to test directly
      due to the timing and hardware requirements to cause a short write.
      
      Add a mechanism to inject CRC errors into log records to facilitate
      testing torn write detection during log recovery. This mechanism is
      dangerous and can result in filesystem corruption. Thus, it is only
      available in DEBUG mode for testing/development purposes. Set a non-zero
      value to the following sysfs entry to enable error injection:
      
      	/sys/fs/xfs/<dev>/log/log_badcrc_factor
      
      Once enabled, XFS intentionally writes an invalid CRC to a log record at
      some random point in the future based on the provided frequency. The
      filesystem immediately shuts down once the record has been written to
      the physical log to prevent metadata writeback (e.g., AIL insertion)
      once the log write completes. This helps reasonably simulate a torn
      write to the log as the affected record must be safe to discard. The
      next mount after the intentional shutdown requires log recovery and
      should detect and recover from the torn write.
      
      Note again that this _will_ result in data loss or worse. For testing
      and development purposes only!
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      609adfc2
    • B
      xfs: detect and trim torn writes during log recovery · 7088c413
      Brian Foster 提交于
      Certain types of storage, such as persistent memory, do not provide
      sector atomicity for writes. This means that if a crash occurs while XFS
      is writing log records, only part of those records might make it to the
      storage. This is problematic because log recovery uses the cycle value
      packed at the top of each log block to locate the head/tail of the log.
      This can lead to CRC verification failures during log recovery and an
      unmountable fs for a filesystem that is otherwise consistent.
      
      Update log recovery to incorporate log record CRC verification as part
      of the head/tail discovery process. Once the head is located via the
      traditional algorithm, run a CRC-only pass over the records up to the
      head of the log. If CRC verification fails, assume that the records are
      torn as a matter of policy and trim the head block back to the start of
      the first bad record.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      7088c413
  5. 04 1月, 2016 19 次提交
  6. 14 11月, 2015 1 次提交
  7. 11 11月, 2015 1 次提交
    • R
      vfs: remove unused wrapper block_page_mkwrite() · 5c500029
      Ross Zwisler 提交于
      The function currently called "__block_page_mkwrite()" used to be called
      "block_page_mkwrite()" until a wrapper for this function was added by:
      
      commit 24da4fab ("vfs: Create __block_page_mkwrite() helper passing
      	error values back")
      
      This wrapper, the current "block_page_mkwrite()", is currently unused.
      __block_page_mkwrite() is used directly by ext4, nilfs2 and xfs.
      
      Remove the unused wrapper, rename __block_page_mkwrite() back to
      block_page_mkwrite() and update the comment above block_page_mkwrite().
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5c500029
  8. 10 11月, 2015 3 次提交
    • C
      xfs: give all workqueues rescuer threads · 7a29ac47
      Chris Mason 提交于
      We're consistently hitting deadlocks here with XFS on recent kernels.
      After some digging through the crash files, it looks like everyone in
      the system is waiting for XFS to reclaim memory.
      
      Something like this:
      
      PID: 2733434  TASK: ffff8808cd242800  CPU: 19  COMMAND: "java"
       #0 [ffff880019c53588] __schedule at ffffffff818c4df2
       #1 [ffff880019c535d8] schedule at ffffffff818c5517
       #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348
       #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb
       #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e
       #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453
       #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7
       #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433
       #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9
       #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73
      #10 [ffff880019c539c8] shrink_slab at ffffffff811460e6
      #11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f
      #12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba
      #13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a
      #14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8
      #15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671
      #16 [ffff880019c53dd8] copy_process at ffffffff8104f781
      #17 [ffff880019c53ec8] do_fork at ffffffff8105129c
      #18 [ffff880019c53f38] sys_clone at ffffffff810515b6
      #19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d
      
      xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting
      for IO, which is waiting for workers to complete the IO which is waiting
      for worker threads that don't exist yet:
      
      PID: 2752451  TASK: ffff880bd6bdda00  CPU: 37  COMMAND: "kworker/37:1"
       #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2
       #1 [ffff8808d20abc00] schedule at ffffffff818c5517
       #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c
       #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495
       #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82
       #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f
       #6 [ffff8808d20abe40] worker_thread at ffffffff810699be
       #7 [ffff8808d20abec0] kthread at ffffffff8106ef59
       #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8
      
      I think we should be using WQ_MEM_RECLAIM to make sure this thread
      pool makes progress when we're not able to allocate new workers.
      
      [dchinner: make all workqueues WQ_MEM_RECLAIM]
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7a29ac47
    • B
      xfs: fix log recovery op header validation assert · 848ccfc8
      Brian Foster 提交于
      Commit 89cebc84 ("xfs: validate transaction header length on log
      recovery") added additional validation of the on-disk op header length
      to protect from buffer overflow during log recovery. It accounts for the
      fact that the transaction header can be split across multiple op
      headers. It added an assert for when this occurs that verifies the
      length of the second part of a split transaction header is less than a
      full transaction header. In other words, it expects that the first op
      header of a split transaction header includes at least some portion of
      the transaction header.
      
      This expectation is not always valid as a zero-length op header can
      exist for the first op header of a split transaction header (see
      xlog_recover_add_to_trans() for details). This means that the second op
      header can have a valid, full length transaction header and thus the
      full header is copied in xlog_recover_add_to_cont_trans(). Fix the
      assert in xlog_recover_add_to_cont_trans() to handle this case correctly
      and require that the op header length is less than or equal to a full
      transaction header.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      848ccfc8
    • A
      xfs: Fix error path in xfs_get_acl · edfb8ebc
      Andreas Gruenbacher 提交于
      Error codes from xfs_attr_get other than -ENOATTR were not properly
      reported.  Fix that.
      
      In addition, the declaration of struct xfs_inode in xfs_acl.h isn't needed.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      edfb8ebc
  9. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  10. 03 11月, 2015 8 次提交
    • D
      xfs: optimise away log forces on timestamp updates for fdatasync · fc0561ce
      Dave Chinner 提交于
      xfs: timestamp updates cause excessive fdatasync log traffic
      
      Sage Weil reported that a ceph test workload was writing to the
      log on every fdatasync during an overwrite workload. Event tracing
      showed that the only metadata modification being made was the
      timestamp updates during the write(2) syscall, but fdatasync(2)
      is supposed to ignore them. The key observation was that the
      transactions in the log all looked like this:
      
      INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32
      
      And contained a flags field of 0x45 or 0x85, and had data and
      attribute forks following the inode core. This means that the
      timestamp updates were triggering dirty relogging of previously
      logged parts of the inode that hadn't yet been flushed back to
      disk.
      
      There are two parts to this problem. The first is that XFS relogs
      dirty regions in subsequent transactions, so it carries around the
      fields that have been dirtied since the last time the inode was
      written back to disk, not since the last time the inode was forced
      into the log.
      
      The second part is that on v5 filesystems, the inode change count
      update during inode dirtying also sets the XFS_ILOG_CORE flag, so
      on v5 filesystems this makes a timestamp update dirty the entire
      inode.
      
      As a result when fdatasync is run, it looks at the dirty fields in
      the inode, and sees more than just the timestamp flag, even though
      the only metadata change since the last fdatasync was just the
      timestamps. Hence we force the log on every subsequent fdatasync
      even though it is not needed.
      
      To fix this, add a new field to the inode log item that tracks
      changes since the last time fsync/fdatasync forced the log to flush
      the changes to the journal. This flag is updated when we dirty the
      inode, but we do it before updating the change count so it does not
      carry the "core dirty" flag from timestamp updates. The fields are
      zeroed when the inode is marked clean (due to writeback/freeing) or
      when an fsync/datasync forces the log. Hence if we only dirty the
      timestamps on the inode between fsync/fdatasync calls, the fdatasync
      will not trigger another log force.
      
      Over 100 runs of the test program:
      
      Ext4 baseline:
      	runtime: 1.63s +/- 0.24s
      	avg lat: 1.59ms +/- 0.24ms
      	iops: ~2000
      
      XFS, vanilla kernel:
              runtime: 2.45s +/- 0.18s
      	avg lat: 2.39ms +/- 0.18ms
      	log forces: ~400/s
      	iops: ~1000
      
      XFS, patched kernel:
              runtime: 1.49s +/- 0.26s
      	avg lat: 1.46ms +/- 0.25ms
      	log forces: ~30/s
      	iops: ~1500
      Reported-by: NSage Weil <sage@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fc0561ce
    • D
      xfs: don't leak uuid table on rmmod · af3b6382
      Darrick J. Wong 提交于
      Don't leak the UUID table when the module is unloaded.
      (Found with kmemleak.)
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      af3b6382
    • A
      xfs: invalidate cached acl if set via ioctl · 47e1bf64
      Andreas Gruenbacher 提交于
      Setting or removing the "SGI_ACL_[FILE|DEFAULT]" attributes via the
      XFS_IOC_ATTRMULTI_BY_HANDLE ioctl completely bypasses the POSIX ACL
      infrastructure, like setting the "trusted.SGI_ACL_[FILE|DEFAULT]" xattrs
      did until commit 6caa1056.  Similar to that commit, invalidate cached
      acls when setting/removing them via the ioctl as well.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      47e1bf64
    • A
      xfs: Plug memory leak in xfs_attrmulti_attr_set · 09cb22d2
      Andreas Gruenbacher 提交于
      When setting attributes via XFS_IOC_ATTRMULTI_BY_HANDLE, the user-space
      buffer is copied into a new kernel-space buffer via memdup_user; that
      buffer then isn't freed.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      09cb22d2
    • A
      xfs: Validate the length of on-disk ACLs · 86a21c79
      Andreas Gruenbacher 提交于
      In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct,
      make sure that the length of the attributes is correct as well.  Also, turn
      the aclp parameter into a const pointer.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      86a21c79
    • B
      xfs: invalidate cached acl if set directly via xattr · 67d8e04e
      Brian Foster 提交于
      ACLs are stored as extended attributes of the inode to which they apply.
      XFS converts the standard "system.posix_acl_[access|default]" attribute
      names used to control ACLs to "trusted.SGI_ACL_[FILE|DEFAULT]" as stored
      on-disk. These xattrs are directly exposed in on-disk format via
      getxattr/setxattr, without any ACL aware code in the path to perform
      validation, etc. This is partly historical and supports backup/restore
      applications such as xfsdump to back up and restore the binary blob that
      represents ACLs as-is.
      
      Andreas reports that the ACLs observed via the getfacl interface is not
      consistent when ACLs are set directly via the setxattr path. This occurs
      because the ACLs are cached in-core against the inode and the xattr path
      has no knowledge that the operation relates to ACLs.
      
      Update the xattr set codepath to trap writes of the special XFS ACL
      attributes and invalidate the associated cached ACL when this occurs.
      This ensures that the correct ACLs are used on a subsequent operation
      through the actual ACL interface.
      
      Note that this does not update or add support for setting the ACL xattrs
      directly beyond the restore use case that requires a correctly formatted
      binary blob and to restore a consistent i_mode at the same time. It is
      still possible for a root user to set an invalid or inconsistent (with
      i_mode) ACL blob on-disk and potentially cause corruption.
      
      [ With fixes from Andreas Gruenbacher. ]
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      67d8e04e
    • D
      xfs: xfs_filemap_pmd_fault treats read faults as write faults · 13ad4fe3
      Dave Chinner 提交于
      The code initially committed didn't have the same checks for write
      faults as the dax_pmd_fault code and hence treats all faults as
      write faults. We can get read faults through this path because they
      is no pmd_mkwrite path for write faults similar to the normal page
      fault path. Hence we need to ensure that we only do c/mtime updates
      on write faults, and freeze protection is unnecessary for read
      faults.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      13ad4fe3
    • D
      xfs: add ->pfn_mkwrite support for DAX · 3af49285
      Dave Chinner 提交于
      ->pfn_mkwrite support is needed so that when a page with allocated
      backing store takes a write fault we can check that the fault has
      not raced with a truncate and is pointing to a region beyond the
      current end of file.
      
      This also allows us to update the timestamp on the inode, too, which
      fixes a generic/080 failure.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3af49285