1. 03 11月, 2015 5 次提交
    • D
      xfs: add ->pfn_mkwrite support for DAX · 3af49285
      Dave Chinner 提交于
      ->pfn_mkwrite support is needed so that when a page with allocated
      backing store takes a write fault we can check that the fault has
      not raced with a truncate and is pointing to a region beyond the
      current end of file.
      
      This also allows us to update the timestamp on the inode, too, which
      fixes a generic/080 failure.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3af49285
    • D
      xfs: DAX does not use IO completion callbacks · 01a155e6
      Dave Chinner 提交于
      For DAX, we are now doing block zeroing during allocation. This
      means we no longer need a special DAX fault IO completion callback
      to do unwritten extent conversion. Because mmap never extends the
      file size (it SEGVs the process) we don't need a callback to update
      the file size, either. Hence we can remove the completion callbacks
      from the __dax_fault and __dax_mkwrite calls.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      01a155e6
    • D
      xfs: Don't use unwritten extents for DAX · 1ca19157
      Dave Chinner 提交于
      DAX has a page fault serialisation problem with block allocation.
      Because it allows concurrent page faults and does not have a page
      lock to serialise faults to the same page, it can get two concurrent
      faults to the page that race.
      
      When two read faults race, this isn't a huge problem as the data
      underlying the page is not changing and so "detect and drop" works
      just fine. The issues are to do with write faults.
      
      When two write faults occur, we serialise block allocation in
      get_blocks() so only one faul will allocate the extent. It will,
      however, be marked as an unwritten extent, and that is where the
      problem lies - the DAX fault code cannot differentiate between a
      block that was just allocated and a block that was preallocated and
      needs zeroing. The result is that both write faults end up zeroing
      the block and attempting to convert it back to written.
      
      The problem is that the first fault can zero and convert before the
      second fault starts zeroing, resulting in the zeroing for the second
      fault overwriting the data that the first fault wrote with zeros.
      The second fault then attempts to convert the unwritten extent,
      which is then a no-op because it's already written. Data loss occurs
      as a result of this race.
      
      Because there is no sane locking construct in the page fault code
      that we can use for serialisation across the page faults, we need to
      ensure block allocation and zeroing occurs atomically in the
      filesystem. This means we can still take concurrent page faults and
      the only time they will serialise is in the filesystem
      mapping/allocation callback. The page fault code will always see
      written, initialised extents, so we will be able to remove the
      unwritten extent handling from the DAX code when all filesystems are
      converted.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1ca19157
    • D
      xfs: introduce BMAPI_ZERO for allocating zeroed extents · 3fbbbea3
      Dave Chinner 提交于
      To enable DAX to do atomic allocation of zeroed extents, we need to
      drive the block zeroing deep into the allocator. Because
      xfs_bmapi_write() can return merged extents on allocation that were
      only partially allocated (i.e. requested range spans allocated and
      hole regions, allocation into the hole was contiguous), we cannot
      zero the extent returned from xfs_bmapi_write() as that can
      overwrite existing data with zeros.
      
      Hence we have to drive the extent zeroing into the allocation code,
      prior to where we merge the extents into the BMBT and return the
      resultant map. This means we need to propagate this need down to
      the xfs_alloc_vextent() and issue the block zeroing at this point.
      
      While this functionality is being introduced for DAX, there is no
      reason why it is specific to DAX - we can per-zero blocks during the
      allocation transaction on any type of device. It's just slow (and
      usually slower than unwritten allocation and conversion) on
      traditional block devices so doesn't tend to get used. We can,
      however, hook hardware zeroing optimisations via sb_issue_zeroout()
      to this operation, so it may be useful in future and hence the
      "allocate zeroed blocks" API needs to be implementation neutral.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fbbbea3
    • D
      xfs: fix inode size update overflow in xfs_map_direct() · 3e12dbbd
      Dave Chinner 提交于
      Both direct IO and DAX pass an offset and count into get_blocks that
      will overflow a s64 variable when an IO goes into the last supported
      block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
      from the tracing:
      
      xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
      xfs_gbmap_direct:     [...] offset 0x7ffffffffffff000 count 4096
      xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096
      
      0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
      overflows the s64 offset and we fail to detect the need for a
      filesize update and an ioend is not allocated.
      
      This is *mostly* avoided for direct IO because such extending IOs
      occur with full block allocation, and so the "IS_UNWRITTEN()" check
      still evaluates as true and we get an ioend that way. However, doing
      single sector extending IOs to this last block will expose the fact
      that file size updates will not occur after the first allocating
      direct IO as the overflow will then be exposed.
      
      There is one further complexity: the DAX page fault path also
      exposes the same issue in block allocation. However, page faults
      cannot extend the file size, so in this case we want to allocate the
      block but do not want to allocate an ioend to enable file size
      update at IO completion. Hence we now need to distinguish between
      the direct IO patch allocation and dax fault path allocation to
      avoid leaking ioend structures.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3e12dbbd
  2. 09 9月, 2015 2 次提交
  3. 05 9月, 2015 1 次提交
    • K
      fs: create and use seq_show_option for escaping · a068acf2
      Kees Cook 提交于
      Many file systems that implement the show_options hook fail to correctly
      escape their output which could lead to unescaped characters (e.g.  new
      lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
      could lead to confusion, spoofed entries (resulting in things like
      systemd issuing false d-bus "mount" notifications), and who knows what
      else.  This looks like it would only be the root user stepping on
      themselves, but it's possible weird things could happen in containers or
      in other situations with delegated mount privileges.
      
      Here's an example using overlay with setuid fusermount trusting the
      contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
      of "sudo" is something more sneaky:
      
        $ BASE="ovl"
        $ MNT="$BASE/mnt"
        $ LOW="$BASE/lower"
        $ UP="$BASE/upper"
        $ WORK="$BASE/work/ 0 0
        none /proc fuse.pwn user_id=1000"
        $ mkdir -p "$LOW" "$UP" "$WORK"
        $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
        $ cat /proc/mounts
        none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
        none /proc fuse.pwn user_id=1000 0 0
        $ fusermount -u /proc
        $ cat /proc/mounts
        cat: /proc/mounts: No such file or directory
      
      This fixes the problem by adding new seq_show_option and
      seq_show_option_n helpers, and updating the vulnerable show_option
      handlers to use them as needed.  Some, like SELinux, need to be open
      coded due to unusual existing escape mechanisms.
      
      [akpm@linux-foundation.org: add lost chunk, per Kees]
      [keescook@chromium.org: seq_show_option should be using const parameters]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Cc: J. R. Okajima <hooanon05g@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a068acf2
  4. 28 8月, 2015 4 次提交
  5. 25 8月, 2015 5 次提交
  6. 20 8月, 2015 1 次提交
  7. 19 8月, 2015 22 次提交
    • B
      xfs: flush entire file on dio read/write to cached file · 3d751af2
      Brian Foster 提交于
      Filesystems are responsible to manage file coherency between the page
      cache and direct I/O. The generic dio code flushes dirty pages over the
      range of a dio to ensure that the dio read or a future buffered read
      returns the correct data. XFS has generally followed this pattern,
      though traditionally has flushed and invalidated the range from the
      start of the I/O all the way to the end of the file. This changed after
      the following commit:
      
      	7d4ea3ce xfs: use ranged writeback and invalidation for direct IO
      
      ... as the full file flush was no longer necessary to deal with the
      strange post-eof delalloc issues that were since fixed. Unfortunately,
      we have since received complaints about performance degradation due to
      the increased exclusive iolock cycles (which locks out parallel dio
      submission) that occur when a file has cached pages. This does not occur
      on filesystems that use the generic code as it also does not incorporate
      locking.
      
      The exclusive iolock is acquired any time the inode mapping has cached
      pages, regardless of whether they reside in the range of the I/O or not.
      If not, the flush/inval calls do no work and the lock was cycled for no
      reason.
      
      Under consideration of the cost of the exclusive iolock, update the dio
      read and write handlers to flush and invalidate the entire mapping when
      cached pages exist. In most cases, this increases the cost of the
      initial flush sequence but eliminates the need for further lock cycles
      and flushes so long as the workload does not actively mix direct and
      buffered I/O. This also more closely matches historical behavior and
      performance characteristics that users have come to expect.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3d751af2
    • J
      xfs: Fix xfs_attr_leafblock definition · ffeecc52
      Jan Kara 提交于
      struct xfs_attr_leafblock contains 'entries' array which is declared
      with size 1 altough it can in fact contain much more entries. Since this
      array is followed by further struct members, gcc (at least in version
      4.8.3) thinks that the array has the fixed size of 1 element and thus
      may optimize away all accesses beyond the end of array resulting in
      non-working code. This problem was only observed with userspace code in
      xfsprogs, however it's better to be safe in kernel as well and have
      matching kernel and xfsprogs definitions.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NJan Kara <jack@suse.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ffeecc52
    • D
      libxfs: readahead of dir3 data blocks should use the read verifier · 2f123bce
      Darrick J. Wong 提交于
      In the dir3 data block readahead function, use the regular read
      verifier to check the block's CRC and spot-check the block contents
      instead of directly calling only the spot-checking routine.  This
      prevents corrupted directory data blocks from being read into the
      kernel, which can lead to garbage ls output and directory loops (if
      say one of the entries contains slashes and other junk).
      
      cc: <stable@vger.kernel.org> # 3.12 - 4.2
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2f123bce
    • D
      xfs: stop holding ILOCK over filldir callbacks · dbad7c99
      Dave Chinner 提交于
      The recent change to the readdir locking made in 40194ecc ("xfs:
      reinstate the ilock in xfs_readdir") for CXFS directory sanity was
      probably the wrong thing to do. Deep in the readdir code we
      can take page faults in the filldir callback, and so taking a page
      fault while holding an inode ilock creates a new set of locking
      issues that lockdep warns all over the place about.
      
      The locking order for regular inodes w.r.t. page faults is io_lock
      -> pagefault -> mmap_sem -> ilock. The directory readdir code now
      triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
      at this point, it inverts all the locking patterns that lockdep
      normally sees on XFS inodes, and so triggers lockdep. We worked
      around this with commit 93a8614e ("xfs: fix directory inode iolock
      lockdep false positive"), but that then just moved the lockdep
      warning to deeper in the page fault path and triggered on security
      inode locks. Fixing the shmem issue there just moved the lockdep
      reports somewhere else, and now we are getting false positives from
      filesystem freezing annotations getting confused.
      
      Further, if we enter memory reclaim in a readdir path, we now get
      lockdep warning about potential deadlocks because the ilock is held
      when we enter reclaim. This, again, is different to a regular file
      in that we never allow memory reclaim to run while holding the ilock
      for regular files. Hence lockdep now throws
      ilock->kmalloc->reclaim->ilock warnings.
      
      Basically, the problem is that the ilock is being used to protect
      the directory data and the inode metadata, whereas for a regular
      file the iolock protects the data and the ilock protects the
      metadata. From the VFS perspective, the i_mutex serialises all
      accesses to the directory data, and so not holding the ilock for
      readdir doesn't matter. The issue is that CXFS doesn't access
      directory data via the VFS, so it has no "data serialisaton"
      mechanism. Hence we need to hold the IOLOCK in the correct places to
      provide this low level directory data access serialisation.
      
      The ilock can then be used just when the extent list needs to be
      read, just like we do for regular files. The directory modification
      code can take the iolock exclusive when the ilock is also taken,
      and this then ensures that readdir is correct excluded while
      modifications are in progress.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dbad7c99
    • D
      xfs: clean up inode lockdep annotations · 0952c818
      Dave Chinner 提交于
      Lockdep annotations are a maintenance nightmare. Locking has to be
      modified to suit the limitations of the annotations, and we're
      always having to fix the annotations because they are unable to
      express the complexity of locking heirarchies correctly.
      
      So, next up, we've got more issues with lockdep annotations for
      inode locking w.r.t. XFS_LOCK_PARENT:
      
      	- lockdep classes are exclusive and can't be ORed together
      	  to form new classes.
      	- IOLOCK needs multiple PARENT subclasses to express the
      	  changes needed for the readdir locking rework needed to
      	  stop the endless flow of lockdep false positives involving
      	  readdir calling filldir under the ILOCK.
      	- there are only 8 unique lockdep subclasses available,
      	  so we can't create a generic solution.
      
      IOWs we need to treat the 3-bit space available to each lock type
      differently:
      
      	- IOLOCK uses xfs_lock_two_inodes(), so needs:
      		- at least 2 IOLOCK subclasses
      		- at least 2 IOLOCK_PARENT subclasses
      	- MMAPLOCK uses xfs_lock_two_inodes(), so needs:
      		- at least 2 MMAPLOCK subclasses
      	- ILOCK uses xfs_lock_inodes with up to 5 inodes, so needs:
      		- at least 5 ILOCK subclasses
      		- one ILOCK_PARENT subclass
      		- one RTBITMAP subclass
      		- one RTSUM subclass
      
      For the IOLOCK, split the space into two sets of subclasses.
      For the MMAPLOCK, just use half the space for the one subclass to
      match the non-parent lock classes of the IOLOCK.
      For the ILOCK, use 0-4 as the ILOCK subclasses, 5-7 for the
      remaining individual subclasses.
      
      Because they are now all different, modify xfs_lock_inumorder() to
      handle the nested subclasses, and to assert fail if passed an
      invalid subclass. Further, annotate xfs_lock_inodes() to assert fail
      if an invalid combination of lock primitives and inode counts are
      passed that would result in a lockdep subclass annotation overflow.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0952c818
    • B
      xfs: swap leaf buffer into path struct atomically during path shift · 7df1c170
      Brian Foster 提交于
      The node directory lookup code uses a state structure that tracks the
      path of buffers used to search for the hash of a filename through the
      leaf blocks. When the lookup encounters a block that ends with the
      requested hash, but the entry has not yet been found, it must shift over
      to the next block and continue looking for the entry (i.e., duplicate
      hashes could continue over into the next block). This shift mechanism
      involves walking back up and down the state structure, replacing buffers
      at the appropriate btree levels as necessary.
      
      When a buffer is replaced, the old buffer is released and the new buffer
      read into the active slot in the path structure. Because the buffer is
      read directly into the path slot, a buffer read failure can result in
      setting a NULL buffer pointer in an active slot. This throws off the
      state cleanup code in xfs_dir2_node_lookup(), which expects to release a
      buffer from each active slot. Instead, a BUG occurs due to a NULL
      pointer dereference:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
        IP: [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
        ...
        RIP: 0010:[<ffffffffa0585063>]  [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
        ...
        Call Trace:
         [<ffffffffa05250c6>] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
         [<ffffffffa0519f7c>] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
         [<ffffffffa055d0e1>] xfs_lookup+0x91/0x290 [xfs]
         [<ffffffffa05580b3>] xfs_vn_lookup+0x73/0xb0 [xfs]
         [<ffffffff8122de8d>] lookup_real+0x1d/0x50
         [<ffffffff8123330e>] path_openat+0x91e/0x1490
         [<ffffffff81235079>] do_filp_open+0x89/0x100
         ...
      
      This has been reproduced via a parallel fsstress and filesystem shutdown
      workload in a loop. The shutdown triggers the read error in the
      aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().
      
      Update xfs_da3_path_shift() to update the active path slot atomically
      with respect to the caller when a buffer is replaced. This ensures that
      the caller always sees the old or new buffer in the slot and prevents
      the NULL pointer dereference.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7df1c170
    • B
      xfs: relocate sparse inode mount warning · 1b867d3a
      Brian Foster 提交于
      The sparse inodes feature is currently considered experimental. We warn
      at mount time from xfs_mount_validate_sb(). This function is part of the
      superblock verifier codepath, however, which means it could be invoked
      repeatedly on superblock reads or writes. This is currently only
      noticeable from userspace, where mkfs produces multiple warnings at
      format time.
      
      As mkfs warnings were not the intent of this change, relocate the mount
      time warning to xfs_fs_fill_super(), which is only invoked once and only
      in kernel space.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1b867d3a
    • D
      xfs: dquots should be stamped with sb_meta_uuid · 92863451
      Dave Chinner 提交于
      Once the sb_uuid is changed, the wrong uuid is stamped into new
      dquots on disk. Found by inspection, verified by generic/219.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      92863451
    • D
      xfs: log recovery needs to validate against sb_meta_uuid · fcfbe2c4
      Dave Chinner 提交于
      Now that sb_uuid can be changed by the user, we cannot use this to
      validate the metadata blocks being recovered belong to this
      filesystem. We must check against the sb_meta_uuid as that will
      remain unchanged.
      
      There is a complication in this code - the superblock itself. We can
      not check the sb_meta_uuid unconditionally, as that may not be set
      on disk. Hence we must verify the superblock sb_uuid matches between
      the log record and the in-core superblock.
      
      Found by inspection after the previous two problems were found.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fcfbe2c4
    • D
      xfs: growfs not aware of sb_meta_uuid · ac383de2
      Dave Chinner 提交于
      Adding this simple change to xfstests:common/rc::_scratch_mkfs_xfs:
      
      +       if [ $mkfs_status -eq 0 ]; then
      +               xfs_admin -U generate $SCRATCH_DEV > /dev/null
      +       fi
      
      triggers all sorts of errors in xfstests. xfs/104 is an example,
      where growfs fails with a UUID mismatch corruption detected by
      xfs_agf_write_verify() when trying to write the first new AG
      headers.
      
      Fix this problem by making sure we copy the sb_meta_uuid into new
      metadata written by growfs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ac383de2
    • D
      xfs: fix sb_meta_uuid usage · bbf155ad
      Dave Chinner 提交于
      After changing the UUID on a v5 filesystem, xfstests fails
      immediately on a debug kernel with:
      
      XFS: Assertion failed: uuid_equal(&ip->i_d.di_uuid, &mp->m_sb.sb_uuid), file: fs/xfs/xfs_inode.c, line: 799
      
      This needs to check against the sb_meta_uuid, not the user visible
      UUID that was changed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bbf155ad
    • E
      xfs: set XFS_DA_OP_OKNOENT in xfs_attr_get · c400ee3e
      Eric Sandeen 提交于
      It's entirely possible for userspace to ask for an xattr which
      does not exist.
      
      Normally, there is no problem whatsoever when we ask for such
      a thing, but when we look at an obfuscated metadump image
      on a debug kernel with selinux, we trip over this ASSERT in
      xfs_da3_path_shift():
      
              *result = -ENOENT;      /* we're out of our tree */
              ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
      
      It (more or less) only shows up in the above scenario, because
      xfs_metadump obfuscates attr names, but chooses names which
      keep the same hash value - and xfs_da3_node_lookup_int does:
      
              if (((retval == -ENOENT) || (retval == -ENOATTR)) &&
                  (blk->hashval == args->hashval)) {
                      error = xfs_da3_path_shift(state, &state->path, 1, 1,
                                                       &retval);
      
      IOWS, we only get down to the xfs_da3_path_shift() ASSERT
      if we are looking for an xattr which doesn't exist, but we
      find xattrs on disk which have the same hash, and so might be
      a hash collision, so we try the path shift.  When *that*
      fails to find what we're looking for, we hit the assert about
      XFS_DA_OP_OKNOENT.
      
      Simply setting XFS_DA_OP_OKNOENT in xfs_attr_get solves this
      rather corner-case problem with no ill side effects.  It's
      fine for an attr name lookup to fail.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      c400ee3e
    • B
      xfs: add missing bmap cancel calls in error paths · d4a97a04
      Brian Foster 提交于
      If a failure occurs after the bmap free list is populated and before
      xfs_bmap_finish() completes successfully (which returns a partial
      list on failure), the bmap free list must be cancelled. Otherwise,
      the extent items on the list are never freed and a memory leak
      occurs.
      
      Several random error paths throughout the code suffer this problem.
      Fix these up such that xfs_bmap_cancel() is always called on error.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d4a97a04
    • B
      xfs: add helper to conditionally remove items from the AIL · 146e54b7
      Brian Foster 提交于
      Several areas of code duplicate a pattern where we take the AIL lock,
      check whether an item is in the AIL and remove it if so. Create a new
      helper for this pattern and use it where appropriate.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      146e54b7
    • B
      xfs: fix btree cursor error cleanups · f307080a
      Brian Foster 提交于
      The btree cursor cleanup function takes an error parameter that
      affects how buffers are released from the cursor. All buffers are
      released in the event of error. Several callers do not specify the
      XFS_BTREE_ERROR flag in the event of error, however. This can cause
      buffers to hang around locked or with an elevated hold count and
      thus lead to umount hangs in the event of errors.
      
      Fix up the xfs_btree_del_cursor() callers to pass XFS_BTREE_ERROR if
      the cursor is being torn down due to error.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f307080a
    • B
      xfs: clean up root inode properly on mount failure · 0ae120f8
      Brian Foster 提交于
      The root inode is read as part of the xfs_mountfs() sequence and the
      reference is dropped in the event of failure after we grab the
      inode.  The reference drop doesn't necessarily free the inode,
      however. It marks it for reclaim and potentially kicks off the
      reclaim workqueue.  The workqueue is destroyed further up the error
      path, which means we are subject to crash if the workqueue job runs
      after this point or a memory leak which is identified if the
      xfs_inode_zone is destroyed (e.g., on module removal). Both of these
      outcomes are reproducible via manual instrumentation of a mount
      error after the root inode xfs_iget() call in xfs_mountfs().
      
      Update the xfs_mountfs() error path to cancel any potential reclaim
      work items and to run a synchronous inode reclaim if the root inode
      is marked for reclaim. This ensures that no jobs remain on the queue
      before it is destroyed and that the root inode is freed before the
      reclaim mechanism is torn down.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0ae120f8
    • B
      xfs: checksum log record ext headers based on record size · a3f20014
      Brian Foster 提交于
      The first 4 bytes of every basic block in the physical log is stamped
      with the current lsn. To support this mechanism, the log record header
      (first block of each new log record) contains space for the original
      first byte of each log record block before it is replaced with the lsn.
      The log record header has space for 32k worth of blocks. The version 2
      log adds new extended record headers for each additional 32k worth of
      blocks beyond what is supported by the record header.
      
      The log record checksum incorporates the log record header, the extended
      headers and the record payload. xlog_cksum() checksums the extended
      headers based on log->l_iclog_heads, which specifies the number of
      extended headers in a log record based on the log buffer size mount
      option. The log buffer size is variable, however, and thus means the
      checksum can be calculated differently based on how a filesystem is
      mounted. This is problematic if a filesystem crashes and recovery occurs
      on a subsequent mount using a different log buffer size. For example,
      crash an active filesystem that is mounted with the default (32k)
      logbsize, attempt remount/recovery using '-o logbsize=64k' and the mount
      fails on or warns about log checksum failures.
      
      To avoid this problem, update xlog_cksum() to calculate the checksum
      based on the size of the log buffer according to the log record. The
      size is already included in the h_size field of the log record header
      and thus is available at log recovery time. Extended log record headers
      are also only written when the log record is large enough to require
      them. This makes checksum calculation of log records consistent with the
      extended record header mechanism as well as how on-disk records are
      checksummed with various log buffer size mount options.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a3f20014
    • B
      xfs: fix broken icreate log item cancellation · fc0d1656
      Brian Foster 提交于
      Inode cluster buffers are invalidated and cancelled when inode chunks
      are freed to notify log recovery that previous logged updates to the
      metadata buffer should be skipped. This ensures that log recovery does
      not overwrite buffers that might have already been reused.
      
      On v4 filesystems, inode chunk allocation and inode updates are logged
      via the cluster buffers and thus cancellation is easily detected via
      buffer cancellation items. v5 filesystems use the new icreate
      transaction, which uses logical logging and ordered buffers to log a
      full inode chunk allocation at once. The resulting icreate item often
      spans multiple inode cluster buffers.
      
      Log recovery checks for cancelled buffers when processing icreate log
      items, but it has a couple problems. First, it uses the full length of
      the inode chunk rather than the cluster size. Second, it uses the length
      in FSB units rather than BB units. Either of these problems prevent
      icreate recovery from identifying cancelled buffers and thus inode
      initialization proceeds unconditionally.
      
      Update xlog_recover_do_icreate_pass2() to iterate the icreate range in
      cluster sized increments and check each increment for cancellation.
      Since icreate is currently only used for the minimum atomic inode chunk
      allocation, we expect that either all or none of the buffers will be
      cancelled. Cancel the icreate if at least one buffer is cancelled to
      avoid making a bad situation worse by initializing a partial inode
      chunk, but detect such anomalies and warn the user.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fc0d1656
    • B
      xfs: icreate log item recovery and cancellation tracepoints · 78d57e45
      Brian Foster 提交于
      Various log items have recovery tracepoints to identify whether a
      particular log item is recovered or cancelled. Add the equivalent
      tracepoints for the icreate transaction.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      78d57e45
    • B
      xfs: don't leave EFIs on AIL on mount failure · f0b2efad
      Brian Foster 提交于
      Log recovery occurs in two phases at mount time. In the first phase,
      EFIs and EFDs are processed and potentially cancelled out. EFIs without
      EFD objects are inserted into the AIL for processing and recovery in the
      second phase. xfs_mountfs() runs various other operations between the
      phases and is thus subject to failure. If failure occurs after the first
      phase but before the second, pending EFIs sit on the AIL, pin it and
      cause the mount to hang.
      
      Update the mount sequence to ensure that pending EFIs are cancelled in
      the event of failure. Add a recovery cancellation mechanism to iterate
      the AIL and cancel all EFI items when requested. Plumb cancellation
      support through the log mount finish helper and update xfs_mountfs() to
      invoke cancellation in the event of failure after recovery has started.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f0b2efad
    • B
      xfs: use EFI refcount consistently in log recovery · e32a1d1f
      Brian Foster 提交于
      The EFI is initialized with a reference count of 2. One for the EFI to
      ensure the item makes it to the AIL and one for the subsequently created
      EFD to release the EFI once the EFD is committed. Log recovery uses the
      EFI in a similar manner, but implements a hack to remove both references
      in one call once the EFD is handled.
      
      Update log recovery to use EFI reference counting in a manner consistent
      with the log. When an EFI is encountered during recovery, an EFI item is
      allocated and inserted to the AIL directly. Since the EFI reference is
      typically dropped when the EFI is unpinned and this is analogous with
      AIL insertion, drop the EFI reference at this point.
      
      When a corresponding EFD is encountered in the log, this indicates that
      the extents were freed, no processing is required and the EFI can be
      dropped. Update xlog_recover_efd_pass2() to simply drop the EFD
      reference at this point rather than open code the AIL removal and EFI
      free.
      
      Remaining EFIs (i.e., with no corresponding EFD) are processed in
      xlog_recover_finish(). An EFD transaction is allocated and the extents
      are freed, which transfers ownership of the EFI reference to the EFD
      item in the log.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e32a1d1f
    • B
      xfs: ensure EFD trans aborts on log recovery extent free failure · 6bc43af3
      Brian Foster 提交于
      Log recovery attempts to free extents with leftover EFIs in the AIL
      after initial processing. If the extent free fails (e.g., due to
      unrelated fs corruption), the transaction is cancelled, though it
      might not be dirtied at the time. If this is the case, the EFD does
      not abort and thus does not release the EFI. This can lead to hangs
      as the EFI pins the AIL.
      
      Update xlog_recover_process_efi() to log the EFD in the transaction
      before xfs_free_extent() errors are handled to ensure the
      transaction is dirty, aborts the EFD and releases the EFI on error.
      Since this is a requirement for EFD processing (and consistent with
      xfs_bmap_finish()), update the EFD logging helper to do the extent
      free and unconditionally log the EFD. This encodes the required EFD
      logging behavior into the helper and reduces the likelihood of
      errors down the road.
      
      [dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
       failure.]
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6bc43af3