1. 03 11月, 2015 6 次提交
    • D
      xfs: optimise away log forces on timestamp updates for fdatasync · fc0561ce
      Dave Chinner 提交于
      xfs: timestamp updates cause excessive fdatasync log traffic
      
      Sage Weil reported that a ceph test workload was writing to the
      log on every fdatasync during an overwrite workload. Event tracing
      showed that the only metadata modification being made was the
      timestamp updates during the write(2) syscall, but fdatasync(2)
      is supposed to ignore them. The key observation was that the
      transactions in the log all looked like this:
      
      INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32
      
      And contained a flags field of 0x45 or 0x85, and had data and
      attribute forks following the inode core. This means that the
      timestamp updates were triggering dirty relogging of previously
      logged parts of the inode that hadn't yet been flushed back to
      disk.
      
      There are two parts to this problem. The first is that XFS relogs
      dirty regions in subsequent transactions, so it carries around the
      fields that have been dirtied since the last time the inode was
      written back to disk, not since the last time the inode was forced
      into the log.
      
      The second part is that on v5 filesystems, the inode change count
      update during inode dirtying also sets the XFS_ILOG_CORE flag, so
      on v5 filesystems this makes a timestamp update dirty the entire
      inode.
      
      As a result when fdatasync is run, it looks at the dirty fields in
      the inode, and sees more than just the timestamp flag, even though
      the only metadata change since the last fdatasync was just the
      timestamps. Hence we force the log on every subsequent fdatasync
      even though it is not needed.
      
      To fix this, add a new field to the inode log item that tracks
      changes since the last time fsync/fdatasync forced the log to flush
      the changes to the journal. This flag is updated when we dirty the
      inode, but we do it before updating the change count so it does not
      carry the "core dirty" flag from timestamp updates. The fields are
      zeroed when the inode is marked clean (due to writeback/freeing) or
      when an fsync/datasync forces the log. Hence if we only dirty the
      timestamps on the inode between fsync/fdatasync calls, the fdatasync
      will not trigger another log force.
      
      Over 100 runs of the test program:
      
      Ext4 baseline:
      	runtime: 1.63s +/- 0.24s
      	avg lat: 1.59ms +/- 0.24ms
      	iops: ~2000
      
      XFS, vanilla kernel:
              runtime: 2.45s +/- 0.18s
      	avg lat: 2.39ms +/- 0.18ms
      	log forces: ~400/s
      	iops: ~1000
      
      XFS, patched kernel:
              runtime: 1.49s +/- 0.26s
      	avg lat: 1.46ms +/- 0.25ms
      	log forces: ~30/s
      	iops: ~1500
      Reported-by: NSage Weil <sage@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fc0561ce
    • D
      xfs: don't leak uuid table on rmmod · af3b6382
      Darrick J. Wong 提交于
      Don't leak the UUID table when the module is unloaded.
      (Found with kmemleak.)
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      af3b6382
    • A
      xfs: invalidate cached acl if set via ioctl · 47e1bf64
      Andreas Gruenbacher 提交于
      Setting or removing the "SGI_ACL_[FILE|DEFAULT]" attributes via the
      XFS_IOC_ATTRMULTI_BY_HANDLE ioctl completely bypasses the POSIX ACL
      infrastructure, like setting the "trusted.SGI_ACL_[FILE|DEFAULT]" xattrs
      did until commit 6caa1056.  Similar to that commit, invalidate cached
      acls when setting/removing them via the ioctl as well.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      47e1bf64
    • A
      xfs: Plug memory leak in xfs_attrmulti_attr_set · 09cb22d2
      Andreas Gruenbacher 提交于
      When setting attributes via XFS_IOC_ATTRMULTI_BY_HANDLE, the user-space
      buffer is copied into a new kernel-space buffer via memdup_user; that
      buffer then isn't freed.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      09cb22d2
    • A
      xfs: Validate the length of on-disk ACLs · 86a21c79
      Andreas Gruenbacher 提交于
      In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct,
      make sure that the length of the attributes is correct as well.  Also, turn
      the aclp parameter into a const pointer.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      86a21c79
    • B
      xfs: invalidate cached acl if set directly via xattr · 67d8e04e
      Brian Foster 提交于
      ACLs are stored as extended attributes of the inode to which they apply.
      XFS converts the standard "system.posix_acl_[access|default]" attribute
      names used to control ACLs to "trusted.SGI_ACL_[FILE|DEFAULT]" as stored
      on-disk. These xattrs are directly exposed in on-disk format via
      getxattr/setxattr, without any ACL aware code in the path to perform
      validation, etc. This is partly historical and supports backup/restore
      applications such as xfsdump to back up and restore the binary blob that
      represents ACLs as-is.
      
      Andreas reports that the ACLs observed via the getfacl interface is not
      consistent when ACLs are set directly via the setxattr path. This occurs
      because the ACLs are cached in-core against the inode and the xattr path
      has no knowledge that the operation relates to ACLs.
      
      Update the xattr set codepath to trap writes of the special XFS ACL
      attributes and invalidate the associated cached ACL when this occurs.
      This ensures that the correct ACLs are used on a subsequent operation
      through the actual ACL interface.
      
      Note that this does not update or add support for setting the ACL xattrs
      directly beyond the restore use case that requires a correctly formatted
      binary blob and to restore a consistent i_mode at the same time. It is
      still possible for a root user to set an invalid or inconsistent (with
      i_mode) ACL blob on-disk and potentially cause corruption.
      
      [ With fixes from Andreas Gruenbacher. ]
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      67d8e04e
  2. 02 11月, 2015 1 次提交
  3. 19 10月, 2015 2 次提交
  4. 12 10月, 2015 21 次提交
    • E
      xfs: simplify /proc teardown & error handling · 9e92054e
      Eric Sandeen 提交于
      remove_proc_subtree() was added in 3.9, and can be
      used to simplify our procfile creation error handling
      and cleanup, removing the nested gotos.  It simply
      removes fs/xfs and everything created under it.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      9e92054e
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
    • B
      xfs: per-filesystem stats in sysfs · 225e4635
      Bill O'Donnell 提交于
      This patch implements per-filesystem stats objects in sysfs. It
      depends on the application of the previous patch series that
      develops the infrastructure to support both xfs global stats and
      xfs per-fs stats in sysfs.
      
      Stats objects are instantiated when an xfs filesystem is mounted
      and deleted on unmount. With this patch, the stats directory is
      created and populated with the familiar stats and stats_clear files.
      Example:
              /sys/fs/xfs/sda9/stats/stats
              /sys/fs/xfs/sda9/stats/stats_clear
      
      With this patch, the individual counts within the new per-fs
      stats file(s) remain at zero. Functions that use the the macros
      to increment, decrement, and add-to the per-fs stats counts will
      be covered in a separate new patch to follow this one. Note that
      the counts within the global stats file (/sys/fs/xfs/stats/stats)
      advance normally and can be cleared as it was prior to this patch.
      
      [dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so
      it is down before/after any path that uses the per-mount stats. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      225e4635
    • E
      xfs: more info from kmem deadlocks and high-level error msgs · 847f9f68
      Eric Sandeen 提交于
      In an effort to get more useful out of "possible memory
      allocation deadlock" messages, print the size of the
      requested allocation, and dump the stack if the xfs error
      level is tuned high.
      
      The stack dump is implemented in define_xfs_printk_level()
      for error levels >= LOGLEVEL_ERR, partly because it
      seems generically useful, and also because kmem.c has
      no knowledge of xfs error level tunables or other such bits,
      it's very kmem-specific.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      847f9f68
    • E
      xfs: avoid null *src in memcpy call in xlog_write · 91f9f5fe
      Eric Sandeen 提交于
      The gcc undefined behavior sanitizer caught this; surely
      any sane memcpy implementation will no-op if size == 0,
      but behavior with a *src of NULL is technically undefined
      (declared nonnull), so avoid it here.
      
      We are actually in this situation frequently via
      xlog_commit_record(), because:
      
              struct xfs_log_iovec reg = {
                      .i_addr = NULL,
                      .i_len = 0,
                      .i_type = XLOG_REG_TYPE_COMMIT,
              };
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      91f9f5fe
    • B
      xfs: pass total block res. as total xfs_bmapi_write() parameter · dbd5c8c9
      Brian Foster 提交于
      The total field from struct xfs_alloc_arg is a bit of an unknown
      commodity. It is documented as the total block requirement for the
      transaction and is used in this manner from most call sites by virtue of
      passing the total block reservation of the transaction associated with
      an allocation. Several xfs_bmapi_write() callers pass hardcoded values
      of 0 or 1 for the total block requirement, which is a historical oddity
      without any clear reasoning.
      
      The xfs_iomap_write_direct() caller, for example, passes 0 for the total
      block requirement. This has been determined to cause problems in the
      form of ABBA deadlocks of AGF buffers due to incorrect AG selection in
      the block allocator. Specifically, the xfs_alloc_space_available()
      function incorrectly selects an AG that doesn't actually have sufficient
      space for the allocation. This occurs because the args.total field is 0
      and thus the remaining free space check on the AG doesn't actually
      consider the size of the allocation request. This locks the AGF buffer,
      the allocation attempt proceeds and ultimately fails (in
      xfs_alloc_fix_minleft()), and xfs_alloc_vexent() moves on to the next
      AG. In turn, this can lead to incorrect AG locking order (if the
      allocator wraps around, attempting to lock AG 0 after acquiring AG N)
      and thus deadlock if racing with another operation. This problem has
      been reproduced via generic/299 on smallish (1GB) ramdisk test devices.
      
      To avoid this problem, replace the undocumented hardcoded total
      parameters from the iomap and utility callers to pass the block
      reservation used for the associated transaction. This is consistent with
      other xfs_bmapi_write() callers throughout XFS. The assumption is that
      the total field allows the selection of an AG that can handle the entire
      operation rather than simply the allocation/range being requested (e.g.,
      resulting btree splits, etc.). This addresses the aforementioned
      generic/299 hang by ensuring AG selection only occurs when the
      allocation can be satisfied by the AG.
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dbd5c8c9
    • J
      xfs: avoid dependency on Linux XATTR_SIZE_MAX · 51fcbfe7
      Jan Tulak 提交于
      Currently, we depends on Linux XATTR value for on disk
      definition. Which causes trouble on other platforms and
      maybe also if this value was to change.
      
      Fix it by creating a custom definition independent from
      those in Linux (although with the same values), so it is OK
      with the be16 fields used for holding these attributes.
      
      This patch reflects a change in xfsprogs.
      Signed-off-by: NJan Tulak <jtulak@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      51fcbfe7
    • J
      xfs: prefix XATTR_LIST_MAX with XFS_ · 4e247614
      Jan Tulak 提交于
      Remove a hard dependency of Linux XATTR_LIST_MAX value by using
      a prefixed version. This patch reflects the same change in xfsprogs.
      Signed-off-by: NJan Tulak <jtulak@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4e247614
    • G
      libxfs: fix two comment typos · fef4ded8
      Geliang Tang 提交于
      Just fix two typos in code comments.
      Signed-off-by: NGeliang Tang <geliangtang@163.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fef4ded8
    • B
      xfs: add an xfs_zero_eof() tracepoint · 0a50f162
      Brian Foster 提交于
      Add a tracepoint in xfs_zero_eof() to facilitate tracking and debugging
      EOF zeroing events. This has proven useful in the context of other
      direct I/O tracepoints to ensure EOF zeroing occurs within appropriate
      file ranges.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0a50f162
    • B
      xfs: always drain dio before extending aio write submission · 3136e8bb
      Brian Foster 提交于
      XFS supports and typically allows concurrent asynchronous direct I/O
      submission to a single file. One exception to the rule is that file
      extending dio writes that start beyond the current EOF (e.g.,
      potentially create a hole at EOF) require exclusive I/O access to the
      file. This is because such writes must zero any pre-existing blocks
      beyond EOF that are exposed by virtue of now residing within EOF as a
      result of the write about to be submitted.
      
      Before EOF zeroing can occur, the current file i_size must be stabilized
      to avoid data corruption. In this scenario, XFS upgrades the iolock to
      exclude any further I/O submission, waits on in-flight I/O to complete
      to ensure i_size is up to date (i_size is updated on dio write
      completion) and restarts the various checks against the state of the
      file. The problem is that this protection sequence is triggered only
      when the iolock is currently held shared. While this is true for async
      dio in most cases, the caller may upgrade the lock in advance based on
      arbitrary circumstances with respect to EOF zeroing. For example, the
      iolock is always acquired exclusively if the start offset is not block
      aligned. This means that even though the iolock is already held
      exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing
      can occur based on an unstable i_size.
      
      This problem has been reproduced as guest data corruption in virtual
      machines with file-backed qcow2 virtual disks hosted on an XFS
      filesystem. The virtual disks must be configured with aio=native mode
      and the must not be truncated out to the maximum file size (as some virt
      managers will do).
      
      Update xfs_file_aio_write_checks() to unconditionally drain in-flight
      dio before EOF zeroing can occur. Rather than trigger the wait based on
      iolock state, use a new flag and upgrade the iolock when necessary. Note
      that this results in a full restart of the inode checks even when the
      iolock was already held exclusive when technically it is only required
      to recheck i_size. This should be a rare enough occurrence that it is
      preferable to keep the code simple rather than create an alternate
      restart jump target.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3136e8bb
    • B
      xfs: validate metadata LSNs against log on v5 superblocks · a45086e2
      Brian Foster 提交于
      Since the onset of v5 superblocks, the LSN of the last modification has
      been included in a variety of on-disk data structures. This LSN is used
      to provide log recovery ordering guarantees (e.g., to ensure an older
      log recovery item is not replayed over a newer target data structure).
      
      While this works correctly from the point a filesystem is formatted and
      mounted, userspace tools have some problematic behaviors that defeat
      this mechanism. For example, xfs_repair historically zeroes out the log
      unconditionally (regardless of whether corruption is detected). If this
      occurs, the LSN of the filesystem is reset and the log is now in a
      problematic state with respect to on-disk metadata structures that might
      have a larger LSN. Until either the log catches up to the highest
      previously used metadata LSN or each affected data structure is modified
      and written out without incident (which resets the metadata LSN), log
      recovery is susceptible to filesystem corruption.
      
      This problem is ultimately addressed and repaired in the associated
      userspace tools. The kernel is still responsible to detect the problem
      and notify the user that something is wrong. Check the superblock LSN at
      mount time and fail the mount if it is invalid. From that point on,
      trigger verifier failure on any metadata I/O where an invalid LSN is
      detected. This results in a filesystem shutdown and guarantees that we
      do not log metadata changes with invalid LSNs on disk. Since this is a
      known issue with a known recovery path, present a warning to instruct
      the user how to recover.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a45086e2
    • T
      xfs: Print name and pid when memory allocation loops · 5bf97b1c
      Tetsuo Handa 提交于
      This patch adds comm name and pid to warning messages printed by
      kmem_alloc(), kmem_zone_alloc() and xfs_buf_allocate_memory().
      This will help telling which memory allocations (e.g. kernel worker
      threads, OOM victim tasks, neither) are stalling because these functions
      are passing __GFP_NOWARN which suppresses not only backtrace but comm name
      and pid.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5bf97b1c
    • B
      xfs: log local to remote symlink conversions correctly on v5 supers · b7cdc66b
      Brian Foster 提交于
      A local format symlink inode is converted to extent format when an
      extended attribute is set on an inode as part of the attribute fork
      creation. This means a block is allocated, the local symlink target name
      is copied to the block and the block is logged. Currently,
      xfs_bmap_local_to_extents() handles logging the remote block data based
      on the size of the data fork prior to the conversion. This is not
      correct on v5 superblock filesystems, which add an additional header to
      remote symlink blocks that is nonexistent in local format inodes.
      
      As a result, the full length of the remote symlink block content is not
      logged. This can lead to corruption should a crash occur and log
      recovery replay this transaction.
      
      Since a callout is already used to initialize the new remote symlink
      block, update the local-to-extents conversion mechanism to make the
      callout also responsible for logging the block. It is already required
      to set the log buffer type and format the block appropriately based on
      the superblock version. This ensures the remote symlink is always logged
      correctly. Note that xfs_bmap_local_to_extents() is only called for
      symlinks so there are no other callouts that require modification.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b7cdc66b
    • B
      xfs: add missing ilock around dio write last extent alignment · 009c6e87
      Brian Foster 提交于
      The iomap codepath (via get_blocks()) acquires and release the inode
      lock in the case of a direct write that requires block allocation. This
      is because xfs_iomap_write_direct() allocates a transaction, which means
      the ilock must be dropped and reacquired after the transaction is
      allocated and reserved.
      
      xfs_iomap_write_direct() invokes xfs_iomap_eof_align_last_fsb() before
      the transaction is created and thus before the ilock is reacquired. This
      can lead to calls to xfs_iread_extents() and reads of the in-core extent
      list without any synchronization (via xfs_bmap_eof() and
      xfs_bmap_last_extent()). xfs_iread_extents() assert fails if the ilock
      is not held, but this is not currently seen in practice as the current
      callers had already invoked xfs_bmapi_read().
      
      What has been seen in practice are reports of crashes down in the
      xfs_bmap_eof() codepath on direct writes due to seemingly bogus pointer
      references from xfs_iext_get_ext(). While an explicit reproducer is not
      currently available to confirm the cause of the problem, crash analysis
      and code inspection from David Jeffrey had identified the insufficient
      locking.
      
      xfs_iomap_eof_align_last_fsb() is called from other contexts with the
      inode lock already held, so we cannot acquire it therein.
      __xfs_get_blocks() acquires and drops the ilock with variable flags to
      cover the event that the extent list must be read in. The common case is
      that __xfs_get_blocks() acquires the shared ilock. To provide locking
      around the last extent alignment call without adding more lock cycles to
      the dio path, update xfs_iomap_write_direct() to expect the shared ilock
      held on entry and do the extent alignment under its protection. Demote
      the lock, if necessary, from __xfs_get_blocks() and push the
      xfs_qm_dqattach() call outside of the shared lock critical section.
      Also, add an assert to document that the extent list is always expected
      to be present in this path. Otherwise, we risk a call to
      xfs_iread_extents() while under the shared ilock. This is safe as all
      current callers have executed an xfs_bmapi_read() call under the current
      iolock context.
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      009c6e87
    • Z
      cancel the setfilesize transation when io error happen · 5cb13dcd
      Zhaohongjiang 提交于
      When I ran xfstest/073 case, the remount process was blocked to wait
      transactions to be zero. I found there was a io error happened, and
      the setfilesize transaction was not released properly. We should add
      the changes to cancel the io error in this case.
      
      Reproduction steps:
      1. dd if=/dev/zero of=xfs1.img bs=1M count=2048
      2. mkfs.xfs xfs1.img
      3. losetup -f ./xfs1.img /dev/loop0
      4. mount -t xfs /dev/loop0 /home/test_dir/
      5. mkdir /home/test_dir/test
      6. mkfs.xfs -dfile,name=image,size=2g
      7. mount -t xfs -o loop image /home/test_dir/test
      8. cp a file bigger than 2g to /home/test_dir/test
      9. mount -t xfs -o remount,ro /home/test_dir/test
      
      [ dchinner: moved io error detection to xfs_setfilesize_ioend() after
        transaction context restoration. ]
      Signed-off-by: NZhao Hongjiang <zhaohongjiang@huawei.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5cb13dcd
    • B
      xfs: pass xfsstats structures to handlers and macros · 80529c45
      Bill O'Donnell 提交于
      This patch is the next step toward per-fs xfs stats. The patch makes
      the show and clear routines able to handle any stats structure
      associated with a kobject.
      
      Instead of a single global xfsstats structure, add kobject and a pointer
      to a per-cpu struct xfsstats. Modify the macros that manipulate the stats
      accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access
      xfsstats->xs_stats.
      
      The sysfs functions need to get from the kobject back to the xfsstats
      structure which contains it, and pass the pointer to the ->xs_stats
      percpu structure into the show & clear routines.
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      80529c45
    • B
      xfs: consolidate sysfs ops · a27c2640
      Bill O'Donnell 提交于
      As a part of the series to move xfs global stats from procfs to sysfs,
      this patch consolidates the sysfs ops functions and removes redundancy.
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a27c2640
    • B
      xfs: remove unused procfs code · 50cf5b74
      Bill O'Donnell 提交于
      As a part of the work to move xfs global stats from procfs to sysfs,
      this patch removes the now unused procfs code that was xfs stat specific.
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      50cf5b74
    • B
      xfs: create symlink proc/fs/xfs/stat to sys/fs/xfs/stats · 32f0ea05
      Bill O'Donnell 提交于
      As a part of the work to move xfs global stats from procfs to sysfs,
      this patch creates the symlink from proc/fs/xfs/stat to sys/fs/xfs/stats.
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      32f0ea05
    • B
      xfs: create global stats and stats_clear in sysfs · bb230c12
      Bill O'Donnell 提交于
      Currently, xfs global stats are in procfs. This patch introduces
      (replicates) the global stats in sysfs. Additionally a stats_clear file
      is introduced in sysfs.
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bb230c12
  5. 20 9月, 2015 1 次提交
    • C
      fs-writeback: unplug before cond_resched in writeback_sb_inodes · 590dca3a
      Chris Mason 提交于
      Commit 505a666e ("writeback: plug writeback in wb_writeback() and
      writeback_inodes_wb()") has us holding a plug during writeback_sb_inodes,
      which increases the merge rate when relatively contiguous small files
      are written by the filesystem.  It helps both on flash and spindles.
      
      For an fs_mark workload creating 4K files in parallel across 8 drives,
      this commit improves performance ~9% more by unplugging before calling
      cond_resched().  cond_resched() doesn't trigger an implicit unplug, so
      explicitly getting the IO down to the device before scheduling reduces
      latencies for anyone waiting on clean pages.
      
      It also cuts down on how often we use kblockd to unplug, which means
      less work bouncing from one workqueue to another.
      
      Many more details about how we got here:
      
        https://lkml.org/lkml/2015/9/11/570Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      590dca3a
  6. 18 9月, 2015 1 次提交
  7. 16 9月, 2015 2 次提交
  8. 13 9月, 2015 1 次提交
    • L
      writeback: plug writeback in wb_writeback() and writeback_inodes_wb() · 505a666e
      Linus Torvalds 提交于
      We had to revert the pluggin in writeback_sb_inodes() because the
      wb->list_lock is held, but we could easily plug at a higher level before
      taking that lock, and unplug after releasing it.  This does that.
      
      Chris will run performance numbers, just to verify that this approach is
      comparable to the alternative (we could just drop and re-take the lock
      around the blk_finish_plug() rather than these two commits.
      
      I'd have preferred waiting for actual performance numbers before picking
      one approach over the other, but I don't want to release rc1 with the
      known "sleeping function called from invalid context" issue, so I'll
      pick this cleanup version for now.  But if the numbers show that we
      really want to plug just at the writeback_sb_inodes() level, and we
      should just play ugly games with the spinlock, we'll switch to that.
      
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505a666e
  9. 12 9月, 2015 4 次提交
    • S
      [CIFS] mount option sec=none not displayed properly in /proc/mounts · eda2116f
      Steve French 提交于
      When the user specifies "sec=none" in a cifs mount, we set
      sec_type as unspecified (and set a flag and the username will be
      null) rather than setting sectype as "none" so
      cifs_show_security was not properly displaying it in
      cifs /proc/mounts entries.
      Signed-off-by: NSteve French <steve.french@primarydata.com>
      Reviewed-by: NJeff Layton <jlayton@poochiereds.net>
      eda2116f
    • A
      revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each" · e527b22c
      Andrew Morton 提交于
      Revert commit f83c7b5e ("ocfs2/dlm: use list_for_each_entry instead
      of list_for_each").
      
      list_for_each_entry() will dereference its `pos' argument, which can be
      NULL in dlm_process_recovery_data().
      Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
      Reported-by: NFengguang Wu <fengguang.wu@gmail.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e527b22c
    • J
      fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void · 6798a8ca
      Joe Perches 提交于
      The seq_<foo> function return values were frequently misused.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      
      All uses of these return values have been removed, so convert the
      return types to void.
      
      Miscellanea:
      
      o Move seq_put_decimal_<type> and seq_escape prototypes closer the
        other seq_vprintf prototypes
      o Reorder seq_putc and seq_puts to return early on overflow
      o Add argument names to seq_vprintf and seq_printf
      o Update the seq_escape kernel-doc
      o Convert a couple of leading spaces to tabs in seq_escape
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joerg Roedel <jroedel@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6798a8ca
    • L
      Revert "writeback: plug writeback at a high level" · 0ba13fd1
      Linus Torvalds 提交于
      This reverts commit d353d758.
      
      Doing the block layer plug/unplug inside writeback_sb_inodes() is
      broken, because that function is actually called with a spinlock held:
      wb->list_lock, as pointed out by Chris Mason.
      
      Chris suggested just dropping and re-taking the spinlock around the
      blk_finish_plug() call (the plgging itself can happen under the
      spinlock), and that would technically work, but is just disgusting.
      
      We do something fairly similar - but not quite as disgusting because we
      at least have a better reason for it - in writeback_single_inode(), so
      it's not like the caller can depend on the lock being held over the
      call, but in this case there just isn't any good reason for that
      "release and re-take the lock" pattern.
      
      [ In general, we should really strive to avoid the "release and retake"
        pattern for locks, because in the general case it can easily cause
        subtle bugs when the caller caches any state around the call that
        might be invalidated by dropping the lock even just temporarily. ]
      
      But in this case, the plugging should be easy to just move up to the
      callers before the spinlock is taken, which should even improve the
      effectiveness of the plug.  So there is really no good reason to play
      games with locking here.
      
      I'll send off a test-patch so that Dave Chinner can verify that that
      plug movement works.  In the meantime this just reverts the problematic
      commit and adds a comment to the function so that we hopefully don't
      make this mistake again.
      Reported-by: NChris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ba13fd1
  10. 11 9月, 2015 1 次提交
    • J
      CIFS: fix type confusion in copy offload ioctl · 4c17a6d5
      Jann Horn 提交于
      This might lead to local privilege escalation (code execution as
      kernel) for systems where the following conditions are met:
      
       - CONFIG_CIFS_SMB2 and CONFIG_CIFS_POSIX are enabled
       - a cifs filesystem is mounted where:
        - the mount option "vers" was used and set to a value >=2.0
        - the attacker has write access to at least one file on the filesystem
      
      To attack this, an attacker would have to guess the target_tcon
      pointer (but guessing wrong doesn't cause a crash, it just returns an
      error code) and win a narrow race.
      
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NJann Horn <jann@thejh.net>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      4c17a6d5