1. 15 7月, 2014 5 次提交
    • B
      xfs: add a sysfs kset · 3d871226
      Brian Foster 提交于
      Create a sysfs kset to contain all sub-objects associated with the XFS
      module. The kset is created and removed on module initialization and
      removal respectively. The kset uses fs_obj as a parent. This leads to
      the creation of a /sys/fs/xfs directory when the kset exists.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3d871226
    • B
      xfs: fix a couple error sequence jumps in xfs_mountfs() · a70a4fa5
      Brian Foster 提交于
      xfs_mountfs() has a couple failure conditions that do not jump to the
      correct labels. Specifically:
      
      - xfs_initialize_perag_data() failure does not deallocate the log even
        though it occurs after log initialization
      - xfs_mount_reset_sbqflags() failure returns the error directly rather
        than jump to the error sequence
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a70a4fa5
    • D
      xfs: null unused quota inodes when quota is on · 03e01349
      Dave Chinner 提交于
      When quota is on, it is expected that unused quota inodes have a
      value of NULLFSINO. The changes to support a separate project quota
      in 3.12 broken this rule for non-project quota inode enabled
      filesystem, as the code now refuses to write the group quota inode
      if neither group or project quotas are enabled. This regression was
      introduced by commit d892d586 ("xfs: Start using pquotaino from the
      superblock").
      
      In this case, we should be writing NULLFSINO rather than nothing to
      ensure that we leave the group quota inode in a valid state while
      quotas are enabled.
      
      Failure to do so doesn't cause a current kernel to break - the
      separate project quota inodes introduced translation code to always
      treat a zero inode as NULLFSINO. This was introduced by commit
      01026297 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
      also in 3.12 but older kernels do not do this and hence taking a
      filesystem back to an older kernel can result in quotas failing
      initialisation at mount time. When that happens, we see this in
      dmesg:
      
      [ 1649.215390] XFS (sdb): Mounting Filesystem
      [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
      [ 1649.316902] XFS (sdb): Ending clean mount
      
      By ensuring that we write NULLFSINO to quota inodes that aren't
      active, we avoid this problem. We have to be really careful when
      determining if the quota inodes are active or not, because we don't
      want to write a NULLFSINO if the quota inodes are active and we
      simply aren't updating them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      03e01349
    • D
      xfs: refine the allocation stack switch · cf11da9c
      Dave Chinner 提交于
      The allocation stack switch at xfs_bmapi_allocate() has served it's
      purpose, but is no longer a sufficient solution to the stack usage
      problem we have in the XFS allocation path.
      
      Whilst the kernel stack size is now 16k, that is not a valid reason
      for undoing all our "keep stack usage down" modifications. What it
      does allow us to do is have the freedom to refine and perfect the
      modifications knowing that if we get it wrong it won't blow up in
      our faces - we have a safety net now.
      
      This is important because we still have the issue of older kernels
      having smaller stacks and that they are still supported and are
      demonstrating a wide range of different stack overflows.  Red Hat
      has several open bugs for allocation based stack overflows from
      directory modifications and direct IO block allocation and these
      problems still need to be solved. If we can solve them upstream,
      then distro's won't need to bake their own unique solutions.
      
      To that end, I've observed that every allocation based stack
      overflow report has had a specific characteristic - it has happened
      during or directly after a bmap btree block split. That event
      requires a new block to be allocated to the tree, and so we
      effectively stack one allocation stack on top of another, and that's
      when we get into trouble.
      
      A further observation is that bmap btree block splits are much rarer
      than writeback allocation - over a range of different workloads I've
      observed the ratio of bmap btree inserts to splits ranges from 100:1
      (xfstests run) to 10000:1 (local VM image server with sparse files
      that range in the hundreds of thousands to millions of extents).
      Either way, bmap btree split events are much, much rarer than
      allocation events.
      
      Finally, we have to move the kswapd state to the allocation workqueue
      work when allocation is done on behalf of kswapd. This is proving to
      cause significant perturbation in performance under memory pressure
      and appears to be generating allocation deadlock warnings under some
      workloads, so avoiding the use of a workqueue for the majority of
      kswapd writeback allocation will minimise the impact of such
      behaviour.
      
      Hence it makes sense to move the stack switch to xfs_btree_split()
      and only do it for bmap btree splits. Stack switches during
      allocation will be much rarer, so there won't be significant
      performacne overhead caused by switching stacks. The worse case
      stack from all allocation paths will be split, not just writeback.
      And the majority of memory allocations will be done in the correct
      context (e.g. kswapd) without causing additional latency, and so we
      simplify the memory reclaim interactions between processes,
      workqueues and kswapd.
      
      The worst stack I've been able to generate with this patch in place
      is 5600 bytes deep. It's very revealing because we exit XFS at:
      
      37)     1768      64   kmem_cache_alloc+0x13b/0x170
      
      about 1800 bytes of stack consumed, and the remaining 3800 bytes
      (and 36 functions) is memory reclaim, swap and the IO stack. And
      this occurs in the inode allocation from an open(O_CREAT) syscall,
      not writeback.
      
      The amount of stack being used is much less than I've previously be
      able to generate - fs_mark testing has been able to generate stack
      usage of around 7k without too much trouble; with this patch it's
      only just getting to 5.5k. This is primarily because the metadata
      allocation paths (e.g. directory blocks) are no longer causing
      double splits on the same stack, and hence now stack tracing is
      showing swapping being the worst stack consumer rather than XFS.
      
      Performance of fs_mark inode create workloads is unchanged.
      Performance of fs_mark async fsync workloads is consistently good
      with context switches reduced by around 150,000/s (30%).
      Performance of dbench, streaming IO and postmark is unchanged.
      Allocation deadlock warnings have not been seen on the workloads
      that generated them since adding this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cf11da9c
    • D
      Revert "xfs: block allocation work needs to be kswapd aware" · aa182e64
      Dave Chinner 提交于
      This reverts commit 1f6d6482.
      
      This commit resulted in regressions in performance in low
      memory situations where kswapd was doing writeback of delayed
      allocation blocks. It resulted in significant parallelism of the
      kswapd work and with the special kswapd flags meant that hundreds of
      active allocation could dip into kswapd specific memory reserves and
      avoid being throttled. This cause a large amount of performance
      variation, as well as random OOM-killer invocations that didn't
      previously exist.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      aa182e64
  2. 25 6月, 2014 4 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
    • D
      libxfs: move source files · 30f712c9
      Dave Chinner 提交于
      Move all the source files that are shared with userspace into
      libxfs/. This is done as one big chunk simpy to get it done
      quickly
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      30f712c9
    • D
      libxfs: move header files · 84be0ffc
      Dave Chinner 提交于
      Move all the header files that are shared with userspace into
      libxfs. This is done as one big chunk simpy to get it done quickly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      84be0ffc
    • D
      xfs: create libxfs infrastructure · 69116a13
      Dave Chinner 提交于
      To minimise the differences between kernel and userspace code,
      split the kernel code into the same structure as the userspace code.
      That is, the gneric core functionality of XFS is moved to a libxfs/
      directory and treat it as a layering barrier in the XFS code.
      
      This patch introduces the libxfs directory, the build infrastructure
      and an initial source and header file to build. The libxfs directory
      will contain the header files that are needed to build libxfs - most
      of userspace does not care about the location of these header files
      as they are accessed indirectly. Hence keeping them inside libxfs
      makes it easy to track the changes and script the sync process as
      the directory structure will be identical.
      
      To allow this changeover to occur in the kernel code, there are some
      temporary infrastructure in the makefiles to grab the header
      filesystem from both locations. Once all the files are moved,
      modifications will be made in the source code that will make the
      need for these include directives go away.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      69116a13
  3. 22 6月, 2014 2 次提交
  4. 12 6月, 2014 1 次提交
    • A
      ->splice_write() via ->write_iter() · 8d020765
      Al Viro 提交于
      iter_file_splice_write() - a ->splice_write() instance that gathers the
      pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
      it to ->write_iter().  A bunch of simple cases coverted to that...
      
      [AV: fixed the braino spotted by Cyrill]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8d020765
  5. 11 6月, 2014 1 次提交
    • A
      fs,userns: Change inode_capable to capable_wrt_inode_uidgid · 23adbe12
      Andy Lutomirski 提交于
      The kernel has no concept of capabilities with respect to inodes; inodes
      exist independently of namespaces.  For example, inode_capable(inode,
      CAP_LINUX_IMMUTABLE) would be nonsense.
      
      This patch changes inode_capable to check for uid and gid mappings and
      renames it to capable_wrt_inode_uidgid, which should make it more
      obvious what it does.
      
      Fixes CVE-2014-4014.
      
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23adbe12
  6. 10 6月, 2014 1 次提交
  7. 06 6月, 2014 23 次提交
  8. 20 5月, 2014 3 次提交
    • R
      xfs: fix compile error when libxfs header used in C++ code · 376c2f3a
      Roger Willcocks 提交于
      xfs_ialloc.h:102: error: expected ',' or '...' before 'delete'
      
      Simple parameter rename, no changes to behaviour.
      Signed-off-by: NRoger Willcocks <roger@filmlight.ltd.uk>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      376c2f3a
    • J
      xfs: fix infinite loop at xfs_vm_writepage on 32bit system · 8695d27e
      Jie Liu 提交于
      Write to a file with an offset greater than 16TB on 32-bit system and
      then trigger page write-back via sync(1) will cause task hang.
      
      # block_size=4096
      # offset=$(((2**32 - 1) * $block_size))
      # xfs_io -f -c "pwrite $offset $block_size" /storage/test_file
      # sync
      
      INFO: task sync:2590 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      sync            D c1064a28     0  2590   2097 0x00000000
      .....
      Call Trace:
      [<c1064a28>] ? ttwu_do_wakeup+0x18/0x130
      [<c1066d0e>] ? try_to_wake_up+0x1ce/0x220
      [<c1066dbf>] ? wake_up_process+0x1f/0x40
      [<c104fc2e>] ? wake_up_worker+0x1e/0x30
      [<c15b6083>] schedule+0x23/0x60
      [<c15b3c2d>] schedule_timeout+0x18d/0x1f0
      [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
      [<c10515f1>] ? __queue_delayed_work+0x91/0x150
      [<c12a12ef>] ? do_raw_spin_lock+0x3f/0x100
      [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
      [<c15b5b5d>] wait_for_completion+0x7d/0xc0
      [<c1066d60>] ? try_to_wake_up+0x220/0x220
      [<c116a4d2>] sync_inodes_sb+0x92/0x180
      [<c116fb05>] sync_inodes_one_sb+0x15/0x20
      [<c114a8f8>] iterate_supers+0xb8/0xc0
      [<c116faf0>] ? fdatawrite_one_bdev+0x20/0x20
      [<c116fc21>] sys_sync+0x31/0x80
      [<c15be18d>] sysenter_do_call+0x12/0x28
      
      This issue can be triggered via xfstests/generic/308.
      
      The reason is that the end_index is unsigned long with maximum value
      '2^32-1=4294967295' on 32-bit platform, and the given offset cause it
      wrapped to 0, so that the following codes will repeat again and again
      until the task schedule time out:
      
      end_index = offset >> PAGE_CACHE_SHIFT;
      last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
      if (page->index >= end_index) {
      	unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
              /*
               * Just skip the page if it is fully outside i_size, e.g. due
               * to a truncate operation that is in progress.
               */
              if (page->index >= end_index + 1 || offset_into_page == 0) {
      	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      		unlock_page(page);
      		return 0;
      	}
      
      In order to check if a page is fully outsids i_size or not, we can fix
      the code logic as below:
      	if (page->index > end_index ||
      	    (page->index == end_index && offset_into_page == 0))
      
      Secondly, there still has another similar issue when calculating the
      end offset for mapping the filesystem blocks to the file blocks for
      delalloc.  With the same tests to above, run unmount(8) will cause
      kernel panic if CONFIG_XFS_DEBUG is enabled:
      
      XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \
      	ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964
      
      kernel BUG at fs/xfs/xfs_message.c:108!
      invalid opcode: 0000 [#1] SMP
      task: edddc100 ti: ec6ee000 task.ti: ec6ee000
      EIP: 0060:[<f83d87cb>] EFLAGS: 00010296 CPU: 1
      EIP is at assfail+0x2b/0x30 [xfs]
      ..............
      Call Trace:
      [<f83d9cd4>] xfs_fs_destroy_inode+0x74/0x120 [xfs]
      [<c115ddf1>] destroy_inode+0x31/0x50
      [<c115deff>] evict+0xef/0x170
      [<c115dfb2>] dispose_list+0x32/0x40
      [<c115ea3a>] evict_inodes+0xca/0xe0
      [<c1149706>] generic_shutdown_super+0x46/0xd0
      [<c11497b9>] kill_block_super+0x29/0x70
      [<c1149a14>] deactivate_locked_super+0x44/0x70
      [<c114a427>] deactivate_super+0x47/0x60
      [<c1161c3d>] mntput_no_expire+0xcd/0x120
      [<c1162ae8>] SyS_umount+0xa8/0x370
      [<c1162dce>] SyS_oldumount+0x1e/0x20
      [<c15be18d>] sysenter_do_call+0x12/0x28
      
      That because the end_offset is evaluated to 0 which is the same reason
      to above, hence the mapping and covertion for dealloc file blocks to
      file system blocks did not happened.
      
      This patch just fixed both issues.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      8695d27e
    • D
      xfs: remove redundant checks from xfs_da_read_buf · 7c166350
      Dave Chinner 提交于
      All of the verification checks of magic numbers are now done by
      verifiers, so ther eis no need to check them again once the buffer
      has been successfully read. If the magic number is bad, it won't
      even get to that code to verify it so it really serves no purpose at
      all anymore. Remove it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      7c166350