1. 15 7月, 2014 5 次提交
    • B
      xfs: add a sysfs kset · 3d871226
      Brian Foster 提交于
      Create a sysfs kset to contain all sub-objects associated with the XFS
      module. The kset is created and removed on module initialization and
      removal respectively. The kset uses fs_obj as a parent. This leads to
      the creation of a /sys/fs/xfs directory when the kset exists.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3d871226
    • B
      xfs: fix a couple error sequence jumps in xfs_mountfs() · a70a4fa5
      Brian Foster 提交于
      xfs_mountfs() has a couple failure conditions that do not jump to the
      correct labels. Specifically:
      
      - xfs_initialize_perag_data() failure does not deallocate the log even
        though it occurs after log initialization
      - xfs_mount_reset_sbqflags() failure returns the error directly rather
        than jump to the error sequence
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a70a4fa5
    • D
      xfs: null unused quota inodes when quota is on · 03e01349
      Dave Chinner 提交于
      When quota is on, it is expected that unused quota inodes have a
      value of NULLFSINO. The changes to support a separate project quota
      in 3.12 broken this rule for non-project quota inode enabled
      filesystem, as the code now refuses to write the group quota inode
      if neither group or project quotas are enabled. This regression was
      introduced by commit d892d586 ("xfs: Start using pquotaino from the
      superblock").
      
      In this case, we should be writing NULLFSINO rather than nothing to
      ensure that we leave the group quota inode in a valid state while
      quotas are enabled.
      
      Failure to do so doesn't cause a current kernel to break - the
      separate project quota inodes introduced translation code to always
      treat a zero inode as NULLFSINO. This was introduced by commit
      01026297 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
      also in 3.12 but older kernels do not do this and hence taking a
      filesystem back to an older kernel can result in quotas failing
      initialisation at mount time. When that happens, we see this in
      dmesg:
      
      [ 1649.215390] XFS (sdb): Mounting Filesystem
      [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
      [ 1649.316902] XFS (sdb): Ending clean mount
      
      By ensuring that we write NULLFSINO to quota inodes that aren't
      active, we avoid this problem. We have to be really careful when
      determining if the quota inodes are active or not, because we don't
      want to write a NULLFSINO if the quota inodes are active and we
      simply aren't updating them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      03e01349
    • D
      xfs: refine the allocation stack switch · cf11da9c
      Dave Chinner 提交于
      The allocation stack switch at xfs_bmapi_allocate() has served it's
      purpose, but is no longer a sufficient solution to the stack usage
      problem we have in the XFS allocation path.
      
      Whilst the kernel stack size is now 16k, that is not a valid reason
      for undoing all our "keep stack usage down" modifications. What it
      does allow us to do is have the freedom to refine and perfect the
      modifications knowing that if we get it wrong it won't blow up in
      our faces - we have a safety net now.
      
      This is important because we still have the issue of older kernels
      having smaller stacks and that they are still supported and are
      demonstrating a wide range of different stack overflows.  Red Hat
      has several open bugs for allocation based stack overflows from
      directory modifications and direct IO block allocation and these
      problems still need to be solved. If we can solve them upstream,
      then distro's won't need to bake their own unique solutions.
      
      To that end, I've observed that every allocation based stack
      overflow report has had a specific characteristic - it has happened
      during or directly after a bmap btree block split. That event
      requires a new block to be allocated to the tree, and so we
      effectively stack one allocation stack on top of another, and that's
      when we get into trouble.
      
      A further observation is that bmap btree block splits are much rarer
      than writeback allocation - over a range of different workloads I've
      observed the ratio of bmap btree inserts to splits ranges from 100:1
      (xfstests run) to 10000:1 (local VM image server with sparse files
      that range in the hundreds of thousands to millions of extents).
      Either way, bmap btree split events are much, much rarer than
      allocation events.
      
      Finally, we have to move the kswapd state to the allocation workqueue
      work when allocation is done on behalf of kswapd. This is proving to
      cause significant perturbation in performance under memory pressure
      and appears to be generating allocation deadlock warnings under some
      workloads, so avoiding the use of a workqueue for the majority of
      kswapd writeback allocation will minimise the impact of such
      behaviour.
      
      Hence it makes sense to move the stack switch to xfs_btree_split()
      and only do it for bmap btree splits. Stack switches during
      allocation will be much rarer, so there won't be significant
      performacne overhead caused by switching stacks. The worse case
      stack from all allocation paths will be split, not just writeback.
      And the majority of memory allocations will be done in the correct
      context (e.g. kswapd) without causing additional latency, and so we
      simplify the memory reclaim interactions between processes,
      workqueues and kswapd.
      
      The worst stack I've been able to generate with this patch in place
      is 5600 bytes deep. It's very revealing because we exit XFS at:
      
      37)     1768      64   kmem_cache_alloc+0x13b/0x170
      
      about 1800 bytes of stack consumed, and the remaining 3800 bytes
      (and 36 functions) is memory reclaim, swap and the IO stack. And
      this occurs in the inode allocation from an open(O_CREAT) syscall,
      not writeback.
      
      The amount of stack being used is much less than I've previously be
      able to generate - fs_mark testing has been able to generate stack
      usage of around 7k without too much trouble; with this patch it's
      only just getting to 5.5k. This is primarily because the metadata
      allocation paths (e.g. directory blocks) are no longer causing
      double splits on the same stack, and hence now stack tracing is
      showing swapping being the worst stack consumer rather than XFS.
      
      Performance of fs_mark inode create workloads is unchanged.
      Performance of fs_mark async fsync workloads is consistently good
      with context switches reduced by around 150,000/s (30%).
      Performance of dbench, streaming IO and postmark is unchanged.
      Allocation deadlock warnings have not been seen on the workloads
      that generated them since adding this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cf11da9c
    • D
      Revert "xfs: block allocation work needs to be kswapd aware" · aa182e64
      Dave Chinner 提交于
      This reverts commit 1f6d6482.
      
      This commit resulted in regressions in performance in low
      memory situations where kswapd was doing writeback of delayed
      allocation blocks. It resulted in significant parallelism of the
      kswapd work and with the special kswapd flags meant that hundreds of
      active allocation could dip into kswapd specific memory reserves and
      avoid being throttled. This cause a large amount of performance
      variation, as well as random OOM-killer invocations that didn't
      previously exist.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      aa182e64
  2. 25 6月, 2014 4 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
    • D
      libxfs: move source files · 30f712c9
      Dave Chinner 提交于
      Move all the source files that are shared with userspace into
      libxfs/. This is done as one big chunk simpy to get it done
      quickly
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      30f712c9
    • D
      libxfs: move header files · 84be0ffc
      Dave Chinner 提交于
      Move all the header files that are shared with userspace into
      libxfs. This is done as one big chunk simpy to get it done quickly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      84be0ffc
    • D
      xfs: create libxfs infrastructure · 69116a13
      Dave Chinner 提交于
      To minimise the differences between kernel and userspace code,
      split the kernel code into the same structure as the userspace code.
      That is, the gneric core functionality of XFS is moved to a libxfs/
      directory and treat it as a layering barrier in the XFS code.
      
      This patch introduces the libxfs directory, the build infrastructure
      and an initial source and header file to build. The libxfs directory
      will contain the header files that are needed to build libxfs - most
      of userspace does not care about the location of these header files
      as they are accessed indirectly. Hence keeping them inside libxfs
      makes it easy to track the changes and script the sync process as
      the directory structure will be identical.
      
      To allow this changeover to occur in the kernel code, there are some
      temporary infrastructure in the makefiles to grab the header
      filesystem from both locations. Once all the files are moved,
      modifications will be made in the source code that will make the
      need for these include directives go away.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      69116a13
  3. 22 6月, 2014 2 次提交
  4. 20 6月, 2014 9 次提交
    • M
      Btrfs: fix wrong error handle when the device is missing or is not writeable · 8408c716
      Miao Xie 提交于
      The original bio might be submitted, so we shoud increase bi_remaining to
      account for it when we deal with the error that the device is missing or
      is not writeable, or we would skip the endio handle.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8408c716
    • M
      Btrfs: fix deadlock when mounting a degraded fs · c55f1396
      Miao Xie 提交于
      The deadlock happened when we mount degraded filesystem, the reproduced
      steps are following:
       # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
       # echo 1 > /sys/block/`basename <dev0>`/device/delete
       # mount -o degraded <dev1> <mnt>
      
      The reason was that the counter -- bi_remaining was wrong. If the missing
      or unwriteable device was the last device in the mapping array, we would
      not submit the original bio, so we shouldn't increase bi_remaining of it
      in btrfs_end_bio(), or we would skip the final endio handle.
      
      Fix this problem by adding a flag into btrfs bio structure. If we submit
      the original bio, we will set the flag, and we increase bi_remaining counter,
      or we don't.
      
      Though there is another way to fix it -- decrease bi_remaining counter of the
      original bio when we make sure the original bio is not submitted, this method
      need add more check and is easy to make mistake.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c55f1396
    • M
      Btrfs: use bio_endio_nodec instead of open code · e990f167
      Miao Xie 提交于
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e990f167
    • W
      Btrfs: fix NULL pointer crash when running balance and scrub concurrently · 298a8f9c
      Wang Shilong 提交于
      While running balance, scrub, fsstress concurrently we hit the
      following kernel crash:
      
      [56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
      [56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
      [56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
      [56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
      [56561.524325] Oops: 0000 [#1] SMP
      [....]
      [56561.527237] Call Trace:
      [56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
      [56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
      [56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
      [56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
      [56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
      [56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
      [56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
      [56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
      [56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
      [...]
      [56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
      [56561.528395]  RSP <ffff88004c0f5be8>
      [56561.528454] CR2: 0000000000000078
      
      This is because in btrfs_relocate_chunk(), we will free @bdev directly while
      scrub may still hold extent mapping, and may access freed memory.
      
      Fix this problem by wrapping freeing @bdev work into free_extent_map() which
      is based on reference count.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      298a8f9c
    • Q
      btrfs: Skip scrubbing removed chunks to avoid -ENOENT. · ced96edc
      Qu Wenruo 提交于
      When run scrub with balance, sometimes -ENOENT will be returned, since
      in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
      btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
      chunk is removed but not committed, -ENOENT will be returned.
      
      However, there is no need to stop scrubbing since other chunks may be
      scrubbed without problem.
      
      So this patch changes the behavior to skip removed chunks and continue
      to scrub the rest.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ced96edc
    • M
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie 提交于
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e570fd27
    • M
      Btrfs: make free space cache write out functions more readable · 5349d6c3
      Miao Xie 提交于
      This patch makes the free space cache write out functions more readable,
      and beisdes that, it also reduces the stack space that the function --
      __btrfs_write_out_cache uses from 194bytes to 144bytes.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5349d6c3
    • F
      Btrfs: remove unused wait queue in struct extent_buffer · 46fefe41
      Filipe Manana 提交于
      The lock_wq wait queue is not used anywhere, therefore just remove it.
      On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
      bytes down to 296 bytes, which means a 4Kb page can now be used for
      13 extent buffers instead of 12.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      46fefe41
    • C
      Btrfs: fix deadlocks with trylock on tree nodes · ea4ebde0
      Chris Mason 提交于
      The Btrfs tree trylock function is poorly named.  It always takes
      the spinlock and backs off if the blocking lock is held.  This
      can lead to surprising lockups because people expect it to really be a
      trylock.
      
      This commit makes it a pure trylock, both for the spinlock and the
      blocking lock.  It also reworks the nested lock handling slightly to
      avoid taking the read lock while a spinning write lock might be held.
      Signed-off-by: NChris Mason <clm@fb.com>
      ea4ebde0
  5. 18 6月, 2014 2 次提交
    • K
      NFSD: fix bug for readdir of pseudofs · f41c5ad2
      Kinglong Mee 提交于
      Commit 561f0ed4 (nfsd4: allow large readdirs) introduces a bug
      about readdir the root of pseudofs.
      
      Call xdr_truncate_encode() revert encoded name when skipping.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      f41c5ad2
    • N
      NFSD: Don't hand out delegations for 30 seconds after recalling them. · 6282cd56
      NeilBrown 提交于
      If nfsd needs to recall a delegation for some reason it implies that there is
      contention on the file, so further delegations should not be handed out.
      
      The current code fails to do so, and the result is effectively a
      live-lock under some workloads: a client attempting a conflicting
      operation on a read-delegated file receives NFS4ERR_DELAY and retries
      the operation, but by the time it retries the server may already have
      given out another delegation.
      
      We could simply avoid delegations for (say) 30 seconds after any recall, but
      this is probably too heavy handed.
      
      We could keep a list of inodes (or inode numbers or filehandles) for recalled
      delegations, but that requires memory allocation and searching.
      
      The approach taken here is to use a bloom filter to record the filehandles
      which are currently blocked from delegation, and to accept the cost of a few
      false positives.
      
      We have 2 bloom filters, each of which is valid for 30 seconds.   When a
      delegation is recalled the filehandle is added to one filter and will remain
      disabled for between 30 and 60 seconds.
      
      We keep a count of the number of filehandles that have been added, so when
      that count is zero we can bypass all other tests.
      
      The bloom filters have 256 bits and 3 hash functions.  This should allow a
      couple of dozen blocked  filehandles with minimal false positives.  If many
      more filehandles are all blocked at once, behaviour will degrade towards
      rejecting all delegations for between 30 and 60 seconds, then resetting and
      allowing new delegations.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      6282cd56
  6. 17 6月, 2014 1 次提交
  7. 14 6月, 2014 7 次提交
    • E
      btrfs: fix error handling in create_pending_snapshot · 47a306a7
      Eric Sandeen 提交于
      fcebe456 cut and pasted some code to a later point
      in create_pending_snapshot(), but didn't switch
      to the appropriate error handling for this stage
      of the function.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47a306a7
    • E
      btrfs: fix use of uninit "ret" in end_extent_writepage() · 3e2426bd
      Eric Sandeen 提交于
      If this condition in end_extent_writepage() is false:
      
      	if (tree->ops && tree->ops->writepage_end_io_hook)
      
      we will then test an uninitialized "ret" at:
      
      	ret = ret < 0 ? ret : -EIO;
      
      The test for ret is for the case where ->writepage_end_io_hook
      failed, and we'd choose that ret as the error; but if
      there is no ->writepage_end_io_hook, nothing sets ret.
      
      Initializing ret to 0 should be sufficient; if
      writepage_end_io_hook wasn't set, (!uptodate) means
      non-zero err was passed in, so we choose -EIO in that case.
      Signed-of-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3e2426bd
    • E
      btrfs: free ulist in qgroup_shared_accounting() error path · d7372780
      Eric Sandeen 提交于
      If tmp = ulist_alloc(GFP_NOFS) fails, we return without
      freeing the previously allocated qgroups = ulist_alloc(GFP_NOFS)
      and cause a memory leak.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7372780
    • F
      Btrfs: fix qgroups sanity test crash or hang · b050f9f6
      Filipe Manana 提交于
      Often when running the qgroups sanity test, a crash or a hang happened.
      This is because the extent buffer the test uses for the root node doesn't
      have an header level explicitly set, making it have a random level value.
      This is a problem when it's not zero for the btrfs_search_slot() calls
      the test ends up doing, resulting in crashes or hangs such as the following:
      
      [ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on
      (...)
      [ 6454.127760] BTRFS: selftest: Running qgroup tests
      [ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup
      [ 6454.127966] BTRFS: selftest: Qgroup basic add
      [ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383]
      [ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs]
      [ 6480.152005] irq event stamp: 188448
      [ 6480.152005] hardirqs last  enabled at (188447): [<ffffffff8168ef5c>] restore_args+0x0/0x30
      [ 6480.152005] hardirqs last disabled at (188448): [<ffffffff81698e6a>] apic_timer_interrupt+0x6a/0x80
      [ 6480.152005] softirqs last  enabled at (188446): [<ffffffff810516cf>] __do_softirq+0x1cf/0x450
      [ 6480.152005] softirqs last disabled at (188441): [<ffffffff81051c25>] irq_exit+0xb5/0xc0
      [ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 3.15.0-rc8-fdm-btrfs-next-33+ #4
      [ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 6480.152005] task: ffff8802146125a0 ti: ffff8800d0d00000 task.ti: ffff8800d0d00000
      [ 6480.152005] RIP: 0010:[<ffffffff81349a63>]  [<ffffffff81349a63>] __write_lock_failed+0x13/0x20
      [ 6480.152005] RSP: 0018:ffff8800d0d038e8  EFLAGS: 00000287
      [ 6480.152005] RAX: 0000000000000000 RBX: ffffffff8168ef5c RCX: 000005deb8525852
      [ 6480.152005] RDX: 0000000000000000 RSI: 0000000000001d45 RDI: ffff8802105000b8
      [ 6480.152005] RBP: ffff8800d0d038e8 R08: fffffe12710f63db R09: ffffffffa03196fb
      [ 6480.152005] R10: ffff8802146125a0 R11: ffff880214612e28 R12: ffff8800d0d03858
      [ 6480.152005] R13: 0000000000000000 R14: ffff8800d0d00000 R15: ffff8802146125a0
      [ 6480.152005] FS:  00007f14ff804700(0000) GS:ffff880215e00000(0000) knlGS:0000000000000000
      [ 6480.152005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 6480.152005] CR2: 00007fff4df0dac8 CR3: 00000000d1796000 CR4: 00000000000006f0
      [ 6480.152005] Stack:
      [ 6480.152005]  ffff8800d0d03908 ffffffff810ae967 0000000000000001 ffff8802105000b8
      [ 6480.152005]  ffff8800d0d03938 ffffffff8168e57e ffffffffa0319c16 0000000000000007
      [ 6480.152005]  ffff880210500000 ffff880210500100 ffff8800d0d039b8 ffffffffa0319c16
      [ 6480.152005] Call Trace:
      [ 6480.152005]  [<ffffffff810ae967>] do_raw_write_lock+0x47/0xa0
      [ 6480.152005]  [<ffffffff8168e57e>] _raw_write_lock+0x5e/0x80
      [ 6480.152005]  [<ffffffffa0319c16>] ? btrfs_tree_lock+0x116/0x270 [btrfs]
      [ 6480.152005]  [<ffffffffa0319c16>] btrfs_tree_lock+0x116/0x270 [btrfs]
      [ 6480.152005]  [<ffffffffa02b2acb>] btrfs_lock_root_node+0x3b/0x50 [btrfs]
      [ 6480.152005]  [<ffffffffa02b81a6>] btrfs_search_slot+0x916/0xa20 [btrfs]
      [ 6480.152005]  [<ffffffff811a727f>] ? create_object+0x23f/0x300
      [ 6480.152005]  [<ffffffffa02b9958>] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
      [ 6480.152005]  [<ffffffffa036041a>] insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs]
      [ 6480.152005]  [<ffffffffa03605c3>] test_no_shared_qgroup+0xb1/0x1ca [btrfs]
      [ 6480.152005]  [<ffffffff8108cad6>] ? local_clock+0x16/0x30
      [ 6480.152005]  [<ffffffffa035ef8e>] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs]
      [ 6480.152005]  [<ffffffffa03a69d2>] ? ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs]
      [ 6480.152005]  [<ffffffffa03a6a86>] init_btrfs_fs+0xb4/0x153 [btrfs]
      [ 6480.152005]  [<ffffffff81000352>] do_one_initcall+0x102/0x150
      [ 6480.152005]  [<ffffffff8103d223>] ? set_memory_nx+0x43/0x50
      [ 6480.152005]  [<ffffffff81682668>] ? set_section_ro_nx+0x6d/0x74
      [ 6480.152005]  [<ffffffff810d91cc>] load_module+0x1cdc/0x2630
      (...)
      
      Therefore initialize the extent buffer as an empty leaf (level 0).
      
      Issue easy to reproduce when btrfs is built as a module via:
      
          $ for ((i = 1; i <= 1000000; i++)); do rmmod btrfs; modprobe btrfs; done
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b050f9f6
    • S
      btrfs: prevent RCU warning when dereferencing radix tree slot · f1e3c289
      Sasha Levin 提交于
      Mark the dereference as protected by lock. Not doing so triggers
      an RCU warning since the radix tree assumed that RCU is in use.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f1e3c289
    • W
      Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting · 5fbc7c59
      Wang Shilong 提交于
      Steps to reproduce:
      
       # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5
       # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device
       # mount /dev/sdb /mnt -o degraded
       # btrfs scrub start -BRd /mnt
      
      This is because readahead would skip missing device, this is not true
      for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block
      mapping. If expected data locates in missing device, readahead thread
      would not call __readahead_hook() which makes event @rc->elems=0
      wait forever.
      
      Fix this problem by checking return value of btrfs_map_block(),we
      can only skip missing device safely if there are several mirrors.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5fbc7c59
    • G
      btrfs: new ioctl TREE_SEARCH_V2 · cc68a8a5
      Gerhard Heift 提交于
      This new ioctl call allows the user to supply a buffer of varying size in which
      a tree search can store its results. This is much more flexible if you want to
      receive items which are larger than the current fixed buffer of 3992 bytes or
      if you want to fetch more items at once. Items larger than this buffer are for
      example some of the type EXTENT_CSUM.
      Signed-off-by: NGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: NChris Mason <clm@fb.com>
      Acked-by: NDavid Sterba <dsterba@suse.cz>
      cc68a8a5
  8. 13 6月, 2014 6 次提交
  9. 12 6月, 2014 4 次提交