1. 19 7月, 2014 3 次提交
  2. 16 7月, 2014 1 次提交
  3. 15 7月, 2014 4 次提交
    • D
      xfs: null unused quota inodes when quota is on · 03e01349
      Dave Chinner 提交于
      When quota is on, it is expected that unused quota inodes have a
      value of NULLFSINO. The changes to support a separate project quota
      in 3.12 broken this rule for non-project quota inode enabled
      filesystem, as the code now refuses to write the group quota inode
      if neither group or project quotas are enabled. This regression was
      introduced by commit d892d586 ("xfs: Start using pquotaino from the
      superblock").
      
      In this case, we should be writing NULLFSINO rather than nothing to
      ensure that we leave the group quota inode in a valid state while
      quotas are enabled.
      
      Failure to do so doesn't cause a current kernel to break - the
      separate project quota inodes introduced translation code to always
      treat a zero inode as NULLFSINO. This was introduced by commit
      01026297 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
      also in 3.12 but older kernels do not do this and hence taking a
      filesystem back to an older kernel can result in quotas failing
      initialisation at mount time. When that happens, we see this in
      dmesg:
      
      [ 1649.215390] XFS (sdb): Mounting Filesystem
      [ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
      [ 1649.316902] XFS (sdb): Ending clean mount
      
      By ensuring that we write NULLFSINO to quota inodes that aren't
      active, we avoid this problem. We have to be really careful when
      determining if the quota inodes are active or not, because we don't
      want to write a NULLFSINO if the quota inodes are active and we
      simply aren't updating them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      03e01349
    • D
      xfs: refine the allocation stack switch · cf11da9c
      Dave Chinner 提交于
      The allocation stack switch at xfs_bmapi_allocate() has served it's
      purpose, but is no longer a sufficient solution to the stack usage
      problem we have in the XFS allocation path.
      
      Whilst the kernel stack size is now 16k, that is not a valid reason
      for undoing all our "keep stack usage down" modifications. What it
      does allow us to do is have the freedom to refine and perfect the
      modifications knowing that if we get it wrong it won't blow up in
      our faces - we have a safety net now.
      
      This is important because we still have the issue of older kernels
      having smaller stacks and that they are still supported and are
      demonstrating a wide range of different stack overflows.  Red Hat
      has several open bugs for allocation based stack overflows from
      directory modifications and direct IO block allocation and these
      problems still need to be solved. If we can solve them upstream,
      then distro's won't need to bake their own unique solutions.
      
      To that end, I've observed that every allocation based stack
      overflow report has had a specific characteristic - it has happened
      during or directly after a bmap btree block split. That event
      requires a new block to be allocated to the tree, and so we
      effectively stack one allocation stack on top of another, and that's
      when we get into trouble.
      
      A further observation is that bmap btree block splits are much rarer
      than writeback allocation - over a range of different workloads I've
      observed the ratio of bmap btree inserts to splits ranges from 100:1
      (xfstests run) to 10000:1 (local VM image server with sparse files
      that range in the hundreds of thousands to millions of extents).
      Either way, bmap btree split events are much, much rarer than
      allocation events.
      
      Finally, we have to move the kswapd state to the allocation workqueue
      work when allocation is done on behalf of kswapd. This is proving to
      cause significant perturbation in performance under memory pressure
      and appears to be generating allocation deadlock warnings under some
      workloads, so avoiding the use of a workqueue for the majority of
      kswapd writeback allocation will minimise the impact of such
      behaviour.
      
      Hence it makes sense to move the stack switch to xfs_btree_split()
      and only do it for bmap btree splits. Stack switches during
      allocation will be much rarer, so there won't be significant
      performacne overhead caused by switching stacks. The worse case
      stack from all allocation paths will be split, not just writeback.
      And the majority of memory allocations will be done in the correct
      context (e.g. kswapd) without causing additional latency, and so we
      simplify the memory reclaim interactions between processes,
      workqueues and kswapd.
      
      The worst stack I've been able to generate with this patch in place
      is 5600 bytes deep. It's very revealing because we exit XFS at:
      
      37)     1768      64   kmem_cache_alloc+0x13b/0x170
      
      about 1800 bytes of stack consumed, and the remaining 3800 bytes
      (and 36 functions) is memory reclaim, swap and the IO stack. And
      this occurs in the inode allocation from an open(O_CREAT) syscall,
      not writeback.
      
      The amount of stack being used is much less than I've previously be
      able to generate - fs_mark testing has been able to generate stack
      usage of around 7k without too much trouble; with this patch it's
      only just getting to 5.5k. This is primarily because the metadata
      allocation paths (e.g. directory blocks) are no longer causing
      double splits on the same stack, and hence now stack tracing is
      showing swapping being the worst stack consumer rather than XFS.
      
      Performance of fs_mark inode create workloads is unchanged.
      Performance of fs_mark async fsync workloads is consistently good
      with context switches reduced by around 150,000/s (30%).
      Performance of dbench, streaming IO and postmark is unchanged.
      Allocation deadlock warnings have not been seen on the workloads
      that generated them since adding this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cf11da9c
    • D
      Revert "xfs: block allocation work needs to be kswapd aware" · aa182e64
      Dave Chinner 提交于
      This reverts commit 1f6d6482.
      
      This commit resulted in regressions in performance in low
      memory situations where kswapd was doing writeback of delayed
      allocation blocks. It resulted in significant parallelism of the
      kswapd work and with the special kswapd flags meant that hundreds of
      active allocation could dip into kswapd specific memory reserves and
      avoid being throttled. This cause a large amount of performance
      variation, as well as random OOM-killer invocations that didn't
      previously exist.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      aa182e64
    • B
      aio: protect reqs_available updates from changes in interrupt handlers · 263782c1
      Benjamin LaHaise 提交于
      As of commit f8567a38 it is now possible to
      have put_reqs_available() called from irq context.  While put_reqs_available()
      is per cpu, it did not protect itself from interrupts on the same CPU.  This
      lead to aio_complete() corrupting the available io requests count when run
      under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
      disabling irq updates around the per cpu batch updates of reqs_available.
      
      Many thanks to Robert and folks for testing and tracking this down.
      Reported-by: NRobert Elliot <Elliott@hp.com>
      Tested-by: NRobert Elliot <Elliott@hp.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
      Cc: stable@vger.kenel.org
      263782c1
  4. 14 7月, 2014 2 次提交
  5. 13 7月, 2014 2 次提交
    • N
      ext4: fix potential null pointer dereference in ext4_free_inode · bf40c926
      Namjae Jeon 提交于
      Fix potential null pointer dereferencing problem caused by e43bb4e6
      ("ext4: decrement free clusters/inodes counters when block group declared bad")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      bf40c926
    • T
      ext4: fix a potential deadlock in __ext4_es_shrink() · 3f1f9b85
      Theodore Ts'o 提交于
      This fixes the following lockdep complaint:
      
      [ INFO: possible circular locking dependency detected ]
      3.16.0-rc2-mm1+ #7 Tainted: G           O  
      -------------------------------------------------------
      kworker/u24:0/4356 is trying to acquire lock:
       (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0
      
      but task is already holding lock:
       (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180
      
      which lock already depends on the new lock.
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&ei->i_es_lock);
                                     lock(&(&sbi->s_es_lru_lock)->rlock);
                                     lock(&ei->i_es_lock);
        lock(&(&sbi->s_es_lru_lock)->rlock);
      
       *** DEADLOCK ***
      
      6 locks held by kworker/u24:0/4356:
       #0:  ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
       #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
       #2:  (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90
       #3:  (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0
       #4:  (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550
       #5:  (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180
      
      stack backtrace:
      CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G           O   3.16.0-rc2-mm1+ #7
      Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      Workqueue: writeback bdi_writeback_workfn (flush-253:0)
       ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007
       ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568
       ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930
      Call Trace:
       [<ffffffff815df0bb>] dump_stack+0x4e/0x68
       [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c
       [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00
       [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe
       [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce
       [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
       [<ffffffff810a8707>] lock_acquire+0x87/0x120
       [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
       [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70
       [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50
       [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
       [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0
       [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0
       [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180
       [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550
       [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00
      	...
      Reported-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reported-by: NMinchan Kim <minchan@kernel.org>
      Cc: stable@vger.kernel.org
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      3f1f9b85
  6. 12 7月, 2014 1 次提交
  7. 10 7月, 2014 1 次提交
  8. 09 7月, 2014 7 次提交
  9. 08 7月, 2014 1 次提交
  10. 07 7月, 2014 5 次提交
    • M
      fuse: avoid scheduling while atomic · c55a01d3
      Miklos Szeredi 提交于
      As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry()
      triggers complains about scheduling while atomic:
      
        BUG: scheduling while atomic: fuse.hf/13976/0x10000001
      
      This happens because fuse_notify_inval_entry() attempts to allocate memory
      with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic().
      
      Introduced by commit 58bda1da "fuse/dev: use atomic maps"
      
      Fix by moving the map/unmap to just cover the actual memcpy operation.
      
      Original patch from Maxim Patlasov <mpatlasov@parallels.com>
      Reported-by: NRichard Sharpe <realrichardsharpe@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: <stable@vger.kernel.org> # v3.15+
      c55a01d3
    • M
      fuse: handle large user and group ID · 233a01fa
      Miklos Szeredi 提交于
      If the number in "user_id=N" or "group_id=N" mount options was larger than
      INT_MAX then fuse returned EINVAL.
      
      Fix this to handle all valid uid/gid values.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      233a01fa
    • H
      fuse: inode: drop cast · 7b3d8bf7
      Himangi Saraogi 提交于
      This patch removes the cast on data of type void * as it is not needed.
      The following Coccinelle semantic patch was used for making the change:
      
      @r@
      expression x;
      void* e;
      type T;
      identifier f;
      @@
      
      (
        *((T *)e)
      |
        ((T *)x)[...]
      |
        ((T *)x)->f
      |
      - (T *)
        e
      )
      Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
      Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      7b3d8bf7
    • A
      fuse: ignore entry-timeout on LOOKUP_REVAL · 154210cc
      Anand Avati 提交于
      The following test case demonstrates the bug:
      
        sh# mount -t glusterfs localhost:meta-test /mnt/one
      
        sh# mount -t glusterfs localhost:meta-test /mnt/two
      
        sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file
        bash: /mnt/one/file: Stale file handle
      
        sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file
      
      On the second open() on /mnt/one, FUSE would have used the old
      nodeid (file handle) trying to re-open it. Gluster is returning
      -ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
      where lookup is re-attempted with LOOKUP_REVAL. The right
      behavior now, would be for FUSE to ignore the entry-timeout and
      and do the up-call revalidation. Instead FUSE is ignoring
      LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
      has not passed), and open() is again retried on the old file
      handle and finally the ESTALE is going back to the application.
      
      Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
      entry-timeout and always do the up-call.
      Signed-off-by: NAnand Avati <avati@redhat.com>
      Reviewed-by: NNiels de Vos <ndevos@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      154210cc
    • M
      fuse: timeout comparison fix · 126b9d43
      Miklos Szeredi 提交于
      As suggested by checkpatch.pl, use time_before64() instead of direct
      comparison of jiffies64 values.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: <stable@vger.kernel.org>
      126b9d43
  11. 06 7月, 2014 4 次提交
  12. 04 7月, 2014 3 次提交
    • H
      fs/seq_file: fallback to vmalloc allocation · 058504ed
      Heiko Carstens 提交于
      There are a couple of seq_files which use the single_open() interface.
      This interface requires that the whole output must fit into a single
      buffer.
      
      E.g.  for /proc/stat allocation failures have been observed because an
      order-4 memory allocation failed due to memory fragmentation.  In such
      situations reading /proc/stat is not possible anymore.
      
      Therefore change the seq_file code to fallback to vmalloc allocations
      which will usually result in a couple of order-0 allocations and hence
      also work if memory is fragmented.
      
      For reference a call trace where reading from /proc/stat failed:
      
        sadc: page allocation failure: order:4, mode:0x1040d0
        CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
        [...]
        Call Trace:
          show_stack+0x6c/0xe8
          warn_alloc_failed+0xd6/0x138
          __alloc_pages_nodemask+0x9da/0xb68
          __get_free_pages+0x2e/0x58
          kmalloc_order_trace+0x44/0xc0
          stat_open+0x5a/0xd8
          proc_reg_open+0x8a/0x140
          do_dentry_open+0x1bc/0x2c8
          finish_open+0x46/0x60
          do_last+0x382/0x10d0
          path_openat+0xc8/0x4f8
          do_filp_open+0x46/0xa8
          do_sys_open+0x114/0x1f0
          sysc_tracego+0x14/0x1a
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
      Cc: Andrea Righi <andrea@betterlinux.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Stefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      058504ed
    • H
      /proc/stat: convert to single_open_size() · f74373a5
      Heiko Carstens 提交于
      These two patches are supposed to "fix" failed order-4 memory
      allocations which have been observed when reading /proc/stat.  The
      problem has been observed on s390 as well as on x86.
      
      To address the problem change the seq_file memory allocations to
      fallback to use vmalloc, so that allocations also work if memory is
      fragmented.
      
      This approach seems to be simpler and less intrusive than changing
      /proc/stat to use an interator.  Also it "fixes" other users as well,
      which use seq_file's single_open() interface.
      
      This patch (of 2):
      
      Use seq_file's single_open_size() to preallocate a buffer that is large
      enough to hold the whole output, instead of open coding it.  Also
      calculate the requested size using the number of online cpus instead of
      possible cpus, since the size of the output only depends on the number
      of online cpus.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
      Cc: Andrea Righi <andrea@betterlinux.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Stefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f74373a5
    • I
      autofs4: fix false positive compile error · 571ff473
      Ian Kent 提交于
      On strict build environments we can see:
      
        fs/autofs4/inode.c: In function 'autofs4_fill_super':
        fs/autofs4/inode.c:312: error: 'pgrp' may be used uninitialized in this function
        make[2]: *** [fs/autofs4/inode.o] Error 1
        make[1]: *** [fs/autofs4] Error 2
        make: *** [fs] Error 2
        make: *** Waiting for unfinished jobs....
      
      This is due to the use of pgrp_set being used to indicate pgrp has has
      been set rather than initializing pgrp itself.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      571ff473
  13. 03 7月, 2014 6 次提交
    • F
      Btrfs: fix crash when starting transaction · abdd2e80
      Filipe Manana 提交于
      Often when starting a transaction we commit the currently running transaction,
      which can end up writing block group caches when the current process has its
      journal_info set to NULL (and not to a transaction). This makes our assertion
      at btrfs_check_data_free_space() (current_journal != NULL) fail, resulting
      in a crash/hang. Therefore fix it by setting journal_info.
      
      Two different traces of this issue follow below.
      
      1)
      
          [51502.241936] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
          [51502.242213] ------------[ cut here ]------------
          [51502.242493] kernel BUG at fs/btrfs/ctree.h:3964!
          [51502.242669] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
          (...)
          [51502.244010] Call Trace:
          [51502.244010]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
          [51502.244010]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
          [51502.244010]  [<ffffffffa0357a6a>] commit_cowonly_roots+0x164/0x226 [btrfs]
          [51502.244010]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
          [51502.244010]  [<ffffffff8168ec7b>] ? _raw_spin_unlock+0x2b/0x40
          [51502.244010]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
          [51502.244010]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
          [51502.244010]  [<ffffffffa02d73e1>] __unlink_start_trans+0x31/0xe0 [btrfs]
          [51502.244010]  [<ffffffffa02dea67>] btrfs_unlink+0x37/0xc0 [btrfs]
          [51502.244010]  [<ffffffff811bb054>] ? do_unlinkat+0x114/0x2a0
          [51502.244010]  [<ffffffff811baebc>] vfs_unlink+0xcc/0x150
          [51502.244010]  [<ffffffff811bb1a0>] do_unlinkat+0x260/0x2a0
          [51502.244010]  [<ffffffff811a9ef4>] ? filp_close+0x64/0x90
          [51502.244010]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
          [51502.244010]  [<ffffffff81349cab>] ? trace_hardirqs_on_thunk+0x3a/0x3f
          [51502.244010]  [<ffffffff811be9eb>] SyS_unlinkat+0x1b/0x40
          [51502.244010]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
          [51502.244010] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 71 13 36 a0 48 89 fe 31 c0 48 c7 c7 b8 43 36 a0 48 89 e5 e8 5d b0 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
          [51502.244010] RIP  [<ffffffffa03575da>] assfail.constprop.88+0x1e/0x20 [btrfs]
      
      2)
      
          [25405.097230] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
          [25405.097488] ------------[ cut here ]------------
          [25405.097767] kernel BUG at fs/btrfs/ctree.h:3964!
          [25405.097940] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
          (...)
          [25405.100008] Call Trace:
          [25405.100008]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
          [25405.100008]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
          [25405.100008]  [<ffffffffa035755a>] commit_cowonly_roots+0x164/0x226 [btrfs]
          [25405.100008]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
          [25405.100008]  [<ffffffff8109c170>] ? bit_waitqueue+0xc0/0xc0
          [25405.100008]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
          [25405.100008]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
          [25405.100008]  [<ffffffffa02e3407>] btrfs_create+0x47/0x210 [btrfs]
          [25405.100008]  [<ffffffffa02d74cc>] ? btrfs_permission+0x3c/0x80 [btrfs]
          [25405.100008]  [<ffffffff811bc63b>] vfs_create+0x9b/0x130
          [25405.100008]  [<ffffffff811bcf19>] do_last+0x849/0xe20
          [25405.100008]  [<ffffffff811b9409>] ? link_path_walk+0x79/0x820
          [25405.100008]  [<ffffffff811bd5b5>] path_openat+0xc5/0x690
          [25405.100008]  [<ffffffff810ab07d>] ? trace_hardirqs_on+0xd/0x10
          [25405.100008]  [<ffffffff811cdcd2>] ? __alloc_fd+0x32/0x1d0
          [25405.100008]  [<ffffffff811be2a3>] do_filp_open+0x43/0xa0
          [25405.100008]  [<ffffffff811cddf1>] ? __alloc_fd+0x151/0x1d0
          [25405.100008]  [<ffffffff811abcfc>] do_sys_open+0x13c/0x230
          [25405.100008]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
          [25405.100008]  [<ffffffff811abe12>] SyS_open+0x22/0x30
          [25405.100008]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
          [25405.100008] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 51 13 36 a0 48 89 fe 31 c0 48 c7 c7 d0 43 36 a0 48 89 e5 e8 6d b5 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
          [25405.100008] RIP  [<ffffffffa03570ca>] assfail.constprop.88+0x1e/0x20 [btrfs]
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      abdd2e80
    • J
      Btrfs: fix btrfs_print_leaf for skinny metadata · be2c765d
      Josef Bacik 提交于
      We wouldn't actuall print the extent information if we had a skinny metadata
      item, this fixes that.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      be2c765d
    • L
      Btrfs: fix race of using total_bytes_pinned · d288db5d
      Liu Bo 提交于
      This percpu counter @total_bytes_pinned is introduced to skip unnecessary
      operations of 'commit transaction', it accounts for those space we may free
      but are stuck in delayed refs.
      
      And we zero out @space_info->total_bytes_pinned every transaction period so
      we have a better idea of how much space we'll actually free up by committing
      this transaction.  However, we do the 'zero out' part a little earlier, before
      we actually unpin space, so we end up returning ENOSPC when we actually have
      free space that's just unpinned from committing transaction.
      
      xfstests/generic/074 complained then.
      
      This fixes it by actually accounting the percpu pinned number when 'unpin',
      and since it's protected by space_info->lock, the race is gone now.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d288db5d
    • D
      btrfs: use E2BIG instead of EIO if compression does not help · 130d5b41
      David Sterba 提交于
      Return codes got updated in 60e1975a
      (btrfs: return errno instead of -1 from compression)
      lzo wrapper returns E2BIG in this case, do the same for zlib.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      130d5b41
    • D
      btrfs: remove stale comment from btrfs_flush_all_pending_stuffs · 0a4eaea8
      David Sterba 提交于
      Commit fcebe456 (Btrfs: rework qgroup
      accounting) removed the qgroup accounting after delayed refs.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      0a4eaea8
    • F
      Btrfs: fix use-after-free when cloning a trailing file hole · 14f59796
      Filipe Manana 提交于
      The transaction handle was being used after being freed.
      
      Cc: Chris Mason <clm@fb.com>
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      14f59796