1. 30 6月, 2017 2 次提交
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo 提交于
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Q
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo 提交于
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      364ecf36
  2. 20 6月, 2017 2 次提交
  3. 09 5月, 2017 1 次提交
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  4. 18 4月, 2017 1 次提交
  5. 28 2月, 2017 8 次提交
  6. 23 2月, 2017 1 次提交
    • F
      Btrfs: fix deadlock between dedup on same file and starting writeback · b1517622
      Filipe Manana 提交于
      If we are deduping two ranges of the same file we need to make sure that
      we lock all pages in ascending order, that is, lock first the pages from
      the range with lower offset and then the pages from the other range, as
      otherwise we can deadlock with a concurrent task that is starting delalloc
      (writeback). Example trace:
      
      [74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds.
      [74073.053889]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
      [74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [74073.056696] kworker/u32:10  D    0 17997      2 0x00000000
      [74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176)
      [74073.061370]  ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240
      [74073.064784]  ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000
      [74073.068386]  ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240
      [74073.071712] Call Trace:
      [74073.072884]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
      [74073.075395]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
      [74073.077511]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
      [74073.079440]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
      [74073.081637]  [<ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14
      [74073.083809]  [<ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197
      [74073.086314]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
      [74073.100654]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
      [74073.102619]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
      [74073.104771]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
      [74073.106969]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
      [74073.108954]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
      [74073.110981]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
      [74073.112833]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
      [74073.115010]  [<ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs]
      [74073.116999]  [<ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs]
      [74073.119243]  [<ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs]
      [74073.121636]  [<ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs]
      [74073.124229]  [<ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs]
      [74073.126372]  [<ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs]
      [74073.129371]  [<ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs]
      [74073.131440]  [<ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs]
      [74073.134303]  [<ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1
      [74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
      [74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
      [74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
      [74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
      [74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
      [74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
      [74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
      [74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
      [74073.143911]  [<ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae
      [74073.145787]  [<ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7
      [74073.147452]  [<ffffffff811b60cd>] wb_workfn+0x194/0x37d
      [74073.149084]  [<ffffffff811b60cd>] ? wb_workfn+0x194/0x37d
      [74073.150726]  [<ffffffff8106ce77>] ? process_one_work+0x154/0x4e4
      [74073.152694]  [<ffffffff8106cf96>] process_one_work+0x273/0x4e4
      [74073.154452]  [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
      [74073.156138]  [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
      [74073.157837]  [<ffffffff81072a81>] kthread+0xd5/0xdd
      [74073.159339]  [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
      [74073.161088]  [<ffffffff814c6257>] ret_from_fork+0x27/0x40
      [74073.162680] INFO: lockdep is turned off.
      [74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds.
      [74073.181180]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
      [74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [74073.185296] fdm-stress      D    0 30264  29974 0x00000000
      [74073.186810]  ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00
      [74073.188998]  ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000
      [74073.191070]  ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00
      [74073.193286] Call Trace:
      [74073.193990]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
      [74073.195418]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
      [74073.196796]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
      [74073.198163]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
      [74073.199621]  [<ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf
      [74073.201100]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
      [74073.202686]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
      [74073.204051]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
      [74073.205585]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
      [74073.207123]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
      [74073.208238]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
      [74073.208871]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
      [74073.209430]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
      [74073.210101]  [<ffffffff8112b800>] lock_page+0x2f/0x32
      [74073.210636]  [<ffffffff8112c502>] pagecache_get_page+0x5e/0x153
      [74073.211270]  [<ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs]
      [74073.212166]  [<ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs]
      [74073.213257]  [<ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221
      [74073.214086]  [<ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600
      [74073.214767]  [<ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d
      [74073.215619]  [<ffffffff811a7953>] ? __fget+0x6b/0x77
      [74073.216338]  [<ffffffff8119e2d9>] SyS_ioctl+0x57/0x79
      [74073.217149]  [<ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad
      [74073.218102]  [<ffffffff81109552>] ? time_hardirqs_off+0x9/0x14
      [74073.218968]  [<ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa
      [74073.219938] INFO: lockdep is turned off.
      
      What happened was the following:
      
            CPU 1                                       CPU 2
      
                                                   btrfs_dedupe_file_range()
                                                     --> using same inode as source
                                                         and target
                                                     --> src range is [768K, 1Mb[
                                                     --> dst range is [0, 256K[
                                                    btrfs_cmp_data_prepare()
                                                     --> calls gather_extent_pages()
                                                         for range [768K, 1Mb[ and
                                                         locks all pages in that range
      
       do_writepages()
        btrfs_writepages()
         extent_writepages()
          extent_write_cache_pages()
           __extent_writepage()
            writepage_delalloc()
             find_lock_delalloc_range()
               --> finds range [0, 1Mb[
               lock_delalloc_pages()
                --> locks all pages in the
                    range [0, 768K[
                --> tries to lock page at
                    offset 768K
                      --> deadlock
      
                                                     --> calls gather_extent_pages()
                                                         to lock pages in the range
                                                         [0, 256K[
                                                          --> deadlock, task at CPU 1
                                                              already locked that
                                                              range and it's trying
                                                              to lock the range we
                                                              locked previously
      
      So fix this by making sure that during a dedup we always lock first the
      pages from the range with lower offset.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b1517622
  7. 17 2月, 2017 5 次提交
  8. 14 2月, 2017 5 次提交
  9. 09 2月, 2017 1 次提交
  10. 10 12月, 2016 1 次提交
    • C
      fs: try to clone files first in vfs_copy_file_range · a76b5b04
      Christoph Hellwig 提交于
      A clone is a perfectly fine implementation of a file copy, so most
      file systems just implement the copy that way.  Instead of duplicating
      this logic move it to the VFS.  Currently btrfs and XFS implement copies
      the same way as clones and there is no behavior change for them, cifs
      only implements clones and grow support for copy_file_range with this
      patch.  NFS implements both, so this will allow copy_file_range to work
      on servers that only implement CLONE and be lot more efficient on servers
      that implements CLONE and COPY.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a76b5b04
  11. 06 12月, 2016 7 次提交
  12. 30 11月, 2016 4 次提交
  13. 25 10月, 2016 1 次提交
  14. 28 9月, 2016 1 次提交