1. 02 6月, 2015 6 次提交
    • T
      writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback · f0054bb1
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->wb_lock and ->worklist into wb.
      
      * The lock protects bdi->worklist and bdi->wb.dwork scheduling.  While
        moving, rename it to wb->work_lock as wb->wb_lock is confusing.
        Also, move wb->dwork downwards so that it's colocated with the new
        ->work_lock and ->work_list fields.
      
      * bdi_writeback_workfn()		-> wb_workfn()
        bdi_wakeup_thread_delayed(bdi)	-> wb_wakeup_delayed(wb)
        bdi_wakeup_thread(bdi)		-> wb_wakeup(wb)
        bdi_queue_work(bdi, ...)		-> wb_queue_work(wb, ...)
        __bdi_start_writeback(bdi, ...)	-> __wb_start_writeback(wb, ...)
        get_next_work_item(bdi)		-> get_next_work_item(wb)
      
      * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
        The function contained parts which belong to the containing bdi
        rather than the wb itself - testing cap_writeback_dirty and
        bdi_remove_from_list() invocation.  Those are moved to
        bdi_unregister().
      
      * bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
        Initializations of the moved bdi->wb_lock and ->work_list are
        relocated from bdi_init() to wb_init().
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f0054bb1
    • T
      writeback: move bandwidth related fields from backing_dev_info into bdi_writeback · a88a341a
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bandwidth related fields from backing_dev_info into
      bdi_writeback.
      
      * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
        write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
        balanced_dirty_ratelimit, completions and dirty_exceeded.
      
      * writeback_chunk_size() and over_bground_thresh() now take @wb
        instead of @bdi.
      
      * bdi_writeout_fraction(bdi, ...)	-> wb_writeout_fraction(wb, ...)
        bdi_dirty_limit(bdi, ...)		-> wb_dirty_limit(wb, ...)
        bdi_position_ration(bdi, ...)		-> wb_position_ratio(wb, ...)
        bdi_update_writebandwidth(bdi, ...)	-> wb_update_write_bandwidth(wb, ...)
        [__]bdi_update_bandwidth(bdi, ...)	-> [__]wb_update_bandwidth(wb, ...)
        bdi_{max|min}_pause(bdi, ...)		-> wb_{max|min}_pause(wb, ...)
        bdi_dirty_limits(bdi, ...)		-> wb_dirty_limits(wb, ...)
      
      * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
        respectively.  Note that explicit zeroing is dropped in the process
        as wb's are cleared in entirety anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      
      v2: Typo in description fixed as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a88a341a
    • T
      writeback: move backing_dev_info->bdi_stat[] into bdi_writeback · 93f78d88
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->bdi_stat[] into wb.
      
      * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
        enums is changed from BDI_ to WB_.
      
      * BDI_STAT_BATCH() -> WB_STAT_BATCH()
      
      * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
      
      * bdi_stat[_error]() -> wb_stat[_error]()
      
      * bdi_writeout_inc() -> wb_writeout_inc()
      
      * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
        frees stat.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93f78d88
    • T
      writeback: move backing_dev_info->state into bdi_writeback · 4452226e
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->state into wb.
      
      * enum bdi_state is renamed to wb_state and the prefix of all enums is
        changed from BDI_ to WB_.
      
      * Explicit zeroing of bdi->state is removed without adding zeoring of
        wb->state as the whole data structure is zeroed on init anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: drbd-dev@lists.linbit.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4452226e
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
    • T
      page_writeback: revive cancel_dirty_page() in a restricted form · 11f81bec
      Tejun Heo 提交于
      cancel_dirty_page() had some issues and b9ea2515 ("page_writeback:
      clean up mess around cancel_dirty_page()") replaced it with
      account_page_cleaned() which makes the caller responsible for clearing
      the dirty bit; unfortunately, the planned changes for cgroup writeback
      support requires synchronization between dirty bit manipulation and
      stat updates.  While we can open-code such synchronization in each
      account_page_cleaned() callsite, that's gonna be unnecessarily awkward
      and verbose.
      
      This patch revives cancel_dirty_page() but in a more restricted form.
      All it does is TestClearPageDirty() followed by account_page_cleaned()
      invocation if the page was dirty.  This helper covers all
      account_page_cleaned() usages except for __delete_from_page_cache()
      which is a special case anyway and left alone.  As this leaves no
      module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
      from it.
      
      This patch just revives cancel_dirty_page() as a trivial wrapper to
      replace equivalent usages and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      11f81bec
  2. 27 5月, 2015 1 次提交
  3. 22 5月, 2015 1 次提交
    • M
      block: remove management of bi_remaining when restoring original bi_end_io · 326e1dbb
      Mike Snitzer 提交于
      Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
      non-chains") regressed all existing callers that followed this pattern:
       1) saving a bio's original bi_end_io
       2) wiring up an intermediate bi_end_io
       3) restoring the original bi_end_io from intermediate bi_end_io
       4) calling bio_endio() to execute the restored original bi_end_io
      
      The regression was due to BIO_CHAIN only ever getting set if
      bio_inc_remaining() is called.  For the above pattern it isn't set until
      step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
      the first bio_endio(), in step 2 above, never decremented __bi_remaining
      before calling the intermediate bi_end_io -- leaving __bi_remaining with
      the value 1 instead of 0.  When bio_inc_remaining() occurred during step
      3 it brought it to a value of 2.  When the second bio_endio() was
      called, in step 4 above, it should've called the original bi_end_io but
      it didn't because there was an extra reference that wasn't dropped (due
      to atomic operations being optimized away since BIO_CHAIN wasn't set
      upfront).
      
      Fix this issue by removing the __bi_remaining management complexity for
      all callers that use the above pattern -- bio_chain() is the only
      interface that _needs_ to be concerned with __bi_remaining.  For the
      above pattern callers just expect the bi_end_io they set to get called!
      Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
      that aren't associated with the bio_chain() interface.
      
      Also, the bio_inc_remaining() interface has been moved local to bio.c.
      
      Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      326e1dbb
  4. 19 5月, 2015 1 次提交
  5. 06 5月, 2015 1 次提交
    • J
      bio: skip atomic inc/dec of ->bi_cnt for most use cases · dac56212
      Jens Axboe 提交于
      Struct bio has a reference count that controls when it can be freed.
      Most uses cases is allocating the bio, which then returns with a
      single reference to it, doing IO, and then dropping that single
      reference. We can remove this atomic_dec_and_test() in the completion
      path, if nobody else is holding a reference to the bio.
      
      If someone does call bio_get() on the bio, then we flag the bio as
      now having valid count and that we must properly honor the reference
      count when it's being put.
      Tested-by: NRobert Elliott <elliott@hp.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dac56212
  6. 03 5月, 2015 3 次提交
    • J
      ext4: fix growing of tiny filesystems · 2c869b26
      Jan Kara 提交于
      The estimate of necessary transaction credits in ext4_flex_group_add()
      is too pessimistic. It reserves credit for sb, resize inode, and resize
      inode dindirect block for each group added in a flex group although they
      are always the same block and thus it is enough to account them only
      once. Also the number of modified GDT block is overestimated since we
      fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.
      
      Make the estimation more precise. That reduces number of requested
      credits enough that we can grow 20 MB filesystem (which has 1 MB
      journal, 79 reserved GDT blocks, and flex group size 16 by default).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      2c869b26
    • D
      ext4: move check under lock scope to close a race. · 280227a7
      Davide Italiano 提交于
      fallocate() checks that the file is extent-based and returns
      EOPNOTSUPP in case is not. Other tasks can convert from and to
      indirect and extent so it's safe to check only after grabbing
      the inode mutex.
      Signed-off-by: NDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      280227a7
    • L
      ext4: fix data corruption caused by unwritten and delayed extents · d2dc317d
      Lukas Czerner 提交于
      Currently it is possible to lose whole file system block worth of data
      when we hit the specific interaction with unwritten and delayed extents
      in status extent tree.
      
      The problem is that when we insert delayed extent into extent status
      tree the only way to get rid of it is when we write out delayed buffer.
      However there is a limitation in the extent status tree implementation
      so that when inserting unwritten extent should there be even a single
      delayed block the whole unwritten extent would be marked as delayed.
      
      At this point, there is no way to get rid of the delayed extents,
      because there are no delayed buffers to write out. So when a we write
      into said unwritten extent we will convert it to written, but it still
      remains delayed.
      
      When we try to write into that block later ext4_da_map_blocks() will set
      the buffer new and delayed and map it to invalid block which causes
      the rest of the block to be zeroed loosing already written data.
      
      For now we can fix this by simply not allowing to set delayed status on
      written extent in the extent status tree. Also add WARN_ON() to make
      sure that we notice if this happens in the future.
      
      This problem can be easily reproduced by running the following xfs_io.
      
      xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
                -c "falloc 0 131072" \
                -c "pwrite -S 0xbb 65536 2048" \
                -c "fsync" /mnt/test/fff
      
      echo 3 > /proc/sys/vm/drop_caches
      xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
      
      This can be theoretically also reproduced by at random by running fsx,
      but it's not very reliable, though on machines with bigger page size
      (like ppc) this can be seen more often (especially xfstest generic/127)
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d2dc317d
  7. 02 5月, 2015 4 次提交
  8. 30 4月, 2015 1 次提交
  9. 26 4月, 2015 9 次提交
    • Y
      Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b
      Yang Dongsheng 提交于
      We need to fill inode when we found a node for it in delayed_nodes_tree.
      But we did not fill the ->last_trans currently, it will cause the test
      of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
      
      Problem:
      	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(2). pwrite(test_fd, buf, 4096, 0)
      	(3). close(test_fd)
      	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
      	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(6). fsync(test_fd);
      				<-------- we did not get the correct log entry for the file
      Reason:
      	When we re-open this file in (5), we would find a node
      in delayed_nodes_tree and fill the inode we are lookup with the
      information. But the ->last_trans is not filled, then the fsync()
      will check the ->last_trans and found it's 0 then say this inode
      is already in our tree which is commited, not recording the extents
      for it.
      
      Fix:
      	This patch fill the ->last_trans properly and set the
      runtime_flags if needed in this situation. Then we can get the
      log entries we expected after (6) and generic/311 passed.
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e17d30b
    • O
      btrfs: unlock i_mutex after attempting to delete subvolume during send · 909e26dc
      Omar Sandoval 提交于
      Whenever the check for a send in progress introduced in commit
      521e0546 (btrfs: protect snapshots from deleting during send) is
      hit, we return without unlocking inode->i_mutex. This is easy to see
      with lockdep enabled:
      
      [  +0.000059] ================================================
      [  +0.000028] [ BUG: lock held when returning to user space! ]
      [  +0.000029] 4.0.0-rc5-00096-g3c435c1e #93 Not tainted
      [  +0.000026] ------------------------------------------------
      [  +0.000029] btrfs/211 is leaving the kernel with locks still held!
      [  +0.000029] 1 lock held by btrfs/211:
      [  +0.000023]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0
      
      Make sure we unlock it in the error path.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      909e26dc
    • O
      btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache · b8605454
      Omar Sandoval 提交于
      If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
      When we try to access them later, things will blow up in various ways.
      
      Also fix the comment about the return value, which is an errno on error,
      not -1, and update the cases where it was not.
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b8605454
    • O
      btrfs: fix race on ENOMEM in alloc_extent_buffer · 5ca64f45
      Omar Sandoval 提交于
      Consider the following interleaving of overlapping calls to
      alloc_extent_buffer:
      
      Call 1:
      
      - Successfully allocates a few pages with find_or_create_page
      - find_or_create_page fails, goto free_eb
      - Unlocks the allocated pages
      
      Call 2:
      - Calls find_or_create_page and gets a page in call 1's extent_buffer
      - Finds that the page is already associated with an extent_buffer
      - Grabs a reference to the half-written extent_buffer and calls
        mark_extent_buffer_accessed on it
      
      mark_extent_buffer_accessed will then try to call mark_page_accessed on
      a null page and panic.
      
      The fix is to decrement the reference count on the half-written
      extent_buffer before unlocking the pages so call 2 won't use it. We
      should also set exists = NULL in the case that we don't use exists to
      avoid accidentally returning a freed extent_buffer in an error case.
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5ca64f45
    • O
      btrfs: handle ENOMEM in btrfs_alloc_tree_block · 67b7859e
      Omar Sandoval 提交于
      This is one of the first places to give out when memory is tight. Handle
      it properly rather than with a BUG_ON.
      
      Also fix the comment about the return value, which is an ERR_PTR, not
      NULL, on error.
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      67b7859e
    • F
      Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole · 1b984508
      Forrest Liu 提交于
      If device tree has hole, find_free_dev_extent() cannot find available
      address properly.
      
      The problem can be reproduce by following script.
      
          mntpath=/btrfs
          loopdev=/dev/loop0
          filepath=/home/forrest/image
      
          umount $mntpath
          losetup -d $loopdev
          truncate --size 100g $filepath
          losetup $loopdev $filepath
          mkfs.btrfs -f $loopdev
          mount $loopdev $mntpath
      
          # make device tree with one big hole
          for i in `seq 1 1 100`; do
              fallocate -l 1g $mntpath/$i
          done
          sync
          for i in `seq 1 1 95`; do
              rm $mntpath/$i
          done
          sync
      
          # wait cleaner thread remove unused block group
          sleep 300
      
          fallocate -l 1g $mntpath/aaa
      
          # failed to allocate new chunk
          fallocate -l 1g $mntpath/bbb
      
      Above script will make device tree with one big hole, and can only allocate
      just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb
      
          item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
              dev extent chunk_tree 3
              chunk objectid 256 chunk offset 106292051968 length 1073741824
          item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
              dev extent chunk_tree 3
              chunk objectid 256 chunk offset 103108575232 length 1073741824
      Signed-off-by: NForrest Liu <forrestl@synology.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1b984508
    • C
      Btrfs: don't check for delalloc_bytes in cache_save_setup · e4c88f00
      Chris Mason 提交于
      Now that we're doing free space cache writeback outside the critical
      section in the commit, there is a bigger window for delalloc_bytes to
      be added after a cache has been written.  find_free_extent may do this
      without putting the block group back into the dirty list, and also
      without a transaction running.
      
      Checking for delalloc_bytes in cache_save_setup means we might leave the
      cache marked as written without invalidating it.  Consistency checks
      during mount will toss the cache, but it's better to get rid of the
      check in cache_save_setup and let it get invalidated by the checks
      already done during cache write out.
      Signed-off-by: NChris Mason <clm@fb.com>
      e4c88f00
    • F
      Btrfs: fix deadlock when starting writeback of bg caches · 24b89d08
      Filipe Manana 提交于
      While starting the writes of the dirty block group caches, if we don't
      find a block group item in the extent tree we were leaving without
      releasing our path, running delayed references and then looping again to
      process any new dirty block groups. However this second iteration of the
      loop could cause a deadlock because it tries to lock some other extent
      tree node/leaf which another task already locked and it's blocked because
      it's waiting for a lock on some node/leaf that is in our path that was not
      released before.
      We could also deadlock when running the delayed references - as we could
      end up trying to lock the same nodes/leafs that we have in our local path
      (with a different lock type).
      
      Got into such case when running xfstests:
      
      [20892.242791] ------------[ cut here ]------------
      [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [20892.245874] BTRFS: Transaction aborted (error -2)
      (...)
      [20892.269378] Call Trace:
      [20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
      [20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      (...)
      [20892.291316] ---[ end trace 597f77e664245373 ]---
      [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
      [20892.297390] BTRFS info (device sdg): forced readonly
      [20892.298222] ------------[ cut here ]------------
      [20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]()
      (...)
      [20892.326253] Call Trace:
      [20892.326904]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.329503]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.330815]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.332556]  [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs]
      [20892.333955]  [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c
      [20892.335562]  [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs]
      [20892.336849]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
      [20892.338222]  [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs]
      [20892.339823]  [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs]
      [20892.341275]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
      [20892.342810]  [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs]
      [20892.344184]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.347162]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      (...)
      [20892.361015] ---[ end trace 597f77e664245374 ]---
      [21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds.
      [21120.689881]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.703696] Call Trace:
      [21120.704310]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.705490]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
      [21120.706757]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.708156]  [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs]
      [21120.709892]  [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs]
      [21120.711605]  [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs]
      [21120.723440]  [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs]
      [21120.724943]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
      [21120.726008]  [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa
      [21120.727230]  [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b
      [21120.728526]  [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b
      [21120.729701]  [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b
      (...)
      [21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds.
      [21120.749459]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.768457] Call Trace:
      [21120.769039]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.770107]  [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs]
      [21120.771558]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.773659]  [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs]
      [21120.776257]  [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs]
      [21120.777755]  [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs]
      [21120.779459]  [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs]
      [21120.781153]  [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs]
      [21120.783918]  [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [21120.785436]  [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f
      [21120.786434]  [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      (...)
      [21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds.
      [21120.890526]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.899960] Call Trace:
      [21120.900743]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.903004]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
      [21120.904383]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.905608]  [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs]
      [21120.906812]  [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c
      [21120.907874]  [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
      [21120.909551]  [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
      [21120.910914]  [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs]
      [21120.912181]  [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs]
      [21120.913784]  [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs]
      [21120.915374]  [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [21120.916735]  [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [21120.917996]  [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [21120.919478]  [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs]
      [21120.921226]  [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs]
      [21120.923121]  [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs]
      [21120.924449]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [21120.926602]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
      [21120.927769]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [21120.929324]  [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
      [21120.930723]  [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs]
      [21120.931897]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
      [21120.934446]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
      [21120.935528]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
      (...)
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      24b89d08
    • F
      Btrfs: fix race between start dirty bg cache writeout and bg deletion · b58d1a9e
      Filipe Manana 提交于
      While running xfstests I ran into the following:
      
      [20892.242791] ------------[ cut here ]------------
      [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [20892.245874] BTRFS: Transaction aborted (error -2)
      [20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$
      [20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [20892.264738]  0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2
      [20892.266244]  ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48
      [20892.267761]  ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0
      [20892.269378] Call Trace:
      [20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
      [20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [20892.281781]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [20892.282873]  [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs]
      [20892.284111]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
      [20892.285203]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
      [20892.286290]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [20892.287469]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
      [20892.288412]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
      [20892.289348]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
      [20892.290255]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [20892.291316] ---[ end trace 597f77e664245373 ]---
      [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
      [20892.297390] BTRFS info (device sdg): forced readonly
      
      This happens because in btrfs_start_dirty_block_groups() we splice the
      transaction's list of dirty block groups into a local list and then we
      keep extracting the first element of the list without holding the
      cache_write_mutex mutex. This means that before we acquire that mutex
      the first block group on the list might be removed by a conurrent task
      running btrfs_remove_block_group(). So make sure we extract the first
      element (and test the list emptyness) while holding that mutex.
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b58d1a9e
  10. 25 4月, 2015 4 次提交
    • A
      RCU pathwalk breakage when running into a symlink overmounting something · 3cab989a
      Al Viro 提交于
      Calling unlazy_walk() in walk_component() and do_last() when we find
      a symlink that needs to be followed doesn't acquire a reference to vfsmount.
      That's fine when the symlink is on the same vfsmount as the parent directory
      (which is almost always the case), but it's not always true - one _can_
      manage to bind a symlink on top of something.  And in such cases we end up
      with excessive mntput().
      
      Cc: stable@vger.kernel.org # since 2.6.39
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3cab989a
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
    • J
      fs/9p: fix readdir() · 8e3c5005
      Johannes Berg 提交于
      Al Viro's IOV changes broke 9p readdir() because the new code
      didn't abort the read when it returned nothing. The original
      code checked if the combined error/length was <= 0 but in the
      new code that accidentally got changed to just an error check.
      
      Add back the return from the function when nothing is read.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: e1200fe6 ("9p: switch p9_client_read() to passing struct iov_iter *")
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8e3c5005
    • C
      Btrfs: prevent list corruption during free space cache processing · a3bdccc4
      Chris Mason 提交于
      __btrfs_write_out_cache is holding the ctl->tree_lock while it prepares
      a list of bitmaps to record in the free space cache.  It was dropping
      the lock while it worked on other components, which made a window for
      free_bitmap() to free the bitmap struct without removing it from the
      list.
      
      This changes things to hold the lock the whole time, and also makes sure
      we hold the lock during enospc cleanup.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a3bdccc4
  11. 24 4月, 2015 9 次提交