1. 23 2月, 2016 28 次提交
    • C
      f2fs: speed up handling holes in fiemap · da85985c
      Chao Yu 提交于
      This patch makes f2fs_map_blocks supporting returning next potential
      page offset which skips hole region in indirect tree of inode, and
      use it to speed up fiemap in handling big hole case.
      
      Test method:
      xfs_io -f /mnt/f2fs/file  -c "pwrite 1099511627776 4096"
      time xfs_io -f /mnt/f2fs/file -c "fiemap -v"
      
      Before:
      time xfs_io -f /mnt/f2fs/file -c "fiemap -v"
      /mnt/f2fs/file:
       EXT: FILE-OFFSET              BLOCK-RANGE      TOTAL FLAGS
         0: [0..2147483647]:         hole             2147483648
         1: [2147483648..2147483655]: 81920..81927         8   0x1
      
      real    3m3.518s
      user    0m0.000s
      sys     3m3.456s
      
      After:
      time xfs_io -f /mnt/f2fs/file -c "fiemap -v"
      /mnt/f2fs/file:
       EXT: FILE-OFFSET              BLOCK-RANGE      TOTAL FLAGS
         0: [0..2147483647]:         hole             2147483648
         1: [2147483648..2147483655]: 81920..81927         8   0x1
      
      real    0m0.008s
      user    0m0.000s
      sys     0m0.008s
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      da85985c
    • C
      f2fs: introduce get_next_page_offset to speed up SEEK_DATA · 3cf45747
      Chao Yu 提交于
      When seeking data in ->llseek, if we encounter a big hole which covers
      several dnode pages, we will try to seek data from index of page which
      is the first page of next dnode page, at most we could skip searching
      (ADDRS_PER_BLOCK - 1) pages.
      
      However it's still not efficient, because if our indirect/double-indirect
      pointer are NULL, there are no dnode page locate in the tree indirect/
      double-indirect pointer point to, it's not necessary to search the whole
      region.
      
      This patch introduces get_next_page_offset to calculate next page offset
      based on current searching level and max searching level returned from
      get_dnode_of_data, with this, we could skip searching the entire area
      indirect or double-indirect node block is not exist.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      3cf45747
    • C
      f2fs: remove unneeded pointer conversion · 81ca7350
      Chao Yu 提交于
      There are redundant pointer conversion in following call stack:
       - at position a, inode was been converted to f2fs_file_info.
       - at position b, f2fs_file_info was been converted to inode again.
      
       - truncate_blocks(inode,..)
        - fi = F2FS_I(inode)		---a
        - ADDRS_PER_PAGE(node_page, fi)
         - addrs_per_inode(fi)
          - inode = &fi->vfs_inode	---b
          - f2fs_has_inline_xattr(inode)
           - fi = F2FS_I(inode)
           - is_inode_flag_set(fi,..)
      
      In order to avoid unneeded conversion, alter ADDRS_PER_PAGE and
      addrs_per_inode to acept parameter with type of inode pointer.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      81ca7350
    • C
      f2fs: simplify __allocate_data_blocks · 5b8db7fa
      Chao Yu 提交于
      This patch uses existing function f2fs_map_block to simplify implementation
      of __allocate_data_blocks.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      5b8db7fa
    • C
      f2fs: simplify f2fs_map_blocks · 4fe71e88
      Chao Yu 提交于
      In f2fs_map_blocks, we use duplicated codes to handle first block mapping
      and the following blocks mapping, it's unnecessary. This patch simplifies
      f2fs_map_blocks to avoid using copied codes.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      4fe71e88
    • S
      f2fs: introduce lifetime write IO statistics · 8f1dbbbb
      Shuoran Liu 提交于
      This patch introduces lifetime IO write statistics exposed to the sysfs interface.
      The write IO amount is obtained from block layer, accumulated in the file system and
      stored in the hot node summary of checkpoint.
      Signed-off-by: NShuoran Liu <liushuoran@huawei.com>
      Signed-off-by: NPengyang Hou <houpengyang@huawei.com>
      [Jaegeuk Kim: add sysfs documentation]
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      8f1dbbbb
    • J
      f2fs: give scheduling point in shrinking path · 6fe2bc95
      Jaegeuk Kim 提交于
      It needs to give a chance to be rescheduled while shrinking slab entries.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      6fe2bc95
    • H
      f2fs: improve shrink performance of extent nodes · 201ef5e0
      Hou Pengyang 提交于
      On the worst case, we need to scan the whole radix tree and every rb-tree to
      free the victimed extent_nodes when shrinking.
      
      Pengyang initially introduced a victim_list to record the victimed extent_nodes,
      and free these extent_nodes by just scanning a list.
      
      Later, Chao Yu enhances the original patch to improve memory footprint by
      removing victim list.
      
      The policy of lru list shrinking becomes:
      1) lock lru list's lock
      2) trylock extent tree's lock
      3) remove extent node from lru list
      4) unlock lru list's lock
      5) do shrink
      6) repeat 1) to 5)
      Signed-off-by: NHou Pengyang <houpengyang@huawei.com>
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      201ef5e0
    • J
      f2fs: don't set cached_en if it will be freed · 42926744
      Jaegeuk Kim 提交于
      If en has empty list pointer, it will be freed sooner, so we don't need to
      set cached_en with it.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      42926744
    • J
      f2fs: move extent_node list operations being coupled with rbtree operation · 43a2fa18
      Jaegeuk Kim 提交于
      This patch moves extent_node list operations to be handled together with
      its rbtree operations.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      43a2fa18
    • H
      f2fs: reconstruct the code to free an extent_node · a03f01f2
      Hou Pengyang 提交于
      There are three steps to free an extent node:
      1) list_del_init, 2)__detach_extent_node, 3) kmem_cache_free
      
      In path f2fs_destroy_extent_tree, 1->2->3 to free a node,
      But in path f2fs_update_extent_tree_range, it is 2->1->3.
      
      This patch makes all the order to be: 1->2->3
      It makes sense, since in the next patch, we import a victim list in the
      path shrink_extent_tree, we could check if the extent_node is in the victim
      list by checking the list_empty(). So it is necessary to put 1) first.
      Signed-off-by: NHou Pengyang <houpengyang@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      a03f01f2
    • J
      f2fs: use wq_has_sleeper for cp_wait wait_queue · 7c506896
      Jaegeuk Kim 提交于
      We need to use wq_has_sleeper including smp_mb to consider cp_wait concurrency.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      7c506896
    • F
      f2fs: avoid unnecessary search while finding victim in gc · 688159b6
      Fan Li 提交于
      variable nsearched in get_victim_by_default() indicates the number of
      dirty segments we already checked. There are 2 problems about the way
      it updates:
      1. When p.ofs_unit is greater than 1, the victim we find consists
         of multiple segments, possibly more than 1 dirty segment.
         But nsearched always increases by 1.
      2. If segments have been found but not been chosen, nsearched won't
         increase. So even we have checked all dirty segments, nsearched
         may still less than p.max_search.
      All these problems could cause unnecessary search after all dirty
      segments have already been checked.
      Signed-off-by: NFan li <fanofcode.li@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      688159b6
    • Y
      f2fs: delete unnecessary wait for page writeback · 85ead818
      Yunlei He 提交于
      no need to wait inline file page writeback for no one
      use it, so this patch delete unnecessary wait.
      Signed-off-by: NYunlei He <heyunlei@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      85ead818
    • J
      f2fs: use wait_for_stable_page to avoid contention · fec1d657
      Jaegeuk Kim 提交于
      In write_begin, if storage supports stable_page, we don't need to wait for
      writeback to update its contents.
      This patch introduces to use wait_for_stable_page instead of
      wait_on_page_writeback.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      fec1d657
    • C
      f2fs: enhance foreground GC · 718e53fa
      Chao Yu 提交于
      If we configure section consist of multiple segments, foreground GC will
      do the garbage collection with following approach:
      
      	for each segment in victim section
      		blk_start_plug
      		for each valid block in segment
      			write out by OPU method
      		submit bio cache   <---
      		blk_finish_plug   <---
      
      There are two issue:
      1) for most of the time, 'submit bio cache' will break the merging in
      current bio buffer from writes of next segments, making a smaller bio
      submitting.
      2) block plug only cover IO submitting in one segment, which reduce
      opportunity of merging IOs in plug with multiple segments.
      
      So refactor the code as below structure to strive for biggest
      opportunity of merging IOs:
      
      	blk_start_plug
      	for each segment in victim section
      		for each valid block in segment
      			write out by OPU method
      	submit bio cache
      	blk_finish_plug
      
      Test method:
      1. mkfs.f2fs -s 8 /dev/sdX
      2. touch 32 files
      3. write 2M data into each file
      4. punch 1.5M data from offset 0 for each file
      5. trigger foreground gc through ioctl
      
      Before patch, there are totoally 40 bios submitted.
      f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880
      f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65776, size = 122880
      f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66016, size = 122880
      f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66256, size = 122880
      f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66496, size = 32768
      ----repeat for 8 times
      
      After patch, there are totally 35 bios submitted.
      f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880
      ----repeat 34 times
      f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 73696, size = 16384
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      718e53fa
    • J
      f2fs: don't need to call set_page_dirty for io error · e3ef1876
      Jaegeuk Kim 提交于
      If end_io gets an error, we don't need to set the page as dirty, since we
      already set f2fs_stop_checkpoint which will not flush any data.
      
      This will resolve the following warning.
      
      ======================================================
      [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
      4.4.0+ #9 Tainted: G           O
      ------------------------------------------------------
      xfs_io/26773 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
       (&(&sbi->inode_lock[i])->rlock){+.+...}, at: [<ffffffffc025483f>] update_dirty_page+0x6f/0xd0 [f2fs]
      
      and this task is already holding:
       (&(&q->__queue_lock)->rlock){-.-.-.}, at: [<ffffffff81396ea2>] blk_queue_bio+0x422/0x490
      which would create a new lock dependency:
       (&(&q->__queue_lock)->rlock){-.-.-.} -> (&(&sbi->inode_lock[i])->rlock){+.+...}
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      e3ef1876
    • J
      f2fs: avoid needless sync_inode_page when reading inline_data · ae96e7bd
      Jaegeuk Kim 提交于
      In write_begin, if there is an inline_data, f2fs loads it into 0'th data page.
      Since it's the read path, we don't need to sync its inode page.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      ae96e7bd
    • J
      f2fs: don't need to sync node page at every time · 52f80337
      Jaegeuk Kim 提交于
      In write_end, we don't need to sync inode page at every time.
      Instead, we can expect f2fs_write_inode will update later.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      52f80337
    • J
      f2fs: avoid multiple node page writes due to inline_data · 2049d4fc
      Jaegeuk Kim 提交于
      The sceanrio is:
      1. create fully node blocks
      2. flush node blocks
      3. write inline_data for all the node blocks again
      4. flush node blocks redundantly
      
      So, this patch tries to flush inline_data when flushing node blocks.
      Reviewed-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      2049d4fc
    • J
      f2fs: do f2fs_balance_fs when block is allocated · 3c082b7b
      Jaegeuk Kim 提交于
      We should consider data block allocation to trigger f2fs_balance_fs.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      3c082b7b
    • J
      f2fs: fix to overcome inline_data floods · 6e17bfbc
      Jaegeuk Kim 提交于
      The scenario is:
      1. create lots of node blocks
      2. sync
      3. write lots of inline_data
      -> got panic due to no free space
      
      In that case, we should flush node blocks when writing inline_data in #3,
      and trigger gc as well.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      6e17bfbc
    • J
      f2fs: use writepages->lock for WB_SYNC_ALL · 25c13551
      Jaegeuk Kim 提交于
      If there are many writepages calls by multiple threads in background, we don't
      need to serialize to merge all the bios, since it's background.
      In such the case, it'd better to run writepages concurrently.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      25c13551
    • J
      f2fs: remove needless condition check · b483fadf
      Jaegeuk Kim 提交于
      This patch removes needless condition variable.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      b483fadf
    • C
      f2fs: correct search area in get_new_segment · 0ab14356
      Chao Yu 提交于
      get_new_segment starts from current segment position, tries to search a
      free segment among its right neighbors locate in same section.
      
      But previously our search area was set as [current segment, max segment],
      which means we have to search to more bits in free_segmap bitmap for some
      worse cases. So here we correct the search area to [current segment, last
      segment in section] to avoid unnecessary searching.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      0ab14356
    • C
      f2fs: export dirty_nats_ratio in sysfs · 2304cb0c
      Chao Yu 提交于
      This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold
      of dirty nat entries, if current ratio exceeds configured threshold,
      checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      2304cb0c
    • C
      f2fs: flush dirty nat entries when exceeding threshold · 7d768d2c
      Chao Yu 提交于
      When testing f2fs with xfstest, generic/251 is stuck for long time,
      the case uses below serials to obtain fresh released space in device,
      in order to prepare for following fstrim test.
      
      1. rm -rf /mnt/dir
      2. mkdir /mnt/dir/
      3. cp -axT `pwd`/ /mnt/dir/
      4. goto 1
      
      During preparing step, all nat entries will be cached in nat cache,
      most of them are dirty entries with invalid blkaddr, which means
      nodes related to these entries have been truncated, and they could
      be reused after the dirty entries been checkpointed.
      
      However, there was no checkpoint been triggered, so nid allocators
      (e.g. mkdir, creat) will run into long journey of iterating all NAT
      pages, looking for free nids in alloc_nid->build_free_nids.
      
      Here, in f2fs_balance_fs_bg we give another chance to do checkpoint
      to flush nat entries for reusing them in free nid cache when dirty
      entry count exceeds 10% of max count.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      7d768d2c
    • C
      f2fs: relocate is_merged_page · 0fd785eb
      Chao Yu 提交于
      Operations in is_merged_page is related to inner bio cache, move it to
      data.c.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      0fd785eb
  2. 19 2月, 2016 4 次提交
    • J
      ext4: fix crashes in dioread_nolock mode · 74dae427
      Jan Kara 提交于
      Competing overwrite DIO in dioread_nolock mode will just overwrite
      pointer to io_end in the inode. This may result in data corruption or
      extent conversion happening from IO completion interrupt because we
      don't properly set buffer_defer_completion() when unlocked DIO races
      with locked DIO to unwritten extent.
      
      Since unlocked DIO doesn't need io_end for anything, just avoid
      allocating it and corrupting pointer from inode for locked DIO.
      A cleaner fix would be to avoid these games with io_end pointer from the
      inode but that requires more intrusive changes so we leave that for
      later.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      74dae427
    • J
      ext4: fix bh->b_state corruption · ed8ad838
      Jan Kara 提交于
      ext4 can update bh->b_state non-atomically in _ext4_get_block() and
      ext4_da_get_block_prep(). Usually this is fine since bh is just a
      temporary storage for mapping information on stack but in some cases it
      can be fully living bh attached to a page. In such case non-atomic
      update of bh->b_state can race with an atomic update which then gets
      lost. Usually when we are mapping bh and thus updating bh->b_state
      non-atomically, nobody else touches the bh and so things work out fine
      but there is one case to especially worry about: ext4_finish_bio() uses
      BH_Uptodate_Lock on the first bh in the page to synchronize handling of
      PageWriteback state. So when blocksize < pagesize, we can be atomically
      modifying bh->b_state of a buffer that actually isn't under IO and thus
      can race e.g. with delalloc trying to map that buffer. The result is
      that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
      corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
      
      Fix the problem by always updating bh->b_state bits atomically.
      
      CC: stable@vger.kernel.org
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ed8ad838
    • J
      fsnotify: turn fsnotify reaper thread into a workqueue job · 0918f1c3
      Jeff Layton 提交于
      We don't require a dedicated thread for fsnotify cleanup.  Switch it
      over to a workqueue job instead that runs on the system_unbound_wq.
      
      In the interest of not thrashing the queued job too often when there are
      a lot of marks being removed, we delay the reaper job slightly when
      queueing it, to allow several to gather on the list.
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Tested-by: NEryu Guan <guaneryu@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0918f1c3
    • J
      Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread" · 13d34ac6
      Jeff Layton 提交于
      This reverts commit c510eff6 ("fsnotify: destroy marks with
      call_srcu instead of dedicated thread").
      
      Eryu reported that he was seeing some OOM kills kick in when running a
      testcase that adds and removes inotify marks on a file in a tight loop.
      
      The above commit changed the code to use call_srcu to clean up the
      marks.  While that does (in principle) work, the srcu callback job is
      limited to cleaning up entries in small batches and only once per jiffy.
      It's easily possible to overwhelm that machinery with too many call_srcu
      callbacks, and Eryu's reproduer did just that.
      
      There's also another potential problem with using call_srcu here.  While
      you can obviously sleep while holding the srcu_read_lock, the callbacks
      run under local_bh_disable, so you can't sleep there.
      
      It's possible when putting the last reference to the fsnotify_mark that
      we'll end up putting a chain of references including the fsnotify_group,
      uid, and associated keys.  While I don't see any obvious ways that that
      could occurs, it's probably still best to avoid using call_srcu here
      after all.
      
      This patch reverts the above patch.  A later patch will take a different
      approach to eliminated the dedicated thread here.
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Reported-by: NEryu Guan <guaneryu@gmail.com>
      Tested-by: NEryu Guan <guaneryu@gmail.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13d34ac6
  3. 17 2月, 2016 2 次提交
    • T
      writeback: initialize inode members that track writeback history · 3d65ae46
      Tahsin Erdogan 提交于
      inode struct members that track cgroup writeback information
      should be reinitialized when inode gets allocated from
      kmem_cache. Otherwise, their values remain and get used by the
      new inode.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Fixes: d10c8095 ("writeback: implement foreign cgroup inode bdi_writeback switching")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3d65ae46
    • T
      writeback: keep superblock pinned during cgroup writeback association switches · 5ff8eaac
      Tejun Heo 提交于
      If cgroup writeback is in use, an inode is associated with a cgroup
      for writeback.  If the inode's main dirtier changes to another cgroup,
      the association gets updated asynchronously.  Nothing was pinning the
      superblock while such switches are in progress and superblock could go
      away while async switching is pending or in progress leading to
      crashes like the following.
      
       kernel BUG at fs/jbd2/transaction.c:319!
       invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
       Hardware name: Google Google, BIOS Google 01/01/2011
       Workqueue: events inode_switch_wbs_work_fn
       task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
       RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
       RSP: 0018:ffff880209267c30  EFLAGS: 00010202
       ...
       Call Trace:
        [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
        [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
        [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
        [<ffffffff8035338b>] evict+0xbb/0x190
        [<ffffffff80354190>] iput+0x130/0x190
        [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
        [<ffffffff80279819>] process_one_work+0x129/0x300
        [<ffffffff80279b16>] worker_thread+0x126/0x480
        [<ffffffff8027ed14>] kthread+0xc4/0xe0
        [<ffffffff809771df>] ret_from_fork+0x3f/0x70
      
      Fix it by bumping s_active while cgroup association switching is in
      flight.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTahsin Erdogan <tahsin@google.com>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: d10c8095 ("writeback: implement foreign cgroup inode bdi_writeback switching")
      Cc: stable@vger.kernel.org #v4.5+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5ff8eaac
  4. 16 2月, 2016 2 次提交
    • K
      ext4: fix memleak in ext4_readdir() · c906f38e
      Kirill Tkhai 提交于
      When ext4_bread() fails, fname_crypto_str remains
      allocated after return. Fix that.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      CC: Dmitry Monakhov <dmonakhov@virtuozzo.com>
      c906f38e
    • F
      Btrfs: fix direct IO requests not reporting IO error to user space · 1636d1d7
      Filipe Manana 提交于
      If a bio for a direct IO request fails, we were not setting the error in
      the parent bio (the main DIO bio), making us not return the error to
      user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
      return the number of bytes issued for IO and not the error a bio created
      and submitted by btrfs_submit_direct() got from the block layer.
      This essentially happens because when we call:
      
         dio_end_io(dio_bio, bio->bi_error);
      
      It does not set dio_bio->bi_error to the value of the second argument.
      So just add this missing assignment in endio callbacks, just as we do in
      the error path at btrfs_submit_direct() when we fail to clone the dio bio
      or allocate its private object. This follows the convention of what is
      done with other similar APIs such as bio_endio() where the caller is
      responsible for setting the bi_error field in the bio it passes as an
      argument to bio_endio().
      
      This was detected by the new generic test cases in xfstests: 271, 272,
      276 and 278. Which essentially setup a dm error target, then load the
      error table, do a direct IO write and unload the error table. They
      expect the write to fail with -EIO, which was not getting reported
      when testing against btrfs.
      
      Cc: stable@vger.kernel.org  # 4.3+
      Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      1636d1d7
  5. 12 2月, 2016 4 次提交
    • E
      ext4: remove unused parameter "newblock" in convert_initialized_extent() · 56263b4c
      Eryu Guan 提交于
      The "newblock" parameter is not used in convert_initialized_extent(),
      remove it.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      56263b4c
    • E
      ext4: don't read blocks from disk after extents being swapped · bcff2488
      Eryu Guan 提交于
      I notice ext4/307 fails occasionally on ppc64 host, reporting md5
      checksum mismatch after moving data from original file to donor file.
      
      The reason is that move_extent_per_page() calls __block_write_begin()
      and block_commit_write() to write saved data from original inode blocks
      to donor inode blocks, but __block_write_begin() not only maps buffer
      heads but also reads block content from disk if the size is not block
      size aligned.  At this time the physical block number in mapped buffer
      head is pointing to the donor file not the original file, and that
      results in reading wrong data to page, which get written to disk in
      following block_commit_write call.
      
      This also can be reproduced by the following script on 1k block size ext4
      on x86_64 host:
      
          mnt=/mnt/ext4
          donorfile=$mnt/donor
          testfile=$mnt/testfile
          e4compact=~/xfstests/src/e4compact
      
          rm -f $donorfile $testfile
      
          # reserve space for donor file, written by 0xaa and sync to disk to
          # avoid EBUSY on EXT4_IOC_MOVE_EXT
          xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile
      
          # create test file written by 0xbb
          xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile
      
          # compute initial md5sum
          md5sum $testfile | tee md5sum.txt
          # drop cache, force e4compact to read data from disk
          echo 3 > /proc/sys/vm/drop_caches
      
          # test defrag
          echo "$testfile" | $e4compact -i -v -f $donorfile
          # check md5sum
          md5sum -c md5sum.txt
      
      Fix it by creating & mapping buffer heads only but not reading blocks
      from disk, because all the data in page is guaranteed to be up-to-date
      in mext_page_mkuptodate().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bcff2488
    • I
      ext4: fix potential integer overflow · 46901760
      Insu Yun 提交于
      Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
      integer overflow could be happened.
      Therefore, need to fix integer overflow sanitization.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NInsu Yun <wuninsu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      46901760
    • H
      ext4: add a line break for proc mb_groups display · 802cf1f9
      Huaitong Han 提交于
      This patch adds a line break for proc mb_groups display.
      Signed-off-by: NHuaitong Han <huaitong.han@intel.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      802cf1f9