1. 08 5月, 2016 1 次提交
  2. 28 4月, 2016 1 次提交
    • C
      f2fs: move node pages only in victim section during GC · da011cc0
      Chao Yu 提交于
      For foreground GC, we cache node blocks in victim section and set them
      dirty, then we call sync_node_pages to flush these node pages, but
      meanwhile, those node pages which does not locate in victim section
      will be flushed together, so more bandwidth and continuous free space
      would be occupied.
      
      So for this condition, it's better to leave those unrelated node page
      in cache for further write hit, and let CP or VM to flush them afterward.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      da011cc0
  3. 27 4月, 2016 4 次提交
  4. 15 4月, 2016 2 次提交
  5. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  6. 18 3月, 2016 3 次提交
  7. 27 2月, 2016 1 次提交
    • C
      f2fs: fix to avoid deadlock when merging inline data · 19c7377b
      Chao Yu 提交于
      When testing with fsstress, kworker and user threads were both blocked:
      
      INFO: task kworker/u16:1:16580 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kworker/u16:1   D ffff8803f2595390     0 16580      2 0x00000000
      Workqueue: writeback bdi_writeback_workfn (flush-251:0)
       ffff8802730e5760 0000000000000046 ffff880274729fc0 0000000000012440
       ffff8802730e5fd8 ffff8802730e4010 0000000000012440 0000000000012440
       ffff8802730e5fd8 0000000000012440 ffff880274729fc0 ffff88026eb50000
      Call Trace:
       [<ffffffff816fe9d9>] schedule+0x29/0x70
       [<ffffffff816ff895>] rwsem_down_read_failed+0xa5/0xf9
       [<ffffffff81378584>] call_rwsem_down_read_failed+0x14/0x30
       [<ffffffffa0694feb>] f2fs_write_data_page+0x31b/0x420 [f2fs]
       [<ffffffffa0690f1a>] __f2fs_writepage+0x1a/0x50 [f2fs]
       [<ffffffffa06922a0>] f2fs_write_data_pages+0xe0/0x290 [f2fs]
       [<ffffffff811473b3>] do_writepages+0x23/0x40
       [<ffffffff811cc3ee>] __writeback_single_inode+0x4e/0x250
       [<ffffffff811cd4f1>] writeback_sb_inodes+0x2c1/0x470
       [<ffffffff811cd73e>] __writeback_inodes_wb+0x9e/0xd0
       [<ffffffff811cda0b>] wb_writeback+0x1fb/0x2d0
       [<ffffffff811cdb7c>] wb_do_writeback+0x9c/0x220
       [<ffffffff811ce232>] bdi_writeback_workfn+0x72/0x1c0
       [<ffffffff8106b74e>] process_one_work+0x1de/0x5b0
       [<ffffffff8106e78f>] worker_thread+0x11f/0x3e0
       [<ffffffff810750ce>] kthread+0xde/0xf0
       [<ffffffff817093f8>] ret_from_fork+0x58/0x90
      
      fsstress thread stack:
       [<ffffffff81139f0e>] sleep_on_page+0xe/0x20
       [<ffffffff81139ef7>] __lock_page+0x67/0x70
       [<ffffffff8113b100>] find_lock_page+0x50/0x80
       [<ffffffff8113b24f>] find_or_create_page+0x3f/0xb0
       [<ffffffffa06983a9>] sync_node_pages+0x259/0x810 [f2fs]
       [<ffffffffa068d874>] write_checkpoint+0x1a4/0xce0 [f2fs]
       [<ffffffffa0686b0c>] f2fs_sync_fs+0x7c/0xd0 [f2fs]
       [<ffffffffa067c813>] f2fs_sync_file+0x143/0x5f0 [f2fs]
       [<ffffffff811d301b>] vfs_fsync_range+0x2b/0x40
       [<ffffffff811d304c>] vfs_fsync+0x1c/0x20
       [<ffffffff811d3291>] do_fsync+0x41/0x70
       [<ffffffff811d32d3>] SyS_fdatasync+0x13/0x20
       [<ffffffff817094a2>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff
      
      The reason of this issue is:
      CPU0:					CPU1:
       - f2fs_write_data_pages
      					 - f2fs_sync_fs
      					  - write_checkpoint
      					   - block_operations
      					    - f2fs_lock_all
      					     - down_write(sbi->cp_rwsem)
        - lock_page(page)
        - f2fs_write_data_page
      					    - sync_node_pages
      					     - flush_inline_data
      					      - pagecache_get_page(page, GFP_LOCK)
         - f2fs_lock_op
          - down_read(sbi->cp_rwsem)
      
      This patch alters to use trylock_page in flush_inline_data to fix this ABBA
      deadlock issue.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      19c7377b
  8. 26 2月, 2016 1 次提交
    • C
      f2fs: fix incorrect upper bound when iterating inode mapping tree · 80dd9c0e
      Chao Yu 提交于
      1. Inode mapping tree can index page in range of [0, ULONG_MAX], however,
      in some places, f2fs only search or iterate page in ragne of [0, LONG_MAX],
      result in miss hitting in page cache.
      
      2. filemap_fdatawait_range accepts range parameters in unit of bytes, so
      the max range it covers should be [0, LLONG_MAX], if we use [0, LONG_MAX]
      as range for waiting on writeback, big number of pages will not be covered.
      
      This patch corrects above two issues.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      80dd9c0e
  9. 23 2月, 2016 13 次提交
    • C
      f2fs: trace old block address for CoWed page · 7a9d7548
      Chao Yu 提交于
      This patch enables to trace old block address of CoWed page for better
      debugging.
      
      f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f0, oldaddr = 0xfe8ab, newaddr = 0xfee90 rw = WRITE_SYNC, type = NODE
      f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f8, oldaddr = 0xfe8b0, newaddr = 0xfee91 rw = WRITE_SYNC, type = NODE
      f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4fa, oldaddr = 0xfe8ae, newaddr = 0xfee92 rw = WRITE_SYNC, type = NODE
      
      f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x96, oldaddr = 0xf049b, newaddr = 0x2bbe rw = WRITE, type = DATA
      f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x97, oldaddr = 0xf049c, newaddr = 0x2bbf rw = WRITE, type = DATA
      f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x98, oldaddr = 0xf049d, newaddr = 0x2bc0 rw = WRITE, type = DATA
      
      f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x47, oldaddr = 0xffffffff, newaddr = 0xf2631 rw = WRITE, type = DATA
      f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x48, oldaddr = 0xffffffff, newaddr = 0xf2632 rw = WRITE, type = DATA
      f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x49, oldaddr = 0xffffffff, newaddr = 0xf2633 rw = WRITE, type = DATA
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      7a9d7548
    • C
      f2fs: try to flush inode after merging inline data · 9a4cbc9e
      Chao Yu 提交于
      When flushing node pages, if current node page is an inline inode page, we
      will try to merge inline data from data page into inline inode page, then
      skip flushing current node page, it will decrease the number of nodes to
      be flushed in batch in this round, which may lead to worse performance.
      
      This patch gives a chance to flush just merged inline inode pages for
      performance.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      9a4cbc9e
    • C
      f2fs: reorder nat cache lock in cache_nat_entry · 1515aef0
      Chao Yu 提交于
      When lookuping nat entry in cache_nat_entry, if we fail to hit nat cache,
      we try to load nat entries a) from journal of current segment cache or b)
      from NAT pages for updating, during the process, write lock of
      nat_tree_lock will be held to avoid inconsistent condition in between
      nid cache and nat cache caused by racing among nat entry shrinker,
      checkpointer, nat entry updater.
      
      But this way may cause low efficient when updating nat cache, because it
      serializes accessing in journal cache or reading NAT pages.
      
      Here, we reorder lock and update flow as below to enhance accessing
      concurrency:
      
       - get_node_info
        - down_read(nat_tree_lock)
        - lookup nat cache --- hit -> unlock & return
        - lookup journal cache --- hit -> unlock & goto update
        - up_read(nat_tree_lock)
      update:
        - down_write(nat_tree_lock)
        - cache_nat_entry
         - lookup nat cache --- nohit -> update
        - up_write(nat_tree_lock)
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      1515aef0
    • C
      f2fs: split journal cache from curseg cache · b7ad7512
      Chao Yu 提交于
      In curseg cache, f2fs caches two different parts:
       - datas of current summay block, i.e. summary entries, footer info.
       - journal info, i.e. sparse nat/sit entries or io stat info.
      
      With this approach, 1) it may cause higher lock contention when we access
      or update both of the parts of cache since we use the same mutex lock
      curseg_mutex to protect the cache. 2) current summary block with last
      journal info will be writebacked into device as a normal summary block
      when flushing, however, we treat journal info as valid one only in current
      summary, so most normal summary blocks contain junk journal data, it wastes
      remaining space of summary block.
      
      So, in order to fix above issues, we split curseg cache into two parts:
      a) current summary block, protected by original mutex lock curseg_mutex
      b) journal cache, protected by newly introduced r/w semaphore journal_rwsem
      
      When loading curseg cache during ->mount, we store summary info and
      journal info into different caches; When doing checkpoint, we combine
      datas of two cache into current summary block for persisting.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      b7ad7512
    • C
      f2fs: introduce f2fs_journal struct to wrap journal info · dfc08a12
      Chao Yu 提交于
      Introduce a new structure f2fs_journal to wrap journal info in struct
      f2fs_summary_block for readability.
      
      struct f2fs_journal {
      	union {
      		__le16 n_nats;
      		__le16 n_sits;
      	};
      	union {
      		struct nat_journal nat_j;
      		struct sit_journal sit_j;
      		struct f2fs_extra_info info;
      	};
      } __packed;
      
      struct f2fs_summary_block {
      	struct f2fs_summary entries[ENTRIES_IN_SUM];
      	struct f2fs_journal journal;
      	struct summary_footer footer;
      } __packed;
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      dfc08a12
    • Y
      f2fs: fix missing skip pages info · d31c7c3f
      Yunlei He 提交于
      fix missing skip pages info in f2fs_writepages trace event.
      Signed-off-by: NYunlei He <heyunlei@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      d31c7c3f
    • C
      f2fs: introduce f2fs_submit_merged_bio_cond · 0c3a5797
      Chao Yu 提交于
      f2fs use single bio buffer per type data (META/NODE/DATA) for caching
      writes locating in continuous block address as many as possible, after
      submitting, these writes may be still cached in bio buffer, so we have
      to flush cached writes in bio buffer by calling f2fs_submit_merged_bio.
      
      Unfortunately, in the scenario of high concurrency, bio buffer could be
      flushed by someone else before we submit it as below reasons:
      a) there is no space in bio buffer.
      b) add a request of different type (SYNC, ASYNC).
      c) add a discontinuous block address.
      
      For this condition, f2fs_submit_merged_bio will be devastating, because
      it could break the following merging of writes in bio buffer, split one
      big bio into two smaller one.
      
      This patch introduces f2fs_submit_merged_bio_cond which can do a
      conditional submitting with bio buffer, before submitting it will judge
      whether:
       - page in DATA type bio buffer is matching with specified page;
       - page in DATA type bio buffer is belong to specified inode;
       - page in NODE type bio buffer is belong to specified inode;
      If there is no eligible page in bio buffer, we will skip submitting step,
      result in gaining more chance to merge consecutive block IOs in bio cache.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      0c3a5797
    • J
      f2fs: wait on page's writeback in writepages path · fa3d2bdf
      Jaegeuk Kim 提交于
      Likewise f2fs_write_cache_pages, let's do for node and meta pages too.
      Especially, for node blocks, we should do this before marking its fsync
      and dentry flags.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      fa3d2bdf
    • C
      f2fs: introduce get_next_page_offset to speed up SEEK_DATA · 3cf45747
      Chao Yu 提交于
      When seeking data in ->llseek, if we encounter a big hole which covers
      several dnode pages, we will try to seek data from index of page which
      is the first page of next dnode page, at most we could skip searching
      (ADDRS_PER_BLOCK - 1) pages.
      
      However it's still not efficient, because if our indirect/double-indirect
      pointer are NULL, there are no dnode page locate in the tree indirect/
      double-indirect pointer point to, it's not necessary to search the whole
      region.
      
      This patch introduces get_next_page_offset to calculate next page offset
      based on current searching level and max searching level returned from
      get_dnode_of_data, with this, we could skip searching the entire area
      indirect or double-indirect node block is not exist.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      3cf45747
    • C
      f2fs: remove unneeded pointer conversion · 81ca7350
      Chao Yu 提交于
      There are redundant pointer conversion in following call stack:
       - at position a, inode was been converted to f2fs_file_info.
       - at position b, f2fs_file_info was been converted to inode again.
      
       - truncate_blocks(inode,..)
        - fi = F2FS_I(inode)		---a
        - ADDRS_PER_PAGE(node_page, fi)
         - addrs_per_inode(fi)
          - inode = &fi->vfs_inode	---b
          - f2fs_has_inline_xattr(inode)
           - fi = F2FS_I(inode)
           - is_inode_flag_set(fi,..)
      
      In order to avoid unneeded conversion, alter ADDRS_PER_PAGE and
      addrs_per_inode to acept parameter with type of inode pointer.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      81ca7350
    • J
      f2fs: use wait_for_stable_page to avoid contention · fec1d657
      Jaegeuk Kim 提交于
      In write_begin, if storage supports stable_page, we don't need to wait for
      writeback to update its contents.
      This patch introduces to use wait_for_stable_page instead of
      wait_on_page_writeback.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      fec1d657
    • J
      f2fs: avoid multiple node page writes due to inline_data · 2049d4fc
      Jaegeuk Kim 提交于
      The sceanrio is:
      1. create fully node blocks
      2. flush node blocks
      3. write inline_data for all the node blocks again
      4. flush node blocks redundantly
      
      So, this patch tries to flush inline_data when flushing node blocks.
      Reviewed-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      2049d4fc
    • C
      f2fs: export dirty_nats_ratio in sysfs · 2304cb0c
      Chao Yu 提交于
      This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold
      of dirty nat entries, if current ratio exceeds configured threshold,
      checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      2304cb0c
  10. 12 1月, 2016 1 次提交
  11. 09 1月, 2016 3 次提交
  12. 07 1月, 2016 2 次提交
  13. 01 1月, 2016 1 次提交
    • J
      f2fs: write pending bios when cp_error is set · 8d4ea29b
      Jaegeuk Kim 提交于
      When testing ioc_shutdown, put_super is able to be hanged by waiting for
      writebacking pages as follows.
      
      INFO: task umount:2723 blocked for more than 120 seconds.
            Tainted: G           O    4.4.0-rc3+ #8
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      umount          D ffff88000859f9d8     0  2723   2110 0x00000000
       ffff88000859f9d8 0000000000000000 0000000000000000 ffffffff81e11540
       ffff880078c225c0 ffff8800085a0000 ffff88007fc17440 7fffffffffffffff
       ffffffff818239f0 ffff88000859fb48 ffff88000859f9f0 ffffffff8182310c
      Call Trace:
       [<ffffffff818239f0>] ? bit_wait+0x50/0x50
       [<ffffffff8182310c>] schedule+0x3c/0x90
       [<ffffffff81827fb9>] schedule_timeout+0x2d9/0x430
       [<ffffffff810e0f8f>] ? mark_held_locks+0x6f/0xa0
       [<ffffffff8111614d>] ? ktime_get+0x7d/0x140
       [<ffffffff818239f0>] ? bit_wait+0x50/0x50
       [<ffffffff8106a655>] ? kvm_clock_get_cycles+0x25/0x30
       [<ffffffff8111617c>] ? ktime_get+0xac/0x140
       [<ffffffff818239f0>] ? bit_wait+0x50/0x50
       [<ffffffff81822564>] io_schedule_timeout+0xa4/0x110
       [<ffffffff81823a25>] bit_wait_io+0x35/0x50
       [<ffffffff818235bd>] __wait_on_bit+0x5d/0x90
       [<ffffffff811b9e8b>] wait_on_page_bit+0xcb/0xf0
       [<ffffffff810d5f90>] ? autoremove_wake_function+0x40/0x40
       [<ffffffff811cf84c>] truncate_inode_pages_range+0x4bc/0x840
       [<ffffffff811cfc3d>] truncate_inode_pages_final+0x4d/0x60
       [<ffffffffc023ced5>] f2fs_evict_inode+0x75/0x400 [f2fs]
       [<ffffffff812639bc>] evict+0xbc/0x190
       [<ffffffff81263d19>] iput+0x229/0x2c0
       [<ffffffffc0241885>] f2fs_put_super+0x105/0x1a0 [f2fs]
       [<ffffffff8124756a>] generic_shutdown_super+0x6a/0xf0
       [<ffffffff812478f7>] kill_block_super+0x27/0x70
       [<ffffffffc0241290>] kill_f2fs_super+0x20/0x30 [f2fs]
       [<ffffffff81247b03>] deactivate_locked_super+0x43/0x70
       [<ffffffff81247f4c>] deactivate_super+0x5c/0x60
       [<ffffffff81268d2f>] cleanup_mnt+0x3f/0x90
       [<ffffffff81268dc2>] __cleanup_mnt+0x12/0x20
       [<ffffffff810ac463>] task_work_run+0x73/0xa0
       [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0
       [<ffffffff81003e7c>] syscall_return_slowpath+0xcc/0xe0
       [<ffffffff81829ea2>] int_ret_from_sys_call+0x25/0x9f
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      8d4ea29b
  14. 31 12月, 2015 3 次提交
  15. 23 12月, 2015 1 次提交
  16. 15 12月, 2015 1 次提交
  17. 13 10月, 2015 1 次提交
    • C
      f2fs: export ra_nid_pages to sysfs · ea1a29a0
      Chao Yu 提交于
      After finishing building free nid cache, we will try to readahead
      asynchronously 4 more pages for the next reloading, the count of
      readahead nid pages is fixed.
      
      In some case, like SMR drive, read less sectors with fixed count
      each time we trigger RA may be low efficient, since we will face
      high seeking overhead, so we'd better let user to configure this
      parameter from sysfs in specific workload.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      ea1a29a0