1. 10 1月, 2015 5 次提交
  2. 09 12月, 2014 1 次提交
  3. 06 12月, 2014 1 次提交
    • J
      f2fs: call radix_tree_preload before radix_tree_insert · 769ec6e5
      Jaegeuk Kim 提交于
      This patch tries to fix:
      
       BUG: using smp_processor_id() in preemptible [00000000] code: f2fs_gc-254:0/384
        (radix_tree_node_alloc+0x14/0x74) from [<c033d8a0>] (radix_tree_insert+0x110/0x200)
        (radix_tree_insert+0x110/0x200) from [<c02e8264>] (gc_data_segment+0x340/0x52c)
        (gc_data_segment+0x340/0x52c) from [<c02e8658>] (f2fs_gc+0x208/0x400)
        (f2fs_gc+0x208/0x400) from [<c02e8a98>] (gc_thread_func+0x248/0x28c)
        (gc_thread_func+0x248/0x28c) from [<c0139944>] (kthread+0xa0/0xac)
        (kthread+0xa0/0xac) from [<c0105ef8>] (ret_from_fork+0x14/0x3c)
      
      The reason is that f2fs calls radix_tree_insert under enabled preemption.
      So, before calling it, we need to call radix_tree_preload.
      
      Otherwise, we should use _GFP_WAIT for the radix tree, and use mutex or
      semaphore to cover the radix tree operations.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      769ec6e5
  4. 04 12月, 2014 2 次提交
  5. 26 11月, 2014 3 次提交
  6. 20 11月, 2014 2 次提交
    • J
      f2fs: submit bio for node blocks in the reclaim path · 27c6bd60
      Jaegeuk Kim 提交于
      If a node page is request to be written during the reclaiming path, we should
      submit the bio to avoid pending to recliam it.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      27c6bd60
    • C
      f2fs: introduce struct inode_management to wrap inner fields · 67298804
      Chao Yu 提交于
      Now in f2fs, we have three inode cache: ORPHAN_INO, APPEND_INO, UPDATE_INO,
      and we manage fields related to inode cache separately in struct f2fs_sb_info
      for each inode cache type.
      This makes codes a bit messy, so that this patch intorduce a new struct
      inode_management to wrap inner fields as following which make codes more neat.
      
      /* for inner inode cache management */
      struct inode_management {
      	struct radix_tree_root ino_root;	/* ino entry array */
      	spinlock_t ino_lock;			/* for ino entry lock */
      	struct list_head ino_list;		/* inode list head */
      	unsigned long ino_num;			/* number of entries */
      };
      
      struct f2fs_sb_info {
      	...
      	struct inode_management im[MAX_INO_ENTRY];      /* manage inode cache */
      	...
      }
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      67298804
  7. 10 11月, 2014 1 次提交
  8. 07 11月, 2014 1 次提交
  9. 01 10月, 2014 1 次提交
    • J
      f2fs: refactor flush_nat_entries to remove costly reorganizing ops · 309cc2b6
      Jaegeuk Kim 提交于
      Previously, f2fs tries to reorganize the dirty nat entries into multiple sets
      according to its nid ranges. This can improve the flushing nat pages, however,
      if there are a lot of cached nat entries, it becomes a bottleneck.
      
      This patch introduces a new set management flow by removing dirty nat list and
      adding a series of set operations when the nat entry becomes dirty.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      309cc2b6
  10. 24 9月, 2014 3 次提交
    • J
      f2fs: use MAX_BIO_BLOCKS(sbi) · 90a893c7
      Jaegeuk Kim 提交于
      This patch cleans up a simple macro.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      90a893c7
    • J
      f2fs: fix conditions to remain recovery information in f2fs_sync_file · 88bd02c9
      Jaegeuk Kim 提交于
      This patch revisited whole the recovery information during the f2fs_sync_file.
      
      In this patch, there are three information to make a decision.
      
      a) IS_CHECKPOINTED,	/* is it checkpointed before? */
      b) HAS_FSYNCED_INODE,	/* is the inode fsynced before? */
      c) HAS_LAST_FSYNC,	/* has the latest node fsync mark? */
      
      And, the scenarios for our rule are based on:
      
      [Term] F: fsync_mark, D: dentry_mark
      
      1. inode(x) | CP | inode(x) | dnode(F)
      2. inode(x) | CP | inode(F) | dnode(F)
      3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
      4. inode(x) | CP | dnode(F) | inode(F)
      5. CP | inode(x) | dnode(F) | inode(DF)
      6. CP | inode(DF) | dnode(F)
      7. CP | dnode(F) | inode(DF)
      8. CP | dnode(F) | inode(x) | inode(DF)
      
      For example, #3, the three conditions should be changed as follows.
      
         inode(x) | CP | dnode(F) | inode(x) | inode(F)
      a)    x       o      o          o          o
      b)    x       x      x          x          o
      c)    x       o      o          x          o
      
      If f2fs_sync_file stops   ------^,
       it should write inode(F)    --------------^
      
      So, the need_inode_block_update should return true, since
       c) get_nat_flag(e, HAS_LAST_FSYNC), is false.
      
      For example, #8,
            CP | alloc | dnode(F) | inode(x) | inode(DF)
      a)    o      x        x          x          x
      b)    x               x          x          o
      c)    o               o          x          o
      
      If f2fs_sync_file stops   -------^,
       it should write inode(DF)    --------------^
      
      Note that, the roll-forward policy should follow this rule, which means,
      if there are any missing blocks, we doesn't need to recover that inode.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      88bd02c9
    • J
      f2fs: introduce a flag to represent each nat entry information · 7ef35e3b
      Jaegeuk Kim 提交于
      This patch introduces a flag in the nat entry structure to merge various
      information such as checkpointed and fsync_done marks.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      7ef35e3b
  11. 10 9月, 2014 2 次提交
    • C
      f2fs: refactor flush_sit_entries codes for reducing SIT writes · 184a5cd2
      Chao Yu 提交于
      In commit aec71382 ("f2fs: refactor flush_nat_entries codes for reducing NAT
      writes"), we descripte the issue as below:
      
      "Although building NAT journal in cursum reduce the read/write work for NAT
      block, but previous design leave us lower performance when write checkpoint
      frequently for these cases:
      1. if journal in cursum has already full, it's a bit of waste that we flush all
         nat entries to page for persistence, but not to cache any entries.
      2. if journal in cursum is not full, we fill nat entries to journal util
         journal is full, then flush the left dirty entries to disk without merge
         journaled entries, so these journaled entries may be flushed to disk at next
         checkpoint but lost chance to flushed last time."
      
      Actually, we have the same problem in using SIT journal area.
      
      In this patch, firstly we will update sit journal with dirty entries as many as
      possible. Secondly if there is no space in sit journal, we will remove all
      entries in journal and walk through the whole dirty entry bitmap of sit,
      accounting dirty sit entries located in same SIT block to sit entry set. All
      entry sets are linked to list sit_entry_set in sm_info, sorted ascending order
      by count of entries in set. Later we flush entries in set which have fewest
      entries into journal as many as we can, and then flush dense set with merged
      entries to disk.
      
      In this way we can use sit journal area more effectively, also we will reduce
      SIT update, result in gaining in performance and saving lifetime of flash
      device.
      
      In my testing environment, it shows this patch can help to reduce SIT block
      update obviously.
      
      virtual machine + hard disk:
      fsstress -p 20 -n 400 -l 5
      		sit page num	cp count	sit pages/cp
      based		2006.50		1349.75		1.486
      patched		1566.25		1463.25		1.070
      
      Our latency of merging op is small when handling a great number of dirty SIT
      entries in flush_sit_entries:
      latency(ns)	dirty sit count
      36038		2151
      49168		2123
      37174		2232
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      184a5cd2
    • J
      f2fs: need fsck.f2fs when f2fs_bug_on is triggered · 9850cf4a
      Jaegeuk Kim 提交于
      If any f2fs_bug_on is triggered, fsck.f2fs is needed.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      9850cf4a
  12. 04 9月, 2014 1 次提交
  13. 26 8月, 2014 1 次提交
  14. 22 8月, 2014 4 次提交
    • C
      f2fs: fix incorrect calculation with total/free inode num · c200b1aa
      Chao Yu 提交于
      Theoretically, our total inodes number is the same as total node number, but
      there are three node ids are reserved in f2fs, they are 0, 1 (node nid), and 2
      (meta nid), and they should never be used by user, so our total/free inode
      number calculated in ->statfs is wrong.
      
      This patch indroduces F2FS_RESERVED_NODE_NUM and then fixes this issue by
      recalculating total/free inode number with the macro.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      c200b1aa
    • J
      f2fs: remove rewrite_node_page · 202095a7
      Jaegeuk Kim 提交于
      I think we need to let the dirty node pages remain in the page cache instead
      of rewriting them in their places.
      So, after done with successful recovery, write_checkpoint will flush all of them
      through the normal write path.
      Through this, we can avoid potential error cases in terms of block allocation.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      202095a7
    • J
      f2fs: handle EIO not to break fs consistency · cf779cab
      Jaegeuk Kim 提交于
      There are two rules when EIO is occurred.
      1. don't write any checkpoint data to preserve the previous checkpoint
      2. don't lose the cached dentry/node/meta pages
      
      So, at first, this patch adds set_page_dirty in f2fs_write_end_io's failure.
      Then, writing checkpoint/dentry/node blocks is not allowed.
      
      Note that, for the data pages, we can't just throw away by redirtying them.
      Otherwise, kworker can fall into infinite loop to flush them.
      (Ref. xfstests/019)
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      cf779cab
    • J
      f2fs: unlock_page when node page is redirtied out · 52746519
      Jaegeuk Kim 提交于
      This patch fixes missing unlock_page when a node page is redirtied out.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      52746519
  15. 20 8月, 2014 4 次提交
  16. 02 8月, 2014 1 次提交
  17. 31 7月, 2014 1 次提交
  18. 10 7月, 2014 2 次提交
    • C
      f2fs: refactor flush_nat_entries codes for reducing NAT writes · aec71382
      Chao Yu 提交于
      Although building NAT journal in cursum reduce the read/write work for NAT
      block, but previous design leave us lower performance when write checkpoint
      frequently for these cases:
      1. if journal in cursum has already full, it's a bit of waste that we flush all
         nat entries to page for persistence, but not to cache any entries.
      2. if journal in cursum is not full, we fill nat entries to journal util
         journal is full, then flush the left dirty entries to disk without merge
         journaled entries, so these journaled entries may be flushed to disk at next
         checkpoint but lost chance to flushed last time.
      
      In this patch we merge dirty entries located in same NAT block to nat entry set,
      and linked all set to list, sorted ascending order by entries' count of set.
      Later we flush entries in sparse set into journal as many as we can, and then
      flush merged entries to disk. In this way we can not only gain in performance,
      but also save lifetime of flash device.
      
      In my testing environment, it shows this patch can help to reduce NAT block
      writes obviously. In hard disk test case: cost time of fsstress is stablely
      reduced by about 5%.
      
      1. virtual machine + hard disk:
      fsstress -p 20 -n 200 -l 5
      		node num	cp count	nodes/cp
      based		4599.6		1803.0		2.551
      patched		2714.6		1829.6		1.483
      
      2. virtual machine + 32g micro SD card:
      fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
      -f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
      -f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S
      
      		node num	cp count	nodes/cp
      based		84.5		43.7		1.933
      patched		49.2		40.0		1.23
      
      Our latency of merging op shows not bad when handling extreme case like:
      merging a great number of dirty nats:
      latency(ns)	dirty nat count
      3089219		24922
      5129423		27422
      4000250		24523
      
      change log from v1:
       o fix wrong logic in add_nat_entry when grab a new nat entry set.
       o swith to create slab cache in create_node_manager_caches.
       o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.
      
      change log from v2:
       o make comment position more appropriate suggested by Jaegeuk Kim.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      aec71382
    • J
      f2fs: clean up an unused parameter and assignment · a014e037
      Jaegeuk Kim 提交于
      This patch cleans up simple unnecessary codes.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      a014e037
  19. 09 7月, 2014 1 次提交
  20. 05 6月, 2014 1 次提交
    • M
      mm: non-atomically mark page accessed during page cache allocation where possible · 2457aec6
      Mel Gorman 提交于
      aops->write_begin may allocate a new page and make it visible only to have
      mark_page_accessed called almost immediately after.  Once the page is
      visible the atomic operations are necessary which is noticable overhead
      when writing to an in-memory filesystem like tmpfs but should also be
      noticable with fast storage.  The objective of the patch is to initialse
      the accessed information with non-atomic operations before the page is
      visible.
      
      The bulk of filesystems directly or indirectly use
      grab_cache_page_write_begin or find_or_create_page for the initial
      allocation of a page cache page.  This patch adds an init_page_accessed()
      helper which behaves like the first call to mark_page_accessed() but may
      called before the page is visible and can be done non-atomically.
      
      The primary APIs of concern in this care are the following and are used
      by most filesystems.
      
      	find_get_page
      	find_lock_page
      	find_or_create_page
      	grab_cache_page_nowait
      	grab_cache_page_write_begin
      
      All of them are very similar in detail to the patch creates a core helper
      pagecache_get_page() which takes a flags parameter that affects its
      behavior such as whether the page should be marked accessed or not.  Then
      old API is preserved but is basically a thin wrapper around this core
      function.
      
      Each of the filesystems are then updated to avoid calling
      mark_page_accessed when it is known that the VM interfaces have already
      done the job.  There is a slight snag in that the timing of the
      mark_page_accessed() has now changed so in rare cases it's possible a page
      gets to the end of the LRU as PageReferenced where as previously it might
      have been repromoted.  This is expected to be rare but it's worth the
      filesystem people thinking about it in case they see a problem with the
      timing change.  It is also the case that some filesystems may be marking
      pages accessed that previously did not but it makes sense that filesystems
      have consistent behaviour in this regard.
      
      The test case used to evaulate this is a simple dd of a large file done
      multiple times with the file deleted on each iterations.  The size of the
      file is 1/10th physical memory to avoid dirty page balancing.  In the
      async case it will be possible that the workload completes without even
      hitting the disk and will have variable results but highlight the impact
      of mark_page_accessed for async IO.  The sync results are expected to be
      more stable.  The exception is tmpfs where the normal case is for the "IO"
      to not hit the disk.
      
      The test machine was single socket and UMA to avoid any scheduling or NUMA
      artifacts.  Throughput and wall times are presented for sync IO, only wall
      times are shown for async as the granularity reported by dd and the
      variability is unsuitable for comparison.  As async results were variable
      do to writback timings, I'm only reporting the maximum figures.  The sync
      results were stable enough to make the mean and stddev uninteresting.
      
      The performance results are reported based on a run with no profiling.
      Profile data is based on a separate run with oprofile running.
      
      async dd
                                          3.15.0-rc3            3.15.0-rc3
                                             vanilla           accessed-v2
      ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
      tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
      btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
      ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
      xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
      
      The XFS figure is a bit strange as it managed to avoid a worst case by
      sheer luck but the average figures looked reasonable.
      
              samples percentage
      ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      
      [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Tested-by: NPrabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2457aec6
  21. 04 6月, 2014 2 次提交
    • J
      f2fs: fix to recover data written by dio · b6fe5873
      Jaegeuk Kim 提交于
      If data are overwritten through dio, previous f2fs doesn't remain the fsync mark
      due to no additional node writes.
      
      Note that this patch should resolve the xfstests:311.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      b6fe5873
    • C
      f2fs: avoid crash when trace f2fs_submit_page_mbio event in ra_sum_pages · bac4eef6
      Chao Yu 提交于
      Previously we allocate pages with no mapping in ra_sum_pages(), so we may
      encounter a crash in event trace of f2fs_submit_page_mbio where we access
      mapping data of the page.
      
      We'd better allocate pages in bd_inode mapping and invalidate these pages after
      we restore data from pages. It could avoid crash in above scenario.
      
      Changes from V1
       o remove redundant code in ra_sum_pages() suggested by Jaegeuk Kim.
      
      Call Trace:
       [<f1031630>] ? ftrace_raw_event_f2fs_write_checkpoint+0x80/0x80 [f2fs]
       [<f10377bb>] f2fs_submit_page_mbio+0x1cb/0x200 [f2fs]
       [<f103c5da>] restore_node_summary+0x13a/0x280 [f2fs]
       [<f103e22d>] build_curseg+0x2bd/0x620 [f2fs]
       [<f104043b>] build_segment_manager+0x1cb/0x920 [f2fs]
       [<f1032c85>] f2fs_fill_super+0x535/0x8e0 [f2fs]
       [<c115b66a>] mount_bdev+0x16a/0x1a0
       [<f102f63f>] f2fs_mount+0x1f/0x30 [f2fs]
       [<c115c096>] mount_fs+0x36/0x170
       [<c1173635>] vfs_kern_mount+0x55/0xe0
       [<c1175388>] do_mount+0x1e8/0x900
       [<c1175d72>] SyS_mount+0x82/0xc0
       [<c16059cc>] sysenter_do_call+0x12/0x22
      Suggested-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      bac4eef6