1. 11 4月, 2015 1 次提交
  2. 04 3月, 2015 2 次提交
    • C
      f2fs: enable rb-tree extent cache · 1dcc336b
      Chao Yu 提交于
      This patch enables rb-tree based extent cache in f2fs.
      
      When we mount with "-o extent_cache", f2fs will try to add recently accessed
      page-block mappings into rb-tree based extent cache as much as possible, instead
      of original one extent info cache.
      
      By this way, f2fs can support more effective cache between dnode page cache and
      disk. It will supply high hit ratio in the cache with fewer memory when dnode
      page cache are reclaimed in environment of low memory.
      
      Storage: Sandisk sd card 64g
      1.append write file (offset: 0, size: 128M);
      2.override write file (offset: 2M, size: 1M);
      3.override write file (offset: 4M, size: 1M);
      ...
      4.override write file (offset: 48M, size: 1M);
      ...
      5.override write file (offset: 112M, size: 1M);
      6.sync
      7.echo 3 > /proc/sys/vm/drop_caches
      8.read file (size:128M, unit: 4k, count: 32768)
      (time dd if=/mnt/f2fs/128m bs=4k count=32768)
      
      Extent Hit Ratio:
      		before		patched
      Hit Ratio	121 / 1071	1071 / 1071
      
      Performance:
      		before		patched
      real    	0m37.051s	0m35.556s
      user    	0m0.040s	0m0.026s
      sys     	0m2.990s	0m2.251s
      
      Memory Cost:
      		before		patched
      Tree Count:	0		1 (size: 24 bytes)
      Node Count:	0		45 (size: 1440 bytes)
      
      v3:
       o retest and given more details of test result.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      1dcc336b
    • C
      f2fs: move ext_lock out of struct extent_info · 0c872e2d
      Chao Yu 提交于
      Move ext_lock out of struct extent_info, then in the following patches we can
      use variables with struct extent_info type as a parameter to pass pure data.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      0c872e2d
  3. 10 1月, 2015 2 次提交
    • C
      f2fs: get rid of kzalloc in __recover_inline_status · 9e5ba77f
      Chao Yu 提交于
      We use kzalloc to allocate memory in __recover_inline_status, and use this
      all-zero memory to check the inline date content of inode page by comparing
      them. This is low effective and not needed, let's check inline date content
      directly.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      [Jaegeuk Kim: make the code more neat]
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      9e5ba77f
    • J
      f2fs: change atomic and volatile write policies · 1e84371f
      Jaegeuk Kim 提交于
      This patch adds two new ioctls to release inmemory pages grabbed by atomic
      writes.
       o f2fs_ioc_abort_volatile_write
        - If transaction was failed, all the grabbed pages and data should be written.
       o f2fs_ioc_release_volatile_write
        - This is to enhance the performance of PERSIST mode in sqlite.
      
      In order to avoid huge memory consumption which causes OOM, this patch changes
      volatile writes to use normal dirty pages, instead blocked flushing to the disk
      as long as system does not suffer from memory pressure.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      1e84371f
  4. 09 12月, 2014 1 次提交
  5. 05 11月, 2014 1 次提交
    • J
      f2fs: revisit inline_data to avoid data races and potential bugs · b3d208f9
      Jaegeuk Kim 提交于
      This patch simplifies the inline_data usage with the following rule.
      1. inline_data is set during the file creation.
      2. If new data is requested to be written ranges out of inline_data,
       f2fs converts that inode permanently.
      3. There is no cases which converts non-inline_data inode to inline_data.
      4. The inline_data flag should be changed under inode page lock.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      b3d208f9
  6. 04 11月, 2014 3 次提交
  7. 08 10月, 2014 1 次提交
    • J
      f2fs: support volatile operations for transient data · 02a1335f
      Jaegeuk Kim 提交于
      This patch adds support for volatile writes which keep data pages in memory
      until f2fs_evict_inode is called by iput.
      
      For instance, we can use this feature for the sqlite database as follows.
      While supporting atomic writes for main database file, we can keep its journal
      data temporarily in the page cache by the following sequence.
      
      1. open
       -> ioctl(F2FS_IOC_START_VOLATILE_WRITE);
      2. writes
       : keep all the data in the page cache.
      3. flush to the database file with atomic writes
        a. ioctl(F2FS_IOC_START_ATOMIC_WRITE);
        b. writes
        c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
      4. close
       -> drop the cached data
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      02a1335f
  8. 07 10月, 2014 1 次提交
    • J
      f2fs: support atomic writes · 88b88a66
      Jaegeuk Kim 提交于
      This patch introduces a very limited functionality for atomic write support.
      In order to support atomic write, this patch adds two ioctls:
       o F2FS_IOC_START_ATOMIC_WRITE
       o F2FS_IOC_COMMIT_ATOMIC_WRITE
      
      The database engine should be aware of the following sequence.
      1. open
       -> ioctl(F2FS_IOC_START_ATOMIC_WRITE);
      2. writes
        : all the written data will be treated as atomic pages.
      3. commit
       -> ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
        : this flushes all the data blocks to the disk, which will be shown all or
        nothing by f2fs recovery procedure.
      4. repeat to #2.
      
      The IO pattens should be:
      
        ,- START_ATOMIC_WRITE                  ,- COMMIT_ATOMIC_WRITE
       CP | D D D D D D | FSYNC | D D D D | FSYNC ...
                            `- COMMIT_ATOMIC_WRITE
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      88b88a66
  9. 01 10月, 2014 1 次提交
  10. 16 9月, 2014 1 次提交
  11. 10 9月, 2014 1 次提交
  12. 04 9月, 2014 1 次提交
  13. 05 8月, 2014 1 次提交
  14. 29 7月, 2014 1 次提交
  15. 25 7月, 2014 1 次提交
    • C
      f2fs: avoid use invalid mapping of node_inode when evict meta inode · dbf20cb2
      Chao Yu 提交于
      Andrey Tsyvarev reported:
      "Using memory error detector reveals the following use-after-free error
      in 3.15.0:
      
      AddressSanitizer: heap-use-after-free in f2fs_evict_inode
      Read of size 8 by thread T22279:
        [<ffffffffa02d8702>] f2fs_evict_inode+0x102/0x2e0 [f2fs]
        [<ffffffff812359af>] evict+0x15f/0x290
        [<     inlined    >] iput+0x196/0x280 iput_final
        [<ffffffff812369a6>] iput+0x196/0x280
        [<ffffffffa02dc416>] f2fs_put_super+0xd6/0x170 [f2fs]
        [<ffffffff81210095>] generic_shutdown_super+0xc5/0x1b0
        [<ffffffff812105fd>] kill_block_super+0x4d/0xb0
        [<ffffffff81210a86>] deactivate_locked_super+0x66/0x80
        [<ffffffff81211c98>] deactivate_super+0x68/0x80
        [<ffffffff8123cc88>] mntput_no_expire+0x198/0x250
        [<     inlined    >] SyS_umount+0xe9/0x1a0 SYSC_umount
        [<ffffffff8123f1c9>] SyS_umount+0xe9/0x1a0
        [<ffffffff81cc8df9>] system_call_fastpath+0x16/0x1b
      
      Freed by thread T3:
        [<ffffffffa02dc337>] f2fs_i_callback+0x27/0x30 [f2fs]
        [<     inlined    >] rcu_process_callbacks+0x2d6/0x930 __rcu_reclaim
        [<     inlined    >] rcu_process_callbacks+0x2d6/0x930 rcu_do_batch
        [<     inlined    >] rcu_process_callbacks+0x2d6/0x930 invoke_rcu_callbacks
        [<     inlined    >] rcu_process_callbacks+0x2d6/0x930 __rcu_process_callbacks
        [<ffffffff810fd266>] rcu_process_callbacks+0x2d6/0x930
        [<ffffffff8107cce2>] __do_softirq+0x142/0x380
        [<ffffffff8107cf50>] run_ksoftirqd+0x30/0x50
        [<ffffffff810b2a87>] smpboot_thread_fn+0x197/0x280
        [<ffffffff810a8238>] kthread+0x148/0x160
        [<ffffffff81cc8d4c>] ret_from_fork+0x7c/0xb0
      
      Allocated by thread T22276:
        [<ffffffffa02dc7dd>] f2fs_alloc_inode+0x2d/0x170 [f2fs]
        [<ffffffff81235e2a>] iget_locked+0x10a/0x230
        [<ffffffffa02d7495>] f2fs_iget+0x35/0xa80 [f2fs]
        [<ffffffffa02e2393>] f2fs_fill_super+0xb53/0xff0 [f2fs]
        [<ffffffff81211bce>] mount_bdev+0x1de/0x240
        [<ffffffffa02dbce0>] f2fs_mount+0x10/0x20 [f2fs]
        [<ffffffff81212a85>] mount_fs+0x55/0x220
        [<ffffffff8123c026>] vfs_kern_mount+0x66/0x200
        [<     inlined    >] do_mount+0x2b4/0x1120 do_new_mount
        [<ffffffff812400d4>] do_mount+0x2b4/0x1120
        [<     inlined    >] SyS_mount+0xb2/0x110 SYSC_mount
        [<ffffffff812414a2>] SyS_mount+0xb2/0x110
        [<ffffffff81cc8df9>] system_call_fastpath+0x16/0x1b
      
      The buggy address ffff8800587866c8 is located 48 bytes inside
        of 680-byte region [ffff880058786698, ffff880058786940)
      
      Memory state around the buggy address:
        ffff880058786100: ffffffff ffffffff ffffffff ffffffff
        ffff880058786200: ffffffff ffffffff ffffffrr rrrrrrrr
        ffff880058786300: rrrrrrrr rrffffff ffffffff ffffffff
        ffff880058786400: ffffffff ffffffff ffffffff ffffffff
        ffff880058786500: ffffffff ffffffff ffffffff fffffffr
       >ffff880058786600: rrrrrrrr rrrrrrrr rrrfffff ffffffff
                                                      ^
        ffff880058786700: ffffffff ffffffff ffffffff ffffffff
        ffff880058786800: ffffffff ffffffff ffffffff ffffffff
        ffff880058786900: ffffffff rrrrrrrr rrrrrrrr rrrr....
        ffff880058786a00: ........ ........ ........ ........
        ffff880058786b00: ........ ........ ........ ........
      Legend:
        f - 8 freed bytes
        r - 8 redzone bytes
        . - 8 allocated bytes
        x=1..7 - x allocated bytes + (8-x) redzone bytes
      
      Investigation shows, that f2fs_evict_inode, when called for
      'meta_inode', uses invalidate_mapping_pages() for 'node_inode'.
      But 'node_inode' is deleted before 'meta_inode' in f2fs_put_super via
      iput().
      
      It seems that in common usage scenario this use-after-free is benign,
      because 'node_inode' remains partially valid data even after
      kmem_cache_free().
      But things may change if, while 'meta_inode' is evicted in one f2fs
      filesystem, another (mounted) f2fs filesystem requests inode from cache,
      and formely
      'node_inode' of the first filesystem is returned."
      
      Nids for both meta_inode and node_inode are reservation, so it's not necessary
      for us to invalidate pages which will never be allocated.
      To fix this issue, let's skipping needlessly invalidating pages for
      {meta,node}_inode in f2fs_evict_inode.
      Reported-by: NAndrey Tsyvarev <tsyvarev@ispras.ru>
      Tested-by: NAndrey Tsyvarev <tsyvarev@ispras.ru>
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      dbf20cb2
  16. 09 7月, 2014 1 次提交
  17. 07 5月, 2014 2 次提交
  18. 04 4月, 2014 1 次提交
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
  19. 18 3月, 2014 1 次提交
  20. 27 2月, 2014 1 次提交
    • J
      f2fs: introduce large directory support · 38431545
      Jaegeuk Kim 提交于
      This patch introduces an i_dir_level field to support large directory.
      
      Previously, f2fs maintains multi-level hash tables to find a dentry quickly
      from a bunch of chiild dentries in a directory, and the hash tables consist of
      the following tree structure as below.
      
      In Documentation/filesystems/f2fs.txt,
      
      ----------------------
      A : bucket
      B : block
      N : MAX_DIR_HASH_DEPTH
      ----------------------
      
      level #0   | A(2B)
                 |
      level #1   | A(2B) - A(2B)
                 |
      level #2   | A(2B) - A(2B) - A(2B) - A(2B)
           .     |   .       .       .       .
      level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
           .     |   .       .       .       .
      level #N   | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
      
      But, if we can guess that a directory will handle a number of child files,
      we don't need to traverse the tree from level #0 to #N all the time.
      Since the lower level tables contain relatively small number of dentries,
      the miss ratio of the target dentry is likely to be high.
      
      In order to avoid that, we can configure the hash tables sparsely from level #0
      like this.
      
      level #0   | A(2B) - A(2B) - A(2B) - A(2B)
      
      level #1   | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
           .     |   .       .       .       .
      level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
           .     |   .       .       .       .
      level #N   | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
      
      With this structure, we can skip the ineffective tree searches in lower level
      hash tables.
      
      This patch adds just a facility for this by introducing i_dir_level in
      f2fs_inode.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      38431545
  21. 17 2月, 2014 1 次提交
  22. 20 1月, 2014 1 次提交
  23. 14 1月, 2014 1 次提交
  24. 06 1月, 2014 1 次提交
  25. 26 12月, 2013 1 次提交
  26. 29 10月, 2013 1 次提交
  27. 18 10月, 2013 1 次提交
  28. 07 10月, 2013 1 次提交
    • G
      f2fs: use rw_sem instead of fs_lock(locks mutex) · e479556b
      Gu Zheng 提交于
      The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
      And each other operate routine(besides checkpoint) needs to acquire a fs_lock,
      there is a terrible problem here, if these are too many concurrency threads acquiring
      fs_lock, so that they will block each other and may lead to some performance problem,
      but this is not the phenomenon we want to see.
      Though there are some optimization patches introduced to enhance the usage of fs_lock,
      but the thorough solution is using a *rw_sem* to replace the fs_lock.
      Checkpoint routine takes write_sem, and other ops take read_sem, so that we can block
      other ops(ex, recovery) when doing checkpoint, and other ops will not disturb each other,
      this can avoid the problem described above completely.
      Because of the weakness of rw_sem, the above change may introduce a potential problem
      that the checkpoint thread might get starved if other threads are intensively locking
      the read semaphore for I/O.(Pointed out by Xu Jin)
      In order to avoid this, a wait_list is introduced, the appending read semaphore ops
      will be dropped into the wait_list if checkpoint thread is waiting for write semaphore,
      and will be waked up when checkpoint thread gives up write semaphore.
      Thanks to Kim's previous review and test, and will be very glad to see other guys'
      performance tests about this patch.
      
      V2:
        -fix the potential starvation problem.
        -use more suitable func name suggested by Xu Jin.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      [Jaegeuk Kim: adjust minor coding standard]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      e479556b
  29. 26 8月, 2013 1 次提交
  30. 19 8月, 2013 1 次提交
    • J
      f2fs: avoid writing inode redundantly when creating a file · 92c4342f
      Jin Xu 提交于
      In f2fs_write_inode, updating inode after f2fs_balance_fs is not
      a optimized way in the case that f2fs_gc is performed ahead. The
      inode page will be unnecessarily written out twice, one of which
      is in f2fs_gc->...->sync_node_pages and the other is in
      update_inode_page.
      
      Let's update the inode page in prior to f2fs_balance_fs to avoid
      this.
      
      To reproduce it,
      $ touch file (before this step, should make the device need f2fs_gc)
      $ sync (or wait the bdi to write dirty inode)
      Signed-off-by: NJin Xu <jinuxstyle@gmail.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      92c4342f
  31. 06 8月, 2013 1 次提交
    • J
      f2fs: fix a deadlock in fsync · a569469e
      Jin Xu 提交于
      This patch fixes a deadlock bug that occurs quite often when there are
      concurrent write and fsync on a same file.
      
      Following is the simplified call trace when tasks get hung.
      
      fsync thread:
      - f2fs_sync_file
       ...
       - f2fs_write_data_pages
       ...
        - update_extent_cache
        ...
         - update_inode
          - wait_on_page_writeback
      
      bdi writeback thread
      - __writeback_single_inode
       - f2fs_write_data_pages
        - mutex_lock(sbi->writepages)
      
      The deadlock happens when the fsync thread waits on a inode page that has
      been added to the f2fs' cached bio sbi->bio[NODE], and unfortunately,
      no one else could be able to submit the cached bio to block layer for
      writeback. This is because the fsync thread already hold a sbi->fs_lock and
      the sbi->writepages lock, causing the bdi thread being blocked when attempt
      to write data pages for the same inode. At the same time, f2fs_gc thread
      does not notice the situation and could not help. Even the sync syscall
      gets blocked.
      
      To fix it, we could submit the cached bio first before waiting on a inode page
      that is being written back.
      Signed-off-by: NJin Xu <jinuxstyle@gmail.com>
      [Jaegeuk Kim: add more cases to use f2fs_wait_on_page_writeback]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      a569469e
  32. 30 7月, 2013 1 次提交
  33. 14 6月, 2013 1 次提交
  34. 28 5月, 2013 2 次提交
    • J
      f2fs: fix wrong condition check · b638f0c4
      Jaegeuk Kim 提交于
      While an orphan inode has zero link_count, f2fs_gc is able to select the inode
      for foreground gc.
      
      - f2fs_gc
       - do_garbage_collect
         - gc_data_segment
           : f2fs_iget is failed
           : get_valid_blocks() != 0, so that retry
      --> here we got the infinite loop.
      
      This patch resolved this issue.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      b638f0c4
    • J
      f2fs: avoid RECLAIM_FS-ON-W: deadlock · 6f85b352
      Jaegeuk Kim 提交于
      This patch tries to avoid the following deadlock condition of which the reclaim
      path can trigger f2fs_balance_fs again.
      
      =================================
      [ INFO: inconsistent lock state ]
      ---------------------------------
      inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      kswapd0/41 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (&sbi->gc_mutex){+.+.?.}, at: f2fs_balance_fs+0xe6/0x100 [f2fs]
      {RECLAIM_FS-ON-W} state was registered at:
        [<ffffffff810aa5a9>] mark_held_locks+0xb9/0x140
        [<ffffffff810aae85>] lockdep_trace_alloc+0x85/0xf0
        [<ffffffff8113ab2c>] __alloc_pages_nodemask+0x7c/0x9b0
        [<ffffffff81175aa8>] alloc_pages_current+0xb8/0x180
        [<ffffffff811319cf>] __page_cache_alloc+0xaf/0xd0
        [<ffffffff8113225c>] find_or_create_page+0x4c/0xb0
        [<ffffffffa021359e>] find_data_page+0x14e/0x210 [f2fs]
        [<ffffffffa021161b>] f2fs_gc+0x9eb/0xd90 [f2fs]
        [<ffffffffa0218fae>] f2fs_balance_fs+0xee/0x100 [f2fs]
        [<ffffffffa020848c>] f2fs_setattr+0x6c/0x200 [f2fs]
        [<ffffffff811ae51b>] notify_change+0x1db/0x3a0
        [<ffffffff8118fbd0>] do_truncate+0x60/0xa0
        [<ffffffff8118fd95>] vfs_truncate+0x185/0x1b0
        [<ffffffff8118fe1c>] do_sys_truncate+0x5c/0xa0
        [<ffffffff8118ffee>] SyS_truncate+0xe/0x10
        [<ffffffff816e2b42>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      6f85b352