1. 09 7月, 2014 1 次提交
  2. 07 5月, 2014 2 次提交
  3. 04 4月, 2014 1 次提交
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
  4. 18 3月, 2014 1 次提交
  5. 27 2月, 2014 1 次提交
    • J
      f2fs: introduce large directory support · 38431545
      Jaegeuk Kim 提交于
      This patch introduces an i_dir_level field to support large directory.
      
      Previously, f2fs maintains multi-level hash tables to find a dentry quickly
      from a bunch of chiild dentries in a directory, and the hash tables consist of
      the following tree structure as below.
      
      In Documentation/filesystems/f2fs.txt,
      
      ----------------------
      A : bucket
      B : block
      N : MAX_DIR_HASH_DEPTH
      ----------------------
      
      level #0   | A(2B)
                 |
      level #1   | A(2B) - A(2B)
                 |
      level #2   | A(2B) - A(2B) - A(2B) - A(2B)
           .     |   .       .       .       .
      level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
           .     |   .       .       .       .
      level #N   | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
      
      But, if we can guess that a directory will handle a number of child files,
      we don't need to traverse the tree from level #0 to #N all the time.
      Since the lower level tables contain relatively small number of dentries,
      the miss ratio of the target dentry is likely to be high.
      
      In order to avoid that, we can configure the hash tables sparsely from level #0
      like this.
      
      level #0   | A(2B) - A(2B) - A(2B) - A(2B)
      
      level #1   | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
           .     |   .       .       .       .
      level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
           .     |   .       .       .       .
      level #N   | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
      
      With this structure, we can skip the ineffective tree searches in lower level
      hash tables.
      
      This patch adds just a facility for this by introducing i_dir_level in
      f2fs_inode.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      38431545
  6. 17 2月, 2014 1 次提交
  7. 20 1月, 2014 1 次提交
  8. 14 1月, 2014 1 次提交
  9. 06 1月, 2014 1 次提交
  10. 26 12月, 2013 1 次提交
  11. 29 10月, 2013 1 次提交
  12. 18 10月, 2013 1 次提交
  13. 07 10月, 2013 1 次提交
    • G
      f2fs: use rw_sem instead of fs_lock(locks mutex) · e479556b
      Gu Zheng 提交于
      The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
      And each other operate routine(besides checkpoint) needs to acquire a fs_lock,
      there is a terrible problem here, if these are too many concurrency threads acquiring
      fs_lock, so that they will block each other and may lead to some performance problem,
      but this is not the phenomenon we want to see.
      Though there are some optimization patches introduced to enhance the usage of fs_lock,
      but the thorough solution is using a *rw_sem* to replace the fs_lock.
      Checkpoint routine takes write_sem, and other ops take read_sem, so that we can block
      other ops(ex, recovery) when doing checkpoint, and other ops will not disturb each other,
      this can avoid the problem described above completely.
      Because of the weakness of rw_sem, the above change may introduce a potential problem
      that the checkpoint thread might get starved if other threads are intensively locking
      the read semaphore for I/O.(Pointed out by Xu Jin)
      In order to avoid this, a wait_list is introduced, the appending read semaphore ops
      will be dropped into the wait_list if checkpoint thread is waiting for write semaphore,
      and will be waked up when checkpoint thread gives up write semaphore.
      Thanks to Kim's previous review and test, and will be very glad to see other guys'
      performance tests about this patch.
      
      V2:
        -fix the potential starvation problem.
        -use more suitable func name suggested by Xu Jin.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      [Jaegeuk Kim: adjust minor coding standard]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      e479556b
  14. 26 8月, 2013 1 次提交
  15. 19 8月, 2013 1 次提交
    • J
      f2fs: avoid writing inode redundantly when creating a file · 92c4342f
      Jin Xu 提交于
      In f2fs_write_inode, updating inode after f2fs_balance_fs is not
      a optimized way in the case that f2fs_gc is performed ahead. The
      inode page will be unnecessarily written out twice, one of which
      is in f2fs_gc->...->sync_node_pages and the other is in
      update_inode_page.
      
      Let's update the inode page in prior to f2fs_balance_fs to avoid
      this.
      
      To reproduce it,
      $ touch file (before this step, should make the device need f2fs_gc)
      $ sync (or wait the bdi to write dirty inode)
      Signed-off-by: NJin Xu <jinuxstyle@gmail.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      92c4342f
  16. 06 8月, 2013 1 次提交
    • J
      f2fs: fix a deadlock in fsync · a569469e
      Jin Xu 提交于
      This patch fixes a deadlock bug that occurs quite often when there are
      concurrent write and fsync on a same file.
      
      Following is the simplified call trace when tasks get hung.
      
      fsync thread:
      - f2fs_sync_file
       ...
       - f2fs_write_data_pages
       ...
        - update_extent_cache
        ...
         - update_inode
          - wait_on_page_writeback
      
      bdi writeback thread
      - __writeback_single_inode
       - f2fs_write_data_pages
        - mutex_lock(sbi->writepages)
      
      The deadlock happens when the fsync thread waits on a inode page that has
      been added to the f2fs' cached bio sbi->bio[NODE], and unfortunately,
      no one else could be able to submit the cached bio to block layer for
      writeback. This is because the fsync thread already hold a sbi->fs_lock and
      the sbi->writepages lock, causing the bdi thread being blocked when attempt
      to write data pages for the same inode. At the same time, f2fs_gc thread
      does not notice the situation and could not help. Even the sync syscall
      gets blocked.
      
      To fix it, we could submit the cached bio first before waiting on a inode page
      that is being written back.
      Signed-off-by: NJin Xu <jinuxstyle@gmail.com>
      [Jaegeuk Kim: add more cases to use f2fs_wait_on_page_writeback]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      a569469e
  17. 30 7月, 2013 1 次提交
  18. 14 6月, 2013 1 次提交
  19. 28 5月, 2013 2 次提交
    • J
      f2fs: fix wrong condition check · b638f0c4
      Jaegeuk Kim 提交于
      While an orphan inode has zero link_count, f2fs_gc is able to select the inode
      for foreground gc.
      
      - f2fs_gc
       - do_garbage_collect
         - gc_data_segment
           : f2fs_iget is failed
           : get_valid_blocks() != 0, so that retry
      --> here we got the infinite loop.
      
      This patch resolved this issue.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      b638f0c4
    • J
      f2fs: avoid RECLAIM_FS-ON-W: deadlock · 6f85b352
      Jaegeuk Kim 提交于
      This patch tries to avoid the following deadlock condition of which the reclaim
      path can trigger f2fs_balance_fs again.
      
      =================================
      [ INFO: inconsistent lock state ]
      ---------------------------------
      inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      kswapd0/41 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (&sbi->gc_mutex){+.+.?.}, at: f2fs_balance_fs+0xe6/0x100 [f2fs]
      {RECLAIM_FS-ON-W} state was registered at:
        [<ffffffff810aa5a9>] mark_held_locks+0xb9/0x140
        [<ffffffff810aae85>] lockdep_trace_alloc+0x85/0xf0
        [<ffffffff8113ab2c>] __alloc_pages_nodemask+0x7c/0x9b0
        [<ffffffff81175aa8>] alloc_pages_current+0xb8/0x180
        [<ffffffff811319cf>] __page_cache_alloc+0xaf/0xd0
        [<ffffffff8113225c>] find_or_create_page+0x4c/0xb0
        [<ffffffffa021359e>] find_data_page+0x14e/0x210 [f2fs]
        [<ffffffffa021161b>] f2fs_gc+0x9eb/0xd90 [f2fs]
        [<ffffffffa0218fae>] f2fs_balance_fs+0xee/0x100 [f2fs]
        [<ffffffffa020848c>] f2fs_setattr+0x6c/0x200 [f2fs]
        [<ffffffff811ae51b>] notify_change+0x1db/0x3a0
        [<ffffffff8118fbd0>] do_truncate+0x60/0xa0
        [<ffffffff8118fd95>] vfs_truncate+0x185/0x1b0
        [<ffffffff8118fe1c>] do_sys_truncate+0x5c/0xa0
        [<ffffffff8118ffee>] SyS_truncate+0xe/0x10
        [<ffffffff816e2b42>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      6f85b352
  20. 23 4月, 2013 1 次提交
  21. 09 4月, 2013 1 次提交
    • J
      f2fs: introduce a new global lock scheme · 39936837
      Jaegeuk Kim 提交于
      In the previous version, f2fs uses global locks according to the usage types,
      such as directory operations, block allocation, block write, and so on.
      
      Reference the following lock types in f2fs.h.
      enum lock_type {
      	RENAME,		/* for renaming operations */
      	DENTRY_OPS,	/* for directory operations */
      	DATA_WRITE,	/* for data write */
      	DATA_NEW,	/* for data allocation */
      	DATA_TRUNC,	/* for data truncate */
      	NODE_NEW,	/* for node allocation */
      	NODE_TRUNC,	/* for node truncate */
      	NODE_WRITE,	/* for node write */
      	NR_LOCK_TYPE,
      };
      
      In that case, we lose the performance under the multi-threading environment,
      since every types of operations must be conducted one at a time.
      
      In order to address the problem, let's share the locks globally with a mutex
      array regardless of any types.
      So, let users grab a mutex and perform their jobs in parallel as much as
      possbile.
      
      For this, I propose a new global lock scheme as follows.
      
      0. Data structure
       - f2fs_sb_info -> mutex_lock[NR_GLOBAL_LOCKS]
       - f2fs_sb_info -> node_write
      
      1. mutex_lock_op(sbi)
       - try to get an avaiable lock from the array.
       - returns the index of the gottern lock variable.
      
      2. mutex_unlock_op(sbi, index of the lock)
       - unlock the given index of the lock.
      
      3. mutex_lock_all(sbi)
       - grab all the locks in the array before the checkpoint.
      
      4. mutex_unlock_all(sbi)
       - release all the locks in the array after checkpoint.
      
      5. block_operations()
       - call mutex_lock_all()
       - sync_dirty_dir_inodes()
       - grab node_write
       - sync_node_pages()
      
      Note that,
       the pairs of mutex_lock_op()/mutex_unlock_op() and
       mutex_lock_all()/mutex_unlock_all() should be used together.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      39936837
  22. 27 3月, 2013 1 次提交
    • J
      f2fs: do not skip writing file meta during fsync · 0ff153a2
      Jaegeuk Kim 提交于
      This patch removes data_version check flow during the fsync call.
      The original purpose for the use of data_version was to avoid writng inode
      pages redundantly by the fsync calls repeatedly.
      However, when user can modify file meta and then call fsync, we should not
      skip fsync procedure.
      So, let's remove this condition check and hope that user triggers in right
      manner.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      0ff153a2
  23. 20 3月, 2013 1 次提交
  24. 12 2月, 2013 3 次提交
    • J
      f2fs: avoid balanc_fs during evict_inode · d4686d56
      Jaegeuk Kim 提交于
      1. Background
      
      Previously, if f2fs tries to move data blocks of an *evicting* inode during the
      cleaning process, it stops the process incompletely and then restarts the whole
      process, since it needs a locked inode to grab victim data pages in its address
      space. In order to get a locked inode, iget_locked() by f2fs_iget() is normally
      used, but, it waits if the inode is on freeing.
      
      So, here is a deadlock scenario.
      1. f2fs_evict_inode()       <- inode "A"
        2. f2fs_balance_fs()
          3. f2fs_gc()
            4. gc_data_segment()
              5. f2fs_iget()      <- inode "A" too!
      
      If step #1 and #5 treat a same inode "A", step #5 would fall into deadlock since
      the inode "A" is on freeing. In order to resolve this, f2fs_iget_nowait() which
      skips __wait_on_freeing_inode() was introduced in step #5, and stops f2fs_gc()
      to complete f2fs_evict_inode().
      
      1. f2fs_evict_inode()           <- inode "A"
        2. f2fs_balance_fs()
          3. f2fs_gc()
            4. gc_data_segment()
              5. f2fs_iget_nowait()   <- inode "A", then stop f2fs_gc() w/ -ENOENT
      
      2. Problem and Solution
      
      In the above scenario, however, f2fs cannot finish f2fs_evict_inode() only if:
       o there are not enough free sections, and
       o f2fs_gc() tries to move data blocks of the *evicting* inode repeatedly.
      
      So, the final solution is to use f2fs_iget() and remove f2fs_balance_fs() in
      f2fs_evict_inode().
      The f2fs_evict_inode() actually truncates all the data and node blocks, which
      means that it doesn't produce any dirty node pages accordingly.
      So, we don't need to do f2fs_balance_fs() in practical.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      d4686d56
    • C
      f2fs: add un/freeze_fs into super_operations · d6212a5f
      Changman Lee 提交于
      This patch supports ioctl FIFREEZE and FITHAW to snapshot filesystem.
      Before calling f2fs_freeze, all writers would be suspended and sync_fs
      would be completed. So no f2fs has to do something.
      Just background gc operation should be skipped due to generate dirty
      nodes and data until unfreeze.
      Signed-off-by: NChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      d6212a5f
    • C
      f2fs: save device node number into f2fs_inode · 7d79e75f
      Changman Lee 提交于
      This patch stores inode->i_rdev into on-disk inode structure.
      
      Alun reported that:
       aspire tmp # mount -t f2fs /dev/sdb mnt
       aspire tmp # mknod mnt/sda1 b 8 1
       aspire tmp # mknod mnt/null c 1 3
       aspire tmp # mknod mnt/console c 5 1
       aspire tmp # ls -l mnt
       total 2
       crw-r--r-- 1 root root 5, 1 Jan 22 18:44 console
       crw-r--r-- 1 root root 1, 3 Jan 22 18:44 null
       brw-r--r-- 1 root root 8, 1 Jan 22 18:44 sda1
       aspire tmp # umount mnt
       aspire tmp # mount -t f2fs /dev/sdb mnt
       aspire tmp # ls -l mnt
       total 2
       crw-r--r-- 1 root root 0, 0 Jan 22 18:44 console
       crw-r--r-- 1 root root 0, 0 Jan 22 18:44 null
       brw-r--r-- 1 root root 0, 0 Jan 22 18:44 sda1
      
      In this report, f2fs lost the major/minor numbers of device files after umount.
      The reason was revealed that f2fs does not store the inode->i_rdev to the
      on-disk inode data structure.
      
      So, as the other file systems do, f2fs also stores i_rdev into the i_addr fields
      in on-disk inode structure without any on-disk layout changes.
      Note that, this bug is limited to device files made by mknod().
      Reported-and-Tested-by: NAlun Jones <alun.linux@ty-penguin.org.uk>
      Signed-off-by: NChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      7d79e75f
  25. 11 1月, 2013 1 次提交
    • J
      f2fs: add f2fs_balance_fs in several interfaces · 7d82db83
      Jaegeuk Kim 提交于
      The f2fs_balance_fs() is to check the number of free sections and decide whether
      it needs to conduct cleaning or not. If there are not enough free sections, the
      cleaning job should be started.
      
      In order to control an amount of free sections even under high utilization, f2fs
      should call f2fs_balance_fs at all the VFS interfaces that are able to produce
      dirty pages.
      This patch adds the function calls in the missing interfaces as follows.
      
      1. f2fs_setxattr()
      The f2fs_setxattr() produces dirty node pages so that we should call
      f2fs_balance_fs() either likewise doing in other VFS interfaces such as
      f2fs_lookup(), f2fs_mkdir(), and so on.
      
      2. f2fs_sync_file()
      We should guarantee serving free sections for syncing metadata during fsync.
      Previously, there is no space check before triggering checkpoint and
      sync_node_pages.
      Therefore, if a bunch of fsync calls are triggered under 100% of FS utilization,
      f2fs is able to be faced with no free sections, resulting in BUG_ON().
      
      3. f2fs_sync_fs()
      Before calling write_checkpoint(), we should guarantee that there are minimum
      free sections.
      
      4. f2fs_write_inode()
      f2fs_write_inode() is also able to produce dirty node pages.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      7d82db83
  26. 26 12月, 2012 1 次提交
    • J
      f2fs: fix handling errors got by f2fs_write_inode · 398b1ac5
      Jaegeuk Kim 提交于
      Ruslan reported that f2fs hangs with an infinite loop in f2fs_sync_file():
      
      	while (sync_node_pages(sbi, inode->i_ino, &wbc) == 0)
      		f2fs_write_inode(inode, NULL);
      
      The reason was revealed that the cold flag is not set even thought this inode is
      a normal file. Therefore, sync_node_pages() skips to write node blocks since it
      only writes cold node blocks.
      
      The cold flag is stored to the node_footer in node block, and whenever a new
      node page is allocated, it is set according to its file type, file or directory.
      
      But, after sudden-power-off, when recovering the inode page, f2fs doesn't recover
      its cold flag.
      
      So, let's assign the cold flag in more right places.
      
      One more thing:
      If f2fs_write_inode() returns an error due to whatever situations, there would
      be no dirty node pages so that sync_node_pages() returns zero.
      (i.e., zero means nothing was written.)
      Reported-by: NRuslan N. Marchenko <me@ruff.mobi>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      398b1ac5
  27. 11 12月, 2012 3 次提交
    • J
      f2fs: fix tracking parent inode number · 6666e6aa
      Jaegeuk Kim 提交于
      Previously, f2fs didn't track the parent inode number correctly which is stored
      in each f2fs_inode. In the case of the following scenario, a bug can be occured.
      
      Let's suppose there are one directory, "/b", and two files, "/a" and "/b/a".
       - pino of "/a" is ROOT_INO.
       - pino of "/b/a" is DIR_B_INO.
      
      Then,
       # sync
        : The inode pages of "/a" and "/b/a" contain the parent inode numbers as
          ROOT_INO and DIR_B_INO respectively.
       # mv /a /b/a
        : The parent inode number of "/a" should be changed to DIR_B_INO, but f2fs
          didn't do that. Ref. f2fs_set_link().
      
      In order to fix this clearly, I added i_pino in f2fs_inode_info, and whenever
      it needs to be changed like in f2fs_add_link() and f2fs_set_link(), it is
      updated temporarily in f2fs_inode_info.
      
      And later, f2fs_write_inode() stores the latest information to the inode pages.
      For power-off-recovery, f2fs_sync_file() triggers simply f2fs_write_inode().
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      6666e6aa
    • J
      f2fs: adjust kernel coding style · 0a8165d7
      Jaegeuk Kim 提交于
      As pointed out by Randy Dunlap, this patch removes all usage of "/**" for comment
      blocks. Instead, just use "/*".
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      0a8165d7
    • J
      f2fs: add core inode operations · 19f99cee
      Jaegeuk Kim 提交于
      This adds core functions to get, read, write, and evict an inode.
      Signed-off-by: NChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      19f99cee