- 24 9月, 2014 9 次提交
-
-
由 Chao Yu 提交于
By using FALLOC_FL_KEEP_SIZE in ->fallocate of f2fs, we can fallocate block past EOF without changing i_size of inode. These blocks past EOF will not be truncated in ->setattr as we truncate them only when change the file size. We should give a chance to truncate blocks out of filesize in setattr(). Signed-off-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
The f2fs_direct_IO uses __allocate_data_block, but inside the allocation path, we should update i_size at the changed time to update its inode page. Otherwise, we can get wrong i_size after roll-forward recovery. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
This patch cleans up a simple macro. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
If same data is updated multiple times, we don't need to redo whole the operations. Let's just update the lastest one. Reviewed-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
In f2fs_sync_file, if there is no written appended writes, it skips to write its node blocks. But, if there is up-to-date inode page, we should write it to update its metadata during the roll-forward recovery. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
We can summarize the roll forward recovery scenarios as follows. [Term] F: fsync_mark, D: dentry_mark 1. inode(x) | CP | inode(x) | dnode(F) -> Update the latest inode(x). 2. inode(x) | CP | inode(F) | dnode(F) -> No problem. 3. inode(x) | CP | dnode(F) | inode(x) -> Recover to the latest dnode(F), and drop the last inode(x) 4. inode(x) | CP | dnode(F) | inode(F) -> No problem. 5. CP | inode(x) | dnode(F) -> The inode(DF) was missing. Should drop this dnode(F). 6. CP | inode(DF) | dnode(F) -> No problem. 7. CP | dnode(F) | inode(DF) -> If f2fs_iget fails, then goto next to find inode(DF). 8. CP | dnode(F) | inode(x) -> If f2fs_iget fails, then goto next to find inode(DF). But it will fail due to no inode(DF). So, this patch adds some missing points such as #1, #5, #7, and #8. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
This patch revisited whole the recovery information during the f2fs_sync_file. In this patch, there are three information to make a decision. a) IS_CHECKPOINTED, /* is it checkpointed before? */ b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */ c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */ And, the scenarios for our rule are based on: [Term] F: fsync_mark, D: dentry_mark 1. inode(x) | CP | inode(x) | dnode(F) 2. inode(x) | CP | inode(F) | dnode(F) 3. inode(x) | CP | dnode(F) | inode(x) | inode(F) 4. inode(x) | CP | dnode(F) | inode(F) 5. CP | inode(x) | dnode(F) | inode(DF) 6. CP | inode(DF) | dnode(F) 7. CP | dnode(F) | inode(DF) 8. CP | dnode(F) | inode(x) | inode(DF) For example, #3, the three conditions should be changed as follows. inode(x) | CP | dnode(F) | inode(x) | inode(F) a) x o o o o b) x x x x o c) x o o x o If f2fs_sync_file stops ------^, it should write inode(F) --------------^ So, the need_inode_block_update should return true, since c) get_nat_flag(e, HAS_LAST_FSYNC), is false. For example, #8, CP | alloc | dnode(F) | inode(x) | inode(DF) a) o x x x x b) x x x o c) o o x o If f2fs_sync_file stops -------^, it should write inode(DF) --------------^ Note that, the roll-forward policy should follow this rule, which means, if there are any missing blocks, we doesn't need to recover that inode. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
This patch introduces a flag in the nat entry structure to merge various information such as checkpointed and fsync_done marks. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
Previously, all the dnode pages should be read during the roll-forward recovery. Even worsely, whole the chain was traversed twice. This patch removes that redundant and costly read operations by using page cache of meta_inode and readahead function as well. Reviewed-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 16 9月, 2014 5 次提交
-
-
由 Jaegeuk Kim 提交于
If the inode is same and its data index are needed to truncate, we can fall into double lock for its inode page via get_dnode_of_data. Error case is like this. 1. write data 1, 2, 3, 4, 5 in inode #4. 2. write data 100, 102, 103, 104, 105 in dnode #6 of inode #4. 3. sync 4. update data 100->106 in dnode #6. 5. fsync inode #4. 6. power-cut -> Then, 1. go back to #3's checkpoint 2. in do_recover_data, get_dnode_of_data() gets inode #4. 3. detect 100->106 in dnode #6. 4. check_index_in_prev_nodes tries to truncate 100 in dnode #6. 5. to trigger truncate_hole, get_dnode_of_data should grab inode #4. 6. detect *kernel hang* This patch should resolve that bug. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Huang Ying 提交于
The nm_i->fcnt checking is executed before spin_lock, so if another thread delete the last free_nid from the list, the wrong nid may be gotten. So fix the race condition by moving the nm_i->fnct checking into spin_lock. Signed-off-by: NHuang, Ying <ying.huang@intel.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Huang Ying 提交于
Now, if there is no free nid in nm_i->free_nid_list, 0 may be saved into next_free_nid of checkpoint, this may cause useless scanning for next mount. nm_i->next_scan_nid should be a better default value than 0. Signed-off-by: NHuang, Ying <ying.huang@intel.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
If user wrote F2FS_IPU_FSYNC:4 in /sys/fs/f2fs/ipu_policy, f2fs_sync_file only starts to try in-place-updates. And, if the number of dirty pages is over /sys/fs/f2fs/min_fsync_blocks, it keeps out-of-order manner. Otherwise, it triggers in-place-updates. This may be used by storage showing very high random write performance. For example, it can be used when, Seq. writes (Data) + wait + Seq. writes (Node) is pretty much slower than, Rand. writes (Data) Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
Previously f2fs only counts dirty dentry pages, but there is no reason not to expand the scope. This patch changes the names on the management of dirty pages and to count dirty pages in each inode info as well. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 11 9月, 2014 1 次提交
-
-
由 Jaegeuk Kim 提交于
This patch is to remove lengthy name by adding a new variable. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 10 9月, 2014 10 次提交
-
-
由 Jaegeuk Kim 提交于
If application throws negative value of lseek with SEEK_DATA|SEEK_HOLE, previous f2fs went into BUG_ON in get_dnode_of_data, which was reported by Tommi Rantala. He could make a simple code to detect this having: lseek(fd, -17595150933902LL, SEEK_DATA); This patch should resolve that bug. Reported-by: NTommi Rentala <tt.rantala@gmail.com> [Jaegeuk Kim: relocate the condition as suggested by Chao] Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Huang Ying 提交于
In gc_node_segment, if node page gc is run concurrently with node page writeback, and check_valid_map and get_node_page run after page locked and before cur_valid_map is updated as below, it is possible for the page to be written twice unnecessarily. sync_node_pages try_lock_page ... check_valid_map f2fs_write_node_page ... write_node_page do_write_page allocate_data_block ... refresh_sit_entry /* update cur_valid_map */ ... ... unlock_page get_node_page ... set_page_dirty ... f2fs_put_page unlock_page This can be solved via calling check_valid_map after get_node_page again. Signed-off-by: NHuang, Ying <ying.huang@intel.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Gu Zheng 提交于
We use flush cmd control to collect many flush cmds, and flush them together. In this case, we use two list to manage the flush cmds (collect and dispatch), and one spin lock is used to protect this. In fact, the lock-less list(llist) is very suitable to this case, and we use simplify this routine. - v2: -use llist_for_each_entry_safe to fix possible use-after-free issue. -remove the unused field from struct flush_cmd. Thanks for Yu's suggestion. - Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
In commit aec71382 ("f2fs: refactor flush_nat_entries codes for reducing NAT writes"), we descripte the issue as below: "Although building NAT journal in cursum reduce the read/write work for NAT block, but previous design leave us lower performance when write checkpoint frequently for these cases: 1. if journal in cursum has already full, it's a bit of waste that we flush all nat entries to page for persistence, but not to cache any entries. 2. if journal in cursum is not full, we fill nat entries to journal util journal is full, then flush the left dirty entries to disk without merge journaled entries, so these journaled entries may be flushed to disk at next checkpoint but lost chance to flushed last time." Actually, we have the same problem in using SIT journal area. In this patch, firstly we will update sit journal with dirty entries as many as possible. Secondly if there is no space in sit journal, we will remove all entries in journal and walk through the whole dirty entry bitmap of sit, accounting dirty sit entries located in same SIT block to sit entry set. All entry sets are linked to list sit_entry_set in sm_info, sorted ascending order by count of entries in set. Later we flush entries in set which have fewest entries into journal as many as we can, and then flush dense set with merged entries to disk. In this way we can use sit journal area more effectively, also we will reduce SIT update, result in gaining in performance and saving lifetime of flash device. In my testing environment, it shows this patch can help to reduce SIT block update obviously. virtual machine + hard disk: fsstress -p 20 -n 400 -l 5 sit page num cp count sit pages/cp based 2006.50 1349.75 1.486 patched 1566.25 1463.25 1.070 Our latency of merging op is small when handling a great number of dirty SIT entries in flush_sit_entries: latency(ns) dirty sit count 36038 2151 49168 2123 37174 2232 Signed-off-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
sit_i in macro SIT_BLOCK_OFFSET/START_SEGNO is not used, remove it. Signed-off-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
If the roll-forward recovery was failed, we'd better conduct fsck.f2fs. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
This patch adds to handle corner buggy cases for fsck.f2fs. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
This patch replaces BUG cases with f2fs_bug_on to remain fsck.f2fs information. And it implements some void functions to initiate fsck.f2fs too. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
If any f2fs_bug_on is triggered, fsck.f2fs is needed. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
This patch adds sbi->need_fsck to conduct fsck.f2fs later. This flag can only be removed by fsck.f2fs. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 04 9月, 2014 1 次提交
-
-
由 Jaegeuk Kim 提交于
This patch adds three inline functions to clean up dirty casting codes. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 02 9月, 2014 1 次提交
-
-
由 Chao Yu 提交于
As the race condition on the inode cache, following scenario can appear: [Thread a] [Thread b] ->f2fs_mkdir ->f2fs_add_link ->__f2fs_add_link ->init_inode_metadata failed here ->gc_thread_func ->f2fs_gc ->do_garbage_collect ->gc_data_segment ->f2fs_iget ->iget_locked ->wait_on_inode ->unlock_new_inode ->move_data_page ->make_bad_inode ->iput When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode should be set as bad to avoid being accessed by other thread. But in above scenario, it allows f2fs to access the invalid inode before this inode was set as bad. This patch fix the potential problem, and this issue was found by code review. change log from v1: o Add condition judgment in gc_data_segment() suggested by Changman Lee. o use iget_failed to simplify code. Signed-off-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 30 8月, 2014 4 次提交
-
-
由 Junxiao Bi 提交于
For debug use, we can see from the log whether the fence decision is made and why it is not fenced. Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com> Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: NMark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Junxiao Bi 提交于
When tcp retransmit timeout(15mins), the connection will be closed. Pending messages may be lost during this time. So we set tcp user timeout to override the retransmit timeout to the max value. This is OK for ocfs2 since we have disk heartbeat, if peer crash, the disk heartbeat will timeout and it will be evicted, if disk heartbeat not timeout and connection idle for a long time, then this means the cluster enters split-brain state, since fence can't happen, we'd better keep the connection and wait network recover. Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com> Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: NMark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Junxiao Bi 提交于
This patch series is to fix a possible message lost bug in ocfs2 when network go bad. This bug will cause ocfs2 hung forever even network become good again. The messages may lost in this case. After the tcp connection is established between two nodes, an idle timer will be set to check its state periodically, if no messages are received during this time, idle timer will timeout, it will shutdown the connection and try to reconnect, so pending messages in tcp queues will be lost. This messages may be from dlm. Dlm may get hung in this case. This may cause the whole ocfs2 cluster hung. This is very possible to happen when network state goes bad. Do the reconnect is useless, it will fail if network state is still bad. Just waiting there for network recovering may be a good idea, it will not lost messages and some node will be fenced until cluster goes into split-brain state, for this case, Tcp user timeout is used to override the tcp retransmit timeout. It will timeout after 25 days, user should have notice this through the provided log and fix the network, if they don't, ocfs2 will fall back to original reconnect way. This patch (of 3): Some messages in the tcp queue maybe lost if we shutdown the connection and reconnect when idle timeout. If packets lost and reconnect success, then the ocfs2 cluster maybe hung. To fix this, we can leave the connection there and do the fence decision when idle timeout, if network recover before fence dicision is made, the connection survive without lost any messages. This bug can be saw when network state go bad. It may cause ocfs2 hung forever if some packets lost. With this fix, ocfs2 will recover from hung if network becomes good again. Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com> Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: NMark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ben Hutchings 提交于
If we failed to copy from the structure, writing back the flags leaks 31 bits of kernel memory (the rest of the ir_flags field). In any case, if we cannot copy from/to the structure, why should we expect putting just the flags to work? Also make sure ocfs2_info_handle_freeinode() returns the right error code if the copy_to_user() fails. Fixes: ddee5cdb ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.') Signed-off-by: NBen Hutchings <ben@decadent.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Acked-by: NMark Fasheh <mfasheh@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 8月, 2014 6 次提交
-
-
由 Jaegeuk Kim 提交于
The dentry name type is unsigned char *. If we don't match this type, some character codes can be changed by signed bit. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Darrick J. Wong 提交于
When performing a same-directory rename, it's possible that adding or setting the new directory entry will cause the directory to overflow the inline data area, which causes the directory to be converted to an extent-based directory. Under this circumstance it is necessary to re-read the directory when deleting the old dirent because the "old directory" context still points to i_block in the inode table, which is now an extent tree root! The delete fails with an FS error, and the subsequent fsck complains about incorrect link counts and hardlinked directories. Test case (originally found with flat_dir_test in the metadata_csum test program): # mkfs.ext4 -O inline_data /dev/sda # mount /dev/sda /mnt # mkdir /mnt/x # touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian # sync # for i in /mnt/x/*; do mv $i $i.longer; done # ls -la /mnt/x/ total 0 -rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer -rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer (Hey! Why are there four files now??) Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Darrick J. Wong 提交于
It turns out that there are some serious problems with the on-disk format of journal checksum v2. The foremost is that the function to calculate descriptor tag size returns sizes that are too big. This causes alignment issues on some architectures and is compounded by the fact that some parts of jbd2 use the structure size (incorrectly) to determine the presence of a 64bit journal instead of checking the feature flags. Therefore, introduce journal checksum v3, which enlarges the descriptor block tag format to allow for full 32-bit checksums of journal blocks, fix the journal tag function to return the correct sizes, and fix the jbd2 recovery code to use feature flags to determine 64bitness. Add a few function helpers so we don't have to open-code quite so many pieces. Switching to a 16-byte block size was found to increase journal size overhead by a maximum of 0.1%, to convert a 32-bit journal with no checksumming to a 32-bit journal with checksum v3 enabled. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reported-by: NTR Reardon <thomas_reardon@hotmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Darrick J. Wong 提交于
When recovering the journal, don't fall into an infinite loop if we encounter a corrupt journal block. Instead, just skip the block and return an error, which fails the mount and thus forces the user to run a full filesystem fsck. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Dmitry Monakhov 提交于
In case of delalloc block i_disksize may be less than i_size. So we have to update i_disksize each time we allocated and submitted some blocks beyond i_disksize. We weren't doing this on the error paths, so fix this. testcase: xfstest generic/019 Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Dan Carpenter 提交于
We can make the code a bit simpler because we know that "!retry" is zero. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 28 8月, 2014 2 次提交
-
-
由 Dmitry Monakhov 提交于
After commit f282ac19 we use different transactions for preallocation and i_disksize update which result in complain from fsck after power-failure. spotted by generic/019. IMHO this is regression because fs becomes inconsistent, even more 'e2fsck -p' will no longer works (which drives admins go crazy) Same transaction requirement applies ctime,mtime updates testcase: xfstest generic/019 Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Dmitry Monakhov 提交于
Currently we reserve only 4 blocks but in worst case scenario ext4_zero_partial_blocks() may want to zeroout and convert two non adjacent blocks. Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
- 27 8月, 2014 1 次提交
-
-
由 Trond Myklebust 提交于
When creating a new object on the NFS server, we should not be sending posix setacl requests unless the preceding posix_acl_create returned a non-trivial acl. Doing so, causes Solaris servers in particular to return an EINVAL. Fixes: 013cdf10 (nfs: use generic posix ACL infrastructure,,,) Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1132786 Cc: stable@vger.kernel.org # 3.14+ Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-