- 18 10月, 2007 17 次提交
-
-
由 Andreas Dilger 提交于
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked, regardless of whether it is in use. This is this the most time consuming part of the filesystem check. The unintialized block group feature can greatly reduce e2fsck time by eliminating checking of uninitialized inodes. With this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation. The feature is enabled through a mkfs option mke2fs /dev/ -O uninit_groups A patch adding support for uninitialized block groups to e2fsprogs tools has been posted to the linux-ext4 mailing list. The patches have been stress tested with fsstress and fsx. In performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count. Since typical ext4 filesystems only use 1-10% of their inodes, this feature can greatly reduce e2fsck time for users. With performance improvement of 2-20 times, depending on how full the filesystem is. The attached graph shows the major improvements in e2fsck times in filesystems with a large total inode count, but few inodes in use. In each group descriptor if we have EXT4_BG_INODE_UNINIT set in bg_flags: Inode table is not initialized/used in this group. So we can skip the consistency check during fsck. EXT4_BG_BLOCK_UNINIT set in bg_flags: No block in the group is used. So we can skip the block bitmap verification for this group. We also add two new fields to group descriptor as a part of uninitialized group patch. __le16 bg_itable_unused; /* Unused inodes count */ __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */ bg_itable_unused: If we have EXT4_BG_INODE_UNINIT not set in bg_flags then bg_itable_unused will give the offset within the inode table till the inodes are used. This can be used by fsck to skip list of inodes that are marked unused. bg_checksum: Now that we depend on bg_flags and bg_itable_unused to determine the block and inode usage, we need to make sure group descriptor is not corrupt. We add checksum to group descriptor to detect corruption. If the descriptor is found to be corrupt, we mark all the blocks and inodes in the group used. Signed-off-by: NAvantika Mathur <mathur@us.ibm.com> Signed-off-by: NAndreas Dilger <adilger@clusterfs.com> Signed-off-by: NMingming Cao <cmm@us.ibm.com> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
-
由 Eric Sandeen 提交于
CONFIG_EXT4_INDEX is not an exposed config option in the kernel, and it is unconditionally defined in ext4_fs.h. tune2fs is already able to turn off dir indexing, so at this point it's just cluttering up the code. Remove it. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NMingming Cao <cmm@us.ibm.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Coly Li 提交于
Fragment support in ext2/3/4 was never implemented, and it probably will never be implemented. So remove it from ext4. Signed-off-by: NColy Li <coyli@suse.de> Acked-by: NAndreas Dilger <adilger@clusterfs.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Jose R. Santos 提交于
Mostly stolen from akpm's JBD cleanup patch. - use `#ifdef foo' instead of `#if defined(foo)' - Make journal_enable_debug __read_mostly just for the heck of it - Make jbd_debugfs_dir and jbd_debug static - debugfs_remove(NULL) is legal: remove unneeded tests - remove unnecessary empty loops Signed-off-by: NJose R. Santos <jrs@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Jan Kara 提交于
We should really call journal_abort() and not __journal_abort_hard() in case of errors. The latter call does not record the error in the journal superblock and thus filesystem won't be marked as with errors later (and user could happily mount it without any warning). Signed-off-by: NJan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Mingming Cao 提交于
change JBD_XXX macros to JBD2_XXX in JBD2/Ext4 Signed-off-by: NMingming Cao <cmm@us.ibm.com> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Mingming Cao 提交于
Convert kmalloc to kzalloc() and get rid of the memset(). Signed-off-by: NMingming Cao <cmm@us.ibm.com>
-
由 Mingming Cao 提交于
This patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: NMingming Cao <cmm@us.ibm.com>
-
由 Mingming Cao 提交于
This patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: NMingming Cao <cmm@us.ibm.com>
-
由 Mingming Cao 提交于
JBD2: Replace slab allocations with page allocations JBD2 allocate memory for committed_data and frozen_data from slab. However JBD2 should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NMingming Cao <cmm@us.ibm.com>
-
由 Mingming Cao 提交于
JBD: Replace slab allocations with page allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NMingming Cao <cmm@us.ibm.com>
-
由 Eric Van Hensbergen 提交于
This patch moves transport dynamic registration and matching to the net module to prevent a bad Kconfig dependency between the net and fs 9p modules. Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
-
由 Eric Van Hensbergen 提交于
Loose mode in 9p utilizes the page cache without respecting coherency with the server. Any writes previously invaldiated the entire mapping for a file. This patch softens the behavior to only invalidate the region of the actual write. Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
-
由 Latchesar Ionkov 提交于
The 9P2000 protocol requires the authentication and permission checks to be done in the file server. For that reason every user that accesses the file server tree has to authenticate and attach to the server separately. Multiple users can share the same connection to the server. Currently v9fs does a single attach and executes all I/O operations as a single user. This makes using v9fs in multiuser environment unsafe as it depends on the client doing the permission checking. This patch improves the 9P2000 support by allowing every user to attach separately. The patch defines three modes of access (new mount option 'access'): - attach-per-user (access=user) (default mode for 9P2000.u) If a user tries to access a file served by v9fs for the first time, v9fs sends an attach command to the server (Tattach) specifying the user. If the attach succeeds, the user can access the v9fs tree. As there is no uname->uid (string->integer) mapping yet, this mode works only with the 9P2000.u dialect. - allow only one user to access the tree (access=<uid>) Only the user with uid can access the v9fs tree. Other users that attempt to access it will get EPERM error. - do all operations as a single user (access=any) (default for 9P2000) V9fs does a single attach and all operations are done as a single user. If this mode is selected, the v9fs behavior is identical with the current one. Signed-off-by: NLatchesar Ionkov <lucho@ionkov.net> Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
-
由 Latchesar Ionkov 提交于
Change the names of 'uid' and 'gid' parameters to the more appropriate 'dfltuid' and 'dfltgid'. This also sets the default uid/gid to -2 (aka nfsnobody) Signed-off-by: NLatchesar Ionkov <lucho@ionkov.net> Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
-
由 Latchesar Ionkov 提交于
Create more general flags field in the v9fs_session_info struct and move the 'extended' flag as a bit in the flags. Signed-off-by: NLatchesar Ionkov <lucho@ionkov.net> Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
-
由 Eric Van Hensbergen 提交于
This patch abstracts out the interfaces to underlying transports so that new transports can be added as modules. This should also allow kernel configuration of transports without ifdef-hell. Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
-
- 17 10月, 2007 23 次提交
-
-
由 Adrian Bunk 提交于
This patch contains the following cleanups that are now possible: - remove the unused security_operations->inode_xattr_getsuffix - remove the no longer used security_operations->unregister_security - remove some no longer required exit code - remove a bunch of no longer used exports Signed-off-by: NAdrian Bunk <bunk@kernel.org> Acked-by: NJames Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Serge E. Hallyn 提交于
Implement file posix capabilities. This allows programs to be given a subset of root's powers regardless of who runs them, without having to use setuid and giving the binary all of root's powers. This version works with Kaigai Kohei's userspace tools, found at http://www.kaigai.gr.jp/index.php. For more information on how to use this patch, Chris Friedhoff has posted a nice page at http://www.friedhoff.org/fscaps.html. Changelog: Nov 27: Incorporate fixes from Andrew Morton (security-introduce-file-caps-tweaks and security-introduce-file-caps-warning-fix) Fix Kconfig dependency. Fix change signaling behavior when file caps are not compiled in. Nov 13: Integrate comments from Alexey: Remove CONFIG_ ifdef from capability.h, and use %zd for printing a size_t. Nov 13: Fix endianness warnings by sparse as suggested by Alexey Dobriyan. Nov 09: Address warnings of unused variables at cap_bprm_set_security when file capabilities are disabled, and simultaneously clean up the code a little, by pulling the new code into a helper function. Nov 08: For pointers to required userspace tools and how to use them, see http://www.friedhoff.org/fscaps.html. Nov 07: Fix the calculation of the highest bit checked in check_cap_sanity(). Nov 07: Allow file caps to be enabled without CONFIG_SECURITY, since capabilities are the default. Hook cap_task_setscheduler when !CONFIG_SECURITY. Move capable(TASK_KILL) to end of cap_task_kill to reduce audit messages. Nov 05: Add secondary calls in selinux/hooks.c to task_setioprio and task_setscheduler so that selinux and capabilities with file cap support can be stacked. Sep 05: As Seth Arnold points out, uid checks are out of place for capability code. Sep 01: Define task_setscheduler, task_setioprio, cap_task_kill, and task_setnice to make sure a user cannot affect a process in which they called a program with some fscaps. One remaining question is the note under task_setscheduler: are we ok with CAP_SYS_NICE being sufficient to confine a process to a cpuset? It is a semantic change, as without fsccaps, attach_task doesn't allow CAP_SYS_NICE to override the uid equivalence check. But since it uses security_task_setscheduler, which elsewhere is used where CAP_SYS_NICE can be used to override the uid equivalence check, fixing it might be tough. task_setscheduler note: this also controls cpuset:attach_task. Are we ok with CAP_SYS_NICE being used to confine to a cpuset? task_setioprio task_setnice sys_setpriority uses this (through set_one_prio) for another process. Need same checks as setrlimit Aug 21: Updated secureexec implementation to reflect the fact that euid and uid might be the same and nonzero, but the process might still have elevated caps. Aug 15: Handle endianness of xattrs. Enforce capability version match between kernel and disk. Enforce that no bits beyond the known max capability are set, else return -EPERM. With this extra processing, it may be worth reconsidering doing all the work at bprm_set_security rather than d_instantiate. Aug 10: Always call getxattr at bprm_set_security, rather than caching it at d_instantiate. [morgan@kernel.org: file-caps clean up for linux/capability.h] [bunk@kernel.org: unexport cap_inode_killpriv] Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: NAndrew Morgan <morgan@kernel.org> Signed-off-by: NAdrian Bunk <bunk@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Hansen 提交于
I'm going to be modifying nfsd_rename() shortly to support read-only bind mounts. This #ifdef is around the area I'm patching, and it starts to get really ugly if I just try to add my new code by itself. Using this little helper makes things a lot cleaner to use. Signed-off-by: NDave Hansen <haveblue@us.ibm.com> Acked-by: NChristoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Hansen 提交于
First of all, this makes the structure jumping look a little bit cleaner. So, this stands alone as a tiny cleanup. But, we also need 'mnt' by itself a few more times later in this series, so this isn't _just_ a cleanup. Signed-off-by: NDave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Hansen 提交于
may_open() calls vfs_permission() before it does checks for IS_RDONLY(inode). It checks _again_ inside of vfs_permission(). The check inside of vfs_permission() is going away eventually. With the mnt_want/drop_write() functions, all of the r/o checks (except for this one) are consistently done before calling permission(). Because of this, I'd like to use permission() to hold a debugging check to make sure that the mnt_want/drop_write() calls are actually being made. So, to do this: 1. remove the IS_RDONLY() check from permission() 2. enforce that you must mnt_want_write() before even calling permission() 3. actually add the debugging check to permission() We need to rearrange may_open() to do r/o checks before calling permission(). Here's the patch. Signed-off-by: NDave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Hansen 提交于
Why do we need r/o bind mounts? This feature allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount. This has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to somefilesystems. This also replaces patches that vserver has had out of the tree for several years. It allows security enhancement by making sure that parts of your filesystem read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated. I've been using the following script to test that the feature is working as desired. It takes a directory and makes a regular bind and a r/o bind mount of it. It then performs some normal filesystem operations on the three directories, including ones that are expected to fail, like creating a file on the r/o mount. This patch: Some filesystems forego the vfs and may_open() and create their own 'struct file's. This patch creates a couple of helper functions which can be used by these filesystems, and will provide a unified place which the r/o bind mount code may patch. Also, rename an existing, static-scope init_file() to a less generic name. Signed-off-by: NDave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Define a new function fuse_refresh_attributes() that conditionally refreshes the attributes based on the validity timeout. In fuse_permission() only refresh the attributes for checking the execute bits if necessary. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Don't return -ENOENT for a read() on the fuse device when the request was aborted. Instead return -ENODEV, meaning the filesystem has been force-umounted or aborted. Previously ENOENT meant that the request was interrupted, but now the 'aborted' flag is not set in case of interrupts. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Don't set 'aborted' flag on a request if it's interrupted. We have to wait for the answer anyway, and this would only a very little time while copying the reply. This means, that write() on the fuse device will not return -ENOENT during normal operation, only if the filesystem is aborted by a forced umount or through the fusectl interface. This could simplify userspace code somewhat when backward compatibility with earlier kernel versions is not required. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Move dput/mntput pair from request_end() to fuse_release_end(), because there's no other place they are used. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
The VFS checks sticky bits on the parent directory even if the filesystem defines it's own ->permission(). In some situations (sshfs, mountlo, etc) the user does have permission to delete a file even if the attribute based checking would not allow it. So work around this by storing the permission bits separately and returning them in stat(), but cutting the permission bits off from inode->i_mode. This is slightly hackish, but it's probably not worth it to add new infrastructure in VFS and a slight performance penalty for all filesystems, just for the sake of fuse. [Jan Engelhardt] cosmetic fixes Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Cc: Jan Engelhardt <jengelh@linux01.gwdg.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
fuse_permission() didn't refresh inode attributes before using them, even if the validity has already expired. Thanks to Junjiro Okajima for spotting this. Also remove some old code to unconditionally refresh the attributes on the root inode. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Aufs seems to depend on a positive i_nlink value. So fill in a dummy but sane value for the root inode at mount time. The inode attributes are refreshed with the correct values at the first opportunity. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Other than truncate, there are two cases, when fuse tries to get rid of cached pages: a) in open, if KEEP_CACHE flag is not set b) in getattr, if file size changed spontaneously Until now invalidate_mapping_pages() were used, which didn't get rid of mapped pages. This is wrong, and becomes more wrong as dirty pages are introduced. So instead properly invalidate all pages with invalidate_inode_pages2(). Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Memory mappings were only truncated on an explicit truncate, but not when the file size was changed externally. Fix this by moving the truncation code from fuse_setattr to fuse_change_attributes. Yes, there are races between write and and external truncation, but we can't really do anything about them. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Make lifetime of 'struct fuse_file' independent from 'struct file' by adding a reference counter and destructor. This will enable asynchronous page writeback, where it cannot be guaranteed, that the file is not released while a request with this file handle is being served. The actual RELEASE request is only sent when there are no more references to the fuse_file. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Use wake_up_all instead of wake_up in put_reserved_req(), otherwise it is possible that the right task is not woken up. Also create a separate reserved_req_waitq in addition to the blocked_waitq, since they fulfill totally separate functions. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Set the read and write congestion state if the request queue is close to blocking, and clear it when it's not. This prevents unnecessary blocking in readahead and (when writable mmaps are allowed) writeback. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Martin J. Bligh 提交于
Val's cross-port of the ext3 reservations code into ext2. [mbligh@mbligh.org: Small type error for printk [akpm@linux-foundation.org: fix types, sync with ext3] [mbligh@mbligh.org: Bring ext2 reservations code in line with latest ext3] [akpm@linux-foundation.org: kill noisy printk] [akpm@linux-foundation.org: remember to dirty the gdp's block] [akpm@linux-foundation.org: cross-port the missed 5dea5176] [akpm@linux-foundation.org: cross-port e6022603] [akpm@linux-foundation.org: Port the omitted 08fb306f] [akpm@linux-foundation.org: Backport the missed 20acaa18] [akpm@linux-foundation.org: fixes] [cmm@us.ibm.com: fix reservation extension] [bunk@stusta.de: make ext2_get_blocks() static] [hugh@veritas.com: fix hang] [hugh@veritas.com: ext2_new_blocks should reset the reservation window size] [hugh@veritas.com: ext2 balloc: fix off-by-one against rsv_end] [hugh@veritas.com: grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such] [hugh@veritas.com: rbtree usage cleanup] [pbadari@us.ibm.com: Fix for ext2 reservation] [bunk@kernel.org: remove fs/ext2/balloc.c:reserve_blocks()] [hugh@veritas.com: ext2 balloc: use io_error label] Cc: "Martin J. Bligh" <mbligh@mbligh.org> Cc: Valerie Henson <val_henson@linux.intel.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAdrian Bunk <bunk@stusta.de> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com> Signed-off-by: NAdrian Bunk <bunk@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joern Engel 提交于
I_LOCK was used for several unrelated purposes, which caused deadlock situations in certain filesystems as a side effect. One of the purposes now uses the new I_SYNC bit. Also document the various bits and change their order from historical to logical. [bunk@stusta.de: make fs/inode.c:wake_up_inode() static] Signed-off-by: NJoern Engel <joern@wohnheim.fh-wedel.de> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: David Chinner <dgc@sgi.com> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAdrian Bunk <bunk@stusta.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
After making dirty a 100M file, the normal behavior is to start the writeback for all data after 30s delays. But sometimes the following happens instead: - after 30s: ~4M - after 5s: ~4M - after 5s: all remaining 92M Some analyze shows that the internal io dispatch queues goes like this: s_io s_more_io ------------------------- 1) 100M,1K 0 2) 1K 96M 3) 0 96M 1) initial state with a 100M file and a 1K file 2) 4M written, nr_to_write <= 0, so write more 3) 1K written, nr_to_write > 0, no more writes(BUG) nr_to_write > 0 in (3) fools the upper layer to think that data have all been written out. The big dirty file is actually still sitting in s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes in s_dirty. It is also not an option to draw inodes from both s_more_io and s_dirty, an let the loop go on: this might lead to live locks, and might also starve other superblocks in sync time(well kupdate may still starve some superblocks, that's another bug). We have to return when a full scan of s_io completes. So nr_to_write > 0 does not necessarily mean that "all data are written". This patch introduces a flag writeback_control.more_io to indicate this situation. With it the big dirty file no longer has to wait for the next kupdate invocation 5s later. Cc: David Chinner <dgc@sgi.com> Cc: Ken Chen <kenchen@google.com> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug: > The following strange behavior can be observed: > > 1. large file is written > 2. after 30 seconds, nr_dirty goes down by 1024 > 3. then for some time (< 30 sec) nothing happens (disk idle) > 4. then nr_dirty again goes down by 1024 > 5. repeat from 3. until whole file is written > > So basically a 4Mbyte chunk of the file is written every 30 seconds. > I'm quite sure this is not the intended behavior. It can be produced by the following test scheme: # cat bin/test-writeback.sh grep nr_dirty /proc/vmstat echo 1 > /proc/sys/fs/inode_debug dd if=/dev/zero of=/var/x bs=1K count=204800& while true; do grep nr_dirty /proc/vmstat; sleep 1; done # bin/test-writeback.sh nr_dirty 19207 nr_dirty 19207 nr_dirty 30924 204800+0 records in 204800+0 records out 209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s nr_dirty 47150 nr_dirty 47141 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47205 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47215 nr_dirty 47216 nr_dirty 47216 nr_dirty 47216 nr_dirty 47154 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47134 nr_dirty 47134 nr_dirty 47135 nr_dirty 47135 nr_dirty 47135 nr_dirty 46097 <== -1038 nr_dirty 46098 nr_dirty 46098 nr_dirty 46098 [...] nr_dirty 46091 nr_dirty 46092 nr_dirty 46092 nr_dirty 45069 <== -1023 nr_dirty 45056 nr_dirty 45056 nr_dirty 45056 [...] nr_dirty 37822 nr_dirty 36799 <== -1023 [...] nr_dirty 36781 nr_dirty 35758 <== -1023 [...] nr_dirty 34708 nr_dirty 33672 <== -1024 [...] nr_dirty 33692 nr_dirty 32669 <== -1023 % ls -li /var/x 847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x % dmesg|grep 847824 # generated by a debug printk [ 529.263184] redirtied inode 847824 line 548 [ 564.250872] redirtied inode 847824 line 548 [ 594.272797] redirtied inode 847824 line 548 [ 629.231330] redirtied inode 847824 line 548 [ 659.224674] redirtied inode 847824 line 548 [ 689.219890] redirtied inode 847824 line 548 [ 724.226655] redirtied inode 847824 line 548 [ 759.198568] redirtied inode 847824 line 548 # line 548 in fs/fs-writeback.c: 543 if (wbc->pages_skipped != pages_skipped) { 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ 548 redirty_tail(inode); 549 } More debug efforts show that __block_write_full_page() never has the chance to call submit_bh() for that big dirty file: the buffer head is *clean*. So basicly no page io is issued by __block_write_full_page(), hence pages_skipped goes up. Also the comment in generic_sync_sb_inodes(): 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ and the comment in __block_write_full_page(): 1713 /* 1714 * The page was marked dirty, but the buffers were 1715 * clean. Someone wrote them back by hand with 1716 * ll_rw_block/submit_bh. A rare case. 1717 */ do not quite agree with each other. The page writeback should be skipped for 'locked buffer', but here it is 'clean buffer'! This patch fixes this bug. Though I'm not sure why __block_write_full_page() is called only to do nothing and who actually issued the writeback for us. This is the two possible new behaviors after the patch: 1) pretty nice: wait 30s and write ALL:) 2) not so good: - during the dd: ~16M - after 30s: ~4M - after 5s: ~4M - after 5s: ~176M The next patch will fix case (2). Cc: David Chinner <dgc@sgi.com> Cc: Ken Chen <kenchen@google.com> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: NDavid Chinner <dgc@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
NTFS's if-condition on dirty inodes is not complete. Fix it with sb_has_dirty_inodes(). Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Ken Chen <kenchen@google.com> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-