- 30 4月, 2017 6 次提交
-
-
由 Eric Biggers 提交于
Currently the case of writing via mmap to a file with inline data is not handled. This is maybe a rare case since it requires a writable memory map of a very small file, but it is trivial to trigger with on inline_data filesystem, and it causes the 'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in ext4_writepages() to be hit: mkfs.ext4 -O inline_data /dev/vdb mount /dev/vdb /mnt xfs_io -f /mnt/file \ -c 'pwrite 0 1' \ -c 'mmap -w 0 1m' \ -c 'mwrite 0 1' \ -c 'fsync' kernel BUG at fs/ext4/inode.c:2723! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 task: ffff88003d3a8040 task.stack: ffffc90000300000 RIP: 0010:ext4_writepages+0xc89/0xf8a RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283 RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246 RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5 R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698 R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0 Call Trace: ? _raw_spin_unlock+0x27/0x2a ? kvm_clock_read+0x1e/0x20 do_writepages+0x23/0x2c ? do_writepages+0x23/0x2c __filemap_fdatawrite_range+0x80/0x87 filemap_write_and_wait_range+0x67/0x8c ext4_sync_file+0x20e/0x472 vfs_fsync_range+0x8e/0x9f ? syscall_trace_enter+0x25b/0x2d0 vfs_fsync+0x1c/0x1e do_fsync+0x31/0x4a SyS_fsync+0x10/0x14 do_syscall_64+0x69/0x131 entry_SYSCALL64_slow_path+0x25/0x25 We could try to be smart and keep the inline data in this case, or at least support delayed allocation when allocating the block, but these solutions would be more complicated and don't seem worthwhile given how rare this case seems to be. So just fix the bug by calling ext4_convert_inline_data() when we're asked to make a page writable, so that any inline data gets evicted, with the block allocated immediately. Reported-by: NNick Alcock <nick.alcock@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Eric Biggers 提交于
ext4_xattr_check_entry() was redundant with validation of the full xattr entries list in ext4_xattr_check_entries(), which all callers also did. ext4_xattr_check_entry() also didn't actually do correct validation; specifically, it never checked that the value doesn't overlap the xattr names, nor did it account for padding when checking whether the xattr value overflows the available space. So remove it to eliminate any potential confusion. Signed-off-by: NEric Biggers <ebiggers@google.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Eric Biggers 提交于
ext4_xattr_check_names() actually validates both the xattr names and values, not just the names. So rename it to ext4_xattr_check_entries() to avoid confusion. Signed-off-by: NEric Biggers <ebiggers@google.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Eric Biggers 提交于
There's no difference between ext4_xattr_list() and ext4_listxattr(), so merge them together and just have ext4_listxattr(). Some years ago they took different arguments, but that's no longer the case. Signed-off-by: NEric Biggers <ebiggers@google.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Eric Biggers 提交于
Constify static data in ext4 that is never (intentionally) modified so that it is placed in .rodata and benefits from memory protection. Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Eric Biggers 提交于
In the initial implementation of ext4 encryption, the filename was encrypted in ext4_insert_dentry(), which could fail and also required access to the 'dir' inode. Since then ext4 filename encryption has been changed to encrypt the filename earlier, so we can revert the additions to ext4_insert_dentry(). Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 03 4月, 2017 2 次提交
-
-
由 David Howells 提交于
Include a mask in struct stat to indicate which bits of stx_attributes the filesystem actually supports. This would also be useful if we add another system call that allows you to do a 'bulk attribute set' and pass in a statx struct with the masks appropriately set to say what you want to set. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Return enhanced file attributes from the Ext4 filesystem. This includes the following: (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME. (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags. This requires that all ext4 inodes have a getattr call, not just some of them, so to this end, split the ext4_getattr() function and only call part of it where appropriate. Example output: [root@andromeda ~]# touch foo [root@andromeda ~]# chattr +ai foo [root@andromeda ~]# /tmp/test-statx foo statx(foo) = 0 results=fff Size: 0 Blocks: 0 IO Block: 4096 regular file Device: 08:12 Inode: 2101950 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2016-02-11 17:08:29.031795451+0000 Modify: 2016-02-11 17:08:29.031795451+0000 Change: 2016-02-11 17:11:11.987790114+0000 Birth: 2016-02-11 17:08:29.031795451+0000 Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----) Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 26 3月, 2017 2 次提交
-
-
由 Theodore Ts'o 提交于
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
We must lock the xattr block before calculating or verifying the checksum in order to avoid spurious checksum failures. https://bugzilla.kernel.org/show_bug.cgi?id=193661Reported-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
- 16 3月, 2017 2 次提交
-
-
由 Eric Biggers 提交于
If ext4_convert_inline_data() was called on a directory with inline data, the filesystem was left in an inconsistent state (as considered by e2fsck) because the file size was not increased to cover the new block. This happened because the inode was not marked dirty after i_disksize was updated. Fix this by marking the inode dirty at the end of ext4_finish_convert_inline_dir(). This bug was probably not noticed before because most users mark the inode dirty afterwards for other reasons. But if userspace executed FS_IOC_SET_ENCRYPTION_POLICY with invalid parameters, as exercised by 'kvm-xfstests -c adv generic/396', then the inode was never marked dirty after updating i_disksize. Cc: stable@vger.kernel.org # 3.10+ Fixes: 3c47d541Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Eric Biggers 提交于
The only use of the ->prepare_context() fscrypt operation was to allow ext4 to evict inline data from the inode before ->set_context(). However, there is no reason why this cannot be done as simply the first step in ->set_context(), and in fact it makes more sense to do it that way because then the policy modes and flags get validated before any real work is done. Therefore, merge ext4_prepare_context() into ext4_set_context(), and remove ->prepare_context(). Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 03 3月, 2017 1 次提交
-
-
由 David Howells 提交于
Add a system call to make extended file information available, including file creation and some attribute flags where available through the underlying filesystem. The getattr inode operation is altered to take two additional arguments: a u32 request_mask and an unsigned int flags that indicate the synchronisation mode. This change is propagated to the vfs_getattr*() function. Functions like vfs_stat() are now inline wrappers around new functions vfs_statx() and vfs_statx_fd() to reduce stack usage. ======== OVERVIEW ======== The idea was initially proposed as a set of xattrs that could be retrieved with getxattr(), but the general preference proved to be for a new syscall with an extended stat structure. A number of requests were gathered for features to be included. The following have been included: (1) Make the fields a consistent size on all arches and make them large. (2) Spare space, request flags and information flags are provided for future expansion. (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an __s64). (4) Creation time: The SMB protocol carries the creation time, which could be exported by Samba, which will in turn help CIFS make use of FS-Cache as that can be used for coherency data (stx_btime). This is also specified in NFSv4 as a recommended attribute and could be exported by NFSD [Steve French]. (5) Lightweight stat: Ask for just those details of interest, and allow a netfs (such as NFS) to approximate anything not of interest, possibly without going to the server [Trond Myklebust, Ulrich Drepper, Andreas Dilger] (AT_STATX_DONT_SYNC). (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks its cached attributes are up to date [Trond Myklebust] (AT_STATX_FORCE_SYNC). And the following have been left out for future extension: (7) Data version number: Could be used by userspace NFS servers [Aneesh Kumar]. Can also be used to modify fill_post_wcc() in NFSD which retrieves i_version directly, but has just called vfs_getattr(). It could get it from the kstat struct if it used vfs_xgetattr() instead. (There's disagreement on the exact semantics of a single field, since not all filesystems do this the same way). (8) BSD stat compatibility: Including more fields from the BSD stat such as creation time (st_btime) and inode generation number (st_gen) [Jeremy Allison, Bernd Schubert]. (9) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd Schubert]. (This was asked for but later deemed unnecessary with the open-by-handle capability available and caused disagreement as to whether it's a security hole or not). (10) Extra coherency data may be useful in making backups [Andreas Dilger]. (No particular data were offered, but things like last backup timestamp, the data version number and the DOS archive bit would come into this category). (11) Allow the filesystem to indicate what it can/cannot provide: A filesystem can now say it doesn't support a standard stat feature if that isn't available, so if, for instance, inode numbers or UIDs don't exist or are fabricated locally... (This requires a separate system call - I have an fsinfo() call idea for this). (12) Store a 16-byte volume ID in the superblock that can be returned in struct xstat [Steve French]. (Deferred to fsinfo). (13) Include granularity fields in the time data to indicate the granularity of each of the times (NFSv4 time_delta) [Steve French]. (Deferred to fsinfo). (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags. Note that the Linux IOC flags are a mess and filesystems such as Ext4 define flags that aren't in linux/fs.h, so translation in the kernel may be a necessity (or, possibly, we provide the filesystem type too). (Some attributes are made available in stx_attributes, but the general feeling was that the IOC flags were to ext[234]-specific and shouldn't be exposed through statx this way). (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer, Michael Kerrisk]. (Deferred, probably to fsinfo. Finding out if there's an ACL or seclabal might require extra filesystem operations). (16) Femtosecond-resolution timestamps [Dave Chinner]. (A __reserved field has been left in the statx_timestamp struct for this - if there proves to be a need). (17) A set multiple attributes syscall to go with this. =============== NEW SYSTEM CALL =============== The new system call is: int ret = statx(int dfd, const char *filename, unsigned int flags, unsigned int mask, struct statx *buffer); The dfd, filename and flags parameters indicate the file to query, in a similar way to fstatat(). There is no equivalent of lstat() as that can be emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is also no equivalent of fstat() as that can be emulated by passing a NULL filename to statx() with the fd of interest in dfd. Whether or not statx() synchronises the attributes with the backing store can be controlled by OR'ing a value into the flags argument (this typically only affects network filesystems): (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this respect. (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise its attributes with the server - which might require data writeback to occur to get the timestamps correct. (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a network filesystem. The resulting values should be considered approximate. mask is a bitmask indicating the fields in struct statx that are of interest to the caller. The user should set this to STATX_BASIC_STATS to get the basic set returned by stat(). It should be noted that asking for more information may entail extra I/O operations. buffer points to the destination for the data. This must be 256 bytes in size. ====================== MAIN ATTRIBUTES RECORD ====================== The following structures are defined in which to return the main attribute set: struct statx_timestamp { __s64 tv_sec; __s32 tv_nsec; __s32 __reserved; }; struct statx { __u32 stx_mask; __u32 stx_blksize; __u64 stx_attributes; __u32 stx_nlink; __u32 stx_uid; __u32 stx_gid; __u16 stx_mode; __u16 __spare0[1]; __u64 stx_ino; __u64 stx_size; __u64 stx_blocks; __u64 __spare1[1]; struct statx_timestamp stx_atime; struct statx_timestamp stx_btime; struct statx_timestamp stx_ctime; struct statx_timestamp stx_mtime; __u32 stx_rdev_major; __u32 stx_rdev_minor; __u32 stx_dev_major; __u32 stx_dev_minor; __u64 __spare2[14]; }; The defined bits in request_mask and stx_mask are: STATX_TYPE Want/got stx_mode & S_IFMT STATX_MODE Want/got stx_mode & ~S_IFMT STATX_NLINK Want/got stx_nlink STATX_UID Want/got stx_uid STATX_GID Want/got stx_gid STATX_ATIME Want/got stx_atime{,_ns} STATX_MTIME Want/got stx_mtime{,_ns} STATX_CTIME Want/got stx_ctime{,_ns} STATX_INO Want/got stx_ino STATX_SIZE Want/got stx_size STATX_BLOCKS Want/got stx_blocks STATX_BASIC_STATS [The stuff in the normal stat struct] STATX_BTIME Want/got stx_btime{,_ns} STATX_ALL [All currently available stuff] stx_btime is the file creation time, stx_mask is a bitmask indicating the data provided and __spares*[] are where as-yet undefined fields can be placed. Time fields are structures with separate seconds and nanoseconds fields plus a reserved field in case we want to add even finer resolution. Note that times will be negative if before 1970; in such a case, the nanosecond fields will also be negative if not zero. The bits defined in the stx_attributes field convey information about a file, how it is accessed, where it is and what it does. The following attributes map to FS_*_FL flags and are the same numerical value: STATX_ATTR_COMPRESSED File is compressed by the fs STATX_ATTR_IMMUTABLE File is marked immutable STATX_ATTR_APPEND File is append-only STATX_ATTR_NODUMP File is not to be dumped STATX_ATTR_ENCRYPTED File requires key to decrypt in fs Within the kernel, the supported flags are listed by: KSTAT_ATTR_FS_IOC_FLAGS [Are any other IOC flags of sufficient general interest to be exposed through this interface?] New flags include: STATX_ATTR_AUTOMOUNT Object is an automount trigger These are for the use of GUI tools that might want to mark files specially, depending on what they are. Fields in struct statx come in a number of classes: (0) stx_dev_*, stx_blksize. These are local system information and are always available. (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino, stx_size, stx_blocks. These will be returned whether the caller asks for them or not. The corresponding bits in stx_mask will be set to indicate whether they actually have valid values. If the caller didn't ask for them, then they may be approximated. For example, NFS won't waste any time updating them from the server, unless as a byproduct of updating something requested. If the values don't actually exist for the underlying object (such as UID or GID on a DOS file), then the bit won't be set in the stx_mask, even if the caller asked for the value. In such a case, the returned value will be a fabrication. Note that there are instances where the type might not be valid, for instance Windows reparse points. (2) stx_rdev_*. This will be set only if stx_mode indicates we're looking at a blockdev or a chardev, otherwise will be 0. (3) stx_btime. Similar to (1), except this will be set to 0 if it doesn't exist. ======= TESTING ======= The following test program can be used to test the statx system call: samples/statx/test-statx.c Just compile and run, passing it paths to the files you want to examine. The file is built automatically if CONFIG_SAMPLES is enabled. Here's some example output. Firstly, an NFS directory that crosses to another FSID. Note that the AUTOMOUNT attribute is set because transiting this directory will cause d_automount to be invoked by the VFS. [root@andromeda ~]# /tmp/test-statx -A /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:26 Inode: 1703937 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------) Secondly, the result of automounting on that directory. [root@andromeda ~]# /tmp/test-statx /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:27 Inode: 2 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 02 3月, 2017 2 次提交
-
-
由 Ingo Molnar 提交于
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Ingo Molnar 提交于
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 28 2月, 2017 2 次提交
-
-
由 Masahiro Yamada 提交于
Fix typos and add the following to the scripts/spelling.txt: comsume||consume comsumer||consumer comsuming||consuming I see some variable names with this pattern, but this commit is only touching comment blocks to avoid unexpected impact. Link: http://lkml.kernel.org/r/1481573103-11329-19-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fabian Frederick 提交于
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs branch. This patch also fixes multiple checkpatch warnings: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Thanks to Andrew Morton for suggesting more appropriate function instead of macro. [geliangtang@gmail.com: truncate: use i_blocksize()] Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.beSigned-off-by: NFabian Frederick <fabf@skynet.be> Signed-off-by: NGeliang Tang <geliangtang@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 2月, 2017 3 次提交
-
-
由 Dave Jiang 提交于
Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has been somewhat painful with getting the flags set and removed at the correct locations. More than one kernel oops was introduced due to difficulties of getting the placement correctly. Remove the flag values and introduce an input parameter to huge_fault that indicates the size of the page entry. This makes the code easier to trace and should avoid the issues we see with the fault flags where removal of the flag was necessary in the fallback paths. Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com> Tested-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Jiang 提交于
Patch series "1G transparent hugepage support for device dax", v2. The following series implements support for 1G trasparent hugepage on x86 for device dax. The bulk of the code was written by Mathew Wilcox a while back supporting transparent 1G hugepage for fs DAX. I have forward ported the relevant bits to 4.10-rc. The current submission has only the necessary code to support device DAX. Comments from Dan Williams: So the motivation and intended user of this functionality mirrors the motivation and users of 1GB page support in hugetlbfs. Given expected capacities of persistent memory devices an in-memory database may want to reduce tlb pressure beyond what they can already achieve with 2MB mappings of a device-dax file. We have customer feedback to that effect as Willy mentioned in his previous version of these patches [1]. [1]: https://lkml.org/lkml/2016/1/31/52 Comments from Nilesh @ Oracle: There are applications which have a process model; and if you assume 10,000 processes attempting to mmap all the 6TB memory available on a server; we are looking at the following: processes : 10,000 memory : 6TB pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB pmd @ 2M page size: 120,000 / 512 = ~240GB pud @ 1G page size: 240GB / 512 = ~480MB As you can see with 2M pages, this system will use up an exorbitant amount of DRAM to hold the page tables; but the 1G pages finally brings it down to a reasonable level. Memory sizes will keep increasing; so this number will keep increasing. An argument can be made to convert the applications from process model to thread model, but in the real world that may not be always practical. Hopefully this helps explain the use case where this is valuable. This patch (of 3): In preparation for adding the ability to handle PUD pages, convert vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The vm_fault structure is extended to include a union of the different page table pointers that may be needed, and three flag bits are reserved to indicate which type of pointer is in the union. [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()] Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path] Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Jiang 提交于
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2017 2 次提交
-
-
由 Dave Jiang 提交于
pmd_fault() and related functions really only need the vmf parameter since the additional parameters are all included in the vmf struct. Remove the additional parameter and simplify pmd_fault() and friends. Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Jiang 提交于
Instead of passing in multiple parameters in the pmd_fault() handler, a vmf can be passed in just like a fault() handler. This will simplify code and remove the need for the actual pmd fault handlers to allocate a vmf. Related functions are also modified to do the same. [dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off] Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 2月, 2017 1 次提交
-
-
由 Theodore Ts'o 提交于
It's very likely the file system independent ioctl name will be FS_IOC_SHUTDOWN, so let's use the same name for the ext4 ioctl name. Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 15 2月, 2017 2 次提交
-
-
由 Theodore Ts'o 提交于
It is OK for s_first_meta_bg to be equal to the number of block group descriptor blocks. (It rarely happens, but it shouldn't cause any problems.) https://bugzilla.kernel.org/show_bug.cgi?id=194567 Fixes: 3a4b77cdSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Theodore Ts'o 提交于
Fix a BUG when the kernel tries to mount a file system constructed as follows: echo foo > foo.txt mke2fs -Fq -t ext4 -O encrypt foo.img 100 debugfs -w foo.img << EOF write foo.txt a set_inode_field a i_flags 0x80800 set_super_value s_last_orphan 12 quit EOF root@kvm-xfstests:~# mount -o loop foo.img /mnt [ 160.238770] ------------[ cut here ]------------ [ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874! [ 160.240106] invalid opcode: 0000 [#1] SMP [ 160.240106] Modules linked in: [ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227 [ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014 [ 160.240106] task: f4518000 task.stack: f47b6000 [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 [ 160.240106] EFLAGS: 00010246 CPU: 0 [ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007 [ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac [ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0 [ 160.240106] Call Trace: [ 160.240106] ext4_truncate+0x1e9/0x3e5 [ 160.240106] ext4_fill_super+0x286f/0x2b1e [ 160.240106] ? set_blocksize+0x2e/0x7e [ 160.240106] mount_bdev+0x114/0x15f [ 160.240106] ext4_mount+0x15/0x17 [ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d [ 160.240106] mount_fs+0x58/0x115 [ 160.240106] vfs_kern_mount+0x4b/0xae [ 160.240106] do_mount+0x671/0x8c3 [ 160.240106] ? _copy_from_user+0x70/0x83 [ 160.240106] ? strndup_user+0x31/0x46 [ 160.240106] SyS_mount+0x57/0x7b [ 160.240106] do_int80_syscall_32+0x4f/0x61 [ 160.240106] entry_INT80_32+0x2f/0x2f [ 160.240106] EIP: 0xb76b919e [ 160.240106] EFLAGS: 00000246 CPU: 0 [ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8 [ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660 [ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9 [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac [ 160.317241] ---[ end trace d6a773a375c810a5 ]--- The problem is that when the kernel tries to truncate an inode in ext4_truncate(), it tries to clear any on-disk data beyond i_size. Without the encryption key, it can't do that, and so it triggers a BUG. E2fsck does *not* provide this service, and in practice most file systems have their orphan list processed by e2fsck, so to avoid crashing, this patch skips this step if we don't have access to the encryption key (which is the case when processing the orphan list; in all other cases, we will have the encryption key, or the kernel wouldn't have allowed the file to be opened). An open question is whether the fact that e2fsck isn't clearing the bytes beyond i_size causing problems --- and if we've lived with it not doing it for so long, can we drop this from the kernel replay of the orphan list in all cases (not just when we don't have the key for encrypted inodes). Addresses-Google-Bug: #35209576 Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 10 2月, 2017 2 次提交
-
-
由 Jan Kara 提交于
Avoid using stripe_width for sbi->s_stripe value if it is not actually set. It prevents using the stride for sbi->s_stripe. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Jan Kara 提交于
When a filesystem is created using: mkfs.ext4 -b 4096 -E stride=512 <dev> and we try to allocate 64MB extent, we will end up directly in ext4_mb_complex_scan_group(). This is because the request is detected as power-of-two allocation (so we start in ext4_mb_regular_allocator() with ac_criteria == 0) however the check before ext4_mb_simple_scan_group() refuses the direct buddy scan because the allocation request is too large. Since cr == 0, the check whether we should use ext4_mb_scan_aligned() fails as well and we fall back to ext4_mb_complex_scan_group(). Fix the problem by checking for upper limit on power-of-two requests directly when detecting them. Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 09 2月, 2017 1 次提交
-
-
由 Christoph Hellwig 提交于
Unlike O_DIRECT DAX is not an optional opt-in feature selected by the application, so we'll have to provide the traditional synchronіzation of overlapping writes as we do for buffered writes. This was broken historically for DAX, but got fixed for ext2 and XFS as part of the iomap conversion. Fix up ext4 as well. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz>
-
- 08 2月, 2017 1 次提交
-
-
由 Eric Biggers 提交于
Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NRichard Weinberger <richard@nod.at>
-
- 07 2月, 2017 1 次提交
-
-
由 Eric Biggers 提交于
Previously, each filesystem configured without encryption support would define all the public fscrypt functions to their notsupp_* stubs. This list of #defines had to be updated in every filesystem whenever a change was made to the public fscrypt functions. To make things more maintainable now that we have three filesystems using fscrypt, split the old header fscrypto.h into several new headers. fscrypt_supp.h contains the real declarations and is included by filesystems when configured with encryption support, whereas fscrypt_notsupp.h contains the inline stubs and is included by filesystems when configured without encryption support. fscrypt_common.h contains common declarations needed by both. Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 06 2月, 2017 1 次提交
-
-
由 Theodore Ts'o 提交于
This ioctl is modeled after the xfs's XFS_IOC_GOINGDOWN ioctl. (In fact, it uses the same code points.) Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 05 2月, 2017 5 次提交
-
-
由 Theodore Ts'o 提交于
Add a shutdown bit that will cause ext4 processing to fail immediately with EIO. Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
We are currently using one bit in s_resize_flags; rename it in order to allow more of the bits in that unsigned long for other purposes. Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
If the file system requires journal recovery, and the device is read-ony, return EROFS to the mount system call. This allows xfstests generic/050 to pass. Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Theodore Ts'o 提交于
If the journal is aborted, the needs_recovery feature flag should not be removed. Otherwise, it's the journal might not get replayed and this could lead to more data getting lost. Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
由 Theodore Ts'o 提交于
The write_end() function must always unlock the page and drop its ref count, even on an error. Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
- 03 2月, 2017 1 次提交
-
-
由 Jason A. Donenfeld 提交于
The "half md4" transform should not be used by any new code. And fortunately, it's only used now by ext4. Since ext4 supports several hashing methods, at some point it might be desirable to move to something like SipHash. As an intermediate step, remove half md4 from cryptohash.h and lib, and make it just a local function in ext4's hash.c. There's precedent for doing this; the other function ext can use for its hashes -- TEA -- is also implemented in the same place. Also, by being a local function, this might allow gcc to perform some additional optimizations. Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 02 2月, 2017 1 次提交
-
-
由 Eric Biggers 提交于
In the case where the child's encryption context was inconsistent with its parent directory, we were using inode->i_sb and inode->i_ino after the inode had already been iput(). Fix this by doing the iput() in the correct places. Note: only ext4 had this bug, not f2fs and ubifs. Fixes: d9cdc903 ("ext4 crypto: enforce context consistency") Cc: stable@vger.kernel.org Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 31 1月, 2017 1 次提交
-
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 28 1月, 2017 1 次提交
-
-
由 Jan Kara 提交于
ext4_journalled_write_end() did not propely handle all the cases when generic_perform_write() did not copy all the data into the target page and could mark buffers with uninitialized contents as uptodate and dirty leading to possible data corruption (which would be quickly fixed by generic_perform_write() retrying the write but still). Fix the problem by carefully handling the case when the page that is written to is not uptodate. CC: stable@vger.kernel.org Reported-by: NAl Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-