- 01 9月, 2017 1 次提交
-
-
由 Dan Williams 提交于
The ->iomap_begin() operation is a hot path, so cache the fs_dax_get_by_host() result at mount time to avoid the incurring the hash lookup overhead on a per-i/o basis. Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Reviewed-by: NJan Kara <jack@suse.cz> Reported-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 06 8月, 2017 4 次提交
-
-
由 Miao Xie 提交于
When upgrading from old format, try to set project id to old file first time, it will return EOVERFLOW, but if that file is dirtied(touch etc), changing project id will be allowed, this might be confusing for users, we could try to expand @i_extra_isize here too. Reported-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NMiao Xie <miaoxie@huawei.com> Signed-off-by: NWang Shilong <wshilong@ddn.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Miao Xie 提交于
Current ext4_expand_extra_isize just tries to expand extra isize, if someone is holding xattr lock or some check fails, it will give up. So rename its name to ext4_try_to_expand_extra_isize. Besides that, we clean up unnecessary check and move some relative checks into it. Signed-off-by: NMiao Xie <miaoxie@huawei.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NWang Shilong <wshilong@ddn.com>
-
由 Miao Xie 提交于
We should avoid the contention between the i_extra_isize update and the inline data insertion, so move the xattr trylock in front of i_extra_isize update. Signed-off-by: NMiao Xie <miaoxie@huawei.com> Reviewed-by: NWang Shilong <wshilong@ddn.com>
-
由 Tahsin Erdogan 提交于
ext4_xattr_inode_read() currently reads each block sequentially while waiting for io operation to complete before moving on to the next block. This prevents request merging in block layer. Add a ext4_bread_batch() function that starts reads for all blocks then optionally waits for them to complete. A similar logic is used in ext4_find_entry(), so update that code to use the new function. Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 31 7月, 2017 1 次提交
-
-
由 Eric Whitney 提交于
Commit 914f82a3 "ext4: refactor direct IO code" deleted ext4_ext_direct_IO(), but references to that function remain in comments. Update them to refer to ext4_direct_IO_write(). Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Reviewed-by: NJan Kara <jack@suse.cz>
-
- 04 7月, 2017 1 次提交
-
-
由 Tahsin Erdogan 提交于
ext4_inode_info->i_data is the storage area for 4 types of data: a) Extents data b) Inline data c) Block map d) Fast symlink data (symlink length < 60) Extents data case is positively identified by EXT4_INODE_EXTENTS flag. Inline data case is also obvious because of EXT4_INODE_INLINE_DATA flag. Distinguishing c) and d) however requires additional logic. This currently relies on i_blocks count. After subtracting external xattr block from i_blocks, if it is greater than 0 then we know that some data blocks exist, so there must be a block map. This logic got broken after ea_inode feature was added. That feature charges the data blocks of external xattr inodes to the referencing inode and so adds them to the i_blocks. To fix this, we could subtract ea_inode blocks by iterating through all xattr entries and then check whether remaining i_blocks count is zero. Besides being complicated, this won't change the fact that the current way of distinguishing between c) and d) is fragile. The alternative solution is to test whether i_size is less than 60 to determine fast symlink case. ext4_symlink() uses the same test to decide whether to store the symlink in i_data. There is one caveat to address before this can work though. If an inode's i_nlink is zero during eviction, its i_size is set to zero and its data is truncated. If system crashes before inode is removed from the orphan list, next boot orphan cleanup may find the inode with zero i_size. So, a symlink that had its data stored in a block may now appear to be a fast symlink. The solution used in this patch is to treat i_size = 0 as a non-fast symlink case. A zero sized symlink is not legal so the only time this can happen is the mentioned scenario. This is also logically correct because a i_size = 0 symlink has no data stored in i_data. Suggested-by: NAndreas Dilger <adilger@dilger.ca> Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
-
- 24 6月, 2017 1 次提交
-
-
由 Eric Biggers 提交于
Currently, filesystems allow truncate(2) on an encrypted file without the encryption key. However, it's impossible to correctly handle the case where the size being truncated to is not a multiple of the filesystem block size, because that would require decrypting the final block, zeroing the part beyond i_size, then encrypting the block. As other modifications to encrypted file contents are prohibited without the key, just prohibit truncate(2) as well, making it fail with ENOKEY. Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 23 6月, 2017 1 次提交
-
-
由 Jan Kara 提交于
These days inode reclaim calls evict_inode() only when it has no pages in the mapping. In that case it is not necessary to wait for transaction commit in ext4_evict_inode() as there can be no pages waiting to be committed. So avoid unnecessary transaction waiting in that case. We still have to keep the check for the case where ext4_evict_inode() gets called from other paths (e.g. umount) where inode still can have some page cache pages. Reported-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 22 6月, 2017 8 次提交
-
-
由 Tahsin Erdogan 提交于
Ext4 ea_inode feature allows storing xattr values in external inodes to be able to store values that are bigger than a block in size. Ext4 also has deduplication support for these type of inodes. With deduplication, the actual storage waste is eliminated but the users of such inodes are still charged full quota for the inodes as if there was no sharing happening in the background. This design requires ext4 to manually charge the users because the inodes are shared. An implication of this is that, if someone calls chown on a file that has such references we need to transfer the quota for the file and xattr inodes. Current dquot_transfer() function implicitly transfers one inode charge. With ea_inode feature, we would like to transfer multiple inode charges. Add get_inode_usage callback which can interrogate the total number of inodes that were charged for a given inode. [ Applied fix from Colin King to make sure the 'ret' variable is initialized on the successful return path. Detected by CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso] Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NColin Ian King <colin.king@canonical.com> Acked-by: NJan Kara <jack@suse.cz>
-
由 Tahsin Erdogan 提交于
Ext4 now supports xattr values that are up to 64k in size (vfs limit). Large xattr values are stored in external inodes each one holding a single value. Once written the data blocks of these inodes are immutable. The real world use cases are expected to have a lot of value duplication such as inherited acls etc. To reduce data duplication on disk, this patch implements a deduplicator that allows sharing of xattr inodes. The deduplication is based on an in-memory hash lookup that is a best effort sharing scheme. When a xattr inode is read from disk (i.e. getxattr() call), its crc32c hash is added to a hash table. Before creating a new xattr inode for a value being set, the hash table is checked to see if an existing inode holds an identical value. If such an inode is found, the ref count on that inode is incremented. On value removal the ref count is decremented and if it reaches zero the inode is deleted. The quota charging for such inodes is manually managed. Every reference holder is charged the full size as if there was no sharing happening. This is consistent with how xattr blocks are also charged. [ Fixed up journal credits calculation to handle inline data and the rare case where an shared xattr block can get freed when two thread race on breaking the xattr block sharing. --tytso ] Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Tahsin Erdogan 提交于
During inode deletion, the number of journal credits that will be needed is hard to determine. For that reason we have journal extend/restart calls in several places. Whenever a transaction is restarted, filesystem must be in a consistent state because there is no atomicity guarantee beyond a restart call. Add ext4_xattr_ensure_credits() helper function which takes care of journal extend/restart logic. It also handles getting jbd2 write access and dirty metadata calls. This function is called at every iteration of handling an ea_inode reference. Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Tahsin Erdogan 提交于
IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4 also uses it to check whether an inode is for a quota file. The distinction currently doesn't matter because quota is disabled only for the quota files. When we start disabling quota for other inodes in the future, we will want to make the distinction clear. Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where we are checking for quota files. Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Tahsin Erdogan 提交于
Tracking struct inode * rather than the inode number eliminates the repeated ext4_xattr_inode_iget() call later. The second call cannot fail in practice but still requires explanation when it wants to ignore the return value. Avoid the trouble and make things simple. Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Tahsin Erdogan 提交于
Setting a large xattr value may require writing the attribute contents to an external inode. In this case we may need to lock the xattr inode along with the parent inode. This doesn't pose a deadlock risk because xattr inodes are not directly visible to the user and their access is restricted. Assign a lockdep subclass to xattr inode's lock. ============================================ WARNING: possible recursive locking detected 4.12.0-rc1+ #740 Not tainted -------------------------------------------- python/1822 is trying to acquire lock: (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff804912ca>] ext4_xattr_set_entry+0x65a/0x7b0 but task is already holding lock: (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&sb->s_type->i_mutex_key#15); lock(&sb->s_type->i_mutex_key#15); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by python/1822: #0: (sb_writers#10){.+.+.+}, at: [<ffffffff803d0eef>] mnt_want_write+0x1f/0x50 #1: (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0 #2: (jbd2_handle){.+.+..}, at: [<ffffffff80493f40>] start_this_handle+0xf0/0x420 #3: (&ei->xattr_sem){++++..}, at: [<ffffffff804920ba>] ext4_xattr_set_handle+0x9a/0x4f0 stack backtrace: CPU: 0 PID: 1822 Comm: python Not tainted 4.12.0-rc1+ #740 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x67/0x9e __lock_acquire+0x5f3/0x1750 lock_acquire+0xb5/0x1d0 down_write+0x2c/0x60 ext4_xattr_set_entry+0x65a/0x7b0 ext4_xattr_block_set+0x1b2/0x9b0 ext4_xattr_set_handle+0x322/0x4f0 ext4_xattr_set+0x144/0x1a0 ext4_xattr_user_set+0x34/0x40 __vfs_setxattr+0x66/0x80 __vfs_setxattr_noperm+0x69/0x1c0 vfs_setxattr+0xa2/0xb0 setxattr+0x12e/0x150 path_setxattr+0x87/0xb0 SyS_setxattr+0xf/0x20 entry_SYSCALL_64_fastpath+0x18/0xad Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Andreas Dilger 提交于
Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE. If the size of an xattr value is larger than will fit in a single external block, then the xattr value will be saved into the body of an external xattr inode. The also helps support a larger number of xattr, since only the headers will be stored in the in-inode space or the single external block. The inode is referenced from the xattr header via "e_value_inum", which was formerly "e_value_block", but that field was never used. The e_value_size still contains the xattr size so that listing xattrs does not need to look up the inode if the data is not accessed. struct ext4_xattr_entry { __u8 e_name_len; /* length of name */ __u8 e_name_index; /* attribute name index */ __le16 e_value_offs; /* offset in disk block of value */ __le32 e_value_inum; /* inode in which value is stored */ __le32 e_value_size; /* size of attribute value */ __le32 e_hash; /* hash value of name and value */ char e_name[0]; /* attribute name */ }; The xattr inode is marked with the EXT4_EA_INODE_FL flag and also holds a back-reference to the owning inode in its i_mtime field, allowing the ext4/e2fsck to verify the correct inode is accessed. [ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ] Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80 Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424Signed-off-by: NKalpak Shah <kalpak.shah@sun.com> Signed-off-by: NJames Simmons <uja.ornl@gmail.com> Signed-off-by: NAndreas Dilger <andreas.dilger@intel.com> Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
-
由 Artem Blagodarenko 提交于
This INCOMPAT_LARGEDIR feature allows larger directories to be created in ldiskfs, both with directory sizes over 2GB and and a maximum htree depth of 3 instead of the current limit of 2. These features are needed in order to exceed the current limit of approximately 10M entries in a single directory. This patch was originally written by Yang Sheng to support the Lustre server. [ Bumped the credits needed to update an indexed directory -- tytso ] Signed-off-by: NLiang Zhen <liang.zhen@intel.com> Signed-off-by: NYang Sheng <yang.sheng@intel.com> Signed-off-by: NArtem Blagodarenko <artem.blagodarenko@seagate.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NAndreas Dilger <andreas.dilger@intel.com>
-
- 30 5月, 2017 1 次提交
-
-
由 Jan Kara 提交于
Currently, extent manipulation operations such as hole punch, range zeroing, or extent shifting do not record the fact that file data has changed and thus fdatasync(2) has a work to do. As a result if we crash e.g. after a punch hole and fdatasync, user can still possibly see the punched out data after journal replay. Test generic/392 fails due to these problems. Fix the problem by properly marking that file data has changed in these operations. CC: stable@vger.kernel.org Fixes: a4bb6b64Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 27 5月, 2017 1 次提交
-
-
由 Jan Kara 提交于
mpage_submit_page() can race with another process growing i_size and writing data via mmap to the written-back page. As mpage_submit_page() samples i_size too early, it may happen that ext4_bio_write_page() zeroes out too large tail of the page and thus corrupts user data. Fix the problem by sampling i_size only after the page has been write-protected in page tables by clear_page_dirty_for_io() call. Reported-by: NMichael Zimmer <michael@swarm64.com> CC: stable@vger.kernel.org Fixes: cb20d518Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 25 5月, 2017 2 次提交
-
-
由 Eric Biggers 提交于
Currently we don't allow direct I/O on encrypted regular files, so in such cases we return 0 early in ext4_direct_IO(). There was also an additional BUG_ON() check in ext4_direct_IO_write(), but it can never be hit because of the earlier check for the exact same condition in ext4_direct_IO(). There was also no matching check on the read path, which made the write path specific check seem very ad-hoc. Just remove the unnecessary BUG_ON(). Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NDavid Gstir <david@sigma-star.at> Reviewed-by: NJan Kara <jack@suse.cz>
-
由 Eric Biggers 提交于
The 'lend' argument of filemap_write_and_wait_range() is inclusive, so we need to subtract 1 from pos + count. Note that 'count' is guaranteed to be nonzero since ext4_file_read_iter() returns early when given a 0 count. Fixes: 16c54688 ("ext4: Allow parallel DIO reads") Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz>
-
- 22 5月, 2017 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
ext4_expand_extra_isize() should clear only space between old and new size. Fixes: 6dd4ee7c # v2.6.23 Cc: stable@vger.kernel.org Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 14 5月, 2017 1 次提交
-
-
由 Dan Williams 提交于
Tetsuo reports: fs/built-in.o: In function `xfs_file_iomap_end': xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax' fs/built-in.o: In function `xfs_file_iomap_begin': xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host' make: *** [vmlinux] Error 1 $ grep DAX .config CONFIG_DAX=m # CONFIG_DEV_DAX is not set # CONFIG_FS_DAX is not set When FS_DAX=n we can/must throw away the dax code in filesystems. Implement 'fs_' versions of dax_get_by_host() and put_dax() that are nops in the FS_DAX=n case. Cc: <linux-xfs@vger.kernel.org> Cc: <linux-ext4@vger.kernel.org> Cc: Jan Kara <jack@suse.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Tested-by: NTony Luck <tony.luck@intel.com> Fixes: ef510424 ("block, dax: move 'select DAX' from BLOCK to FS_DAX") Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 01 5月, 2017 1 次提交
-
-
由 Jan Kara 提交于
Currently ext4_writepages() submits all pages with transaction started. When no page needs block allocation or extent conversion we can submit all dirty pages in the inode while holding a single transaction handle and when device is congested this can take significant amount of time. Thus ext4_writepages() can block transaction commits for extended periods of time. Take for example a simple benchmark simulating PostgreSQL database (pgioperf in mmtest). The benchmark runs 16 processes doing random reads from a huge file, one process doing random writes to the huge file, and one process doing sequential writes to a small files and frequently running fsync. With unpatched kernel transaction commits take on average ~18s with standard deviation of ~41s, top 5 commit times are: 274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s. After this patch transaction commits take on average 0.1s with standard deviation of 0.15s, top 5 commit times are: 0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s [ Modified so we use an explicit do_map flag instead of relying on io_end not being allocated, the since io_end->inode is needed for I/O error handling. -- tytso ] Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 30 4月, 2017 1 次提交
-
-
由 Eric Biggers 提交于
Currently the case of writing via mmap to a file with inline data is not handled. This is maybe a rare case since it requires a writable memory map of a very small file, but it is trivial to trigger with on inline_data filesystem, and it causes the 'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in ext4_writepages() to be hit: mkfs.ext4 -O inline_data /dev/vdb mount /dev/vdb /mnt xfs_io -f /mnt/file \ -c 'pwrite 0 1' \ -c 'mmap -w 0 1m' \ -c 'mwrite 0 1' \ -c 'fsync' kernel BUG at fs/ext4/inode.c:2723! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 task: ffff88003d3a8040 task.stack: ffffc90000300000 RIP: 0010:ext4_writepages+0xc89/0xf8a RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283 RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246 RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5 R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698 R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0 Call Trace: ? _raw_spin_unlock+0x27/0x2a ? kvm_clock_read+0x1e/0x20 do_writepages+0x23/0x2c ? do_writepages+0x23/0x2c __filemap_fdatawrite_range+0x80/0x87 filemap_write_and_wait_range+0x67/0x8c ext4_sync_file+0x20e/0x472 vfs_fsync_range+0x8e/0x9f ? syscall_trace_enter+0x25b/0x2d0 vfs_fsync+0x1c/0x1e do_fsync+0x31/0x4a SyS_fsync+0x10/0x14 do_syscall_64+0x69/0x131 entry_SYSCALL64_slow_path+0x25/0x25 We could try to be smart and keep the inline data in this case, or at least support delayed allocation when allocating the block, but these solutions would be more complicated and don't seem worthwhile given how rare this case seems to be. So just fix the bug by calling ext4_convert_inline_data() when we're asked to make a page writable, so that any inline data gets evicted, with the block allocated immediately. Reported-by: NNick Alcock <nick.alcock@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 26 4月, 2017 1 次提交
-
-
由 Dan Williams 提交于
In preparation for converting fs/dax.c to use dax_direct_access() instead of bdev_direct_access(), add the plumbing to retrieve the dax_device associated with a given block_device. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 19 4月, 2017 1 次提交
-
-
由 Jan Kara 提交于
Now that all places setting inode->i_flags that should be reflected in on-disk flags are gone, we can remove ext4_get_inode_flags() call. Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 03 4月, 2017 2 次提交
-
-
由 David Howells 提交于
Include a mask in struct stat to indicate which bits of stx_attributes the filesystem actually supports. This would also be useful if we add another system call that allows you to do a 'bulk attribute set' and pass in a statx struct with the masks appropriately set to say what you want to set. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Return enhanced file attributes from the Ext4 filesystem. This includes the following: (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME. (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags. This requires that all ext4 inodes have a getattr call, not just some of them, so to this end, split the ext4_getattr() function and only call part of it where appropriate. Example output: [root@andromeda ~]# touch foo [root@andromeda ~]# chattr +ai foo [root@andromeda ~]# /tmp/test-statx foo statx(foo) = 0 results=fff Size: 0 Blocks: 0 IO Block: 4096 regular file Device: 08:12 Inode: 2101950 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2016-02-11 17:08:29.031795451+0000 Modify: 2016-02-11 17:08:29.031795451+0000 Change: 2016-02-11 17:11:11.987790114+0000 Birth: 2016-02-11 17:08:29.031795451+0000 Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----) Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 26 3月, 2017 1 次提交
-
-
由 Theodore Ts'o 提交于
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 03 3月, 2017 1 次提交
-
-
由 David Howells 提交于
Add a system call to make extended file information available, including file creation and some attribute flags where available through the underlying filesystem. The getattr inode operation is altered to take two additional arguments: a u32 request_mask and an unsigned int flags that indicate the synchronisation mode. This change is propagated to the vfs_getattr*() function. Functions like vfs_stat() are now inline wrappers around new functions vfs_statx() and vfs_statx_fd() to reduce stack usage. ======== OVERVIEW ======== The idea was initially proposed as a set of xattrs that could be retrieved with getxattr(), but the general preference proved to be for a new syscall with an extended stat structure. A number of requests were gathered for features to be included. The following have been included: (1) Make the fields a consistent size on all arches and make them large. (2) Spare space, request flags and information flags are provided for future expansion. (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an __s64). (4) Creation time: The SMB protocol carries the creation time, which could be exported by Samba, which will in turn help CIFS make use of FS-Cache as that can be used for coherency data (stx_btime). This is also specified in NFSv4 as a recommended attribute and could be exported by NFSD [Steve French]. (5) Lightweight stat: Ask for just those details of interest, and allow a netfs (such as NFS) to approximate anything not of interest, possibly without going to the server [Trond Myklebust, Ulrich Drepper, Andreas Dilger] (AT_STATX_DONT_SYNC). (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks its cached attributes are up to date [Trond Myklebust] (AT_STATX_FORCE_SYNC). And the following have been left out for future extension: (7) Data version number: Could be used by userspace NFS servers [Aneesh Kumar]. Can also be used to modify fill_post_wcc() in NFSD which retrieves i_version directly, but has just called vfs_getattr(). It could get it from the kstat struct if it used vfs_xgetattr() instead. (There's disagreement on the exact semantics of a single field, since not all filesystems do this the same way). (8) BSD stat compatibility: Including more fields from the BSD stat such as creation time (st_btime) and inode generation number (st_gen) [Jeremy Allison, Bernd Schubert]. (9) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd Schubert]. (This was asked for but later deemed unnecessary with the open-by-handle capability available and caused disagreement as to whether it's a security hole or not). (10) Extra coherency data may be useful in making backups [Andreas Dilger]. (No particular data were offered, but things like last backup timestamp, the data version number and the DOS archive bit would come into this category). (11) Allow the filesystem to indicate what it can/cannot provide: A filesystem can now say it doesn't support a standard stat feature if that isn't available, so if, for instance, inode numbers or UIDs don't exist or are fabricated locally... (This requires a separate system call - I have an fsinfo() call idea for this). (12) Store a 16-byte volume ID in the superblock that can be returned in struct xstat [Steve French]. (Deferred to fsinfo). (13) Include granularity fields in the time data to indicate the granularity of each of the times (NFSv4 time_delta) [Steve French]. (Deferred to fsinfo). (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags. Note that the Linux IOC flags are a mess and filesystems such as Ext4 define flags that aren't in linux/fs.h, so translation in the kernel may be a necessity (or, possibly, we provide the filesystem type too). (Some attributes are made available in stx_attributes, but the general feeling was that the IOC flags were to ext[234]-specific and shouldn't be exposed through statx this way). (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer, Michael Kerrisk]. (Deferred, probably to fsinfo. Finding out if there's an ACL or seclabal might require extra filesystem operations). (16) Femtosecond-resolution timestamps [Dave Chinner]. (A __reserved field has been left in the statx_timestamp struct for this - if there proves to be a need). (17) A set multiple attributes syscall to go with this. =============== NEW SYSTEM CALL =============== The new system call is: int ret = statx(int dfd, const char *filename, unsigned int flags, unsigned int mask, struct statx *buffer); The dfd, filename and flags parameters indicate the file to query, in a similar way to fstatat(). There is no equivalent of lstat() as that can be emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is also no equivalent of fstat() as that can be emulated by passing a NULL filename to statx() with the fd of interest in dfd. Whether or not statx() synchronises the attributes with the backing store can be controlled by OR'ing a value into the flags argument (this typically only affects network filesystems): (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this respect. (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise its attributes with the server - which might require data writeback to occur to get the timestamps correct. (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a network filesystem. The resulting values should be considered approximate. mask is a bitmask indicating the fields in struct statx that are of interest to the caller. The user should set this to STATX_BASIC_STATS to get the basic set returned by stat(). It should be noted that asking for more information may entail extra I/O operations. buffer points to the destination for the data. This must be 256 bytes in size. ====================== MAIN ATTRIBUTES RECORD ====================== The following structures are defined in which to return the main attribute set: struct statx_timestamp { __s64 tv_sec; __s32 tv_nsec; __s32 __reserved; }; struct statx { __u32 stx_mask; __u32 stx_blksize; __u64 stx_attributes; __u32 stx_nlink; __u32 stx_uid; __u32 stx_gid; __u16 stx_mode; __u16 __spare0[1]; __u64 stx_ino; __u64 stx_size; __u64 stx_blocks; __u64 __spare1[1]; struct statx_timestamp stx_atime; struct statx_timestamp stx_btime; struct statx_timestamp stx_ctime; struct statx_timestamp stx_mtime; __u32 stx_rdev_major; __u32 stx_rdev_minor; __u32 stx_dev_major; __u32 stx_dev_minor; __u64 __spare2[14]; }; The defined bits in request_mask and stx_mask are: STATX_TYPE Want/got stx_mode & S_IFMT STATX_MODE Want/got stx_mode & ~S_IFMT STATX_NLINK Want/got stx_nlink STATX_UID Want/got stx_uid STATX_GID Want/got stx_gid STATX_ATIME Want/got stx_atime{,_ns} STATX_MTIME Want/got stx_mtime{,_ns} STATX_CTIME Want/got stx_ctime{,_ns} STATX_INO Want/got stx_ino STATX_SIZE Want/got stx_size STATX_BLOCKS Want/got stx_blocks STATX_BASIC_STATS [The stuff in the normal stat struct] STATX_BTIME Want/got stx_btime{,_ns} STATX_ALL [All currently available stuff] stx_btime is the file creation time, stx_mask is a bitmask indicating the data provided and __spares*[] are where as-yet undefined fields can be placed. Time fields are structures with separate seconds and nanoseconds fields plus a reserved field in case we want to add even finer resolution. Note that times will be negative if before 1970; in such a case, the nanosecond fields will also be negative if not zero. The bits defined in the stx_attributes field convey information about a file, how it is accessed, where it is and what it does. The following attributes map to FS_*_FL flags and are the same numerical value: STATX_ATTR_COMPRESSED File is compressed by the fs STATX_ATTR_IMMUTABLE File is marked immutable STATX_ATTR_APPEND File is append-only STATX_ATTR_NODUMP File is not to be dumped STATX_ATTR_ENCRYPTED File requires key to decrypt in fs Within the kernel, the supported flags are listed by: KSTAT_ATTR_FS_IOC_FLAGS [Are any other IOC flags of sufficient general interest to be exposed through this interface?] New flags include: STATX_ATTR_AUTOMOUNT Object is an automount trigger These are for the use of GUI tools that might want to mark files specially, depending on what they are. Fields in struct statx come in a number of classes: (0) stx_dev_*, stx_blksize. These are local system information and are always available. (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino, stx_size, stx_blocks. These will be returned whether the caller asks for them or not. The corresponding bits in stx_mask will be set to indicate whether they actually have valid values. If the caller didn't ask for them, then they may be approximated. For example, NFS won't waste any time updating them from the server, unless as a byproduct of updating something requested. If the values don't actually exist for the underlying object (such as UID or GID on a DOS file), then the bit won't be set in the stx_mask, even if the caller asked for the value. In such a case, the returned value will be a fabrication. Note that there are instances where the type might not be valid, for instance Windows reparse points. (2) stx_rdev_*. This will be set only if stx_mode indicates we're looking at a blockdev or a chardev, otherwise will be 0. (3) stx_btime. Similar to (1), except this will be set to 0 if it doesn't exist. ======= TESTING ======= The following test program can be used to test the statx system call: samples/statx/test-statx.c Just compile and run, passing it paths to the files you want to examine. The file is built automatically if CONFIG_SAMPLES is enabled. Here's some example output. Firstly, an NFS directory that crosses to another FSID. Note that the AUTOMOUNT attribute is set because transiting this directory will cause d_automount to be invoked by the VFS. [root@andromeda ~]# /tmp/test-statx -A /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:26 Inode: 1703937 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------) Secondly, the result of automounting on that directory. [root@andromeda ~]# /tmp/test-statx /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:27 Inode: 2 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 2月, 2017 1 次提交
-
-
由 Fabian Frederick 提交于
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs branch. This patch also fixes multiple checkpatch warnings: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Thanks to Andrew Morton for suggesting more appropriate function instead of macro. [geliangtang@gmail.com: truncate: use i_blocksize()] Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.beSigned-off-by: NFabian Frederick <fabf@skynet.be> Signed-off-by: NGeliang Tang <geliangtang@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 2月, 2017 1 次提交
-
-
由 Dave Jiang 提交于
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 2月, 2017 1 次提交
-
-
由 Theodore Ts'o 提交于
Fix a BUG when the kernel tries to mount a file system constructed as follows: echo foo > foo.txt mke2fs -Fq -t ext4 -O encrypt foo.img 100 debugfs -w foo.img << EOF write foo.txt a set_inode_field a i_flags 0x80800 set_super_value s_last_orphan 12 quit EOF root@kvm-xfstests:~# mount -o loop foo.img /mnt [ 160.238770] ------------[ cut here ]------------ [ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874! [ 160.240106] invalid opcode: 0000 [#1] SMP [ 160.240106] Modules linked in: [ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227 [ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014 [ 160.240106] task: f4518000 task.stack: f47b6000 [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 [ 160.240106] EFLAGS: 00010246 CPU: 0 [ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007 [ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac [ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0 [ 160.240106] Call Trace: [ 160.240106] ext4_truncate+0x1e9/0x3e5 [ 160.240106] ext4_fill_super+0x286f/0x2b1e [ 160.240106] ? set_blocksize+0x2e/0x7e [ 160.240106] mount_bdev+0x114/0x15f [ 160.240106] ext4_mount+0x15/0x17 [ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d [ 160.240106] mount_fs+0x58/0x115 [ 160.240106] vfs_kern_mount+0x4b/0xae [ 160.240106] do_mount+0x671/0x8c3 [ 160.240106] ? _copy_from_user+0x70/0x83 [ 160.240106] ? strndup_user+0x31/0x46 [ 160.240106] SyS_mount+0x57/0x7b [ 160.240106] do_int80_syscall_32+0x4f/0x61 [ 160.240106] entry_INT80_32+0x2f/0x2f [ 160.240106] EIP: 0xb76b919e [ 160.240106] EFLAGS: 00000246 CPU: 0 [ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8 [ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660 [ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9 [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac [ 160.317241] ---[ end trace d6a773a375c810a5 ]--- The problem is that when the kernel tries to truncate an inode in ext4_truncate(), it tries to clear any on-disk data beyond i_size. Without the encryption key, it can't do that, and so it triggers a BUG. E2fsck does *not* provide this service, and in practice most file systems have their orphan list processed by e2fsck, so to avoid crashing, this patch skips this step if we don't have access to the encryption key (which is the case when processing the orphan list; in all other cases, we will have the encryption key, or the kernel wouldn't have allowed the file to be opened). An open question is whether the fact that e2fsck isn't clearing the bytes beyond i_size causing problems --- and if we've lived with it not doing it for so long, can we drop this from the kernel replay of the orphan list in all cases (not just when we don't have the key for encrypted inodes). Addresses-Google-Bug: #35209576 Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 05 2月, 2017 2 次提交
-
-
由 Theodore Ts'o 提交于
Add a shutdown bit that will cause ext4 processing to fail immediately with EIO. Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
The write_end() function must always unlock the page and drop its ref count, even on an error. Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
-
- 31 1月, 2017 1 次提交
-
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 28 1月, 2017 1 次提交
-
-
由 Jan Kara 提交于
ext4_journalled_write_end() did not propely handle all the cases when generic_perform_write() did not copy all the data into the target page and could mark buffers with uninitialized contents as uptodate and dirty leading to possible data corruption (which would be quickly fixed by generic_perform_write() retrying the write but still). Fix the problem by carefully handling the case when the page that is written to is not uptodate. CC: stable@vger.kernel.org Reported-by: NAl Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 23 1月, 2017 1 次提交
-
-
由 Theodore Ts'o 提交于
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 12 1月, 2017 1 次提交
-
-
由 Theodore Ts'o 提交于
There is no need to call ext4_mark_inode_dirty while holding xattr_sem or i_data_sem, so where it's easy to avoid it, move it out from the critical region. Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-