1. 06 8月, 2017 1 次提交
    • T
      ext4: make xattr inode reads faster · 9699d4f9
      Tahsin Erdogan 提交于
      ext4_xattr_inode_read() currently reads each block sequentially while
      waiting for io operation to complete before moving on to the next
      block. This prevents request merging in block layer.
      
      Add a ext4_bread_batch() function that starts reads for all blocks
      then optionally waits for them to complete. A similar logic is used
      in ext4_find_entry(), so update that code to use the new function.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      9699d4f9
  2. 31 7月, 2017 1 次提交
  3. 04 7月, 2017 1 次提交
    • T
      ext4: change fast symlink test to not rely on i_blocks · 407cd7fb
      Tahsin Erdogan 提交于
      ext4_inode_info->i_data is the storage area for 4 types of data:
      
        a) Extents data
        b) Inline data
        c) Block map
        d) Fast symlink data (symlink length < 60)
      
      Extents data case is positively identified by EXT4_INODE_EXTENTS flag.
      Inline data case is also obvious because of EXT4_INODE_INLINE_DATA
      flag.
      
      Distinguishing c) and d) however requires additional logic. This
      currently relies on i_blocks count. After subtracting external xattr
      block from i_blocks, if it is greater than 0 then we know that some
      data blocks exist, so there must be a block map.
      
      This logic got broken after ea_inode feature was added. That feature
      charges the data blocks of external xattr inodes to the referencing
      inode and so adds them to the i_blocks. To fix this, we could subtract
      ea_inode blocks by iterating through all xattr entries and then check
      whether remaining i_blocks count is zero. Besides being complicated,
      this won't change the fact that the current way of distinguishing
      between c) and d) is fragile.
      
      The alternative solution is to test whether i_size is less than 60 to
      determine fast symlink case. ext4_symlink() uses the same test to decide
      whether to store the symlink in i_data. There is one caveat to address
      before this can work though.
      
      If an inode's i_nlink is zero during eviction, its i_size is set to
      zero and its data is truncated. If system crashes before inode is removed
      from the orphan list, next boot orphan cleanup may find the inode with
      zero i_size. So, a symlink that had its data stored in a block may now
      appear to be a fast symlink. The solution used in this patch is to treat
      i_size = 0 as a non-fast symlink case. A zero sized symlink is not legal
      so the only time this can happen is the mentioned scenario. This is also
      logically correct because a i_size = 0 symlink has no data stored in
      i_data.
      Suggested-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      407cd7fb
  4. 24 6月, 2017 1 次提交
    • E
      ext4: require key for truncate(2) of encrypted file · 63136858
      Eric Biggers 提交于
      Currently, filesystems allow truncate(2) on an encrypted file without
      the encryption key.  However, it's impossible to correctly handle the
      case where the size being truncated to is not a multiple of the
      filesystem block size, because that would require decrypting the final
      block, zeroing the part beyond i_size, then encrypting the block.
      
      As other modifications to encrypted file contents are prohibited without
      the key, just prohibit truncate(2) as well, making it fail with ENOKEY.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      63136858
  5. 23 6月, 2017 1 次提交
  6. 22 6月, 2017 8 次提交
    • T
      quota: add get_inode_usage callback to transfer multi-inode charges · 7a9ca53a
      Tahsin Erdogan 提交于
      Ext4 ea_inode feature allows storing xattr values in external inodes to
      be able to store values that are bigger than a block in size. Ext4 also
      has deduplication support for these type of inodes. With deduplication,
      the actual storage waste is eliminated but the users of such inodes are
      still charged full quota for the inodes as if there was no sharing
      happening in the background.
      
      This design requires ext4 to manually charge the users because the
      inodes are shared.
      
      An implication of this is that, if someone calls chown on a file that
      has such references we need to transfer the quota for the file and xattr
      inodes. Current dquot_transfer() function implicitly transfers one inode
      charge. With ea_inode feature, we would like to transfer multiple inode
      charges.
      
      Add get_inode_usage callback which can interrogate the total number of
      inodes that were charged for a given inode.
      
      [ Applied fix from Colin King to make sure the 'ret' variable is
        initialized on the successful return path.  Detected by
        CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso]
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NJan Kara <jack@suse.cz>
      7a9ca53a
    • T
      ext4: xattr inode deduplication · dec214d0
      Tahsin Erdogan 提交于
      Ext4 now supports xattr values that are up to 64k in size (vfs limit).
      Large xattr values are stored in external inodes each one holding a
      single value. Once written the data blocks of these inodes are immutable.
      
      The real world use cases are expected to have a lot of value duplication
      such as inherited acls etc. To reduce data duplication on disk, this patch
      implements a deduplicator that allows sharing of xattr inodes.
      
      The deduplication is based on an in-memory hash lookup that is a best
      effort sharing scheme. When a xattr inode is read from disk (i.e.
      getxattr() call), its crc32c hash is added to a hash table. Before
      creating a new xattr inode for a value being set, the hash table is
      checked to see if an existing inode holds an identical value. If such an
      inode is found, the ref count on that inode is incremented. On value
      removal the ref count is decremented and if it reaches zero the inode is
      deleted.
      
      The quota charging for such inodes is manually managed. Every reference
      holder is charged the full size as if there was no sharing happening.
      This is consistent with how xattr blocks are also charged.
      
      [ Fixed up journal credits calculation to handle inline data and the
        rare case where an shared xattr block can get freed when two thread
        race on breaking the xattr block sharing. --tytso ]
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dec214d0
    • T
      ext4: cleanup transaction restarts during inode deletion · 30a7eb97
      Tahsin Erdogan 提交于
      During inode deletion, the number of journal credits that will be
      needed is hard to determine.  For that reason we have journal
      extend/restart calls in several places.  Whenever a transaction is
      restarted, filesystem must be in a consistent state because there is
      no atomicity guarantee beyond a restart call.
      
      Add ext4_xattr_ensure_credits() helper function which takes care of
      journal extend/restart logic.  It also handles getting jbd2 write
      access and dirty metadata calls.  This function is called at every
      iteration of handling an ea_inode reference.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      30a7eb97
    • T
      ext4: add ext4_is_quota_file() · 02749a4c
      Tahsin Erdogan 提交于
      IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4
      also uses it to check whether an inode is for a quota file. The
      distinction currently doesn't matter because quota is disabled only
      for the quota files. When we start disabling quota for other inodes
      in the future, we will want to make the distinction clear.
      
      Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where
      we are checking for quota files.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      02749a4c
    • T
      ext4: modify ext4_xattr_ino_array to hold struct inode * · 0421a189
      Tahsin Erdogan 提交于
      Tracking struct inode * rather than the inode number eliminates the
      repeated ext4_xattr_inode_iget() call later. The second call cannot
      fail in practice but still requires explanation when it wants to ignore
      the return value. Avoid the trouble and make things simple.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0421a189
    • T
      ext4: fix lockdep warning about recursive inode locking · 33d201e0
      Tahsin Erdogan 提交于
      Setting a large xattr value may require writing the attribute contents
      to an external inode. In this case we may need to lock the xattr inode
      along with the parent inode. This doesn't pose a deadlock risk because
      xattr inodes are not directly visible to the user and their access is
      restricted.
      
      Assign a lockdep subclass to xattr inode's lock.
      
       ============================================
       WARNING: possible recursive locking detected
       4.12.0-rc1+ #740 Not tainted
       --------------------------------------------
       python/1822 is trying to acquire lock:
        (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff804912ca>] ext4_xattr_set_entry+0x65a/0x7b0
      
       but task is already holding lock:
        (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&sb->s_type->i_mutex_key#15);
         lock(&sb->s_type->i_mutex_key#15);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       4 locks held by python/1822:
        #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff803d0eef>] mnt_want_write+0x1f/0x50
        #1:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
        #2:  (jbd2_handle){.+.+..}, at: [<ffffffff80493f40>] start_this_handle+0xf0/0x420
        #3:  (&ei->xattr_sem){++++..}, at: [<ffffffff804920ba>] ext4_xattr_set_handle+0x9a/0x4f0
      
       stack backtrace:
       CPU: 0 PID: 1822 Comm: python Not tainted 4.12.0-rc1+ #740
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       Call Trace:
        dump_stack+0x67/0x9e
        __lock_acquire+0x5f3/0x1750
        lock_acquire+0xb5/0x1d0
        down_write+0x2c/0x60
        ext4_xattr_set_entry+0x65a/0x7b0
        ext4_xattr_block_set+0x1b2/0x9b0
        ext4_xattr_set_handle+0x322/0x4f0
        ext4_xattr_set+0x144/0x1a0
        ext4_xattr_user_set+0x34/0x40
        __vfs_setxattr+0x66/0x80
        __vfs_setxattr_noperm+0x69/0x1c0
        vfs_setxattr+0xa2/0xb0
        setxattr+0x12e/0x150
        path_setxattr+0x87/0xb0
        SyS_setxattr+0xf/0x20
        entry_SYSCALL_64_fastpath+0x18/0xad
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      33d201e0
    • A
      ext4: xattr-in-inode support · e50e5129
      Andreas Dilger 提交于
      Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.
      
      If the size of an xattr value is larger than will fit in a single
      external block, then the xattr value will be saved into the body
      of an external xattr inode.
      
      The also helps support a larger number of xattr, since only the headers
      will be stored in the in-inode space or the single external block.
      
      The inode is referenced from the xattr header via "e_value_inum",
      which was formerly "e_value_block", but that field was never used.
      The e_value_size still contains the xattr size so that listing
      xattrs does not need to look up the inode if the data is not accessed.
      
      struct ext4_xattr_entry {
              __u8    e_name_len;     /* length of name */
              __u8    e_name_index;   /* attribute name index */
              __le16  e_value_offs;   /* offset in disk block of value */
              __le32  e_value_inum;   /* inode in which value is stored */
              __le32  e_value_size;   /* size of attribute value */
              __le32  e_hash;         /* hash value of name and value */
              char    e_name[0];      /* attribute name */
      };
      
      The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
      holds a back-reference to the owning inode in its i_mtime field,
      allowing the ext4/e2fsck to verify the correct inode is accessed.
      
      [ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]
      
      Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
      Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424Signed-off-by: NKalpak Shah <kalpak.shah@sun.com>
      Signed-off-by: NJames Simmons <uja.ornl@gmail.com>
      Signed-off-by: NAndreas Dilger <andreas.dilger@intel.com>
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      e50e5129
    • A
      ext4: add largedir feature · e08ac99f
      Artem Blagodarenko 提交于
      This INCOMPAT_LARGEDIR feature allows larger directories to be created
      in ldiskfs, both with directory sizes over 2GB and and a maximum htree
      depth of 3 instead of the current limit of 2. These features are needed
      in order to exceed the current limit of approximately 10M entries in a
      single directory.
      
      This patch was originally written by Yang Sheng to support the Lustre server.
      
      [ Bumped the credits needed to update an indexed directory -- tytso ]
      Signed-off-by: NLiang Zhen <liang.zhen@intel.com>
      Signed-off-by: NYang Sheng <yang.sheng@intel.com>
      Signed-off-by: NArtem Blagodarenko <artem.blagodarenko@seagate.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <andreas.dilger@intel.com>
      e08ac99f
  7. 30 5月, 2017 1 次提交
    • J
      ext4: fix fdatasync(2) after extent manipulation operations · 67a7d5f5
      Jan Kara 提交于
      Currently, extent manipulation operations such as hole punch, range
      zeroing, or extent shifting do not record the fact that file data has
      changed and thus fdatasync(2) has a work to do. As a result if we crash
      e.g. after a punch hole and fdatasync, user can still possibly see the
      punched out data after journal replay. Test generic/392 fails due to
      these problems.
      
      Fix the problem by properly marking that file data has changed in these
      operations.
      
      CC: stable@vger.kernel.org
      Fixes: a4bb6b64Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      67a7d5f5
  8. 27 5月, 2017 1 次提交
  9. 25 5月, 2017 2 次提交
  10. 22 5月, 2017 1 次提交
  11. 14 5月, 2017 1 次提交
    • D
      dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case · f5705aa8
      Dan Williams 提交于
      Tetsuo reports:
      
        fs/built-in.o: In function `xfs_file_iomap_end':
        xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
        fs/built-in.o: In function `xfs_file_iomap_begin':
        xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
        make: *** [vmlinux] Error 1
        $ grep DAX .config
        CONFIG_DAX=m
        # CONFIG_DEV_DAX is not set
        # CONFIG_FS_DAX is not set
      
      When FS_DAX=n we can/must throw away the dax code in filesystems.
      Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
      nops in the FS_DAX=n case.
      
      Cc: <linux-xfs@vger.kernel.org>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Fixes: ef510424 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f5705aa8
  12. 01 5月, 2017 1 次提交
    • J
      ext4: avoid unnecessary transaction stalls during writeback · dddbd6ac
      Jan Kara 提交于
      Currently ext4_writepages() submits all pages with transaction started.
      When no page needs block allocation or extent conversion we can submit
      all dirty pages in the inode while holding a single transaction handle
      and when device is congested this can take significant amount of time.
      Thus ext4_writepages() can block transaction commits for extended
      periods of time.
      
      Take for example a simple benchmark simulating PostgreSQL database
      (pgioperf in mmtest). The benchmark runs 16 processes doing random reads
      from a huge file, one process doing random writes to the huge file, and
      one process doing sequential writes to a small files and frequently
      running fsync. With unpatched kernel transaction commits take on average
      ~18s with standard deviation of ~41s, top 5 commit times are:
      
      274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s.
      
      After this patch transaction commits take on average 0.1s with standard
      deviation of 0.15s, top 5 commit times are:
      
      0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s
      
      [ Modified so we use an explicit do_map flag instead of relying on
        io_end not being allocated, the since io_end->inode is needed for I/O
        error handling. -- tytso ]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dddbd6ac
  13. 30 4月, 2017 1 次提交
    • E
      ext4: evict inline data when writing to memory map · 7b4cc978
      Eric Biggers 提交于
      Currently the case of writing via mmap to a file with inline data is not
      handled.  This is maybe a rare case since it requires a writable memory
      map of a very small file, but it is trivial to trigger with on
      inline_data filesystem, and it causes the
      'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
      ext4_writepages() to be hit:
      
          mkfs.ext4 -O inline_data /dev/vdb
          mount /dev/vdb /mnt
          xfs_io -f /mnt/file \
      	-c 'pwrite 0 1' \
      	-c 'mmap -w 0 1m' \
      	-c 'mwrite 0 1' \
      	-c 'fsync'
      
      	kernel BUG at fs/ext4/inode.c:2723!
      	invalid opcode: 0000 [#1] SMP
      	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
      	task: ffff88003d3a8040 task.stack: ffffc90000300000
      	RIP: 0010:ext4_writepages+0xc89/0xf8a
      	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
      	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
      	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
      	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
      	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
      	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
      	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
      	Call Trace:
      	 ? _raw_spin_unlock+0x27/0x2a
      	 ? kvm_clock_read+0x1e/0x20
      	 do_writepages+0x23/0x2c
      	 ? do_writepages+0x23/0x2c
      	 __filemap_fdatawrite_range+0x80/0x87
      	 filemap_write_and_wait_range+0x67/0x8c
      	 ext4_sync_file+0x20e/0x472
      	 vfs_fsync_range+0x8e/0x9f
      	 ? syscall_trace_enter+0x25b/0x2d0
      	 vfs_fsync+0x1c/0x1e
      	 do_fsync+0x31/0x4a
      	 SyS_fsync+0x10/0x14
      	 do_syscall_64+0x69/0x131
      	 entry_SYSCALL64_slow_path+0x25/0x25
      
      We could try to be smart and keep the inline data in this case, or at
      least support delayed allocation when allocating the block, but these
      solutions would be more complicated and don't seem worthwhile given how
      rare this case seems to be.  So just fix the bug by calling
      ext4_convert_inline_data() when we're asked to make a page writable, so
      that any inline data gets evicted, with the block allocated immediately.
      Reported-by: NNick Alcock <nick.alcock@oracle.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7b4cc978
  14. 26 4月, 2017 1 次提交
  15. 19 4月, 2017 1 次提交
  16. 03 4月, 2017 2 次提交
    • D
      statx: Include a mask for stx_attributes in struct statx · 3209f68b
      David Howells 提交于
      Include a mask in struct stat to indicate which bits of stx_attributes the
      filesystem actually supports.
      
      This would also be useful if we add another system call that allows you to
      do a 'bulk attribute set' and pass in a statx struct with the masks
      appropriately set to say what you want to set.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3209f68b
    • D
      ext4: Add statx support · 99652ea5
      David Howells 提交于
      Return enhanced file attributes from the Ext4 filesystem.  This includes
      the following:
      
       (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.
      
       (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.
      
      This requires that all ext4 inodes have a getattr call, not just some of
      them, so to this end, split the ext4_getattr() function and only call part
      of it where appropriate.
      
      Example output:
      
      	[root@andromeda ~]# touch foo
      	[root@andromeda ~]# chattr +ai foo
      	[root@andromeda ~]# /tmp/test-statx foo
      	statx(foo) = 0
      	results=fff
      	  Size: 0               Blocks: 0          IO Block: 4096    regular file
      	Device: 08:12           Inode: 2101950     Links: 1
      	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
      	Access: 2016-02-11 17:08:29.031795451+0000
      	Modify: 2016-02-11 17:08:29.031795451+0000
      	Change: 2016-02-11 17:11:11.987790114+0000
      	 Birth: 2016-02-11 17:08:29.031795451+0000
      	Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99652ea5
  17. 26 3月, 2017 1 次提交
  18. 03 3月, 2017 1 次提交
    • D
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells 提交于
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      function.
      
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      
      ========
      OVERVIEW
      ========
      
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      
      A number of requests were gathered for features to be included.  The
      following have been included:
      
       (1) Make the fields a consistent size on all arches and make them large.
      
       (2) Spare space, request flags and information flags are provided for
           future expansion.
      
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
           __s64).
      
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
      
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
      
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
      
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
           (AT_STATX_FORCE_SYNC).
      
      And the following have been left out for future extension:
      
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Kumar].
      
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
      
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
      
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
      
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
      
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
      
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
      
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
      
           (Deferred to fsinfo).
      
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
      
           (Deferred to fsinfo).
      
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
      
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
      
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      
      (16) Femtosecond-resolution timestamps [Dave Chinner].
      
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      
      (17) A set multiple attributes syscall to go with this.
      
      ===============
      NEW SYSTEM CALL
      ===============
      
      The new system call is:
      
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
      
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
           respect.
      
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
      
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
           approximate.
      
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      
      buffer points to the destination for the data.  This must be 256 bytes in
      size.
      
      ======================
      MAIN ATTRIBUTES RECORD
      ======================
      
      The following structures are defined in which to return the main attribute
      set:
      
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	};
      
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      	};
      
      The defined bits in request_mask and stx_mask are:
      
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      placed.
      
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      
      Within the kernel, the supported flags are listed by:
      
      	KSTAT_ATTR_FS_IOC_FLAGS
      
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      
      New flags include:
      
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      
      Fields in struct statx come in a number of classes:
      
       (0) stx_dev_*, stx_blksize.
      
           These are local system information and are always available.
      
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
      
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
      
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
      
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
      
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
      
       (2) stx_rdev_*.
      
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
      
       (3) stx_btime.
      
           Similar to (1), except this will be set to 0 if it doesn't exist.
      
      =======
      TESTING
      =======
      
      The following test program can be used to test the statx system call:
      
      	samples/statx/test-statx.c
      
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      
      Secondly, the result of automounting on that directory.
      
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a528d35e
  19. 28 2月, 2017 1 次提交
  20. 25 2月, 2017 1 次提交
  21. 15 2月, 2017 1 次提交
    • T
      ext4: don't BUG when truncating encrypted inodes on the orphan list · 0d06863f
      Theodore Ts'o 提交于
      Fix a BUG when the kernel tries to mount a file system constructed as
      follows:
      
      echo foo > foo.txt
      mke2fs -Fq -t ext4 -O encrypt foo.img 100
      debugfs -w foo.img << EOF
      write foo.txt a
      set_inode_field a i_flags 0x80800
      set_super_value s_last_orphan 12
      quit
      EOF
      
      root@kvm-xfstests:~# mount -o loop foo.img /mnt
      [  160.238770] ------------[ cut here ]------------
      [  160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
      [  160.240106] invalid opcode: 0000 [#1] SMP
      [  160.240106] Modules linked in:
      [  160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G        W       4.10.0-rc3-00034-gcdd33b941b67 #227
      [  160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
      [  160.240106] task: f4518000 task.stack: f47b6000
      [  160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
      [  160.240106] EFLAGS: 00010246 CPU: 0
      [  160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
      [  160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
      [  160.240106]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      [  160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
      [  160.240106] Call Trace:
      [  160.240106]  ext4_truncate+0x1e9/0x3e5
      [  160.240106]  ext4_fill_super+0x286f/0x2b1e
      [  160.240106]  ? set_blocksize+0x2e/0x7e
      [  160.240106]  mount_bdev+0x114/0x15f
      [  160.240106]  ext4_mount+0x15/0x17
      [  160.240106]  ? ext4_calculate_overhead+0x39d/0x39d
      [  160.240106]  mount_fs+0x58/0x115
      [  160.240106]  vfs_kern_mount+0x4b/0xae
      [  160.240106]  do_mount+0x671/0x8c3
      [  160.240106]  ? _copy_from_user+0x70/0x83
      [  160.240106]  ? strndup_user+0x31/0x46
      [  160.240106]  SyS_mount+0x57/0x7b
      [  160.240106]  do_int80_syscall_32+0x4f/0x61
      [  160.240106]  entry_INT80_32+0x2f/0x2f
      [  160.240106] EIP: 0xb76b919e
      [  160.240106] EFLAGS: 00000246 CPU: 0
      [  160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
      [  160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
      [  160.240106]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
      [  160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
      [  160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
      [  160.317241] ---[ end trace d6a773a375c810a5 ]---
      
      The problem is that when the kernel tries to truncate an inode in
      ext4_truncate(), it tries to clear any on-disk data beyond i_size.
      Without the encryption key, it can't do that, and so it triggers a
      BUG.
      
      E2fsck does *not* provide this service, and in practice most file
      systems have their orphan list processed by e2fsck, so to avoid
      crashing, this patch skips this step if we don't have access to the
      encryption key (which is the case when processing the orphan list; in
      all other cases, we will have the encryption key, or the kernel
      wouldn't have allowed the file to be opened).
      
      An open question is whether the fact that e2fsck isn't clearing the
      bytes beyond i_size causing problems --- and if we've lived with it
      not doing it for so long, can we drop this from the kernel replay of
      the orphan list in all cases (not just when we don't have the key for
      encrypted inodes).
      
      Addresses-Google-Bug: #35209576
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0d06863f
  22. 05 2月, 2017 2 次提交
  23. 31 1月, 2017 1 次提交
  24. 28 1月, 2017 1 次提交
    • J
      ext4: fix data corruption in data=journal mode · 3b136499
      Jan Kara 提交于
      ext4_journalled_write_end() did not propely handle all the cases when
      generic_perform_write() did not copy all the data into the target page
      and could mark buffers with uninitialized contents as uptodate and dirty
      leading to possible data corruption (which would be quickly fixed by
      generic_perform_write() retrying the write but still). Fix the problem
      by carefully handling the case when the page that is written to is not
      uptodate.
      
      CC: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      3b136499
  25. 23 1月, 2017 1 次提交
  26. 12 1月, 2017 1 次提交
  27. 12 12月, 2016 1 次提交
  28. 10 12月, 2016 1 次提交
  29. 03 12月, 2016 1 次提交
    • T
      ext4: fix reading new encrypted symlinks on no-journal file systems · 4db0d88e
      Theodore Ts'o 提交于
      On a filesystem with no journal, a symlink longer than about 32
      characters (exact length depending on padding for encryption) could not
      be followed or read immediately after being created in an encrypted
      directory.  This happened because when the symlink data went through the
      delayed allocation path instead of the journaling path, the symlink was
      incorrectly detected as a "fast" symlink rather than a "slow" symlink
      until its data was written out.
      
      To fix this, disable delayed allocation for symlinks, since there is
      no benefit for delayed allocation anyway.
      Reported-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4db0d88e
  30. 02 12月, 2016 1 次提交