1. 09 11月, 2017 1 次提交
  2. 19 10月, 2017 6 次提交
  3. 12 10月, 2017 2 次提交
    • R
      ext4: add sanity check for encryption + DAX · 7d3e06a8
      Ross Zwisler 提交于
      We prevent DAX from being used on inodes which are using ext4's built in
      encryption via a check in ext4_set_inode_flags().  We do have what appears
      to be an unsafe transition of S_DAX in ext4_set_context(), though, where
      S_DAX can get disabled without us doing a proper writeback + invalidate.
      
      There are also issues with mm-level races when changing the value of S_DAX,
      as well as issues with the VM_MIXEDMAP flag:
      
      https://www.spinics.net/lists/linux-xfs/msg09859.html
      
      I actually think we are safe in this case because of the following:
      
      1) You can't encrypt an existing file.  Encryption can only be set on an
      empty directory, with new inodes in that directory being created with
      encryption turned on, so I don't think it's possible to turn encryption on
      for a file that has open DAX mmaps or outstanding I/Os.
      
      2) There is no way to turn encryption off on a given file.  Once an inode
      is encrypted, it stays encrypted for the life of that inode, so we don't
      have to worry about the case where we turn encryption off and S_DAX
      suddenly turns on.
      
      3) The only way we end up in ext4_set_context() to turn on encryption is
      when we are creating a new file in the encrypted directory.  This happens
      as part of ext4_create() before the inode has been allowed to do any I/O.
      Here's the call tree:
      
       ext4_create()
         __ext4_new_inode()
      	 ext4_set_inode_flags() // sets S_DAX
      	 fscrypt_inherit_context()
      		fscrypt_get_encryption_info();
      		ext4_set_context() // sets EXT4_INODE_ENCRYPT, clears S_DAX
      
      So, I actually think it's safe to transition S_DAX in ext4_set_context()
      without any locking, writebacks or invalidations.  I've added a
      WARN_ON_ONCE() sanity check to make sure that we are notified if we ever
      encounter a case where we are encrypting an inode that already has data,
      in which case we need to add code to safely transition S_DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      7d3e06a8
    • R
      ext4: prevent data corruption with inline data + DAX · 559db4c6
      Ross Zwisler 提交于
      If an inode has inline data it is currently prevented from using DAX by a
      check in ext4_set_inode_flags().  When the inode grows inline data via
      ext4_create_inline_data() or removes its inline data via
      ext4_destroy_inline_data_nolock(), the value of S_DAX can change.
      
      Currently these changes are unsafe because we don't hold off page faults
      and I/O, write back dirty radix tree entries and invalidate all mappings.
      There are also issues with mm-level races when changing the value of S_DAX,
      as well as issues with the VM_MIXEDMAP flag:
      
      https://www.spinics.net/lists/linux-xfs/msg09859.html
      
      The unsafe transition of S_DAX can reliably cause data corruption, as shown
      by the following fstest:
      
      https://patchwork.kernel.org/patch/9948381/
      
      Fix this issue by preventing the DAX mount option from being used on
      filesystems that were created to support inline data.  Inline data is an
      option given to mkfs.ext4.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      CC: stable@vger.kernel.org
      559db4c6
  4. 06 9月, 2017 1 次提交
  5. 01 9月, 2017 1 次提交
  6. 25 8月, 2017 2 次提交
  7. 18 8月, 2017 3 次提交
    • J
      quota: Reduce contention on dq_data_lock · 7b9ca4c6
      Jan Kara 提交于
      dq_data_lock is currently used to protect all modifications of quota
      accounting information, consistency of quota accounting on the inode,
      and dquot pointers from inode. As a result contention on the lock can be
      pretty heavy.
      
      Reduce the contention on the lock by protecting quota accounting
      information by a new dquot->dq_dqb_lock and consistency of quota
      accounting with inode usage by inode->i_lock.
      
      This change reduces time to create 500000 files on ext4 on ramdisk by 50
      different processes in separate directories by 6% when user quota is
      turned on. When those 50 processes belong to 50 different users, the
      improvement is about 9%.
      Signed-off-by: NJan Kara <jack@suse.cz>
      7b9ca4c6
    • J
      ext4: Disable dirty list tracking of dquots when journalling quotas · 91389240
      Jan Kara 提交于
      When journalling quotas, we writeback all dquots immediately after
      changing them as part of current transation. Thus there's no need to
      write anything in dquot_writeback_dquots() and so we can avoid updating
      list of dirty dquots to reduce dq_list_lock contention.
      
      This change reduces time to create 500000 files on ext4 on ramdisk by 50
      different processes in separate directories by 15% when user quota is
      turned on.
      Signed-off-by: NJan Kara <jack@suse.cz>
      91389240
    • J
      quota: Convert dqio_mutex to rwsem · bc8230ee
      Jan Kara 提交于
      Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
      yet.
      Signed-off-by: NJan Kara <jack@suse.cz>
      bc8230ee
  8. 31 7月, 2017 1 次提交
  9. 17 7月, 2017 1 次提交
    • D
      VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c
      David Howells 提交于
      Firstly by applying the following with coccinelle's spatch:
      
      	@@ expression SB; @@
      	-SB->s_flags & MS_RDONLY
      	+sb_rdonly(SB)
      
      to effect the conversion to sb_rdonly(sb), then by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(!sb_rdonly(SB)) && A
      	+!sb_rdonly(SB) && A
      	|
      	-A != (sb_rdonly(SB))
      	+A != sb_rdonly(SB)
      	|
      	-A == (sb_rdonly(SB))
      	+A == sb_rdonly(SB)
      	|
      	-!(sb_rdonly(SB))
      	+!sb_rdonly(SB)
      	|
      	-A && (sb_rdonly(SB))
      	+A && sb_rdonly(SB)
      	|
      	-A || (sb_rdonly(SB))
      	+A || sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) != A
      	+sb_rdonly(SB) != A
      	|
      	-(sb_rdonly(SB)) == A
      	+sb_rdonly(SB) == A
      	|
      	-(sb_rdonly(SB)) && A
      	+sb_rdonly(SB) && A
      	|
      	-(sb_rdonly(SB)) || A
      	+sb_rdonly(SB) || A
      	)
      
      	@@ expression A, B, SB; @@
      	(
      	-(sb_rdonly(SB)) ? 1 : 0
      	+sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) ? A : B
      	+sb_rdonly(SB) ? A : B
      	)
      
      to remove left over excess bracketage and finally by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(A & MS_RDONLY) != sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
      	|
      	-(A & MS_RDONLY) == sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
      	)
      
      to make comparisons against the result of sb_rdonly() (which is a bool)
      work correctly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bc98a42c
  10. 06 7月, 2017 1 次提交
    • T
      ext4: fix __ext4_new_inode() journal credits calculation · af65207c
      Tahsin Erdogan 提交于
      ea_inode feature allows creating extended attributes that are up to
      64k in size. Update __ext4_new_inode() to pick increased credit limits.
      
      To avoid overallocating too many journal credits, update
      __ext4_xattr_set_credits() to make a distinction between xattr create
      vs update. This helps __ext4_new_inode() because all attributes are
      known to be new, so we can save credits that are normally needed to
      delete old values.
      
      Also, have fscrypt specify its maximum context size so that we don't
      end up allocating credits for 64k size.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      af65207c
  11. 24 6月, 2017 1 次提交
  12. 23 6月, 2017 2 次提交
    • E
      ext4: forbid encrypting root directory · 9ce0151a
      Eric Biggers 提交于
      Currently it's possible to encrypt all files and directories on an ext4
      filesystem by deleting everything, including lost+found, then setting an
      encryption policy on the root directory.  However, this is incompatible
      with e2fsck because e2fsck expects to find, create, and/or write to
      lost+found and does not have access to any encryption keys.  Especially
      problematic is that if e2fsck can't find lost+found, it will create it
      without regard for whether the root directory is encrypted.  This is
      wrong for obvious reasons, and it causes a later run of e2fsck to
      consider the lost+found directory entry to be corrupted.
      
      Encrypting the root directory may also be of limited use because it is
      the "all-or-nothing" use case, for which dm-crypt can be used instead.
      (By design, encryption policies are inherited and cannot be overridden;
      so the root directory having an encryption policy implies that all files
      and directories on the filesystem have that same encryption policy.)
      
      In any case, encrypting the root directory is broken currently and must
      not be allowed; so start returning an error if userspace requests it.
      For now only do this in ext4, because f2fs and ubifs do not appear to
      have the lost+found requirement.  We could move it into
      fscrypt_ioctl_set_policy() later if desired, though.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      9ce0151a
    • D
      ext4: send parallel discards on commit completions · a0154344
      Daeho Jeong 提交于
      Now, when we mount ext4 filesystem with '-o discard' option, we have to
      issue all the discard commands for the blocks to be deallocated and
      wait for the completion of the commands on the commit complete phase.
      Because this procedure might involve a lot of sequential combinations of
      issuing discard commands and waiting for that, the delay of this
      procedure might be too much long, even to 17.0s in our test,
      and it results in long commit delay and fsync() performance degradation.
      
      To reduce this kind of delay, instead of adding callback for each
      extent and handling all of them in a sequential manner on commit phase,
      we instead add a separate list of extents to free to the superblock and
      then process this list at once after transaction commits so that
      we can issue all the discard commands in a parallel manner like XFS
      filesystem.
      
      Finally, we could enhance the discard command handling performance.
      The result was such that 17.0s delay of a single commit in the worst
      case has been enhanced to 4.8s.
      Signed-off-by: NDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Tested-by: NHobin Woo <hobin.woo@samsung.com>
      Tested-by: NKitae Lee <kitae87.lee@samsung.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      a0154344
  13. 22 6月, 2017 5 次提交
    • T
      ext4: add nombcache mount option · cdb7ee4c
      Tahsin Erdogan 提交于
      The main purpose of mb cache is to achieve deduplication in
      extended attributes. In use cases where opportunity for deduplication
      is unlikely, it only adds overhead.
      
      Add a mount option to explicitly turn off mb cache.
      Suggested-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      cdb7ee4c
    • T
      quota: add get_inode_usage callback to transfer multi-inode charges · 7a9ca53a
      Tahsin Erdogan 提交于
      Ext4 ea_inode feature allows storing xattr values in external inodes to
      be able to store values that are bigger than a block in size. Ext4 also
      has deduplication support for these type of inodes. With deduplication,
      the actual storage waste is eliminated but the users of such inodes are
      still charged full quota for the inodes as if there was no sharing
      happening in the background.
      
      This design requires ext4 to manually charge the users because the
      inodes are shared.
      
      An implication of this is that, if someone calls chown on a file that
      has such references we need to transfer the quota for the file and xattr
      inodes. Current dquot_transfer() function implicitly transfers one inode
      charge. With ea_inode feature, we would like to transfer multiple inode
      charges.
      
      Add get_inode_usage callback which can interrogate the total number of
      inodes that were charged for a given inode.
      
      [ Applied fix from Colin King to make sure the 'ret' variable is
        initialized on the successful return path.  Detected by
        CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso]
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NJan Kara <jack@suse.cz>
      7a9ca53a
    • T
      ext4: xattr inode deduplication · dec214d0
      Tahsin Erdogan 提交于
      Ext4 now supports xattr values that are up to 64k in size (vfs limit).
      Large xattr values are stored in external inodes each one holding a
      single value. Once written the data blocks of these inodes are immutable.
      
      The real world use cases are expected to have a lot of value duplication
      such as inherited acls etc. To reduce data duplication on disk, this patch
      implements a deduplicator that allows sharing of xattr inodes.
      
      The deduplication is based on an in-memory hash lookup that is a best
      effort sharing scheme. When a xattr inode is read from disk (i.e.
      getxattr() call), its crc32c hash is added to a hash table. Before
      creating a new xattr inode for a value being set, the hash table is
      checked to see if an existing inode holds an identical value. If such an
      inode is found, the ref count on that inode is incremented. On value
      removal the ref count is decremented and if it reaches zero the inode is
      deleted.
      
      The quota charging for such inodes is manually managed. Every reference
      holder is charged the full size as if there was no sharing happening.
      This is consistent with how xattr blocks are also charged.
      
      [ Fixed up journal credits calculation to handle inline data and the
        rare case where an shared xattr block can get freed when two thread
        race on breaking the xattr block sharing. --tytso ]
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dec214d0
    • T
      ext2, ext4: make mb block cache names more explicit · 47387409
      Tahsin Erdogan 提交于
      There will be a second mb_cache instance that tracks ea_inodes. Make
      existing names more explicit so that it is clear that they refer to
      xattr block cache.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      47387409
    • T
      ext4: improve journal credit handling in set xattr paths · c1a5d5f6
      Tahsin Erdogan 提交于
      Both ext4_set_acl() and ext4_set_context() need to be made aware of
      ea_inode feature when it comes to credits calculation.
      
      Also add a sufficient credits check in ext4_xattr_set_handle() right
      after xattr write lock is grabbed. Original credits calculation is done
      outside the lock so there is a possiblity that the initially calculated
      credits are not sufficient anymore.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c1a5d5f6
  14. 05 6月, 2017 1 次提交
  15. 25 5月, 2017 1 次提交
  16. 22 5月, 2017 1 次提交
  17. 09 5月, 2017 2 次提交
    • M
      mm: introduce kv[mz]alloc helpers · a7c3e901
      Michal Hocko 提交于
      Patch series "kvmalloc", v5.
      
      There are many open coded kmalloc with vmalloc fallback instances in the
      tree.  Most of them are not careful enough or simply do not care about
      the underlying semantic of the kmalloc/page allocator which means that
      a) some vmalloc fallbacks are basically unreachable because the kmalloc
      part will keep retrying until it succeeds b) the page allocator can
      invoke a really disruptive steps like the OOM killer to move forward
      which doesn't sound appropriate when we consider that the vmalloc
      fallback is available.
      
      As it can be seen implementing kvmalloc requires quite an intimate
      knowledge if the page allocator and the memory reclaim internals which
      strongly suggests that a helper should be implemented in the memory
      subsystem proper.
      
      Most callers, I could find, have been converted to use the helper
      instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
      in the networking stack which I have converted as well and Eric Dumazet
      was not opposed [2] to convert them as well.
      
      [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
      
      This patch (of 9):
      
      Using kmalloc with the vmalloc fallback for larger allocations is a
      common pattern in the kernel code.  Yet we do not have any common helper
      for that and so users have invented their own helpers.  Some of them are
      really creative when doing so.  Let's just add kv[mz]alloc and make sure
      it is implemented properly.  This implementation makes sure to not make
      a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
      to not warn about allocation failures.  This also rules out the OOM
      killer as the vmalloc is a more approapriate fallback than a disruptive
      user visible action.
      
      This patch also changes some existing users and removes helpers which
      are specific for them.  In some cases this is not possible (e.g.
      ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
      require GFP_NO{FS,IO} context which is not vmalloc compatible in general
      (note that the page table allocation is GFP_KERNEL).  Those need to be
      fixed separately.
      
      While we are at it, document that __vmalloc{_node} about unsupported gfp
      mask because there seems to be a lot of confusion out there.
      kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
      superset) flags to catch new abusers.  Existing ones would have to die
      slowly.
      
      [sfr@canb.auug.org.au: f2fs fixup]
        Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7c3e901
    • D
      block, dax: move "select DAX" from BLOCK to FS_DAX · ef510424
      Dan Williams 提交于
      For configurations that do not enable DAX filesystems or drivers, do not
      require the DAX core to be built.
      
      Given that the 'direct_access' method has been removed from
      'block_device_operations', we can also go ahead and remove the
      block-related dax helper functions from fs/block_dev.c to
      drivers/dax/super.c. This keeps dax details out of the block layer and
      lets the DAX core be built as a module in the FS_DAX=n case.
      
      Filesystems need to include dax.h to call bdev_dax_supported().
      
      Cc: linux-xfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef510424
  18. 04 5月, 2017 1 次提交
    • J
      ext4: mark superblock writes synchronous for nobarrier mounts · 00473374
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
      generic_make_request_checks() however strips REQ_FUA flag from a bio
      when the storage doesn't report volatile write cache and thus write
      effectively becomes asynchronous which can lead to performance
      regressions. This affects superblock writes for ext4. Fix the problem
      by marking superblock writes always as synchronous.
      
      Fixes: b685d3d6
      CC: linux-ext4@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      00473374
  19. 30 4月, 2017 3 次提交
  20. 24 4月, 2017 1 次提交
  21. 19 4月, 2017 1 次提交
    • J
      ext4: Set flags on quota files directly · 957153fc
      Jan Kara 提交于
      Currently immutable and noatime flags on quota files are set by quota
      code which requires us to copy inode->i_flags to our on disk version of
      quota flags in GETFLAGS ioctl and ext4_do_update_inode(). Move to
      setting / clearing these on-disk flags directly to save that copying.
      Signed-off-by: NJan Kara <jack@suse.cz>
      957153fc
  22. 16 3月, 2017 1 次提交
    • E
      fscrypt: eliminate ->prepare_context() operation · 94840e3c
      Eric Biggers 提交于
      The only use of the ->prepare_context() fscrypt operation was to allow
      ext4 to evict inline data from the inode before ->set_context().
      However, there is no reason why this cannot be done as simply the first
      step in ->set_context(), and in fact it makes more sense to do it that
      way because then the policy modes and flags get validated before any
      real work is done.  Therefore, merge ext4_prepare_context() into
      ext4_set_context(), and remove ->prepare_context().
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      94840e3c
  23. 15 2月, 2017 1 次提交