1. 31 5月, 2018 1 次提交
    • D
      fs: clear writeback errors in inode_init_always · 829bc787
      Darrick J. Wong 提交于
      In inode_init_always(), we clear the inode mapping flags, which clears
      any retained error (AS_EIO, AS_ENOSPC) bits.  Unfortunately, we do not
      also clear wb_err, which means that old mapping errors can leak through
      to new inodes.
      
      This is crucial for the XFS inode allocation path because we recycle old
      in-core inodes and we do not want error state from an old file to leak
      into the new file.  This bug was discovered by running generic/036 and
      generic/047 in a loop and noticing that the EIOs generated by the
      collision of direct and buffered writes in generic/036 would survive the
      remount between 036 and 047, and get reported to the fsyncs (on
      different files!) in generic/047.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      829bc787
  2. 12 4月, 2018 1 次提交
  3. 12 3月, 2018 2 次提交
  4. 07 2月, 2018 1 次提交
  5. 29 1月, 2018 2 次提交
    • J
      fs: only set S_VERSION when updating times if necessary · e38cf302
      Jeff Layton 提交于
      We only really need to update i_version if someone has queried for it
      since we last incremented it. By doing that, we can avoid having to
      update the inode if the times haven't changed.
      
      If the times have changed, then we go ahead and forcibly increment the
      counter, under the assumption that we'll be going to the storage
      anyway, and the increment itself is relatively cheap.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      e38cf302
    • J
      fs: new API for handling inode->i_version · ae5e165d
      Jeff Layton 提交于
      Add a documentation blob that explains what the i_version field is, how
      it is expected to work, and how it is currently implemented by various
      filesystems.
      
      We already have inode_inc_iversion. Add several other functions for
      manipulating and accessing the i_version counter. For now, the
      implementation is trivial and basically works the way that all of the
      open-coded i_version accesses work today.
      
      Future patches will convert existing users of i_version to use the new
      API, and then convert the backend implementation to do things more
      efficiently.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      ae5e165d
  6. 28 11月, 2017 1 次提交
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  7. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  8. 09 9月, 2017 1 次提交
  9. 05 9月, 2017 1 次提交
  10. 02 9月, 2017 1 次提交
    • D
      xfs: evict all inodes involved with log redo item · 799ea9e9
      Darrick J. Wong 提交于
      When we introduced the bmap redo log items, we set MS_ACTIVE on the
      mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
      from being truncated prematurely during log recovery.  This also had the
      effect of putting linked inodes on the lru instead of evicting them.
      
      Unfortunately, we neglected to find all those unreferenced lru inodes
      and evict them after finishing log recovery, which means that we leak
      them if anything goes wrong in the rest of xfs_mountfs, because the lru
      is only cleaned out on unmount.
      
      Therefore, evict unreferenced inodes in the lru list immediately
      after clearing MS_ACTIVE.
      
      Fixes: 17c12bcd ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Cc: viro@ZenIV.linux.org.uk
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      799ea9e9
  11. 07 7月, 2017 1 次提交
    • P
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin 提交于
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      memory.
      
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d375d78
  12. 30 6月, 2017 1 次提交
  13. 28 6月, 2017 1 次提交
    • J
      fs: add fcntl() interface for setting/getting write life time hints · c75b1d94
      Jens Axboe 提交于
      Define a set of write life time hints:
      
      RWH_WRITE_LIFE_NOT_SET	No hint information set
      RWH_WRITE_LIFE_NONE	No hints about write life time
      RWH_WRITE_LIFE_SHORT	Data written has a short life time
      RWH_WRITE_LIFE_MEDIUM	Data written has a medium life time
      RWH_WRITE_LIFE_LONG	Data written has a long life time
      RWH_WRITE_LIFE_EXTREME	Data written has an extremely long life time
      
      The intent is for these values to be relative to each other, no
      absolute meaning should be attached to these flag names.
      
      Add an fcntl interface for querying these flags, and also for
      setting them as well:
      
      F_GET_RW_HINT		Returns the read/write hint set on the
      			underlying inode.
      
      F_SET_RW_HINT		Set one of the above write hints on the
      			underlying inode.
      
      F_GET_FILE_RW_HINT	Returns the read/write hint set on the
      			file descriptor.
      
      F_SET_FILE_RW_HINT	Set one of the above write hints on the
      			file descriptor.
      
      The user passes in a 64-bit pointer to get/set these values, and
      the interface returns 0/-1 on success/error.
      
      Sample program testing/implementing basic setting/getting of write
      hints is below.
      
      Add support for storing the write life time hint in the inode flags
      and in struct file as well, and pass them to the kiocb flags. If
      both a file and its corresponding inode has a write hint, then we
      use the one in the file, if available. The file hint can be used
      for sync/direct IO, for buffered writeback only the inode hint
      is available.
      
      This is in preparation for utilizing these hints in the block layer,
      to guide on-media data placement.
      
      /*
       * writehint.c: get or set an inode write hint
       */
       #include <stdio.h>
       #include <fcntl.h>
       #include <stdlib.h>
       #include <unistd.h>
       #include <stdbool.h>
       #include <inttypes.h>
      
       #ifndef F_GET_RW_HINT
       #define F_LINUX_SPECIFIC_BASE	1024
       #define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
       #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
       #endif
      
      static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
      			"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
      			"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
      
      int main(int argc, char *argv[])
      {
      	uint64_t hint;
      	int fd, ret;
      
      	if (argc < 2) {
      		fprintf(stderr, "%s: file <hint>\n", argv[0]);
      		return 1;
      	}
      
      	fd = open(argv[1], O_RDONLY);
      	if (fd < 0) {
      		perror("open");
      		return 2;
      	}
      
      	if (argc > 2) {
      		hint = atoi(argv[2]);
      		ret = fcntl(fd, F_SET_RW_HINT, &hint);
      		if (ret < 0) {
      			perror("fcntl: F_SET_RW_HINT");
      			return 4;
      		}
      	}
      
      	ret = fcntl(fd, F_GET_RW_HINT, &hint);
      	if (ret < 0) {
      		perror("fcntl: F_GET_RW_HINT");
      		return 3;
      	}
      
      	printf("%s: hint %s\n", argv[1], str[hint]);
      	close(fd);
      	return 0;
      }
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c75b1d94
  14. 20 6月, 2017 1 次提交
  15. 09 5月, 2017 1 次提交
  16. 03 5月, 2017 1 次提交
    • J
      fs: don't set *REFERENCED on single use objects · 563f4001
      Josef Bacik 提交于
      By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
      inode we create.  This is problematic as this means that it takes two
      trips through the LRU for any of these objects to be reclaimed,
      regardless of their actual lifetime.  With enough pressure from these
      caches we can easily evict our working set from page cache with single
      use objects.  So instead only set *REFERENCED if we've already been
      added to the LRU list.  This means that we've been touched since the
      first time we were accessed, and so more likely to need to hang out in
      cache.
      
      To illustrate this issue I wrote the following scripts
      
      https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure
      
      on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
      I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
      creates a new file system and creates 2 6.5gib files in order to take up
      13gib of the 16gib of ram with pagecache.  Then it runs a test program
      that reads these 2 files in a loop, and keeps track of how often it has
      to read bytes for each loop.  On an ideal system with no pressure we
      should have to read 0 bytes indefinitely.  The second thing this script
      does is start a fs_mark job that creates a ton of 0 length files,
      putting pressure on the system with slab only allocations.  On exit the
      script prints out how many bytes were read by the read-file program.
      The results are as follows
      
      Without patch:
      /mnt/btrfs-test/reads/file1: total read during loops 27262988288
      /mnt/btrfs-test/reads/file2: total read during loops 27262976000
      
      With patch:
      /mnt/btrfs-test/reads/file2: total read during loops 18640457728
      /mnt/btrfs-test/reads/file1: total read during loops 9565376512
      
      This patch results in a 50% reduction of the amount of pages evicted
      from our working set.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      563f4001
  17. 10 4月, 2017 2 次提交
    • J
      fsnotify: Free fsnotify_mark_connector when there is no mark attached · 08991e83
      Jan Kara 提交于
      Currently we free fsnotify_mark_connector structure only when inode /
      vfsmount is getting freed. This can however impose noticeable memory
      overhead when marks get attached to inodes only temporarily. So free the
      connector structure once the last mark is detached from the object.
      Since notification infrastructure can be working with the connector
      under the protection of fsnotify_mark_srcu, we have to be careful and
      free the fsnotify_mark_connector only after SRCU period passes.
      Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      08991e83
    • J
      fsnotify: Move mark list head from object into dedicated structure · 9dd813c1
      Jan Kara 提交于
      Currently notification marks are attached to object (inode or vfsmnt) by
      a hlist_head in the object. The list is also protected by a spinlock in
      the object. So while there is any mark attached to the list of marks,
      the object must be pinned in memory (and thus e.g. last iput() deleting
      inode cannot happen). Also for list iteration in fsnotify() to work, we
      must hold fsnotify_mark_srcu lock so that mark itself and
      mark->obj_list.next cannot get freed. Thus we are required to wait for
      response to fanotify events from userspace process with
      fsnotify_mark_srcu lock held. That causes issues when userspace process
      is buggy and does not reply to some event - basically the whole
      notification subsystem gets eventually stuck.
      
      So to be able to drop fsnotify_mark_srcu lock while waiting for
      response, we have to pin the mark in memory and make sure it stays in
      the object list (as removing the mark waiting for response could lead to
      lost notification events for groups later in the list). However we don't
      want inode reclaim to block on such mark as that would lead to system
      just locking up elsewhere.
      
      This commit is the first in the series that paves way towards solving
      these conflicting lifetime needs. Instead of anchoring the list of marks
      directly in the object, we anchor it in a dedicated structure
      (fsnotify_mark_connector) and just point to that structure from the
      object. The following commits will also add spinlock protecting the list
      and object pointer to the structure.
      Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      9dd813c1
  18. 08 10月, 2016 1 次提交
  19. 28 9月, 2016 2 次提交
    • D
      fs: Replace current_fs_time() with current_time() · c2050a45
      Deepa Dinamani 提交于
      current_fs_time() uses struct super_block* as an argument.
      As per Linus's suggestion, this is changed to take struct
      inode* as a parameter instead. This is because the function
      is primarily meant for vfs inode timestamps.
      Also the function was renamed as per Arnd's suggestion.
      
      Change all calls to current_fs_time() to use the new
      current_time() function instead. current_fs_time() will be
      deleted.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c2050a45
    • D
      vfs: Add current_time() api · 3cd88666
      Deepa Dinamani 提交于
      current_fs_time() is used for inode timestamps.
      
      Change the signature of the function to take inode pointer
      instead of superblock as per Linus's suggestion.
      
      Also, move the api under vfs as per the discussion on the
      thread: https://lkml.org/lkml/2016/6/9/36 . As per Arnd's
      suggestion on the thread, changing the function name.
      
      current_fs_time() will be deleted after all the references
      to it are replaced by current_time().
      
      There was a bug reported by kbuild test bot with the change
      as some of the calls to current_time() were made before the
      super_block was initialized. Catch these accidental assignments
      as timespec_trunc() does for wrong granularities. This allows
      for the function to work right even in these circumstances.
      But, adds a warning to make the user aware of the bug.
      
      A coccinelle script was used to identify all the current
      .alloc_inode super_block callbacks that updated inode timestamps.
      proc filesystem was the only one that was modifying inode times
      as part of this callback. The series includes a patch to fix that.
      
      Note that timespec_trunc() will also be moved to fs/inode.c
      in a separate patch when this will need to be revamped for
      bounds checking purposes.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3cd88666
  20. 16 9月, 2016 1 次提交
    • M
      vfs: update ovl inode before relatime check · 598e3c8f
      Miklos Szeredi 提交于
      On overlayfs relatime_need_update() needs inode times to be correct on
      overlay inode.  But i_mtime and i_ctime are updated by filesystem code on
      underlying inode only, so they will be out-of-date on the overlay inode.
      
      This patch copies the times from the underlying inode if needed.  This
      can't be done if called from RCU lookup (link following) but link m/ctime
      are not updated by fs, so this is all right.
      
      This patch doesn't change functionality for anything but overlayfs.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      598e3c8f
  21. 03 8月, 2016 3 次提交
  22. 27 7月, 2016 1 次提交
  23. 06 7月, 2016 1 次提交
    • E
      vfs: Don't modify inodes with a uid or gid unknown to the vfs · 0bd23d09
      Eric W. Biederman 提交于
      When a filesystem outside of init_user_ns is mounted it could have
      uids and gids stored in it that do not map to init_user_ns.
      
      The plan is to allow those filesystems to set i_uid to INVALID_UID and
      i_gid to INVALID_GID for unmapped uids and gids and then to handle
      that strange case in the vfs to ensure there is consistent robust
      handling of the weirdness.
      
      Upon a careful review of the vfs and filesystems about the only case
      where there is any possibility of confusion or trouble is when the
      inode is written back to disk.  In that case filesystems typically
      read the inode->i_uid and inode->i_gid and write them to disk even
      when just an inode timestamp is being updated.
      
      Which leads to a rule that is very simple to implement and understand
      inodes whose i_uid or i_gid is not valid may not be written.
      
      In dealing with access times this means treat those inodes as if the
      inode flag S_NOATIME was set.  Reads of the inodes appear safe and
      useful, but any write or modification is disallowed.  The only inode
      write that is allowed is a chown that sets the uid and gid on the
      inode to valid values.  After such a chown the inode is normal and may
      be treated as such.
      
      Denying all writes to inodes with uids or gids unknown to the vfs also
      prevents several oddball cases where corruption would have occurred
      because the vfs does not have complete information.
      
      One problem case that is prevented is attempting to use the gid of a
      directory for new inodes where the directories sgid bit is set but the
      directories gid is not mapped.
      
      Another problem case avoided is attempting to update the evm hash
      after setxattr, removexattr, and setattr.  As the evm hash includeds
      the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
      a correct evm hash from being computed.  evm hash verification also
      fails when i_uid or i_gid is unknown but that is essentially harmless
      as it does not cause filesystem corruption.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0bd23d09
  24. 04 7月, 2016 1 次提交
    • A
      iget_locked et.al.: make sure we don't return bad inodes · 2864f301
      Al Viro 提交于
      If one thread does iget_locked(), proceeds to try and set
      the new inode up and fails, inode will be unhashed and dropped.
      However, another thread doing ilookup/iget_locked in the middle
      of that would end up finding a half-set-up inode, grabbing
      a reference, waiting for it to come unlocked and getting the
      resulting bad inode.  It's a race (if that ilookup had been
      called just after the failure of setup attempt it wouldn't
      have found the sucker at all), particularly unpleasant in
      cases when failure is transient/caller-dependent/etc.
      
      While it can be dealt with in the callers, there's no reason
      not to handle it in fs/inode.c primitives, especially since
      the cost is trivial.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2864f301
  25. 03 5月, 2016 2 次提交
    • A
      parallel lookups: actual switch to rwsem · 9902af79
      Al Viro 提交于
      ta-da!
      
      The main issue is the lack of down_write_killable(), so the places
      like readdir.c switched to plain inode_lock(); once killable
      variants of rwsem primitives appear, that'll be dealt with.
      
      lockdep side also might need more work
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9902af79
    • A
      parallel lookups machinery, part 2 · 84e710da
      Al Viro 提交于
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      84e710da
  26. 31 3月, 2016 1 次提交
    • A
      posix_acl: Inode acl caching fixes · b8a7a3a6
      Andreas Gruenbacher 提交于
      When get_acl() is called for an inode whose ACL is not cached yet, the
      get_acl inode operation is called to fetch the ACL from the filesystem.
      The inode operation is responsible for updating the cached acl with
      set_cached_acl().  This is done without locking at the VFS level, so
      another task can call set_cached_acl() or forget_cached_acl() before the
      get_acl inode operation gets to calling set_cached_acl(), and then
      get_acl's call to set_cached_acl() results in caching an outdate ACL.
      
      Prevent this from happening by setting the cached ACL pointer to a
      task-specific sentinel value before calling the get_acl inode operation.
      Move the responsibility for updating the cached ACL from the get_acl
      inode operations to get_acl().  There, only set the cached ACL if the
      sentinel value hasn't changed.
      
      The sentinel values are chosen to have odd values.  Likewise, the value
      of ACL_NOT_CACHED is odd.  In contrast, ACL object pointers always have
      an even value (ACLs are aligned in memory).  This allows to distinguish
      uncached ACLs values from ACL objects.
      
      In addition, switch from guarding inode->i_acl and inode->i_default_acl
      upates by the inode->i_lock spinlock to using xchg() and cmpxchg().
      
      Filesystems that do not want ACLs returned from their get_acl inode
      operations to be cached must call forget_cached_acl() to prevent the VFS
      from doing so.
      
      (Patch written by Al Viro and Andreas Gruenbacher.)
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b8a7a3a6
  27. 17 2月, 2016 1 次提交
  28. 23 1月, 2016 2 次提交
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  29. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  30. 09 1月, 2016 1 次提交
  31. 09 12月, 2015 1 次提交
    • A
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro 提交于
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  32. 11 11月, 2015 1 次提交