1. 02 9月, 2017 1 次提交
    • D
      xfs: evict all inodes involved with log redo item · 799ea9e9
      Darrick J. Wong 提交于
      When we introduced the bmap redo log items, we set MS_ACTIVE on the
      mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
      from being truncated prematurely during log recovery.  This also had the
      effect of putting linked inodes on the lru instead of evicting them.
      
      Unfortunately, we neglected to find all those unreferenced lru inodes
      and evict them after finishing log recovery, which means that we leak
      them if anything goes wrong in the rest of xfs_mountfs, because the lru
      is only cleaned out on unmount.
      
      Therefore, evict unreferenced inodes in the lru list immediately
      after clearing MS_ACTIVE.
      
      Fixes: 17c12bcd ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Cc: viro@ZenIV.linux.org.uk
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      799ea9e9
  2. 07 7月, 2017 1 次提交
    • P
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin 提交于
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      memory.
      
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d375d78
  3. 30 6月, 2017 1 次提交
  4. 28 6月, 2017 1 次提交
    • J
      fs: add fcntl() interface for setting/getting write life time hints · c75b1d94
      Jens Axboe 提交于
      Define a set of write life time hints:
      
      RWH_WRITE_LIFE_NOT_SET	No hint information set
      RWH_WRITE_LIFE_NONE	No hints about write life time
      RWH_WRITE_LIFE_SHORT	Data written has a short life time
      RWH_WRITE_LIFE_MEDIUM	Data written has a medium life time
      RWH_WRITE_LIFE_LONG	Data written has a long life time
      RWH_WRITE_LIFE_EXTREME	Data written has an extremely long life time
      
      The intent is for these values to be relative to each other, no
      absolute meaning should be attached to these flag names.
      
      Add an fcntl interface for querying these flags, and also for
      setting them as well:
      
      F_GET_RW_HINT		Returns the read/write hint set on the
      			underlying inode.
      
      F_SET_RW_HINT		Set one of the above write hints on the
      			underlying inode.
      
      F_GET_FILE_RW_HINT	Returns the read/write hint set on the
      			file descriptor.
      
      F_SET_FILE_RW_HINT	Set one of the above write hints on the
      			file descriptor.
      
      The user passes in a 64-bit pointer to get/set these values, and
      the interface returns 0/-1 on success/error.
      
      Sample program testing/implementing basic setting/getting of write
      hints is below.
      
      Add support for storing the write life time hint in the inode flags
      and in struct file as well, and pass them to the kiocb flags. If
      both a file and its corresponding inode has a write hint, then we
      use the one in the file, if available. The file hint can be used
      for sync/direct IO, for buffered writeback only the inode hint
      is available.
      
      This is in preparation for utilizing these hints in the block layer,
      to guide on-media data placement.
      
      /*
       * writehint.c: get or set an inode write hint
       */
       #include <stdio.h>
       #include <fcntl.h>
       #include <stdlib.h>
       #include <unistd.h>
       #include <stdbool.h>
       #include <inttypes.h>
      
       #ifndef F_GET_RW_HINT
       #define F_LINUX_SPECIFIC_BASE	1024
       #define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
       #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
       #endif
      
      static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
      			"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
      			"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
      
      int main(int argc, char *argv[])
      {
      	uint64_t hint;
      	int fd, ret;
      
      	if (argc < 2) {
      		fprintf(stderr, "%s: file <hint>\n", argv[0]);
      		return 1;
      	}
      
      	fd = open(argv[1], O_RDONLY);
      	if (fd < 0) {
      		perror("open");
      		return 2;
      	}
      
      	if (argc > 2) {
      		hint = atoi(argv[2]);
      		ret = fcntl(fd, F_SET_RW_HINT, &hint);
      		if (ret < 0) {
      			perror("fcntl: F_SET_RW_HINT");
      			return 4;
      		}
      	}
      
      	ret = fcntl(fd, F_GET_RW_HINT, &hint);
      	if (ret < 0) {
      		perror("fcntl: F_GET_RW_HINT");
      		return 3;
      	}
      
      	printf("%s: hint %s\n", argv[1], str[hint]);
      	close(fd);
      	return 0;
      }
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c75b1d94
  5. 20 6月, 2017 1 次提交
  6. 09 5月, 2017 1 次提交
  7. 03 5月, 2017 1 次提交
    • J
      fs: don't set *REFERENCED on single use objects · 563f4001
      Josef Bacik 提交于
      By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
      inode we create.  This is problematic as this means that it takes two
      trips through the LRU for any of these objects to be reclaimed,
      regardless of their actual lifetime.  With enough pressure from these
      caches we can easily evict our working set from page cache with single
      use objects.  So instead only set *REFERENCED if we've already been
      added to the LRU list.  This means that we've been touched since the
      first time we were accessed, and so more likely to need to hang out in
      cache.
      
      To illustrate this issue I wrote the following scripts
      
      https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure
      
      on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
      I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
      creates a new file system and creates 2 6.5gib files in order to take up
      13gib of the 16gib of ram with pagecache.  Then it runs a test program
      that reads these 2 files in a loop, and keeps track of how often it has
      to read bytes for each loop.  On an ideal system with no pressure we
      should have to read 0 bytes indefinitely.  The second thing this script
      does is start a fs_mark job that creates a ton of 0 length files,
      putting pressure on the system with slab only allocations.  On exit the
      script prints out how many bytes were read by the read-file program.
      The results are as follows
      
      Without patch:
      /mnt/btrfs-test/reads/file1: total read during loops 27262988288
      /mnt/btrfs-test/reads/file2: total read during loops 27262976000
      
      With patch:
      /mnt/btrfs-test/reads/file2: total read during loops 18640457728
      /mnt/btrfs-test/reads/file1: total read during loops 9565376512
      
      This patch results in a 50% reduction of the amount of pages evicted
      from our working set.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      563f4001
  8. 10 4月, 2017 2 次提交
    • J
      fsnotify: Free fsnotify_mark_connector when there is no mark attached · 08991e83
      Jan Kara 提交于
      Currently we free fsnotify_mark_connector structure only when inode /
      vfsmount is getting freed. This can however impose noticeable memory
      overhead when marks get attached to inodes only temporarily. So free the
      connector structure once the last mark is detached from the object.
      Since notification infrastructure can be working with the connector
      under the protection of fsnotify_mark_srcu, we have to be careful and
      free the fsnotify_mark_connector only after SRCU period passes.
      Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      08991e83
    • J
      fsnotify: Move mark list head from object into dedicated structure · 9dd813c1
      Jan Kara 提交于
      Currently notification marks are attached to object (inode or vfsmnt) by
      a hlist_head in the object. The list is also protected by a spinlock in
      the object. So while there is any mark attached to the list of marks,
      the object must be pinned in memory (and thus e.g. last iput() deleting
      inode cannot happen). Also for list iteration in fsnotify() to work, we
      must hold fsnotify_mark_srcu lock so that mark itself and
      mark->obj_list.next cannot get freed. Thus we are required to wait for
      response to fanotify events from userspace process with
      fsnotify_mark_srcu lock held. That causes issues when userspace process
      is buggy and does not reply to some event - basically the whole
      notification subsystem gets eventually stuck.
      
      So to be able to drop fsnotify_mark_srcu lock while waiting for
      response, we have to pin the mark in memory and make sure it stays in
      the object list (as removing the mark waiting for response could lead to
      lost notification events for groups later in the list). However we don't
      want inode reclaim to block on such mark as that would lead to system
      just locking up elsewhere.
      
      This commit is the first in the series that paves way towards solving
      these conflicting lifetime needs. Instead of anchoring the list of marks
      directly in the object, we anchor it in a dedicated structure
      (fsnotify_mark_connector) and just point to that structure from the
      object. The following commits will also add spinlock protecting the list
      and object pointer to the structure.
      Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      9dd813c1
  9. 08 10月, 2016 1 次提交
  10. 28 9月, 2016 2 次提交
    • D
      fs: Replace current_fs_time() with current_time() · c2050a45
      Deepa Dinamani 提交于
      current_fs_time() uses struct super_block* as an argument.
      As per Linus's suggestion, this is changed to take struct
      inode* as a parameter instead. This is because the function
      is primarily meant for vfs inode timestamps.
      Also the function was renamed as per Arnd's suggestion.
      
      Change all calls to current_fs_time() to use the new
      current_time() function instead. current_fs_time() will be
      deleted.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c2050a45
    • D
      vfs: Add current_time() api · 3cd88666
      Deepa Dinamani 提交于
      current_fs_time() is used for inode timestamps.
      
      Change the signature of the function to take inode pointer
      instead of superblock as per Linus's suggestion.
      
      Also, move the api under vfs as per the discussion on the
      thread: https://lkml.org/lkml/2016/6/9/36 . As per Arnd's
      suggestion on the thread, changing the function name.
      
      current_fs_time() will be deleted after all the references
      to it are replaced by current_time().
      
      There was a bug reported by kbuild test bot with the change
      as some of the calls to current_time() were made before the
      super_block was initialized. Catch these accidental assignments
      as timespec_trunc() does for wrong granularities. This allows
      for the function to work right even in these circumstances.
      But, adds a warning to make the user aware of the bug.
      
      A coccinelle script was used to identify all the current
      .alloc_inode super_block callbacks that updated inode timestamps.
      proc filesystem was the only one that was modifying inode times
      as part of this callback. The series includes a patch to fix that.
      
      Note that timespec_trunc() will also be moved to fs/inode.c
      in a separate patch when this will need to be revamped for
      bounds checking purposes.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3cd88666
  11. 16 9月, 2016 1 次提交
    • M
      vfs: update ovl inode before relatime check · 598e3c8f
      Miklos Szeredi 提交于
      On overlayfs relatime_need_update() needs inode times to be correct on
      overlay inode.  But i_mtime and i_ctime are updated by filesystem code on
      underlying inode only, so they will be out-of-date on the overlay inode.
      
      This patch copies the times from the underlying inode if needed.  This
      can't be done if called from RCU lookup (link following) but link m/ctime
      are not updated by fs, so this is all right.
      
      This patch doesn't change functionality for anything but overlayfs.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      598e3c8f
  12. 03 8月, 2016 3 次提交
  13. 27 7月, 2016 1 次提交
  14. 06 7月, 2016 1 次提交
    • E
      vfs: Don't modify inodes with a uid or gid unknown to the vfs · 0bd23d09
      Eric W. Biederman 提交于
      When a filesystem outside of init_user_ns is mounted it could have
      uids and gids stored in it that do not map to init_user_ns.
      
      The plan is to allow those filesystems to set i_uid to INVALID_UID and
      i_gid to INVALID_GID for unmapped uids and gids and then to handle
      that strange case in the vfs to ensure there is consistent robust
      handling of the weirdness.
      
      Upon a careful review of the vfs and filesystems about the only case
      where there is any possibility of confusion or trouble is when the
      inode is written back to disk.  In that case filesystems typically
      read the inode->i_uid and inode->i_gid and write them to disk even
      when just an inode timestamp is being updated.
      
      Which leads to a rule that is very simple to implement and understand
      inodes whose i_uid or i_gid is not valid may not be written.
      
      In dealing with access times this means treat those inodes as if the
      inode flag S_NOATIME was set.  Reads of the inodes appear safe and
      useful, but any write or modification is disallowed.  The only inode
      write that is allowed is a chown that sets the uid and gid on the
      inode to valid values.  After such a chown the inode is normal and may
      be treated as such.
      
      Denying all writes to inodes with uids or gids unknown to the vfs also
      prevents several oddball cases where corruption would have occurred
      because the vfs does not have complete information.
      
      One problem case that is prevented is attempting to use the gid of a
      directory for new inodes where the directories sgid bit is set but the
      directories gid is not mapped.
      
      Another problem case avoided is attempting to update the evm hash
      after setxattr, removexattr, and setattr.  As the evm hash includeds
      the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
      a correct evm hash from being computed.  evm hash verification also
      fails when i_uid or i_gid is unknown but that is essentially harmless
      as it does not cause filesystem corruption.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0bd23d09
  15. 04 7月, 2016 1 次提交
    • A
      iget_locked et.al.: make sure we don't return bad inodes · 2864f301
      Al Viro 提交于
      If one thread does iget_locked(), proceeds to try and set
      the new inode up and fails, inode will be unhashed and dropped.
      However, another thread doing ilookup/iget_locked in the middle
      of that would end up finding a half-set-up inode, grabbing
      a reference, waiting for it to come unlocked and getting the
      resulting bad inode.  It's a race (if that ilookup had been
      called just after the failure of setup attempt it wouldn't
      have found the sucker at all), particularly unpleasant in
      cases when failure is transient/caller-dependent/etc.
      
      While it can be dealt with in the callers, there's no reason
      not to handle it in fs/inode.c primitives, especially since
      the cost is trivial.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2864f301
  16. 03 5月, 2016 2 次提交
    • A
      parallel lookups: actual switch to rwsem · 9902af79
      Al Viro 提交于
      ta-da!
      
      The main issue is the lack of down_write_killable(), so the places
      like readdir.c switched to plain inode_lock(); once killable
      variants of rwsem primitives appear, that'll be dealt with.
      
      lockdep side also might need more work
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9902af79
    • A
      parallel lookups machinery, part 2 · 84e710da
      Al Viro 提交于
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      84e710da
  17. 31 3月, 2016 1 次提交
    • A
      posix_acl: Inode acl caching fixes · b8a7a3a6
      Andreas Gruenbacher 提交于
      When get_acl() is called for an inode whose ACL is not cached yet, the
      get_acl inode operation is called to fetch the ACL from the filesystem.
      The inode operation is responsible for updating the cached acl with
      set_cached_acl().  This is done without locking at the VFS level, so
      another task can call set_cached_acl() or forget_cached_acl() before the
      get_acl inode operation gets to calling set_cached_acl(), and then
      get_acl's call to set_cached_acl() results in caching an outdate ACL.
      
      Prevent this from happening by setting the cached ACL pointer to a
      task-specific sentinel value before calling the get_acl inode operation.
      Move the responsibility for updating the cached ACL from the get_acl
      inode operations to get_acl().  There, only set the cached ACL if the
      sentinel value hasn't changed.
      
      The sentinel values are chosen to have odd values.  Likewise, the value
      of ACL_NOT_CACHED is odd.  In contrast, ACL object pointers always have
      an even value (ACLs are aligned in memory).  This allows to distinguish
      uncached ACLs values from ACL objects.
      
      In addition, switch from guarding inode->i_acl and inode->i_default_acl
      upates by the inode->i_lock spinlock to using xchg() and cmpxchg().
      
      Filesystems that do not want ACLs returned from their get_acl inode
      operations to be cached must call forget_cached_acl() to prevent the VFS
      from doing so.
      
      (Patch written by Al Viro and Andreas Gruenbacher.)
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b8a7a3a6
  18. 17 2月, 2016 1 次提交
  19. 23 1月, 2016 2 次提交
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  20. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  21. 09 1月, 2016 1 次提交
  22. 09 12月, 2015 1 次提交
    • A
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro 提交于
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  23. 11 11月, 2015 1 次提交
  24. 10 11月, 2015 1 次提交
  25. 19 8月, 2015 1 次提交
    • J
      inode: don't softlockup when evicting inodes · ac05fbb4
      Josef Bacik 提交于
      On a box with a lot of ram (148gb) I can make the box softlockup after running
      an fs_mark job that creates hundreds of millions of empty files.  This is
      because we never generate enough memory pressure to keep the number of inodes on
      our unused list low, so when we go to unmount we have to evict ~100 million
      inodes.  This makes one processor a very unhappy person, so add a cond_resched()
      in dispose_list() and if we need a resched when processing the s_inodes list do
      that and run dispose_list() on what we've currently culled.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      ac05fbb4
  26. 18 8月, 2015 2 次提交
  27. 01 7月, 2015 1 次提交
    • C
      vfs: avoid creation of inode number 0 in get_next_ino · 2adc376c
      Carlos Maiolino 提交于
      currently, get_next_ino() is able to create inodes with inode number = 0.
      This have a bad impact in the filesystems relying in this function to generate
      inode numbers.
      
      While there is no problem at all in having inodes with number 0, userspace tools
      which handle file management tasks can have problems handling these files, like
      for example, the impossiblity of users to delete these files, since glibc will
      ignore them. So, I believe the best way is kernel to avoid creating them.
      
      This problem has been raised previously, but the old thread didn't have any
      other update for a year+, and I've seen too many users hitting the same issue
      regarding the impossibility to delete files while using filesystems relying on
      this function. So, I'm starting the thread again, with the same patch
      that I believe is enough to address this problem.
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2adc376c
  28. 24 6月, 2015 4 次提交
  29. 02 6月, 2015 1 次提交
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
  30. 15 5月, 2015 1 次提交