1. 08 10月, 2016 1 次提交
  2. 03 8月, 2016 3 次提交
  3. 27 7月, 2016 1 次提交
  4. 06 7月, 2016 1 次提交
    • E
      vfs: Don't modify inodes with a uid or gid unknown to the vfs · 0bd23d09
      Eric W. Biederman 提交于
      When a filesystem outside of init_user_ns is mounted it could have
      uids and gids stored in it that do not map to init_user_ns.
      
      The plan is to allow those filesystems to set i_uid to INVALID_UID and
      i_gid to INVALID_GID for unmapped uids and gids and then to handle
      that strange case in the vfs to ensure there is consistent robust
      handling of the weirdness.
      
      Upon a careful review of the vfs and filesystems about the only case
      where there is any possibility of confusion or trouble is when the
      inode is written back to disk.  In that case filesystems typically
      read the inode->i_uid and inode->i_gid and write them to disk even
      when just an inode timestamp is being updated.
      
      Which leads to a rule that is very simple to implement and understand
      inodes whose i_uid or i_gid is not valid may not be written.
      
      In dealing with access times this means treat those inodes as if the
      inode flag S_NOATIME was set.  Reads of the inodes appear safe and
      useful, but any write or modification is disallowed.  The only inode
      write that is allowed is a chown that sets the uid and gid on the
      inode to valid values.  After such a chown the inode is normal and may
      be treated as such.
      
      Denying all writes to inodes with uids or gids unknown to the vfs also
      prevents several oddball cases where corruption would have occurred
      because the vfs does not have complete information.
      
      One problem case that is prevented is attempting to use the gid of a
      directory for new inodes where the directories sgid bit is set but the
      directories gid is not mapped.
      
      Another problem case avoided is attempting to update the evm hash
      after setxattr, removexattr, and setattr.  As the evm hash includeds
      the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
      a correct evm hash from being computed.  evm hash verification also
      fails when i_uid or i_gid is unknown but that is essentially harmless
      as it does not cause filesystem corruption.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0bd23d09
  5. 03 5月, 2016 2 次提交
    • A
      parallel lookups: actual switch to rwsem · 9902af79
      Al Viro 提交于
      ta-da!
      
      The main issue is the lack of down_write_killable(), so the places
      like readdir.c switched to plain inode_lock(); once killable
      variants of rwsem primitives appear, that'll be dealt with.
      
      lockdep side also might need more work
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9902af79
    • A
      parallel lookups machinery, part 2 · 84e710da
      Al Viro 提交于
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      84e710da
  6. 31 3月, 2016 1 次提交
    • A
      posix_acl: Inode acl caching fixes · b8a7a3a6
      Andreas Gruenbacher 提交于
      When get_acl() is called for an inode whose ACL is not cached yet, the
      get_acl inode operation is called to fetch the ACL from the filesystem.
      The inode operation is responsible for updating the cached acl with
      set_cached_acl().  This is done without locking at the VFS level, so
      another task can call set_cached_acl() or forget_cached_acl() before the
      get_acl inode operation gets to calling set_cached_acl(), and then
      get_acl's call to set_cached_acl() results in caching an outdate ACL.
      
      Prevent this from happening by setting the cached ACL pointer to a
      task-specific sentinel value before calling the get_acl inode operation.
      Move the responsibility for updating the cached ACL from the get_acl
      inode operations to get_acl().  There, only set the cached ACL if the
      sentinel value hasn't changed.
      
      The sentinel values are chosen to have odd values.  Likewise, the value
      of ACL_NOT_CACHED is odd.  In contrast, ACL object pointers always have
      an even value (ACLs are aligned in memory).  This allows to distinguish
      uncached ACLs values from ACL objects.
      
      In addition, switch from guarding inode->i_acl and inode->i_default_acl
      upates by the inode->i_lock spinlock to using xchg() and cmpxchg().
      
      Filesystems that do not want ACLs returned from their get_acl inode
      operations to be cached must call forget_cached_acl() to prevent the VFS
      from doing so.
      
      (Patch written by Al Viro and Andreas Gruenbacher.)
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b8a7a3a6
  7. 17 2月, 2016 1 次提交
  8. 23 1月, 2016 2 次提交
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  9. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  10. 09 1月, 2016 1 次提交
  11. 09 12月, 2015 1 次提交
    • A
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro 提交于
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  12. 11 11月, 2015 1 次提交
  13. 10 11月, 2015 1 次提交
  14. 19 8月, 2015 1 次提交
    • J
      inode: don't softlockup when evicting inodes · ac05fbb4
      Josef Bacik 提交于
      On a box with a lot of ram (148gb) I can make the box softlockup after running
      an fs_mark job that creates hundreds of millions of empty files.  This is
      because we never generate enough memory pressure to keep the number of inodes on
      our unused list low, so when we go to unmount we have to evict ~100 million
      inodes.  This makes one processor a very unhappy person, so add a cond_resched()
      in dispose_list() and if we need a resched when processing the s_inodes list do
      that and run dispose_list() on what we've currently culled.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      ac05fbb4
  15. 18 8月, 2015 2 次提交
  16. 01 7月, 2015 1 次提交
    • C
      vfs: avoid creation of inode number 0 in get_next_ino · 2adc376c
      Carlos Maiolino 提交于
      currently, get_next_ino() is able to create inodes with inode number = 0.
      This have a bad impact in the filesystems relying in this function to generate
      inode numbers.
      
      While there is no problem at all in having inodes with number 0, userspace tools
      which handle file management tasks can have problems handling these files, like
      for example, the impossiblity of users to delete these files, since glibc will
      ignore them. So, I believe the best way is kernel to avoid creating them.
      
      This problem has been raised previously, but the old thread didn't have any
      other update for a year+, and I've seen too many users hitting the same issue
      regarding the impossibility to delete files while using filesystems relying on
      this function. So, I'm starting the thread again, with the same patch
      that I believe is enough to address this problem.
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2adc376c
  17. 24 6月, 2015 4 次提交
  18. 02 6月, 2015 1 次提交
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
  19. 15 5月, 2015 1 次提交
  20. 11 5月, 2015 1 次提交
  21. 25 4月, 2015 1 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  22. 16 4月, 2015 1 次提交
  23. 13 2月, 2015 2 次提交
    • V
      list_lru: add helpers to isolate items · 3f97b163
      Vladimir Davydov 提交于
      Currently, the isolate callback passed to the list_lru_walk family of
      functions is supposed to just delete an item from the list upon returning
      LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
      __list_lru_walk_one after the callback returns.  Since the callback is
      allowed to drop the lock after removing an item (it has to return
      LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
      of elements on the list even if we check them under the lock.  This makes
      it difficult to move items from one list_lru_one to another, which is
      required for per-memcg list_lru reparenting - we can't just splice the
      lists, we have to move entries one by one.
      
      This patch therefore introduces helpers that must be used by callback
      functions to isolate items instead of raw list_del/list_move.  These are
      list_lru_isolate and list_lru_isolate_move.  They not only remove the
      entry from the list, but also fix the nr_items counter, making sure
      nr_items always reflects the actual number of elements on the list if
      checked under the appropriate lock.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f97b163
    • V
      list_lru: introduce list_lru_shrink_{count,walk} · 503c358c
      Vladimir Davydov 提交于
      Kmem accounting of memcg is unusable now, because it lacks slab shrinker
      support.  That means when we hit the limit we will get ENOMEM w/o any
      chance to recover.  What we should do then is to call shrink_slab, which
      would reclaim old inode/dentry caches from this cgroup.  This is what
      this patch set is intended to do.
      
      Basically, it does two things.  First, it introduces the notion of
      per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
      cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
      passed the memory cgroup to scan from in shrink_control->memcg.  For
      such shrinkers shrink_slab iterates over the whole cgroup subtree under
      the target cgroup and calls the shrinker for each kmem-active memory
      cgroup.
      
      Secondly, this patch set makes the list_lru structure per-memcg.  It's
      done transparently to list_lru users - everything they have to do is to
      tell list_lru_init that they want memcg-aware list_lru.  Then the
      list_lru will automatically distribute objects among per-memcg lists
      basing on which cgroup the object is accounted to.  This way to make FS
      shrinkers (icache, dcache) memcg-aware we only need to make them use
      memcg-aware list_lru, and this is what this patch set does.
      
      As before, this patch set only enables per-memcg kmem reclaim when the
      pressure goes from memory.limit, not from memory.kmem.limit.  Handling
      memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
      it is still unclear whether we will have this knob in the unified
      hierarchy.
      
      This patch (of 9):
      
      NUMA aware slab shrinkers use the list_lru structure to distribute
      objects coming from different NUMA nodes to different lists.  Whenever
      such a shrinker needs to count or scan objects from a particular node,
      it issues commands like this:
      
              count = list_lru_count_node(lru, sc->nid);
              freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                         isolate_arg, &sc->nr_to_scan);
      
      where sc is an instance of the shrink_control structure passed to it
      from vmscan.
      
      To simplify this, let's add special list_lru functions to be used by
      shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
      consolidate the nid and nr_to_scan arguments in the shrink_control
      structure.
      
      This will also allow us to avoid patching shrinkers that use list_lru
      when we make shrink_slab() per-memcg - all we will have to do is extend
      the shrink_control structure to include the target memcg and make
      list_lru_shrink_{count,walk} handle this appropriately.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      503c358c
  24. 11 2月, 2015 1 次提交
  25. 05 2月, 2015 2 次提交
    • T
      vfs: add find_inode_nowait() function · fe032c42
      Theodore Ts'o 提交于
      Add a new function find_inode_nowait() which is an even more general
      version of ilookup5_nowait().  It is designed for callers which need
      very fine grained control over when the function is allowed to block
      or increment the inode's reference count.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe032c42
    • T
      vfs: add support for a lazytime mount option · 0ae45f63
      Theodore Ts'o 提交于
      Add a new mount option which enables a new "lazytime" mode.  This mode
      causes atime, mtime, and ctime updates to only be made to the
      in-memory version of the inode.  The on-disk times will only get
      updated when (a) if the inode needs to be updated for some non-time
      related change, (b) if userspace calls fsync(), syncfs() or sync(), or
      (c) just before an undeleted inode is evicted from memory.
      
      This is OK according to POSIX because there are no guarantees after a
      crash unless userspace explicitly requests via a fsync(2) call.
      
      For workloads which feature a large number of random write to a
      preallocated file, the lazytime mount option significantly reduces
      writes to the inode table.  The repeated 4k writes to a single block
      will result in undesirable stress on flash devices and SMR disk
      drives.  Even on conventional HDD's, the repeated writes to the inode
      table block will trigger Adjacent Track Interference (ATI) remediation
      latencies, which very negatively impact long tail latencies --- which
      is a very big deal for web serving tiers (for example).
      
      Google-Bug-Id: 18297052
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0ae45f63
  26. 21 1月, 2015 1 次提交
  27. 17 1月, 2015 1 次提交
    • J
      locks: add a new struct file_locking_context pointer to struct inode · 4a075e39
      Jeff Layton 提交于
      The current scheme of using the i_flock list is really difficult to
      manage. There is also a legitimate desire for a per-inode spinlock to
      manage these lists that isn't the i_lock.
      
      Start conversion to a new scheme to eventually replace the old i_flock
      list with a new "file_lock_context" object.
      
      We start by adding a new i_flctx to struct inode. For now, it lives in
      parallel with i_flock list, but will eventually replace it. The idea is
      to allocate a structure to sit in that pointer and act as a locus for
      all things file locking.
      
      We allocate a file_lock_context for an inode when the first lock is
      added to it, and it's only freed when the inode is freed. We use the
      i_lock to protect the assignment, but afterward it should mostly be
      accessed locklessly.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      4a075e39
  28. 14 12月, 2014 1 次提交
  29. 11 12月, 2014 1 次提交
    • A
      make default ->i_fop have ->open() fail with ENXIO · bd9b51e7
      Al Viro 提交于
      As it is, default ->i_fop has NULL ->open() (along with all other methods).
      The only case where it matters is reopening (via procfs symlink) a file that
      didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
      to something sane (default would fail on read/write/ioctl/etc.).
      
      	Unfortunately, such case exists - alloc_file() users, especially
      anon_get_file() ones.  There we have tons of opened files of very different
      kinds sharing the same inode.  As the result, attempt to reopen those via
      procfs succeeds and you get a descriptor you can't do anything with.
      
      	Moreover, in case of sockets we set ->i_fop that will only be used
      on such reopen attempts - and put a failing ->open() into it to make sure
      those do not succeed.
      
      	It would be simpler to put such ->open() into default ->i_fop and leave
      it unchanged both for anon inode (as we do anyway) and for socket ones.  Result:
      	* everything going through do_dentry_open() works as it used to
      	* sock_no_open() kludge is gone
      	* attempts to reopen anon-inode files fail as they really ought to
      	* ditto for aio_private_file()
      	* ditto for perfmon - this one actually tried to imitate sock_no_open()
      trick, but failed to set ->i_fop, so in the current tree reopens succeed and
      yield completely useless descriptor.  Intent clearly had been to fail with
      -ENXIO on such reopens; now it actually does.
      	* everything else that used alloc_file() keeps working - it has ->i_fop
      set for its inodes anyway
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd9b51e7
  30. 10 11月, 2014 1 次提交