1. 17 5月, 2016 1 次提交
    • T
      ext2: Add alignment check for DAX mount · 284854be
      Toshi Kani 提交于
      When a partition is not aligned by 4KB, mount -o dax succeeds,
      but any read/write access to the filesystem fails, except for
      metadata update.
      
      Call bdev_dax_supported() to perform proper precondition checks
      which includes this partition alignment check.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      284854be
  2. 23 2月, 2016 1 次提交
    • J
      ext2: convert to mbcache2 · be0726d3
      Jan Kara 提交于
      The conversion is generally straightforward. We convert filesystem from
      a global cache to per-fs one. Similarly to ext4 the tricky part is that
      xattr block corresponding to found mbcache entry can get freed before we
      get buffer lock for that block. So we have to check whether the entry is
      still valid after getting the buffer lock.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      be0726d3
  3. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  4. 17 11月, 2015 1 次提交
    • D
      ext2, ext4: warn when mounting with dax enabled · ef83b6e8
      Dan Williams 提交于
      Similar to XFS warn when mounting DAX while it is still considered under
      development.  Also, aspects of the DAX implementation, for example
      synchronization against multiple faults and faults causing block
      allocation, depend on the correct implementation in the filesystem.  The
      maturity of a given DAX implementation is filesystem specific.
      
      Cc: <stable@vger.kernel.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: linux-ext4@vger.kernel.org
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Acked-by: NJan Kara <jack@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef83b6e8
  5. 19 10月, 2015 1 次提交
    • R
      ext2: Add locking for DAX faults · 5726b27b
      Ross Zwisler 提交于
      Add locking to ensure that DAX faults are isolated from ext2 operations
      that modify the data blocks allocation for an inode.  This is intended to
      be analogous to the work being done in XFS by Dave Chinner:
      
      http://www.spinics.net/lists/linux-fsdevel/msg90260.html
      
      Compared with XFS the ext2 case is greatly simplified by the fact that ext2
      already allocates and zeros new blocks before they are returned as part of
      ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
      unwritten buffer heads.
      
      This means that the only work we need to do in ext2 is to isolate the DAX
      faults from inode block allocation changes.  I believe this just means that
      we need to isolate the DAX faults from truncate operations.
      
      The newly introduced dax_sem is intended to replicate the protection
      offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
      protects XFS operations like hole punching, fallocate down, extent
      manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
      the only one of these operations supported by ext2.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.com>
      5726b27b
  6. 18 6月, 2015 1 次提交
    • T
      vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB · 46b15caa
      Tejun Heo 提交于
      FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
      cgroup writeback; however, different super_blocks of the same
      file_system_type may or may not support cgroup writeback depending on
      filesystem options.  This patch replaces FS_CGROUP_WRITEBACK with a
      per-super_block flag.
      
      super_block->s_flags carries some internal flags in the high bits but
      it's exposd to userland through uapi header and running out of space
      anyway.  This patch adds a new field super_block->s_iflags to carry
      kernel-internal flags.  It is currently only used by the new
      SB_I_CGROUPWB flag whose concatenated and abbreviated name is for
      consistency with other super_block flags.
      
      ext2_fill_super() is updated to set SB_I_CGROUPWB.
      
      v2: Added super_block->s_iflags instead of stealing another high bit
          from sb->s_flags as suggested by Christoph and Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-ext4@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      46b15caa
  7. 02 6月, 2015 1 次提交
    • T
      ext2: enable cgroup writeback support · 108dad65
      Tejun Heo 提交于
      Writeback now supports cgroup writeback and the generic writeback,
      buffer, libfs, and mpage helpers that ext2 uses are all updated to
      work with cgroup writeback.
      
      This patch enables cgroup writeback for ext2 by adding
      FS_CGROUP_WRITEBACK to its ->fs_flags.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-ext4@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      108dad65
  8. 17 2月, 2015 4 次提交
  9. 10 11月, 2014 1 次提交
  10. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  11. 16 7月, 2014 1 次提交
  12. 13 3月, 2014 1 次提交
    • T
      fs: push sync_filesystem() down to the file system's remount_fs() · 02b9984d
      Theodore Ts'o 提交于
      Previously, the no-op "mount -o mount /dev/xxx" operation when the
      file system is already mounted read-write causes an implied,
      unconditional syncfs().  This seems pretty stupid, and it's certainly
      documented or guaraunteed to do this, nor is it particularly useful,
      except in the case where the file system was mounted rw and is getting
      remounted read-only.
      
      However, it's possible that there might be some file systems that are
      actually depending on this behavior.  In most file systems, it's
      probably fine to only call sync_filesystem() when transitioning from
      read-write to read-only, and there are some file systems where this is
      not needed at all (for example, for a pseudo-filesystem or something
      like romfs).
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Jan Kara <jack@suse.cz>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Anders Larsen <al@alarsen.net>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: xfs@oss.sgi.com
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-cifs@vger.kernel.org
      Cc: samba-technical@lists.samba.org
      Cc: codalist@coda.cs.cmu.edu
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: fuse-devel@lists.sourceforge.net
      Cc: cluster-devel@redhat.com
      Cc: linux-mtd@lists.infradead.org
      Cc: jfs-discussion@lists.sourceforge.net
      Cc: linux-nfs@vger.kernel.org
      Cc: linux-nilfs@vger.kernel.org
      Cc: linux-ntfs-dev@lists.sourceforge.net
      Cc: ocfs2-devel@oss.oracle.com
      Cc: reiserfs-devel@vger.kernel.org
      02b9984d
  13. 03 3月, 2014 1 次提交
  14. 04 12月, 2013 1 次提交
  15. 04 3月, 2013 1 次提交
    • E
      fs: Limit sys_mount to only request filesystem modules. · 7f78e035
      Eric W. Biederman 提交于
      Modify the request_module to prefix the file system type with "fs-"
      and add aliases to all of the filesystems that can be built as modules
      to match.
      
      A common practice is to build all of the kernel code and leave code
      that is not commonly needed as modules, with the result that many
      users are exposed to any bug anywhere in the kernel.
      
      Looking for filesystems with a fs- prefix limits the pool of possible
      modules that can be loaded by mount to just filesystems trivially
      making things safer with no real cost.
      
      Using aliases means user space can control the policy of which
      filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
      with blacklist and alias directives.  Allowing simple, safe,
      well understood work-arounds to known problematic software.
      
      This also addresses a rare but unfortunate problem where the filesystem
      name is not the same as it's module name and module auto-loading
      would not work.  While writing this patch I saw a handful of such
      cases.  The most significant being autofs that lives in the module
      autofs4.
      
      This is relevant to user namespaces because we can reach the request
      module in get_fs_type() without having any special permissions, and
      people get uncomfortable when a user specified string (in this case
      the filesystem type) goes all of the way to request_module.
      
      After having looked at this issue I don't think there is any
      particular reason to perform any filtering or permission checks beyond
      making it clear in the module request that we want a filesystem
      module.  The common pattern in the kernel is to call request_module()
      without regards to the users permissions.  In general all a filesystem
      module does once loaded is call register_filesystem() and go to sleep.
      Which means there is not much attack surface exposed by loading a
      filesytem module unless the filesystem is mounted.  In a user
      namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
      which most filesystems do not set today.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7f78e035
  16. 21 1月, 2013 1 次提交
  17. 10 10月, 2012 1 次提交
  18. 03 10月, 2012 1 次提交
  19. 31 7月, 2012 1 次提交
    • J
      ext2: Implement freezing · 1e8b212f
      Jan Kara 提交于
      The only missing piece to make freezing work reliably with ext2 is to
      stop iput() of unlinked inode from deleting the inode on frozen filesystem.
      So add a necessary protection to ext2_evict_inode().
      
      We also provide appropriate ->freeze_fs and ->unfreeze_fs functions.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e8b212f
  20. 23 7月, 2012 1 次提交
    • J
      quota: Move quota syncing to ->sync_fs method · a1177825
      Jan Kara 提交于
      Since the moment writes to quota files are using block device page cache and
      space for quota structures is reserved at the moment they are first accessed we
      have no reason to sync quota before inode writeback. In fact this order is now
      only harmful since quota information can easily change during inode writeback
      (either because conversion of delayed-allocated extents or simply because of
      allocation of new blocks for simple filesystems not using page_mkwrite).
      
      So move syncing of quota information after writeback of inodes into ->sync_fs
      method. This way we do not have to use ->quota_sync callback which is primarily
      intended for use by quotactl syscall anyway and we get rid of calling
      ->sync_fs() twice unnecessarily. We skip quota syncing for OCFS2 since it does
      proper quota journalling in all cases (unlike ext3, ext4, and reiserfs which
      also support legacy non-journalled quotas) and thus there are no dirty quota
      structures.
      
      CC: "Theodore Ts'o" <tytso@mit.edu>
      CC: Joel Becker <jlbec@evilplan.org>
      CC: reiserfs-devel@vger.kernel.org
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Acked-by: NDave Kleikamp <shaggy@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a1177825
  21. 09 7月, 2012 1 次提交
  22. 16 5月, 2012 3 次提交
  23. 11 4月, 2012 3 次提交
    • A
      ext2: do not register write_super within VFS · f72cf5e2
      Artem Bityutskiy 提交于
      Jan Kara removed 'sb->s_dirt' VFS flag references, so we do not need to
      register the ext2 'ext2_write_super()' method in the VFS superblock operations,
      because 'sb->s_dirt' won't be ever set to 1 and VFS won't ever call
      '->write_super()' anyway. Thus, remove the method.
      
      Tested using xfstests.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      f72cf5e2
    • J
      ext2: Remove s_dirt handling · b838ec22
      Jan Kara 提交于
      Places which modify superblock feature / state fields mark the superblock
      buffer dirty so it is written out by flusher thread. Thus there's no need to
      set s_dirt there.
      
      The only other fields changing in the superblock are the numbers of free
      blocks, free inodes and s_wtime. There's no real need to write (or even
      compute) these periodically. Free blocks / inodes counters are recomputed on
      every mount from group counters anyway and value of s_wtime is only
      informational and imprecise anyway. So it should be enough to write these
      opportunistically on mount, remount, umount, and sync_fs times.
      Signed-off-by: NJan Kara <jack@suse.cz>
      b838ec22
    • A
      ext2: write superblock only once on unmount · f2b22420
      Artem Bityutskiy 提交于
      Currently on unmount if we are mounted R/W, we first write the superblock to
      the media if it is dirty, and then write it again, which is not optimal. This
      patch makes ext2 write the superblock on unmount less times.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      f2b22420
  24. 21 3月, 2012 2 次提交
  25. 09 1月, 2012 1 次提交
  26. 07 1月, 2012 1 次提交
  27. 04 1月, 2012 1 次提交
    • A
      vfs: fix the stupidity with i_dentry in inode destructors · 6b520e05
      Al Viro 提交于
      Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
      it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
      the cost of taking it into inode_init_always() will be negligible for pipes
      and sockets and negative for everything else.  Not to mention the removal of
      boilerplate code from ->destroy_inode() instances...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6b520e05
  28. 30 8月, 2011 1 次提交
  29. 17 5月, 2011 1 次提交
  30. 31 3月, 2011 1 次提交
  31. 07 1月, 2011 1 次提交
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
  32. 06 1月, 2011 1 次提交