1. 10 3月, 2016 1 次提交
  2. 09 3月, 2016 6 次提交
    • J
      ext4: remove i_ioend_count · 600be30a
      Jan Kara 提交于
      Remove counter of pending io ends as it is unused.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      600be30a
    • J
      ext4: simplify io_end handling for AIO DIO · 109811c2
      Jan Kara 提交于
      When mapping blocks for direct IO, we allocate io_end structure before
      mapping blocks and store pointer to it in the inode. This creates a
      requirement that any AIO DIO using io_end must be protected by i_mutex.
      This created problems in the past with dioread_nolock mode which was
      corrupting io_end pointers. Also io_end is allocated unnecessarily in
      case where we don't need to convert any extents (which is a common case
      for example when overwriting file).
      
      We fix the problem by allocating io_end only once we return unwritten
      extent from block mapping function for AIO DIO (so we can save some
      pointless io_end allocations) and we pass pointer to it in bh->b_private
      which generic DIO code later passes to our end IO callback. That way we
      remove any need for global pointer to io_end structure and thus fix the
      races.
      
      The downside of this change is that the checking for unwritten IO in
      flight in ext4_extents_can_be_merged() is more racy since we now
      increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
      i_data_sem. However the check has been racy already before because
      ext4_writepages() already increment i_unwritten after dropping
      i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
      case.
      Signed-off-by: NJan Kara <jack@suse.cz>
      109811c2
    • J
      ext4: move trans handling and completion deferal out of _ext4_get_block · efe70c29
      Jan Kara 提交于
      There is no need to handle starting of a transaction and deferal of DIO
      completion in _ext4_get_block() function. We can move this out to get
      block functions for direct IO that need it. That way we can add stricter
      checks verifying things work as we expect.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      efe70c29
    • J
      ext4: rename and split get blocks functions · 705965bd
      Jan Kara 提交于
      Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
      describe what it does. Also split out get blocks functions for direct
      IO. Later we move functionality from _ext4_get_blocks() there. There's no
      functional change in this patch.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      705965bd
    • J
      ext4: use i_mutex to serialize unaligned AIO DIO · e142d052
      Jan Kara 提交于
      Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
      However the code cleanups that happened after 2011 when the lock was
      introduced made aio_mutex acquired at almost the same places where we
      already have exclusion using i_mutex. So just use i_mutex for the
      exclusion of unaligned AIO DIO.
      
      The change moves waiting for pending unwritten extent conversion under
      i_mutex. That makes special handling of O_APPEND writes unnecessary and
      also avoids possible livelocking of unaligned AIO DIO with aligned one
      (nothing was preventing contiguous stream of aligned AIO DIOs to let
      unaligned AIO DIO wait forever).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e142d052
    • J
      ext4: pack ioend structure better · 3bd6ad7b
      Jan Kara 提交于
      On 64-bit architectures we have two 4-byte holes in struct ext4_io_end.
      Order entries better to avoid this and thus make the structure occupy
      64 instead of 72 bytes for 64-bit architectures.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      3bd6ad7b
  3. 23 2月, 2016 6 次提交
  4. 22 2月, 2016 2 次提交
  5. 19 2月, 2016 2 次提交
    • J
      ext4: fix crashes in dioread_nolock mode · 74dae427
      Jan Kara 提交于
      Competing overwrite DIO in dioread_nolock mode will just overwrite
      pointer to io_end in the inode. This may result in data corruption or
      extent conversion happening from IO completion interrupt because we
      don't properly set buffer_defer_completion() when unlocked DIO races
      with locked DIO to unwritten extent.
      
      Since unlocked DIO doesn't need io_end for anything, just avoid
      allocating it and corrupting pointer from inode for locked DIO.
      A cleaner fix would be to avoid these games with io_end pointer from the
      inode but that requires more intrusive changes so we leave that for
      later.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      74dae427
    • J
      ext4: fix bh->b_state corruption · ed8ad838
      Jan Kara 提交于
      ext4 can update bh->b_state non-atomically in _ext4_get_block() and
      ext4_da_get_block_prep(). Usually this is fine since bh is just a
      temporary storage for mapping information on stack but in some cases it
      can be fully living bh attached to a page. In such case non-atomic
      update of bh->b_state can race with an atomic update which then gets
      lost. Usually when we are mapping bh and thus updating bh->b_state
      non-atomically, nobody else touches the bh and so things work out fine
      but there is one case to especially worry about: ext4_finish_bio() uses
      BH_Uptodate_Lock on the first bh in the page to synchronize handling of
      PageWriteback state. So when blocksize < pagesize, we can be atomically
      modifying bh->b_state of a buffer that actually isn't under IO and thus
      can race e.g. with delalloc trying to map that buffer. The result is
      that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
      corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
      
      Fix the problem by always updating bh->b_state bits atomically.
      
      CC: stable@vger.kernel.org
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ed8ad838
  6. 16 2月, 2016 1 次提交
  7. 12 2月, 2016 6 次提交
    • E
      ext4: remove unused parameter "newblock" in convert_initialized_extent() · 56263b4c
      Eryu Guan 提交于
      The "newblock" parameter is not used in convert_initialized_extent(),
      remove it.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      56263b4c
    • E
      ext4: don't read blocks from disk after extents being swapped · bcff2488
      Eryu Guan 提交于
      I notice ext4/307 fails occasionally on ppc64 host, reporting md5
      checksum mismatch after moving data from original file to donor file.
      
      The reason is that move_extent_per_page() calls __block_write_begin()
      and block_commit_write() to write saved data from original inode blocks
      to donor inode blocks, but __block_write_begin() not only maps buffer
      heads but also reads block content from disk if the size is not block
      size aligned.  At this time the physical block number in mapped buffer
      head is pointing to the donor file not the original file, and that
      results in reading wrong data to page, which get written to disk in
      following block_commit_write call.
      
      This also can be reproduced by the following script on 1k block size ext4
      on x86_64 host:
      
          mnt=/mnt/ext4
          donorfile=$mnt/donor
          testfile=$mnt/testfile
          e4compact=~/xfstests/src/e4compact
      
          rm -f $donorfile $testfile
      
          # reserve space for donor file, written by 0xaa and sync to disk to
          # avoid EBUSY on EXT4_IOC_MOVE_EXT
          xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile
      
          # create test file written by 0xbb
          xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile
      
          # compute initial md5sum
          md5sum $testfile | tee md5sum.txt
          # drop cache, force e4compact to read data from disk
          echo 3 > /proc/sys/vm/drop_caches
      
          # test defrag
          echo "$testfile" | $e4compact -i -v -f $donorfile
          # check md5sum
          md5sum -c md5sum.txt
      
      Fix it by creating & mapping buffer heads only but not reading blocks
      from disk, because all the data in page is guaranteed to be up-to-date
      in mext_page_mkuptodate().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bcff2488
    • I
      ext4: fix potential integer overflow · 46901760
      Insu Yun 提交于
      Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
      integer overflow could be happened.
      Therefore, need to fix integer overflow sanitization.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NInsu Yun <wuninsu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      46901760
    • H
      ext4: add a line break for proc mb_groups display · 802cf1f9
      Huaitong Han 提交于
      This patch adds a line break for proc mb_groups display.
      Signed-off-by: NHuaitong Han <huaitong.han@intel.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      802cf1f9
    • A
      ext4: ioctl: fix erroneous return value · fdde368e
      Anton Protopopov 提交于
      The ext4_ioctl_setflags() function which is used in the ioctls
      EXT4_IOC_SETFLAGS and EXT4_IOC_FSSETXATTR may return the positive value
      EPERM instead of -EPERM in case of error. This bug was introduced by a
      recent commit 9b7365fc.
      
      The following program can be used to illustrate the wrong behavior:
      
          #include <sys/types.h>
          #include <sys/ioctl.h>
          #include <sys/stat.h>
          #include <fcntl.h>
          #include <err.h>
      
          #define FS_IOC_GETFLAGS _IOR('f', 1, long)
          #define FS_IOC_SETFLAGS _IOW('f', 2, long)
          #define FS_IMMUTABLE_FL 0x00000010
      
          int main(void)
          {
              int fd;
              long flags;
      
              fd = open("file", O_RDWR|O_CREAT, 0600);
              if (fd < 0)
                  err(1, "open");
      
              if (ioctl(fd, FS_IOC_GETFLAGS, &flags) < 0)
                  err(1, "ioctl: FS_IOC_GETFLAGS");
      
              flags |= FS_IMMUTABLE_FL;
      
              if (ioctl(fd, FS_IOC_SETFLAGS, &flags) < 0)
                  err(1, "ioctl: FS_IOC_SETFLAGS");
      
              warnx("ioctl returned no error");
      
              return 0;
          }
      
      Running it gives the following result:
      
          $ strace -e ioctl ./test
          ioctl(3, FS_IOC_GETFLAGS, 0x7ffdbd8bfd38) = 0
          ioctl(3, FS_IOC_SETFLAGS, 0x7ffdbd8bfd38) = 1
          test: ioctl returned no error
          +++ exited with 0 +++
      
      Running the program on a kernel with the bug fixed gives the proper result:
      
          $ strace -e ioctl ./test
          ioctl(3, FS_IOC_GETFLAGS, 0x7ffdd2768258) = 0
          ioctl(3, FS_IOC_SETFLAGS, 0x7ffdd2768258) = -1 EPERM (Operation not permitted)
          test: ioctl: FS_IOC_SETFLAGS: Operation not permitted
          +++ exited with 1 +++
      Signed-off-by: NAnton Protopopov <a.s.protopopov@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      fdde368e
    • J
      ext4: fix scheduling in atomic on group checksum failure · 05145bd7
      Jan Kara 提交于
      When block group checksum is wrong, we call ext4_error() while holding
      group spinlock from ext4_init_block_bitmap() or
      ext4_init_inode_bitmap() which results in scheduling while in atomic.
      Fix the issue by calling ext4_error() later after dropping the spinlock.
      
      CC: stable@vger.kernel.org
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      05145bd7
  8. 08 2月, 2016 2 次提交
  9. 23 1月, 2016 2 次提交
    • R
      ext4: call dax_pfn_mkwrite() for DAX fsync/msync · d5be7a03
      Ross Zwisler 提交于
      To properly support the new DAX fsync/msync infrastructure filesystems
      need to call dax_pfn_mkwrite() so that DAX can track when user pages are
      dirtied.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5be7a03
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  10. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  11. 09 1月, 2016 4 次提交
  12. 07 1月, 2016 1 次提交
  13. 31 12月, 2015 1 次提交
  14. 14 12月, 2015 1 次提交
  15. 10 12月, 2015 1 次提交
  16. 09 12月, 2015 2 次提交
    • A
      replace ->follow_link() with new method that could stay in RCU mode · 6b255391
      Al Viro 提交于
      new method: ->get_link(); replacement of ->follow_link().  The differences
      are:
      	* inode and dentry are passed separately
      	* might be called both in RCU and non-RCU mode;
      the former is indicated by passing it a NULL dentry.
      	* when called that way it isn't allowed to block
      and should return ERR_PTR(-ECHILD) if it needs to be called
      in non-RCU mode.
      
      It's a flagday change - the old method is gone, all in-tree instances
      converted.  Conversion isn't hard; said that, so far very few instances
      do not immediately bail out when called in RCU mode.  That'll change
      in the next commits.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6b255391
    • A
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro 提交于
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  17. 08 12月, 2015 1 次提交