1. 05 9月, 2014 2 次提交
  2. 03 9月, 2014 1 次提交
    • J
      aio: add missing smp_rmb() in read_events_ring · 2ff396be
      Jeff Moyer 提交于
      We ran into a case on ppc64 running mariadb where io_getevents would
      return zeroed out I/O events.  After adding instrumentation, it became
      clear that there was some missing synchronization between reading the
      tail pointer and the events themselves.  This small patch fixes the
      problem in testing.
      
      Thanks to Zach for helping to look into this, and suggesting the fix.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      2ff396be
  3. 02 9月, 2014 8 次提交
    • C
      f2fs: reposition unlock_new_inode to prevent accessing invalid inode · b73e5282
      Chao Yu 提交于
      As the race condition on the inode cache, following scenario can appear:
      [Thread a]				[Thread b]
      					->f2fs_mkdir
      					  ->f2fs_add_link
      					    ->__f2fs_add_link
      					      ->init_inode_metadata failed here
      ->gc_thread_func
        ->f2fs_gc
          ->do_garbage_collect
            ->gc_data_segment
              ->f2fs_iget
                ->iget_locked
                  ->wait_on_inode
      					  ->unlock_new_inode
              ->move_data_page
      					  ->make_bad_inode
      					  ->iput
      
      When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
      should be set as bad to avoid being accessed by other thread. But in above
      scenario, it allows f2fs to access the invalid inode before this inode was set
      as bad.
      This patch fix the potential problem, and this issue was found by code review.
      
      change log from v1:
       o Add condition judgment in gc_data_segment() suggested by Changman Lee.
       o use iget_failed to simplify code.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      b73e5282
    • B
      xfs: trim eofblocks before collapse range · 41b9d726
      Brian Foster 提交于
      xfs_collapse_file_space() currently writes back the entire file
      undergoing collapse range to settle things down for the extent shift
      algorithm. While this prevents changes to the extent list during the
      collapse operation, the writeback itself is not enough to prevent
      unnecessary collapse failures.
      
      The current shift algorithm uses the extent index to iterate the in-core
      extent list. If a post-eof delalloc extent persists after the writeback
      (e.g., a prior zero range op where the end of the range aligns with eof
      can separate the post-eof blocks such that they are not written back and
      converted), xfs_bmap_shift_extents() becomes confused over the encoded
      br_startblock value and fails the collapse.
      
      As with the full writeback, this is a temporary fix until the algorithm
      is improved to cope with a volatile extent list and avoid attempts to
      shift post-eof extents.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      41b9d726
    • D
      xfs: xfs_file_collapse_range is delalloc challenged · 1669a8ca
      Dave Chinner 提交于
      If we have delalloc extents on a file before we run a collapse range
      opertaion, we sync the range that we are going to collapse to
      convert delalloc extents in that region to real extents to simplify
      the shift operation.
      
      However, the shift operation then assumes that the extent list is
      not going to change as it iterates over the extent list moving
      things about. Unfortunately, this isn't true because we can't hold
      the ILOCK over all the operations. We can prevent new IO from
      modifying the extent list by holding the IOLOCK, but that doesn't
      prevent writeback from running....
      
      And when writeback runs, it can convert delalloc extents is the
      range of the file prior to the region being collapsed, and this
      changes the indexes of all the extents in the file. That causes the
      collapse range operation to Go Bad.
      
      The right fix is to rewrite the extent shift operation not to be
      dependent on the extent list not changing across the entire
      operation, but this is a fairly significant piece of work to do.
      Hence, as a short-term workaround for the problem, sync the entire
      file before starting a collapse operation to remove all delalloc
      ranges from the file and so avoid the problem of concurrent
      writeback changing the extent list.
      Diagnosed-and-Reported-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      
      1669a8ca
    • B
      xfs: don't log inode unless extent shift makes extent modifications · ca446d88
      Brian Foster 提交于
      The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
      all subsequent extents down into the specified, previously punched out,
      region. This function performs some validation, such as whether a
      sufficient hole exists in the target region of the collapse, then shifts
      the remaining exents downward.
      
      The exit path of the function currently logs the inode unconditionally.
      While we must log the inode (and abort) if an error occurs and the
      transaction is dirty, the initial validation paths can generate errors
      before the transaction has been dirtied. This creates an unnecessary
      filesystem shutdown scenario, as the caller will cancel a transaction
      that has been marked dirty.
      
      Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
      are made to the inode bmap. Only log the inode in the exit path if
      logflags has been set. This ensures we only have to cancel a dirty
      transaction if modifications have been made and prevents an unnecessary
      filesystem shutdown otherwise.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      ca446d88
    • D
      xfs: use ranged writeback and invalidation for direct IO · 7d4ea3ce
      Dave Chinner 提交于
      Now we are not doing silly things with dirtying buffers beyond EOF
      and using invalidation correctly, we can finally reduce the ranges of
      writeback and invalidation used by direct IO to match that of the IO
      being issued.
      
      Bring the writeback and invalidation ranges back to match the
      generic direct IO code - this will greatly reduce the perturbation
      of cached data when direct IO and buffered IO are mixed, but still
      provide the same buffered vs direct IO coherency behaviour we
      currently have.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      7d4ea3ce
    • D
      xfs: don't zero partial page cache pages during O_DIRECT writes · 834ffca6
      Dave Chinner 提交于
      Similar to direct IO reads, direct IO writes are using 
      truncate_pagecache_range to invalidate the page cache. This is
      incorrect due to the sub-block zeroing in the page cache that
      truncate_pagecache_range() triggers.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      cc: stable@vger.kernel.org
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      834ffca6
    • C
      xfs: don't zero partial page cache pages during O_DIRECT writes · 85e584da
      Chris Mason 提交于
      xfs is using truncate_pagecache_range to invalidate the page cache
      during DIO reads.  This is different from the other filesystems who
      only invalidate pages during DIO writes.
      
      truncate_pagecache_range is meant to be used when we are freeing the
      underlying data structs from disk, so it will zero any partial
      ranges in the page.  This means a DIO read can zero out part of the
      page cache page, and it is possible the page will stay in cache.
      
      buffered reads will find an up to date page with zeros instead of
      the data actually on disk.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      [dchinner: catch error and warn if it fails. Comment.]
      
      cc: stable@vger.kernel.org
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      85e584da
    • D
      xfs: don't dirty buffers beyond EOF · 22e757a4
      Dave Chinner 提交于
      generic/263 is failing fsx at this point with a page spanning
      EOF that cannot be invalidated. The operations are:
      
      1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
      1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
      1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
      
      where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
      write attempts to invalidate the cached page over this range, it
      fails with -EBUSY and so any attempt to do page invalidation fails.
      
      The real question is this: Why can't that page be invalidated after
      it has been written to disk and cleaned?
      
      Well, there's data on the first two buffers in the page (1k block
      size, 4k page), but the third buffer on the page (i.e. beyond EOF)
      is failing drop_buffers because it's bh->b_state == 0x3, which is
      BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
      what?
      
      OK, set_buffer_dirty() is called on all buffers from
      __set_page_buffers_dirty(), regardless of whether the buffer is
      beyond EOF or not, which means that when we get to ->writepage,
      we have buffers marked dirty beyond EOF that we need to clean.
      So, we need to implement our own .set_page_dirty method that
      doesn't dirty buffers beyond EOF.
      
      This is messy because the buffer code is not meant to be shared
      and it has interesting locking issues on the buffer dirty bits.
      So just copy and paste it and then modify it to suit what we need.
      
      Note: the solutions the other filesystems and generic block code use
      of marking the buffers clean in ->writepage does not work for XFS.
      It still leaves dirty buffers beyond EOF and invalidations still
      fail. Hence rather than play whack-a-mole, this patch simply
      prevents those buffers from being dirtied in the first place.
      
      cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      22e757a4
  4. 30 8月, 2014 4 次提交
  5. 29 8月, 2014 6 次提交
    • J
      f2fs: fix wrong casting for dentry name · 3304b564
      Jaegeuk Kim 提交于
      The dentry name type is unsigned char *.
      If we don't match this type, some character codes can be changed by signed bit.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      3304b564
    • D
      ext4: fix same-dir rename when inline data directory overflows · d80d448c
      Darrick J. Wong 提交于
      When performing a same-directory rename, it's possible that adding or
      setting the new directory entry will cause the directory to overflow
      the inline data area, which causes the directory to be converted to an
      extent-based directory.  Under this circumstance it is necessary to
      re-read the directory when deleting the old dirent because the "old
      directory" context still points to i_block in the inode table, which
      is now an extent tree root!  The delete fails with an FS error, and
      the subsequent fsck complains about incorrect link counts and
      hardlinked directories.
      
      Test case (originally found with flat_dir_test in the metadata_csum
      test program):
      
      # mkfs.ext4 -O inline_data /dev/sda
      # mount /dev/sda /mnt
      # mkdir /mnt/x
      # touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian
      # sync
      # for i in /mnt/x/*; do mv $i $i.longer; done
      # ls -la /mnt/x/
      total 0
      -rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer
      -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright
      -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer
      -rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer
      
      (Hey!  Why are there four files now??)
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d80d448c
    • D
      jbd2: fix descriptor block size handling errors with journal_csum · db9ee220
      Darrick J. Wong 提交于
      It turns out that there are some serious problems with the on-disk
      format of journal checksum v2.  The foremost is that the function to
      calculate descriptor tag size returns sizes that are too big.  This
      causes alignment issues on some architectures and is compounded by the
      fact that some parts of jbd2 use the structure size (incorrectly) to
      determine the presence of a 64bit journal instead of checking the
      feature flags.
      
      Therefore, introduce journal checksum v3, which enlarges the
      descriptor block tag format to allow for full 32-bit checksums of
      journal blocks, fix the journal tag function to return the correct
      sizes, and fix the jbd2 recovery code to use feature flags to
      determine 64bitness.
      
      Add a few function helpers so we don't have to open-code quite so
      many pieces.
      
      Switching to a 16-byte block size was found to increase journal size
      overhead by a maximum of 0.1%, to convert a 32-bit journal with no
      checksumming to a 32-bit journal with checksum v3 enabled.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reported-by: NTR Reardon <thomas_reardon@hotmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      db9ee220
    • D
      jbd2: fix infinite loop when recovering corrupt journal blocks · 022eaa75
      Darrick J. Wong 提交于
      When recovering the journal, don't fall into an infinite loop if we
      encounter a corrupt journal block.  Instead, just skip the block and
      return an error, which fails the mount and thus forces the user to run
      a full filesystem fsck.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      022eaa75
    • D
      ext4: update i_disksize coherently with block allocation on error path · 6603120e
      Dmitry Monakhov 提交于
      In case of delalloc block i_disksize may be less than i_size. So we
      have to update i_disksize each time we allocated and submitted some
      blocks beyond i_disksize.  We weren't doing this on the error paths,
      so fix this.
      
      testcase: xfstest generic/019
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      6603120e
    • D
      f2fs: simplify by using a literal · 922cedbd
      Dan Carpenter 提交于
      We can make the code a bit simpler because we know that "!retry" is
      zero.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      922cedbd
  6. 28 8月, 2014 2 次提交
  7. 27 8月, 2014 3 次提交
  8. 26 8月, 2014 1 次提交
  9. 25 8月, 2014 1 次提交
    • B
      aio: fix reqs_available handling · d856f32a
      Benjamin LaHaise 提交于
      As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
      leak when events are reaped by userspace") introduces a regression when
      user code attempts to perform io_submit() with more events than are
      available in the ring buffer.  Reverting that commit would reintroduce a
      regression when user space event reaping is used.
      
      Fixing this bug is a bit more involved than the previous attempts to fix
      this regression.  Since we do not have a single point at which we can
      count events as being reaped by user space and io_getevents(), we have
      to track event completion by looking at the number of events left in the
      event ring.  So long as there are as many events in the ring buffer as
      there have been completion events generate, we cannot call
      put_reqs_available().  The code to check for this is now placed in
      refill_reqs_available().
      
      A test program from Dan and modified by me for verifying this bug is available
      at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
      Reported-by: NDan Aloni <dan@kernelim.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Acked-by: NDan Aloni <dan@kernelim.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38 was backported to
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d856f32a
  10. 24 8月, 2014 4 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
    • D
      ext4: move i_size,i_disksize update routines to helper function · 4631dbf6
      Dmitry Monakhov 提交于
      Cc: stable@vger.kernel.org # needed for bug fix patches
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4631dbf6
    • T
      ext4: fix BUG_ON in mb_free_blocks() · c99d1e6e
      Theodore Ts'o 提交于
      If we suffer a block allocation failure (for example due to a memory
      allocation failure), it's possible that we will call
      ext4_discard_allocated_blocks() before we've actually allocated any
      blocks.  In that case, fe_len and fe_start in ac->ac_f_ex will still
      be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
      triggering the BUG_ON on mb_free_blocks():
      
      	BUG_ON(last >= (sb->s_blocksize << 3));
      
      Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
      is zero.
      
      Also fix a missing ext4_mb_unload_buddy() call in
      ext4_discard_allocated_blocks().
      
      Google-Bug-Id: 16844242
      
      Fixes: 86f0afd4Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c99d1e6e
    • T
      ext4: propagate errors up to ext4_find_entry()'s callers · 36de9286
      Theodore Ts'o 提交于
      If we run into some kind of error, such as ENOMEM, while calling
      ext4_getblk() or ext4_dx_find_entry(), we need to make sure this error
      gets propagated up to ext4_find_entry() and then to its callers.  This
      way, transient errors such as ENOMEM can get propagated to the VFS.
      This is important so that the system calls return the appropriate
      error, and also so that in the case of ext4_lookup(), we return an
      error instead of a NULL inode, since that will result in a negative
      dentry cache entry that will stick around long past the OOM condition
      which caused a transient ENOMEM error.
      
      Google-Bug-Id: #17142205
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      36de9286
  11. 23 8月, 2014 8 次提交