1. 17 4月, 2014 4 次提交
    • D
      xfs: wrong error sign conversion during failed DIO writes · 07d5035a
      Dave Chinner 提交于
      We negate the error value being returned from a generic function
      incorrectly. The code path that it is running in returned negative
      errors, so there is no need to negate it to get the correct error
      signs here.
      
      This was uncovered by generic/019.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      07d5035a
    • D
      xfs: unmount does not wait for shutdown during unmount · 9c23eccc
      Dave Chinner 提交于
      And interesting situation can occur if a log IO error occurs during
      the unmount of a filesystem. The cases reported have the same
      signature - the update of the superblock counters fails due to a log
      write IO error:
      
      XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa08a44a1
      XFS (dm-16): Log I/O Error Detected.  Shutting down filesystem
      XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.
      XFS (dm-16): xfs_log_force: error 5 returned.
      XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)
      
      It can be seen that the last line of output contains a corrupt
      device name - this is because the log and xfs_mount structures have
      already been freed by the time this message is printed. A kernel
      oops closely follows.
      
      The issue is that the shutdown is occurring in a separate IO
      completion thread to the unmount. Once the shutdown processing has
      started and all the iclogs are marked with XLOG_STATE_IOERROR, the
      log shutdown code wakes anyone waiting on a log force so they can
      process the shutdown error. This wakes up the unmount code that
      is doing a synchronous transaction to update the superblock
      counters.
      
      The unmount path now sees all the iclogs are marked with
      XLOG_STATE_IOERROR and so never waits on them again, knowing that if
      it does, there will not be a wakeup trigger for it and we will hang
      the unmount if we do. Hence the unmount runs through all the
      remaining code and frees all the filesystem structures while the
      xlog_iodone() is still processing the shutdown. When the log
      shutdown processing completes, xfs_do_force_shutdown() emits the
      "Please umount the filesystem and rectify the problem(s)" message,
      and xlog_iodone() then aborts all the objects attached to the iclog.
      An iclog that has already been freed....
      
      The real issue here is that there is no serialisation point between
      the log IO and the unmount. We have serialisations points for log
      writes, log forces, reservations, etc, but we don't actually have
      any code that wakes for log IO to fully complete. We do that for all
      other types of object, so why not iclogbufs?
      
      Well, it turns out that we can easily do this. We've got xfs_buf
      handles, and that's what everyone else uses for IO serialisation.
      i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only
      release the lock in xlog_iodone() when we are finished with the
      buffer. That way before we tear down the iclog, we can lock and
      unlock the buffer to ensure IO completion has finished completely
      before we tear it down.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NBob Mastors <bob.mastors@solidfire.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      9c23eccc
    • D
      xfs: collapse range is delalloc challenged · d39a2ced
      Dave Chinner 提交于
      FSX has been detecting data corruption after to collapse range
      calls. The key observation is that the offset of the last extent in
      the file was not being shifted, and hence when the file size was
      adjusted it was truncating away data because the extents handled
      been correctly shifted.
      
      Tracing indicated that before the collapse, the extent list looked
      like:
      
      ....
      ino 0x5788 state  idx 6 offset 26 block 195904 count 10 flag 0
      ino 0x5788 state  idx 7 offset 39 block 195917 count 35 flag 0
      ino 0x5788 state  idx 8 offset 86 block 195964 count 32 flag 0
      
      and after the shift of 2 blocks:
      
      ino 0x5788 state  idx 6 offset 24 block 195904 count 10 flag 0
      ino 0x5788 state  idx 7 offset 37 block 195917 count 35 flag 0
      ino 0x5788 state  idx 8 offset 86 block 195964 count 32 flag 0
      
      Note that the last extent did not change offset. After the changing
      of the file size:
      
      ino 0x5788 state  idx 6 offset 24 block 195904 count 10 flag 0
      ino 0x5788 state  idx 7 offset 37 block 195917 count 35 flag 0
      ino 0x5788 state  idx 8 offset 86 block 195964 count 30 flag 0
      
      You can see that the last extent had it's length truncated,
      indicating that we've lost data.
      
      The reason for this is that the xfs_bmap_shift_extents() loop uses
      XFS_IFORK_NEXTENTS() to determine how many extents are in the inode.
      This, unfortunately, doesn't take into account delayed allocation
      extents - it's a count of physically allocated extents - and hence
      when the file being collapsed has a delalloc extent like this one
      does prior to the range being collapsed:
      
      ....
      ino 0x5788 state  idx 4 offset 11 block 4503599627239429 count 1 flag 0
      ....
      
      it gets the count wrong and terminates the shift loop early.
      
      Fix it by using the in-memory extent array size that includes
      delayed allocation extents to determine the number of extents on the
      inode.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d39a2ced
    • D
      xfs: don't map ranges that span EOF for direct IO · 0e1f789d
      Dave Chinner 提交于
      Al Viro tracked down the problem that has caused generic/263 to fail
      on XFS since the test was introduced. If is caused by
      xfs_get_blocks() mapping a single extent that spans EOF without
      marking it as buffer-new() so that the direct IO code does not zero
      the tail of the block at the new EOF. This is a long standing bug
      that has been around for many, many years.
      
      Because xfs_get_blocks() starts the map before EOF, it can't set
      buffer_new(), because that causes he direct IO code to also zero
      unaligned sectors at the head of the IO. This would overwrite valid
      data with zeros, and hence we cannot validly return a single extent
      that spans EOF to direct IO.
      
      Fix this by detecting a mapping that spans EOF and truncate it down
      to EOF. This results in the the direct IO code doing the right thing
      for unaligned data blocks before EOF, and then returning to get
      another mapping for the region beyond EOF which XFS treats correctly
      by setting buffer_new() on it. This makes direct Io behave correctly
      w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.
      
      Again, thanks to Al Viro for finding what I couldn't.
      
      [ dchinner: Fix for __divdi3 build error:
      Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Tested-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0e1f789d
  2. 14 4月, 2014 4 次提交
  3. 08 4月, 2014 1 次提交
  4. 04 4月, 2014 3 次提交
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
    • M
      xfs: fix directory hash ordering bug · c88547a8
      Mark Tinguely 提交于
      Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
      in 3.10 incorrectly converted the btree hash index array pointer in
      xfs_da3_fixhashpath(). It resulted in the the current hash always
      being compared against the first entry in the btree rather than the
      current block index into the btree block's hash entry array. As a
      result, it was comparing the wrong hashes, and so could misorder the
      entries in the btree.
      
      For most cases, this doesn't cause any problems as it requires hash
      collisions to expose the ordering problem. However, when there are
      hash collisions within a directory there is a very good probability
      that the entries will be ordered incorrectly and that actually
      matters when duplicate hashes are placed into or removed from the
      btree block hash entry array.
      
      This bug results in an on-disk directory corruption and that results
      in directory verifier functions throwing corruption warnings into
      the logs. While no data or directory entries are lost, access to
      them may be compromised, and attempts to remove entries from a
      directory that has suffered from this corruption may result in a
      filesystem shutdown.  xfs_repair will fix the directory hash
      ordering without data loss occuring.
      
      [dchinner: wrote useful a commit message]
      
      cc: <stable@vger.kernel.org>
      Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      c88547a8
    • D
      xfs: extra semi-colon breaks a condition · 805eeb8e
      Dan Carpenter 提交于
      There were some extra semi-colons here which mean that we return true
      unintentionally.
      
      Fixes: a49935f2 ('xfs: xfs_check_page_type buffer checks need help')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      805eeb8e
  5. 02 4月, 2014 4 次提交
  6. 13 3月, 2014 2 次提交
    • T
      fs: push sync_filesystem() down to the file system's remount_fs() · 02b9984d
      Theodore Ts'o 提交于
      Previously, the no-op "mount -o mount /dev/xxx" operation when the
      file system is already mounted read-write causes an implied,
      unconditional syncfs().  This seems pretty stupid, and it's certainly
      documented or guaraunteed to do this, nor is it particularly useful,
      except in the case where the file system was mounted rw and is getting
      remounted read-only.
      
      However, it's possible that there might be some file systems that are
      actually depending on this behavior.  In most file systems, it's
      probably fine to only call sync_filesystem() when transitioning from
      read-write to read-only, and there are some file systems where this is
      not needed at all (for example, for a pseudo-filesystem or something
      like romfs).
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Jan Kara <jack@suse.cz>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Anders Larsen <al@alarsen.net>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: xfs@oss.sgi.com
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-cifs@vger.kernel.org
      Cc: samba-technical@lists.samba.org
      Cc: codalist@coda.cs.cmu.edu
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: fuse-devel@lists.sourceforge.net
      Cc: cluster-devel@redhat.com
      Cc: linux-mtd@lists.infradead.org
      Cc: jfs-discussion@lists.sourceforge.net
      Cc: linux-nfs@vger.kernel.org
      Cc: linux-nilfs@vger.kernel.org
      Cc: linux-ntfs-dev@lists.sourceforge.net
      Cc: ocfs2-devel@oss.oracle.com
      Cc: reiserfs-devel@vger.kernel.org
      02b9984d
    • L
      xfs: Add support for FALLOC_FL_ZERO_RANGE · 376ba313
      Lukas Czerner 提交于
      Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
      functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
      
      We can also preallocate blocks past EOF in the same was as with
      fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain
      the same even if we preallocate blocks past EOF.
      
      It uses the same code to zero range as it is used by the
      XFS_IOC_ZERO_RANGE ioctl.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      376ba313
  7. 07 3月, 2014 5 次提交
    • D
      xfs: inode log reservations are still too small · fe4c224a
      Dave Chinner 提交于
      Back in commit 23956703 ("xfs: inode log reservations are too
      small"), the reservation size was increased to take into account the
      difference in size between the in-memory BMBT block headers and the
      on-disk BMDR headers. This solved a transaction overrun when logging
      the inode size.
      
      Recently, however, we've seen a number of these same overruns on
      kernels with the above fix in it. All of them have been by 4 bytes,
      so we must still not be accounting for something correctly.
      
      Through inspection it turns out the above commit didn't take into
      account everything it should have. That is, it only accounts for a
      single log op_hdr structure, when it can actually require up to four
      op_hdrs - one for each region (log iovec) that is formatted. These
      regions are the inode log format header, the inode core, and the two
      forks that can be held in the literal area of the inode.
      
      This means we are not accounting for 36 bytes of log space that the
      transaction can use, and hence when we get inodes in certain formats
      with particular fragmentation patterns we can overrun the
      transaction. Fix this by adding the correct accounting for log
      op_headers in the transaction.
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fe4c224a
    • D
      xfs: xfs_check_page_type buffer checks need help · a49935f2
      Dave Chinner 提交于
      xfs_aops_discard_page() was introduced in the following commit:
      
        xfs: truncate delalloc extents when IO fails in writeback
      
      ... to clean up left over delalloc ranges after I/O failure in
      ->writepage(). generic/224 tests for this scenario and occasionally
      reproduces panics on sub-4k blocksize filesystems.
      
      The cause of this is failure to clean up the delalloc range on a
      page where the first buffer does not match one of the expected
      states of xfs_check_page_type(). If a buffer is not unwritten,
      delayed or dirty&mapped, xfs_check_page_type() stops and
      immediately returns 0.
      
      The stress test of generic/224 creates a scenario where the first
      several buffers of a page with delayed buffers are mapped & uptodate
      and some subsequent buffer is delayed. If the ->writepage() happens
      to fail for this page, xfs_aops_discard_page() incorrectly skips
      the entire page.
      
      This then causes later failures either when direct IO maps the range
      and finds the stale delayed buffer, or we evict the inode and find
      that the inode still has a delayed block reservation accounted to
      it.
      
      We can easily fix this xfs_aops_discard_page() failure by making
      xfs_check_page_type() check all buffers, but this breaks
      xfs_convert_page() more than it is already broken. Indeed,
      xfs_convert_page() wants xfs_check_page_type() to tell it if the
      first buffers on the pages are of a type that can be aggregated into
      the contiguous IO that is already being built.
      
      xfs_convert_page() should not be writing random buffers out of a
      page, but the current behaviour will cause it to do so if there are
      buffers that don't match the current specification on the page.
      Hence for xfs_convert_page() we need to:
      
      	a) return "not ok" if the first buffer on the page does not
      	match the specification provided to we don't write anything;
      	and
      	b) abort it's buffer-add-to-io loop the moment we come
      	across a buffer that does not match the specification.
      
      Hence we need to fix both xfs_check_page_type() and
      xfs_convert_page() to work correctly with pages that have mixed
      buffer types, whilst allowing xfs_aops_discard_page() to scan all
      buffers on the page for a type match.
      Reported-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a49935f2
    • B
      xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation · e480a723
      Brian Foster 提交于
      The inode chunk allocation path can lead to deadlock conditions if
      a transaction is dirtied with an AGF (to fix up the freelist) for
      an AG that cannot satisfy the actual allocation request. This code
      path is written to try and avoid this scenario, but it can be
      reproduced by running xfstests generic/270 in a loop on a 512b fs.
      
      An example situation is:
      - process A attempts an inode allocation on AG 3, modifies
        the freelist, fails the allocation and ultimately moves on to
        AG 0 with the AG 3 AGF held
      - process B is doing a free space operation (i.e., truncate) and
        acquires the AG 0 AGF, waits on the AG 3 AGF
      - process A acquires the AG 0 AGI, waits on the AG 0 AGF (deadlock)
      
      The problem here is that process A acquired the AG 3 AGF while
      moving on to AG 0 (and releasing the AG 3 AGI with the AG 3 AGF
      held). xfs_dialloc() makes one pass through each of the AGs when
      attempting to allocate an inode chunk. The expectation is a clean
      transaction if a particular AG cannot satisfy the allocation
      request. xfs_ialloc_ag_alloc() is written to support this through
      use of the minalignslop allocation args field.
      
      When using the agi->agi_newino optimization, we attempt an exact
      bno allocation request based on the location of the previously
      allocated chunk. minalignslop is set to inform the allocator that
      we will require alignment on this chunk, and thus to not allow the
      request for this AG if the extra space is not available. Suppose
      that the AG in question has just enough space for this request, but
      not at the requested bno. xfs_alloc_fix_freelist() will proceed as
      normal as it determines the request should succeed, and thus it is
      allowed to modify the agf. xfs_alloc_ag_vextent() ultimately fails
      because the requested bno is not available. In response, the caller
      moves on to a NEAR_BNO allocation request for the same AG. The
      alignment is set, but the minalignslop field is never reset. This
      increases the overall requirement of the request from the first
      attempt. If this delta is the difference between allocation success
      and failure for the AG, xfs_alloc_fix_freelist() rejects this
      request outright the second time around and causes the allocation
      request to unnecessarily fail for this AG.
      
      To address this situation, reset the minalignslop field immediately
      after use and prevent it from leaking into subsequent requests.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e480a723
    • D
      xfs: use NOIO contexts for vm_map_ram · ae687e58
      Dave Chinner 提交于
      When we map pages in the buffer cache, we can do so in GFP_NOFS
      contexts. However, the vmap interfaces do not provide any method of
      communicating this information to memory reclaim, and hence we get
      lockdep complaining about it regularly and occassionally see hangs
      that may be vmap related reclaim deadlocks. We can also see these
      same problems from anywhere where we use vmalloc for a large buffer
      (e.g. attribute code) inside a transaction context.
      
      A typical lockdep report shows up as a reclaim state warning like so:
      
      [14046.101458] =================================
      [14046.102850] [ INFO: inconsistent lock state ]
      [14046.102850] 3.14.0-rc4+ #2 Not tainted
      [14046.102850] ---------------------------------
      [14046.102850] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      [14046.102850] kswapd0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
      [14046.102850]  (&xfs_dir_ilock_class){++++?+}, at: [<791a04bb>] xfs_ilock+0xff/0x16a
      [14046.102850] {RECLAIM_FS-ON-W} state was registered at:
      [14046.102850]   [<7904cdb1>] mark_held_locks+0x81/0xe7
      [14046.102850]   [<7904d390>] lockdep_trace_alloc+0x5c/0xb4
      [14046.102850]   [<790c2c28>] kmem_cache_alloc_trace+0x2b/0x11e
      [14046.102850]   [<790ba7f4>] vm_map_ram+0x119/0x3e6
      [14046.102850]   [<7914e124>] _xfs_buf_map_pages+0x5b/0xcf
      [14046.102850]   [<7914ed74>] xfs_buf_get_map+0x67/0x13f
      [14046.102850]   [<7917506f>] xfs_attr_rmtval_set+0x396/0x4d5
      [14046.102850]   [<7916e8bb>] xfs_attr_leaf_addname+0x18f/0x37d
      [14046.102850]   [<7916ed9e>] xfs_attr_set_int+0x2f5/0x3e8
      [14046.102850]   [<7916eefc>] xfs_attr_set+0x6b/0x74
      [14046.102850]   [<79168355>] xfs_xattr_set+0x61/0x81
      [14046.102850]   [<790e5b10>] generic_setxattr+0x59/0x68
      [14046.102850]   [<790e4c06>] __vfs_setxattr_noperm+0x58/0xce
      [14046.102850]   [<790e4d0a>] vfs_setxattr+0x8e/0x92
      [14046.102850]   [<790e4ddd>] setxattr+0xcf/0x159
      [14046.102850]   [<790e5423>] SyS_lsetxattr+0x88/0xbb
      [14046.102850]   [<79268438>] sysenter_do_call+0x12/0x36
      
      Now, we can't completely remove these traces - mainly because
      vm_map_ram() will do GFP_KERNEL allocation and that generates the
      above warning before we get into the reclaim code, but we can turn
      them all into false positive warnings.
      
      To do that, use the method that DM and other IO context code uses to
      avoid this problem: there is a process flag to tell memory reclaim
      not to do IO that we can set appropriately. That prevents GFP_KERNEL
      context reclaim being done from deep inside the vmalloc code in
      places we can't directly pass a GFP_NOFS context to. That interface
      has a pair of wrapper functions: memalloc_noio_save() and
      memalloc_noio_restore().
      
      Adding them around vm_map_ram and the vzalloc call in
      kmem_alloc_large() will prevent deadlocks and most lockdep reports
      for this issue. Also, convert the vzalloc() call in
      kmem_alloc_large() to use __vmalloc() so that we can pass the
      correct gfp context to the data page allocation routine inside
      __vmalloc() so that it is clear that GFP_NOFS context is important
      to this vmalloc call.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ae687e58
    • D
      xfs: don't leak EFSBADCRC to userspace · ac75a1f7
      Dave Chinner 提交于
      While the verifier routines may return EFSBADCRC when a buffer has
      a bad CRC, we need to translate that to EFSCORRUPTED so that the
      higher layers treat the error appropriately and we return a
      consistent error to userspace. This fixes a xfs/005 regression.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ac75a1f7
  8. 27 2月, 2014 10 次提交
  9. 24 2月, 2014 1 次提交
  10. 22 2月, 2014 1 次提交
    • J
      Revert "writeback: do not sync data dirtied after sync start" · 0dc83bd3
      Jan Kara 提交于
      This reverts commit c4a391b5. Dave
      Chinner <david@fromorbit.com> has reported the commit may cause some
      inodes to be left out from sync(2). This is because we can call
      redirty_tail() for some inode (which sets i_dirtied_when to current time)
      after sync(2) has started or similarly requeue_inode() can set
      i_dirtied_when to current time if writeback had to skip some pages. The
      real problem is in the functions clobbering i_dirtied_when but fixing
      that isn't trivial so revert is a safer choice for now.
      
      CC: stable@vger.kernel.org # >= 3.13
      Signed-off-by: NJan Kara <jack@suse.cz>
      0dc83bd3
  11. 19 2月, 2014 3 次提交
    • E
      xfs: limit superblock corruption errors to actual corruption · 5ef11eb0
      Eric Sandeen 提交于
      Today, if
      
      xfs_sb_read_verify
        xfs_sb_verify
          xfs_mount_validate_sb
      
      detects superblock corruption, it'll be extremely noisy, dumping
      2 stacks, 2 hexdumps, etc.
      
      This is because we call XFS_CORRUPTION_ERROR in xfs_mount_validate_sb
      as well as in xfs_sb_read_verify.
      
      Also, *any* errors in xfs_mount_validate_sb which are not corruption
      per se; things like too-big-blocksize, bad version, bad magic, v1 dirs,
      rw-incompat etc - things which do not return EFSCORRUPTED - will
      still do the whole XFS_CORRUPTION_ERROR spew when xfs_sb_read_verify
      sees any error at all.  And it suggests to the user that they
      should run xfs_repair, even if the root cause of the mount failure
      is a simple incompatibility.
      
      I'll submit that the probably-not-corrupted errors don't warrant
      this much noise, so this patch removes the warning for anything
      other than EFSCORRUPTED returns, and replaces the lower-level
      XFS_CORRUPTION_ERROR with an xfs_notice().
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5ef11eb0
    • E
      xfs: skip verification on initial "guess" superblock read · daba5427
      Eric Sandeen 提交于
      When xfs_readsb() does the very first read of the superblock,
      it makes a guess at the length of the buffer, based on the
      sector size of the underlying storage.  This may or may
      not match the filesystem sector size in sb_sectsize, so
      we can't i.e. do a CRC check on it; it might be too short.
      
      In fact, mounting a filesystem with sb_sectsize larger
      than the device sector size will cause a mount failure
      if CRCs are enabled, because we are checksumming a length
      which exceeds the buffer passed to it.
      
      So always read twice; the first time we read with NULL
      buffer ops to skip verification; then set the proper
      read length, hook up the proper verifier, and give it
      another go.
      
      Once we are sure that we've got the right buffer length,
      we can also use bp->b_length in the xfs_sb_read_verify,
      rather than the less-trusted on-disk sectorsize for
      secondary superblocks.  Before this we ran the risk of
      passing junk to the crc32c routines, which didn't always
      handle extreme values.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      daba5427
    • E
      xfs: xfs_sb_read_verify() doesn't flag bad crcs on primary sb · 7a01e707
      Eric Sandeen 提交于
      My earlier commit 10e6e65d deserves a layer or two of brown paper
      bags.  The logic in that commit means that a CRC failure on the
      primary superblock will *never* result in an error return.
      
      Hopefully this fixes it, so that we always return the error
      if it's a primary superblock, otherwise only if the filesystem
      has CRCs enabled.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      7a01e707
  12. 10 2月, 2014 2 次提交