1. 14 11月, 2014 1 次提交
  2. 07 11月, 2014 6 次提交
    • D
      xfs: track bulkstat progress by agino · 00275899
      Dave Chinner 提交于
      The bulkstat main loop progress is tracked by the "lastino"
      variable, which is a full 64 bit inode. However, the loop actually
      works on agno/agino pairs, and so there's a significant disconnect
      between the rest of the loop and the main cursor. Convert this to
      use the agino, and pass the agino into the chunk formatting function
      and convert it too.
      
      This gets rid of the inconsistency in the loop processing, and
      finally makes it simple for us to skip inodes at any point in the
      loop simply by incrementing the agino cursor.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      00275899
    • D
      xfs: bulkstat error handling is broken · febe3cbe
      Dave Chinner 提交于
      The error propagation is a horror - xfs_bulkstat() returns
      a rval variable which is only set if there are formatter errors. Any
      sort of btree walk error or corruption will cause the bulkstat walk
      to terminate but will not pass an error back to userspace. Worse
      is the fact that formatter errors will also be ignored if any inodes
      were correctly formatted into the user buffer.
      
      Hence bulkstat can fail badly yet still report success to userspace.
      This causes significant issues with xfsdump not dumping everything
      in the filesystem yet reporting success. It's not until a restore
      fails that there is any indication that the dump was bad and tha
      bulkstat failed. This patch now triggers xfsdump to fail with
      bulkstat errors rather than silently missing files in the dump.
      
      This now causes bulkstat to fail when the lastino cookie does not
      fall inside an existing inode chunk. The pre-3.17 code tolerated
      that error by allowing the code to move to the next inode chunk
      as the agino target is guaranteed to fall into the next btree
      record.
      
      With the fixes up to this point in the series, xfsdump now passes on
      the troublesome filesystem image that exposes all these bugs.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      febe3cbe
    • D
      xfs: bulkstat main loop logic is a mess · 6e57c542
      Dave Chinner 提交于
      There are a bunch of variables tha tare more wildy scoped than they
      need to be, obfuscated user buffer checks and tortured "next inode"
      tracking. This all needs cleaning up to expose the real issues that
      need fixing.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6e57c542
    • D
      xfs: bulkstat chunk-formatter has issues · 2b831ac6
      Dave Chinner 提交于
      The loop construct has issues:
      	- clustidx is completely unused, so remove it.
      	- the loop tries to be smart by terminating when the
      	  "freecount" tells it that all inodes are free. Just drop
      	  it as in most cases we have to scan all inodes in the
      	  chunk anyway.
      	- move the "user buffer left" condition check to the only
      	  point where we consume space int eh user buffer.
      	- move the initialisation of agino out of the loop, leaving
      	  just a simple loop control logic using the clusteridx.
      
      Also, double handling of the user buffer variables leads to problems
      tracking the current state - use the cursor variables directly
      rather than keeping local copies and then having to update the
      cursor before returning.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2b831ac6
    • D
      xfs: bulkstat chunk formatting cursor is broken · bf4a5af2
      Dave Chinner 提交于
      The xfs_bulkstat_agichunk formatting cursor takes buffer values from
      the main loop and passes them via the structure to the chunk
      formatter, and the writes the changed values back into the main loop
      local variables. Unfortunately, this complex dance is full of corner
      cases that aren't handled correctly.
      
      The biggest problem is that it is double handling the information in
      both the main loop and the chunk formatting function, leading to
      inconsistent updates and endless loops where progress is not made.
      
      To fix this, push the struct xfs_bulkstat_agichunk outwards to be
      the primary holder of user buffer information. this removes the
      double handling in the main loop.
      
      Also, pass the last inode processed by the chunk formatter as a
      separate parameter as it purely an output variable and is not
      related to the user buffer consumption cursor.
      
      Finally, the chunk formatting code is not shared by anyone, so make
      it local to xfs_itable.c.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bf4a5af2
    • D
      xfs: bulkstat btree walk doesn't terminate · afa947cb
      Dave Chinner 提交于
      The bulkstat code has several different ways of detecting the end of
      an AG when doing a walk. They are not consistently detected, and the
      code that checks for the end of AG conditions is not consistently
      coded. Hence the are conditions where the walk code can get stuck in
      an endless loop making no progress and not triggering any
      termination conditions.
      
      Convert all the "tmp/i" status return codes from btree operations
      to a common name (stat) and apply end-of-ag detection to these
      operations consistently.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      afa947cb
  3. 06 11月, 2014 1 次提交
  4. 05 11月, 2014 1 次提交
  5. 04 11月, 2014 1 次提交
  6. 01 11月, 2014 1 次提交
  7. 31 10月, 2014 3 次提交
    • D
      Return short read or 0 at end of a raw device, not EIO · b2de525f
      David Jeffery 提交于
      Author: David Jeffery <djeffery@redhat.com>
      Changes to the basic direct I/O code have broken the raw driver when reading
      to the end of a raw device.  Instead of returning a short read for a read that
      extends partially beyond the device's end or 0 when at the end of the device,
      these reads now return EIO.
      
      The raw driver needs the same end of device handling as was added for normal
      block devices.  Using blkdev_read_iter, which has the needed size checks,
      prevents the EIO conditions at the end of the device.
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2de525f
    • A
      isofs: don't bother with ->d_op for normal case · b0afd8e5
      Al Viro 提交于
      we only need it for joliet and case-insensitive mounts
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0afd8e5
    • E
      fs: allow open(dir, O_TMPFILE|..., 0) with mode 0 · 69a91c23
      Eric Rannaud 提交于
      The man page for open(2) indicates that when O_CREAT is specified, the
      'mode' argument applies only to future accesses to the file:
      
      	Note that this mode applies only to future accesses of the newly
      	created file; the open() call that creates a read-only file
      	may well return a read/write file descriptor.
      
      The man page for open(2) implies that 'mode' is treated identically by
      O_CREAT and O_TMPFILE.
      
      O_TMPFILE, however, behaves differently:
      
      	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
      	assert(fd == -1);
      	assert(errno == EACCES);
      
      	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
      	assert(fd > 0);
      
      For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:
      
      	if (*opened & FILE_CREATED) {
      		/* Don't check for write permission, don't truncate */
      		open_flag &= ~O_TRUNC;
      		will_truncate = false;
      		acc_mode = MAY_OPEN;
      		path_to_nameidata(path, nd);
      		goto finish_open_created;
      	}
      
      But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
      may_open().
      
      This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
      inode is created, may_open() is called with acc_mode = MAY_OPEN, in
      do_tmpfile().
      
      A different, but related glibc bug revealed the discrepancy:
      https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      
      The glibc lazily loads the 'mode' argument of open() and openat() using
      va_arg() only if O_CREAT is present in 'flags' (to support both the 2
      argument and the 3 argument forms of open; same idea for openat()).
      However, the glibc ignores the 'mode' argument if O_TMPFILE is in
      'flags'.
      
      On x86_64, for open(), it magically works anyway, as 'mode' is in
      RDX when entering open(), and is still in RDX on SYSCALL, which is where
      the kernel looks for the 3rd argument of a syscall.
      
      But openat() is not quite so lucky: 'mode' is in RCX when entering the
      glibc wrapper for openat(), while the kernel looks for the 4th argument
      of a syscall in R10. Indeed, the syscall calling convention differs from
      the regular calling convention in this respect on x86_64. So the kernel
      sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
      fails with EACCES.
      Signed-off-by: NEric Rannaud <e@nanocritical.com>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69a91c23
  8. 30 10月, 2014 14 次提交
    • J
      ext4: make ext4_ext_convert_to_initialized() return proper number of blocks · ae9e9c6a
      Jan Kara 提交于
      ext4_ext_convert_to_initialized() can return more blocks than are
      actually allocated from map->m_lblk in case where initial part of the
      on-disk extent is zeroed out. Luckily this doesn't have serious
      consequences because the caller currently uses the return value
      only to unmap metadata buffers. Anyway this is a data
      corruption/exposure problem waiting to happen so fix it.
      
      Coverity-id: 1226848
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ae9e9c6a
    • J
      ext4: bail early when clearing inode journal flag fails · 4f879ca6
      Jan Kara 提交于
      When clearing inode journal flag, we call jbd2_journal_flush() to force
      all the journalled data to their final locations. Currently we ignore
      when this fails and continue clearing inode journal flag. This isn't a
      big problem because when jbd2_journal_flush() fails, journal is likely
      aborted anyway. But it can still lead to somewhat confusing results so
      rather bail out early.
      
      Coverity-id: 989044
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4f879ca6
    • J
      ext4: bail out from make_indexed_dir() on first error · 6050d47a
      Jan Kara 提交于
      When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
      fail, there's really something wrong with the fs and there's no point in
      continuing further. Just return error from make_indexed_dir() in that
      case. Also initialize frames array so that if we return early due to
      error, dx_release() doesn't try to dereference uninitialized memory
      (which could happen also due to error in do_split()).
      
      Coverity-id: 741300
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      6050d47a
    • T
      jbd2: use a better hash function for the revoke table · d48458d4
      Theodore Ts'o 提交于
      The old hash function didn't work well for 64-bit block numbers, and
      used undefined (negative) shift right behavior.  Use the generic
      64-bit hash function instead.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reported-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      d48458d4
    • D
      ext4: prevent bugon on race between write/fcntl · a41537e6
      Dmitry Monakhov 提交于
      O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked
      twice inside ext4_file_write_iter() and __generic_file_write() which
      result in BUG_ON inside ext4_direct_IO.
      
      Let's initialize iocb->private unconditionally.
      
      TESTCASE: xfstest:generic/036  https://patchwork.ozlabs.org/patch/402445/
      
      #TYPICAL STACK TRACE:
      kernel BUG at fs/ext4/inode.c:2960!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
      CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161
      Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
      task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000
      RIP: 0010:[<ffffffff811fabf2>]  [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
      RSP: 0018:ffff88080f90bb58  EFLAGS: 00010246
      RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818
      RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001
      RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80
      R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28
      FS:  00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0
      Stack:
       ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30
       0000000000000200 0000000000000200 0000000000000001 0000000000000200
       ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200
      Call Trace:
       [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180
       [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370
       [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400
       [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
       [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
       [<ffffffff810990e5>] ? local_clock+0x25/0x30
       [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700
       [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0
       [<ffffffff811bd756>] aio_run_iocb+0x286/0x410
       [<ffffffff810990e5>] ? local_clock+0x25/0x30
       [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
       [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0
       [<ffffffff811bde3b>] do_io_submit+0x55b/0x740
       [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740
       [<ffffffff811be030>] SyS_io_submit+0x10/0x20
       [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b
      Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c
      24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89
      RIP  [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
       RSP <ffff88080f90bb58>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Cc: stable@vger.kernel.org
      a41537e6
    • D
      ext4: remove extent status procfs files if journal load fails · 50460fe8
      Darrick J. Wong 提交于
      If we can't load the journal, remove the procfs files for the extent
      status information file to avoid leaking resources.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      50460fe8
    • D
      ext4: disallow changing journal_csum option during remount · 6b992ff2
      Darrick J. Wong 提交于
      ext4 does not permit changing the metadata or journal checksum feature
      flag while mounted.  Until we decide to support that, don't allow a
      remount to change the journal_csum flag (right now we silently fail to
      change anything).
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6b992ff2
    • D
      ext4: enable journal checksum when metadata checksum feature enabled · 98c1a759
      Darrick J. Wong 提交于
      If metadata checksumming is turned on for the FS, we need to tell the
      journal to use checksumming too.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      98c1a759
    • J
      ext4: fix oops when loading block bitmap failed · 599a9b77
      Jan Kara 提交于
      When we fail to load block bitmap in __ext4_new_inode() we will
      dereference NULL pointer in ext4_journal_get_write_access(). So check
      for error from ext4_read_block_bitmap().
      
      Coverity-id: 989065
      Cc: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      599a9b77
    • J
      ext4: fix overflow when updating superblock backups after resize · 9378c676
      Jan Kara 提交于
      When there are no meta block groups update_backups() will compute the
      backup block in 32-bit arithmetics thus possibly overflowing the block
      number and corrupting the filesystem. OTOH filesystems without meta
      block groups larger than 16 TB should be rare. Fix the problem by doing
      the counting in 64-bit arithmetics.
      
      Coverity-id: 741252
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      9378c676
    • B
      xfs: rework zero range to prevent invalid i_size updates · 5d11fb4b
      Brian Foster 提交于
      The zero range operation is analogous to fallocate with the exception of
      converting the range to zeroes. E.g., it attempts to allocate zeroed
      blocks over the range specified by the caller. The XFS implementation
      kills all delalloc blocks currently over the aligned range, converts the
      range to allocated zero blocks (unwritten extents) and handles the
      partial pages at the ends of the range by sending writes through the
      pagecache.
      
      The current implementation suffers from several problems associated with
      inode size. If the aligned range covers an extending I/O, said I/O is
      discarded and an inode size update from a previous write never makes it
      to disk. Further, if an unaligned zero range extends beyond eof, the
      page write induced for the partial end page can itself increase the
      inode size, even if the zero range request is not supposed to update
      i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
      
      The latter behavior not only incorrectly increases the inode size, but
      can lead to stray delalloc blocks on the inode. Typically, post-eof
      preallocation blocks are either truncated on release or inode eviction
      or explicitly written to by xfs_zero_eof() on natural file size
      extension. If the inode size increases due to zero range, however,
      associated blocks leak into the address space having never been
      converted or mapped to pagecache pages. A direct I/O to such an
      uncovered range cannot convert the extent via writeback and will BUG().
      For example:
      
      $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
      ...
      $ xfs_io -d -c "pread 128k 128k" <file>
      <BUG>
      
      If the entire delalloc extent happens to not have page coverage
      whatsoever (e.g., delalloc conversion couldn't find a large enough free
      space extent), even a full file writeback won't convert what's left of
      the extent and we'll assert on inode eviction.
      
      Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
      Use the existing hole punch and prealloc mechanisms as primitives for
      zero range. This implementation is not efficient nor ideal as we
      writeback dirty data over the range and remove existing extents rather
      than convert to unwrittern. The former writeback, however, is currently
      the only mechanism available to ensure consistency between pagecache and
      extent state. Even a pagecache truncate/delalloc punch prior to hole
      punch has lead to inconsistencies due to racing with writeback.
      
      This provides a consistent, correct implementation of zero range that
      survives fsstress/fsx testing without assert failures. The
      implementation can be optimized from this point forward once the
      fundamental issue of pagecache and delalloc extent state consistency is
      addressed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      5d11fb4b
    • J
      xfs: Check error during inode btree iteration in xfs_bulkstat() · 7a19dee1
      Jan Kara 提交于
      xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In
      case of specific fs corruption that could result in xfs_bulkstat()
      entering an infinite loop because we would be looping over the same
      chunk over and over again. Fix the problem by checking the return value
      and terminating the loop properly.
      
      Coverity-id: 1231338
      cc: <stable@vger.kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJie Liu <jeff.u.liu@gmail.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7a19dee1
    • R
      ocfs2: fix d_splice_alias() return code checking · d3556bab
      Richard Weinberger 提交于
      d_splice_alias() can return a valid dentry, NULL or an ERR_PTR.
      Currently the code checks not for ERR_PTR and will cuase an oops in
      ocfs2_dentry_attach_lock().  Fix this by using IS_ERR_OR_NULL().
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3556bab
    • J
      fsnotify: next_i is freed during fsnotify_unmount_inodes. · 6424babf
      Jerry Hoemann 提交于
      During file system stress testing on 3.10 and 3.12 based kernels, the
      umount command occasionally hung in fsnotify_unmount_inodes in the
      section of code:
      
                      spin_lock(&inode->i_lock);
                      if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
                              spin_unlock(&inode->i_lock);
                              continue;
                      }
      
      As this section of code holds the global inode_sb_list_lock, eventually
      the system hangs trying to acquire the lock.
      
      Multiple crash dumps showed:
      
      The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
      back at itself.  As this is not the value of list upon entry to the
      function, the kernel never exits the loop.
      
      To help narrow down problem, the call to list_del_init in
      inode_sb_list_del was changed to list_del.  This poisons the pointers in
      the i_sb_list and causes a kernel to panic if it transverse a freed
      inode.
      
      Subsequent stress testing paniced in fsnotify_unmount_inodes at the
      bottom of the list_for_each_entry_safe loop showing next_i had become
      free.
      
      We believe the root cause of the problem is that next_i is being freed
      during the window of time that the list_for_each_entry_safe loop
      temporarily releases inode_sb_list_lock to call fsnotify and
      fsnotify_inode_delete.
      
      The code in fsnotify_unmount_inodes attempts to prevent the freeing of
      inode and next_i by calling __iget.  However, the code doesn't do the
      __iget call on next_i
      
      	if i_count == 0 or
      	if i_state & (I_FREEING | I_WILL_FREE)
      
      The patch addresses this issue by advancing next_i in the above two cases
      until we either find a next_i which we can __iget or we reach the end of
      the list.  This makes the handling of next_i more closely match the
      handling of the variable "inode."
      
      The time to reproduce the hang is highly variable (from hours to days.) We
      ran the stress test on a 3.10 kernel with the proposed patch for a week
      without failure.
      
      During list_for_each_entry_safe, next_i is becoming free causing
      the loop to never terminate.  Advance next_i in those cases where
      __iget is not done.
      Signed-off-by: NJerry Hoemann <jerry.hoemann@hp.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ken Helias <kenhelias@firemail.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6424babf
  9. 29 10月, 2014 6 次提交
  10. 28 10月, 2014 3 次提交
  11. 25 10月, 2014 3 次提交