1. 19 12月, 2014 1 次提交
    • J
      ocfs2: fix journal commit deadlock · 136f49b9
      Junxiao Bi 提交于
      For buffer write, page lock will be got in write_begin and released in
      write_end, in ocfs2_write_end_nolock(), before it unlock the page in
      ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask
      for the read lock of journal->j_trans_barrier.  Holding page lock and
      ask for journal->j_trans_barrier breaks the locking order.
      
      This will cause a deadlock with journal commit threads, ocfs2cmt will
      get write lock of journal->j_trans_barrier first, then it wakes up
      kjournald2 to do the commit work, at last it waits until done.  To
      commit journal, kjournald2 needs flushing data first, it needs get the
      cache page lock.
      
      Since some ocfs2 cluster locks are holding by write process, this
      deadlock may hung the whole cluster.
      
      unlock pages before ocfs2_run_deallocs() can fix the locking order, also
      put unlock before ocfs2_commit_trans() to make page lock is unlocked
      before j_trans_barrier to preserve unlocking order.
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NWengang Wang <wen.gang.wang@oracle.com>
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      136f49b9
  2. 11 12月, 2014 1 次提交
  3. 10 10月, 2014 1 次提交
    • J
      ocfs2: fix deadlock due to wrong locking order · f775da2f
      Junxiao Bi 提交于
      For commit ocfs2 journal, ocfs2 journal thread will acquire the mutex
      osb->journal->j_trans_barrier and wake up jbd2 commit thread, then it
      will wait until jbd2 commit thread done. In order journal mode, jbd2
      needs flushing dirty data pages first, and this needs get page lock.
      So osb->journal->j_trans_barrier should be got before page lock.
      
      But ocfs2_write_zero_page() and ocfs2_write_begin_inline() obey this
      locking order, and this will cause deadlock and hung the whole cluster.
      
      One deadlock catched is the following:
      
      PID: 13449  TASK: ffff8802e2f08180  CPU: 31  COMMAND: "oracle"
       #0 [ffff8802ee3f79b0] __schedule at ffffffff8150a524
       #1 [ffff8802ee3f7a58] schedule at ffffffff8150acbf
       #2 [ffff8802ee3f7a68] rwsem_down_failed_common at ffffffff8150cb85
       #3 [ffff8802ee3f7ad8] rwsem_down_read_failed at ffffffff8150cc55
       #4 [ffff8802ee3f7ae8] call_rwsem_down_read_failed at ffffffff812617a4
       #5 [ffff8802ee3f7b50] ocfs2_start_trans at ffffffffa0498919 [ocfs2]
       #6 [ffff8802ee3f7ba0] ocfs2_zero_start_ordered_transaction at ffffffffa048b2b8 [ocfs2]
       #7 [ffff8802ee3f7bf0] ocfs2_write_zero_page at ffffffffa048e9bd [ocfs2]
       #8 [ffff8802ee3f7c80] ocfs2_zero_extend_range at ffffffffa048ec83 [ocfs2]
       #9 [ffff8802ee3f7ce0] ocfs2_zero_extend at ffffffffa048edfd [ocfs2]
       #10 [ffff8802ee3f7d50] ocfs2_extend_file at ffffffffa049079e [ocfs2]
       #11 [ffff8802ee3f7da0] ocfs2_setattr at ffffffffa04910ed [ocfs2]
       #12 [ffff8802ee3f7e70] notify_change at ffffffff81187d29
       #13 [ffff8802ee3f7ee0] do_truncate at ffffffff8116bbc1
       #14 [ffff8802ee3f7f50] sys_ftruncate at ffffffff8116bcbd
       #15 [ffff8802ee3f7f80] system_call_fastpath at ffffffff81515142
          RIP: 00007f8de750c6f7  RSP: 00007fffe786e478  RFLAGS: 00000206
          RAX: 000000000000004d  RBX: ffffffff81515142  RCX: 0000000000000000
          RDX: 0000000000000200  RSI: 0000000000028400  RDI: 000000000000000d
          RBP: 00007fffe786e040   R8: 0000000000000000   R9: 000000000000000d
          R10: 0000000000000000  R11: 0000000000000206  R12: 000000000000000d
          R13: 00007fffe786e710  R14: 00007f8de70f8340  R15: 0000000000028400
          ORIG_RAX: 000000000000004d  CS: 0033  SS: 002b
      
      crash64> bt
      PID: 7610   TASK: ffff88100fd56140  CPU: 1   COMMAND: "ocfs2cmt"
       #0 [ffff88100f4d1c50] __schedule at ffffffff8150a524
       #1 [ffff88100f4d1cf8] schedule at ffffffff8150acbf
       #2 [ffff88100f4d1d08] jbd2_log_wait_commit at ffffffffa01274fd [jbd2]
       #3 [ffff88100f4d1d98] jbd2_journal_flush at ffffffffa01280b4 [jbd2]
       #4 [ffff88100f4d1dd8] ocfs2_commit_cache at ffffffffa0499b14 [ocfs2]
       #5 [ffff88100f4d1e38] ocfs2_commit_thread at ffffffffa0499d38 [ocfs2]
       #6 [ffff88100f4d1ee8] kthread at ffffffff81090db6
       #7 [ffff88100f4d1f48] kernel_thread_helper at ffffffff81516284
      
      crash64> bt
      PID: 7609   TASK: ffff88100f2d4480  CPU: 0   COMMAND: "jbd2/dm-20-86"
       #0 [ffff88100def3920] __schedule at ffffffff8150a524
       #1 [ffff88100def39c8] schedule at ffffffff8150acbf
       #2 [ffff88100def39d8] io_schedule at ffffffff8150ad6c
       #3 [ffff88100def39f8] sleep_on_page at ffffffff8111069e
       #4 [ffff88100def3a08] __wait_on_bit_lock at ffffffff8150b30a
       #5 [ffff88100def3a58] __lock_page at ffffffff81110687
       #6 [ffff88100def3ab8] write_cache_pages at ffffffff8111b752
       #7 [ffff88100def3be8] generic_writepages at ffffffff8111b901
       #8 [ffff88100def3c48] journal_submit_data_buffers at ffffffffa0120f67 [jbd2]
       #9 [ffff88100def3cf8] jbd2_journal_commit_transaction at ffffffffa0121372[jbd2]
       #10 [ffff88100def3e68] kjournald2 at ffffffffa0127a86 [jbd2]
       #11 [ffff88100def3ee8] kthread at ffffffff81090db6
       #12 [ffff88100def3f48] kernel_thread_helper at ffffffff81516284
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Alex Chen <alex.chen@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f775da2f
  4. 07 5月, 2014 2 次提交
  5. 04 4月, 2014 2 次提交
  6. 13 11月, 2013 4 次提交
  7. 12 9月, 2013 1 次提交
  8. 04 9月, 2013 1 次提交
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  9. 14 8月, 2013 1 次提交
  10. 22 5月, 2013 3 次提交
    • L
      ocfs2: use ->invalidatepage() length argument · e5f8d30d
      Lukas Czerner 提交于
      ->invalidatepage() aop now accepts range to invalidate so we can make
      use of it in ocfs2_invalidatepage().
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NJoel Becker <jlbec@evilplan.org>
      e5f8d30d
    • L
      jbd2: change jbd2_journal_invalidatepage to accept length · 259709b0
      Lukas Czerner 提交于
      invalidatepage now accepts range to invalidate and there are two file
      system using jbd2 also implementing punch hole feature which can benefit
      from this. We need to implement the same thing for jbd2 layer in order to
      allow those file system take benefit of this functionality.
      
      This commit adds length argument to the jbd2_journal_invalidatepage()
      and updates all instances in ext4 and ocfs2.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      259709b0
    • L
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner 提交于
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  11. 26 2月, 2013 1 次提交
  12. 23 2月, 2013 1 次提交
  13. 22 2月, 2013 1 次提交
  14. 20 3月, 2012 1 次提交
  15. 28 7月, 2011 2 次提交
    • J
      ocfs2: Avoid livelock in ocfs2_readpage() · c7e25e6e
      Jan Kara 提交于
      When someone writes to an inode, readers accessing the same inode via
      ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
      do_generic_file_read() looks up the page again and retries ->readpage()
      when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
      readers, they can occupy all CPUs and in non-preempt kernel the system is
      deadlocked because writer holding ip_alloc_sem is never run to release the
      semaphore. Fix the problem by making reader block on ip_alloc_sem to break
      the busy loop.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJoel Becker <jlbec@evilplan.org>
      c7e25e6e
    • M
      ocfs2: serialize unaligned aio · a11f7e63
      Mark Fasheh 提交于
      Fix a corruption that can happen when we have (two or more) outstanding
      aio's to an overlapping unaligned region.  Ext4
      (e9e3bcec) and xfs recently had to fix
      similar issues.
      
      In our case what happens is that we can have an outstanding aio on a region
      and if a write comes in with some bytes overlapping the original aio we may
      decide to read that region into a page before continuing (typically because
      of buffered-io fallback).  Since we have no ordering guarantees with the
      aio, we can read stale or bad data into the page and then write it back out.
      
      If the i/o is page and block aligned, then we avoid this issue as there
      won't be any need to read data from disk.
      
      I took the same approach as Eric in the ext4 patch and introduced some
      serialization of unaligned async direct i/o.  I don't expect this to have an
      effect on the most common cases of AIO.  Unaligned aio will be slower
      though, but that's far more acceptable than data corruption.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <jlbec@evilplan.org>
      a11f7e63
  16. 25 7月, 2011 1 次提交
  17. 21 7月, 2011 3 次提交
    • C
      fs: move inode_dio_done to the end_io handler · 72c5052d
      Christoph Hellwig 提交于
      For filesystems that delay their end_io processing we should keep our
      i_dio_count until the the processing is done.  Enable this by moving
      the inode_dio_done call to the end_io handler if one exist.  Note that
      the actual move to the workqueue for ext4 and XFS is not done in
      this patch yet, but left to the filesystem maintainers.  At least
      for XFS it's not needed yet either as XFS has an internal equivalent
      to i_dio_count.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      72c5052d
    • C
      fs: always maintain i_dio_count · df2d6f26
      Christoph Hellwig 提交于
      Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
      This these filesystems to also protect truncate against direct I/O requests
      by using common code.  Right now the only non-DIO_LOCKING filesystem that
      appears to do so is XFS, which uses an opencoded variant of the i_dio_count
      scheme.
      
      Behaviour doesn't change for filesystems never calling inode_dio_wait.
      For ext4 behaviour changes when using the dioread_nonlock option, which
      previously was missing any protection between truncate and direct I/O reads.
      For ocfs2 that handcrafted i_dio_count manipulations are replaced with
      the common code now enable.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      df2d6f26
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
  18. 29 3月, 2011 1 次提交
  19. 10 3月, 2011 1 次提交
  20. 22 2月, 2011 1 次提交
  21. 07 3月, 2011 1 次提交
    • T
      ocfs2: Remove EXIT from masklog. · c1e8d35e
      Tao Ma 提交于
      mlog_exit is used to record the exit status of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      This patch just try to remove it or change it. So:
      1. if all the error paths already use mlog_errno, it is just removed.
         Otherwise, it will be replaced by mlog_errno.
      2. if it is used to print some return value, it is replaced with
         mlog(0,...).
      mlog_exit_ptr is changed to mlog(0.
      All those mlog(0,...) will be replaced with trace events later.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      c1e8d35e
  22. 21 2月, 2011 1 次提交
    • T
      ocfs2: Remove ENTRY from masklog. · ef6b689b
      Tao Ma 提交于
      ENTRY is used to record the entry of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      So for mlog_entry_void, we just remove it.
      for mlog_entry(...), we replace it with mlog(0,...), and they
      will be replace by trace event later.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      ef6b689b
  23. 16 12月, 2010 1 次提交
    • T
      ocfs2: Try to free truncate log when meeting ENOSPC in write. · 50308d81
      Tao Ma 提交于
      Recently, one of our colleagues meet with a problem that if we
      write/delete a 32mb files repeatly, we will get an ENOSPC in
      the end. And the corresponding bug is 1288.
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1288
      
      The real problem is that although we have freed the clusters,
      they are in truncate log and they will be summed up so that
      we can free them once in a whole.
      
      So this patch just try to resolve it. In case we see -ENOSPC
      in ocfs2_write_begin_no_lock, we will check whether the truncate
      log has enough clusters for our need, if yes, we will try to
      flush the truncate log at that point and try again. This method
      is inspired by Mark Fasheh <mfasheh@suse.com>. Thanks.
      
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      50308d81
  24. 10 12月, 2010 1 次提交
  25. 26 10月, 2010 1 次提交
  26. 10 9月, 2010 1 次提交
  27. 12 8月, 2010 2 次提交
  28. 10 8月, 2010 1 次提交
    • C
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  29. 27 7月, 2010 1 次提交
    • C
      direct-io: move aio_complete into ->end_io · 552ef802
      Christoph Hellwig 提交于
      Filesystems with unwritten extent support must not complete an AIO request
      until the transaction to convert the extent has been commited.  That means
      the aio_complete calls needs to be moved into the ->end_io callback so
      that the filesystem can control when to call it exactly.
      
      This makes a bit of a mess out of dio_complete and the ->end_io callback
      prototype even more complicated. 
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: Jan Kara <jack@suse.cz> 
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      552ef802