1. 10 6月, 2009 1 次提交
    • J
      jbd: fix race in buffer processing in commit code · a61d90d7
      Jan Kara 提交于
      In commit code, we scan buffers attached to a transaction.  During this
      scan, we sometimes have to drop j_list_lock and then we recheck whether
      the journal buffer head didn't get freed by journal_try_to_free_buffers().
       But checking for buffer_jbd(bh) isn't enough because a new journal head
      could get attached to our buffer head.  So add a check whether the journal
      head remained the same and whether it's still at the same transaction and
      list.
      
      This is a nasty bug and can cause problems like memory corruption (use after
      free) or trigger various assertions in JBD code (observed).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a61d90d7
  2. 14 4月, 2009 1 次提交
  3. 06 4月, 2009 1 次提交
  4. 28 3月, 2009 1 次提交
  5. 09 1月, 2009 1 次提交
    • J
      jbd: improve fsync batching · f420d4dc
      Josef Bacik 提交于
      There is a flaw with the way jbd handles fsync batching.  If we fsync() a
      file and we were not the last person to run fsync() on this fs then we
      automatically sleep for 1 jiffie in order to wait for new writers to join
      into the transaction before forcing the commit.  The problem with this is
      that with really fast storage (ie a Clariion) the time it takes to commit
      a transaction to disk is way faster than 1 jiffie in most cases, so
      sleeping means waiting longer with nothing to do than if we just committed
      the transaction and kept going.  Ric Wheeler noticed this when using
      fs_mark with more than 1 thread, the throughput would plummet as he added
      more threads.
      
      This patch attempts to fix this problem by recording the average time in
      nanoseconds that it takes to commit a transaction to disk, and what time
      we started the transaction.  If we run an fsync() and we have been running
      for less time than it takes to commit the transaction to disk, we sleep
      for the delta amount of time and then commit to disk.  We acheive
      sub-jiffie sleeping using schedule_hrtimeout.  This means that the wait
      time is auto-tuned to the speed of the underlying disk, instead of having
      this static timeout.  I weighted the average according to somebody's
      comments (Andreas Dilger I think) in order to help normalize random
      outliers where we take way longer or way less time to commit than the
      average.  I also have a min() check in there to make sure we don't sleep
      longer than a jiffie in case our storage is super slow, this was requested
      by Andrew.
      
      I unfortunately do not have access to a Clariion, so I had to use a
      ramdisk to represent a super fast array.  I tested with a SATA drive with
      barrier=1 to make sure there was no regression with local disks, I tested
      with a 4 way multipathed Apple Xserve RAID array and of course the
      ramdisk.  I ran the following command
      
      fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i
      
      where $i was 2, 4, 8, 16 and 32.  I mkfs'ed the fs each time.  Here are my
      results
      
      type	threads		with patch	without patch
      sata	2		24.6		26.3
      sata	4		49.2		48.1
      sata	8		70.1		67.0
      sata	16		104.0		94.1
      sata	32		153.6		142.7
      
      xserve	2		246.4		222.0
      xserve	4		480.0		440.8
      xserve	8		829.5		730.8
      xserve	16		1172.7		1026.9
      xserve	32		1816.3		1650.5
      
      ramdisk	2		2538.3		1745.6
      ramdisk	4		2942.3		661.9
      ramdisk	8		2882.5		999.8
      ramdisk	16		2738.7		1801.9
      ramdisk	32		2541.9		2394.0
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Ric Wheeler <rwheeler@redhat.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f420d4dc
  6. 20 10月, 2008 3 次提交
  7. 05 8月, 2008 2 次提交
  8. 26 7月, 2008 2 次提交
    • H
      jbd: don't abort if flushing file data failed · cbe5f466
      Hidehiro Kawai 提交于
      In ordered mode, the current jbd aborts the journal if a file data buffer
      has an error.  But this behavior is unintended, and we found that it has
      been adopted accidentally.
      
      This patch undoes it and just calls printk() instead of aborting the
      journal.  Additionally, set AS_EIO into the address_space object of the
      failed buffer which is submitted by journal_do_submit_data() so that
      fsync() can get -EIO.
      
      Missing error checkings are also added to inform errors on file data
      buffers to the user.  The following buffers are targeted.
      
        (a) the buffer which has already been written out by pdflush
        (b) the buffer which has been unlocked before scanned in the
            t_locked_list loop
      
      [akpm@linux-foundation.org: improve grammar in a printk]
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbe5f466
    • T
      jbd: positively dispose the unmapped data buffers in journal_commit_transaction() · fc80c442
      Toshiyuki Okajima 提交于
      After ext3-ordered files are truncated, there is a possibility that the
      pages which cannot be estimated still remain.  Remaining pages can be
      released when the system has really few memory.  So, it is not memory
      leakage.  But the resource management software etc.  may not work
      correctly.
      
      It is possible that journal_unmap_buffer() cannot release the buffers, and
      the pages to which they belong because they are attached to a commiting
      transaction and journal_unmap_buffer() cannot release them.  To release
      such the buffers and the pages later, journal_unmap_buffer() leaves it to
      journal_commit_transaction().  (journal_unmap_buffer() puts the mark
      'BH_Freed' to the buffers so that journal_commit_transaction() can
      identify whether they can be released or not.)
      
      In the journalled mode and the writeback mode, jbd does with only metadata
      buffers.  But in the ordered mode, jbd does with metadata buffers and also
      data buffers.
      
      Actually, journal_commit_transaction() releases only the metadata buffers
      of which release is demanded by journal_unmap_buffer(), and also releases
      the pages to which they belong if possible.
      
      As a result, the data buffers of which release is demanded by
      journal_unmap_buffer() remain after a transaction commits.  And also the
      pages to which they belong remain.
      
      Such the remained pages don't have mapping any longer.  Due to this fact,
      there is a possibility that the pages which cannot be estimated remain.
      
      The metadata buffers marked 'BH_Freed' and the pages to which
      they belong can be released at 'JBD: commit phase 7'.
      
      Therefore, by applying the same code into 'JBD: commit phase 2' (where the
      data buffers are done with), journal_commit_transaction() can also release
      the data buffers marked 'BH_Freed' and the pages to which they belong.
      
      As a result, all the buffers marked 'BH_Freed' can be released, and also
      all the pages to which these buffers belong can be released at
      journal_commit_transaction().  So, the page which cannot be estimated is
      lost.
      
      <<Excerpt of code at 'JBD: commit phase 7'>>
       >         spin_lock(&journal->j_list_lock);
       >         while (commit_transaction->t_forget) {
       >                 transaction_t *cp_transaction;
       >                 struct buffer_head *bh;
       >
       >                 jh = commit_transaction->t_forget;
       >...
       >                 if (buffer_freed(bh)) {
       >                 ^^^^^^^^^^^^^^^^^^^^^^^^
       >                         clear_buffer_freed(bh);
       >                        ^^^^^^^^^^^^^^^^^^^^^^^^
       >                         clear_buffer_jbddirty(bh);
       >                 }
       >
       >                 if (buffer_jbddirty(bh)) {
       >                         JBUFFER_TRACE(jh, "add to new checkpointing trans");
       >                         __journal_insert_checkpoint(jh, commit_transaction);
       >                         JBUFFER_TRACE(jh, "refile for checkpoint writeback");
       >                         __journal_refile_buffer(jh);
       >                         jbd_unlock_bh_state(bh);
       >                 } else {
       >                         J_ASSERT_BH(bh, !buffer_dirty(bh));
       > ...
       >                         JBUFFER_TRACE(jh, "refile or unfile freed buffer");
       >                         __journal_refile_buffer(jh);
       >                         if (!jh->b_transaction) {
       >                                 jbd_unlock_bh_state(bh);
       >                                  /* needs a brelse */
       >                                 journal_remove_journal_head(bh);
       >                                 release_buffer_page(bh);
       >                                 ^^^^^^^^^^^^^^^^^^^^^^^^
       >                         } else
       >                 }
      ****************************************************************
      * Apply the code of "^^^^^^" lines into 'JBD: commit phase 2' *
      ****************************************************************
      
      At journal_commit_transaction() code, there is one extra message in the
      series of jbd debug messages.  ("JBD: commit phase 2") This patch fixes
      it, too.
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc80c442
  9. 15 5月, 2008 1 次提交
  10. 28 4月, 2008 2 次提交
    • J
      jbd: fix possible journal overflow issues · 5b9a499d
      Josef Bacik 提交于
      There are several cases where the running transaction can get buffers added to
      its BJ_Metadata list which it never dirtied, which makes its t_nr_buffers
      counter end up larger than its t_outstanding_credits counter.
      
      This will cause issues when starting new transactions as while we are logging
      buffers we decrement t_outstanding_buffers, so when t_outstanding_buffers goes
      negative, we will report that we need less space in the journal than we
      actually need, so transactions will be started even though there may not be
      enough room for them.  In the worst case scenario (which admittedly is almost
      impossible to reproduce) this will result in the journal running out of space.
      
      The fix is to only
      refile buffers from the committing transaction to the running transactions
      BJ_Modified list when b_modified is set on that journal, which is the only way
      to be sure if the running transaction has modified that buffer.
      
      This patch also fixes an accounting error in journal_forget, it is possible
      that we can call journal_forget on a buffer without having modified it, only
      gotten write access to it, so instead of freeing a credit, we only do so if
      the buffer was modified.  The assert will help catch if this problem occurs.
      Without these two patches I could hit this assert within minutes of running
      postmark, with them this issue no longer arises.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Cc: <linux-ext4@vger.kernel.org>
      Acked-by: NJan Kara <jack@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b9a499d
    • J
      jbd: fix the way the b_modified flag is cleared · 5bc833fe
      Josef Bacik 提交于
      Currently at the start of a journal commit we loop through all of the buffers
      on the committing transaction and clear the b_modified flag (the flag that is
      set when a transaction modifies the buffer) under the j_list_lock.
      
      The problem is that everywhere else this flag is modified only under the jbd
      lock buffer flag, so it will race with a running transaction who could
      potentially set it, and have it unset by the committing transaction.
      
      This is also a big waste, you can have several thousands of buffers that you
      are clearing the modified flag on when you may not need to.  This patch
      removes this code and instead clears the b_modified flag upon entering
      do_get_write_access/journal_get_create_access, so if that transaction does
      indeed use the buffer then it will be accounted for properly, and if it does
      not then we know we didn't use it.
      
      That will be important for the next patch in this series.  Tested thoroughly
      by myself using postmark/iozone/bonnie++.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Cc: <linux-ext4@vger.kernel.org>
      Acked-by: NJan Kara <jack@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5bc833fe
  11. 09 2月, 2008 1 次提交
    • N
      ext3 can fail badly when device stops accepting BIO_RW_BARRIER requests · 28ae094c
      Neil Brown 提交于
      Some devices - notably dm and md - can change their behaviour in response
      to BIO_RW_BARRIER requests.  They might start out accepting such requests
      but on reconfiguration, they find out that they cannot any more.
      
      ext3 (and other filesystems) deal with this by always testing if
      BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write
      requests without the barrier (probably after waiting for any pending writes
      to complete).
      
      However there is a bug in the handling for this for ext3.
      
      When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request, it
      sets the buffer_ordered flag on the buffer head.  If the request completes
      successfully, the flag STAYS SET.
      
      Other code might then write the same buffer_head after the device has been
      reconfigured to not accept barriers.  This write will then fail, but the
      "other code" is not ready to handle EOPNOTSUPP errors and the error will be
      treated as fatal.
      
      This can be seen without having to reconfigure a device at exactly the
      wrong time by putting:
      
      		if (buffer_ordered(bh))
      			printk("OH DEAR, and ordered buffer\n");
      
      in the while loop in "commit phase 5" of journal_commit_transaction.
      
      If it ever prints the "OH DEAR ..." message (as it does sometimes for
      me), then that request could (in different circumstances) have failed
      with EOPNOTSUPP, but that isn't tested for.
      
      My proposed fix is to clear the buffer_ordered flag after it has been
      used, as in the following patch.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28ae094c
  12. 01 2月, 2008 1 次提交
  13. 30 1月, 2008 1 次提交
    • N
      spinlock: lockbreak cleanup · 95c354fe
      Nick Piggin 提交于
      The break_lock data structure and code for spinlocks is quite nasty.
      Not only does it double the size of a spinlock but it changes locking to
      a potentially less optimal trylock.
      
      Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
      __raw_spin_is_contended that uses the lock data itself to determine whether
      there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
      not set.
      
      Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
      decouple it from the spinlock implementation, and make it typesafe (rwlocks
      do not have any need_lockbreak sites -- why do they even get bloated up
      with that break_lock then?).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      95c354fe
  14. 06 12月, 2007 1 次提交
    • J
      jbd: Fix assertion failure in fs/jbd/checkpoint.c · d4beaf4a
      Jan Kara 提交于
      Before we start committing a transaction, we call
      __journal_clean_checkpoint_list() to cleanup transaction's written-back
      buffers.
      
      If this call happens to remove all of them (and there were already some
      buffers), __journal_remove_checkpoint() will decide to free the transaction
      because it isn't (yet) a committing transaction and soon we fail some
      assertion - the transaction really isn't ready to be freed :).
      
      We change the check in __journal_remove_checkpoint() to free only a
      transaction in T_FINISHED state.  The locking there is subtle though (as
      everywhere in JBD ;().  We use j_list_lock to protect the check and a
      subsequent call to __journal_drop_transaction() and do the same in the end
      of journal_commit_transaction() which is the only place where a transaction
      can get to T_FINISHED state.
      
      Probably I'm too paranoid here and such locking is not really necessary -
      checkpoint lists are processed only from log_do_checkpoint() where a
      transaction must be already committed to be processed or from
      __journal_clean_checkpoint_list() where kjournald itself calls it and thus
      transaction cannot change state either.  Better be safe if something
      changes in future...
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4beaf4a
  15. 20 10月, 2007 1 次提交
  16. 18 10月, 2007 1 次提交
  17. 17 7月, 2007 1 次提交
  18. 09 5月, 2007 1 次提交
  19. 23 12月, 2006 1 次提交
  20. 04 10月, 2006 1 次提交
  21. 26 9月, 2006 1 次提交
    • J
      [PATCH] jbd: fix commit of ordered data buffers · 3998b930
      Jan Kara 提交于
      Original commit code assumes, that when a buffer on BJ_SyncData list is
      locked, it is being written to disk.  But this is not true and hence it can
      lead to a potential data loss on crash.  Also the code didn't count with
      the fact that journal_dirty_data() can steal buffers from committing
      transaction and hence could write buffers that no longer belong to the
      committing transaction.  Finally it could possibly happen that we tried
      writing out one buffer several times.
      
      The patch below tries to solve these problems by a complete rewrite of the
      data commit code.  We go through buffers on t_sync_datalist, lock buffers
      needing write out and store them in an array.  Buffers are also immediately
      refiled to BJ_Locked list or unfiled (if the write out is completed).  When
      the array is full or we have to block on buffer lock, we submit all
      accumulated buffers for IO.
      
      [suitable for 2.6.18.x around the 2.6.19-rc2 timeframe]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3998b930
  22. 28 8月, 2006 1 次提交
  23. 23 6月, 2006 1 次提交
    • J
      [PATCH] jbd: fix BUG in journal_commit_transaction() · 9ada7340
      Jan Kara 提交于
      Fix possible assertion failure in journal_commit_transaction() on
      jh->b_next_transaction == NULL (when we are processing BJ_Forget list and
      buffer is not jbddirty).
      
      !jbddirty buffers can be placed on BJ_Forget list for example by
      journal_forget() or by __dispose_buffer() - generally such buffer means
      that it has been freed by this transaction.
      
      Freed buffers should not be reallocated until the transaction has committed
      (that's why we have the assertion there) but they *can* be reallocated when
      the transaction has already been committed to disk and we are just
      processing the BJ_Forget list (as soon as we remove b_committed_data from
      the bitmap bh, ext3 will be able to reallocate buffers freed by the
      committing transaction).  So we have to also count with the case that the
      buffer has been reallocated and b_next_transaction has been already set.
      
      And one more subtle point: it can happen that we manage to reallocate the
      buffer and also mark it jbddirty.  Then we also add the freed buffer to the
      checkpoint list of the committing trasaction.  But that should do no harm.
      
      Non-jbddirty buffers should be filed to BJ_Reserved and not BJ_Metadata
      list.  It can actually happen that we refile such buffers during the commit
      phase when we reallocate in the running transaction blocks deleted in
      committing transaction (and that can happen if the committing transaction
      already wrote all the data and is just cleaning up BJ_Forget list).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: N"Stephen C. Tweedie" <sct@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9ada7340
  24. 15 2月, 2006 1 次提交
  25. 19 1月, 2006 1 次提交
  26. 07 11月, 2005 1 次提交
  27. 08 9月, 2005 2 次提交
  28. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4