1. 16 5月, 2012 3 次提交
    • J
      jbd: Write journal superblock with WRITE_FUA after checkpointing · fd2cbd4d
      Jan Kara 提交于
      If journal superblock is written only in disk's caches and other transaction
      starts reusing space of the transaction cleaned from the log, it can happen
      blocks of a new transaction reach the disk before journal superblock. When
      power failure happens in such case, subsequent journal replay would still try
      to replay the old transaction but some of it's blocks may be already
      overwritten by the new transaction. For this reason we must use WRITE_FUA when
      updating log tail and we must first write new log tail to disk and update
      in-memory information only after that.
      Signed-off-by: NJan Kara <jack@suse.cz>
      fd2cbd4d
    • J
      jbd: protect all log tail updates with j_checkpoint_mutex · 1ce8486d
      Jan Kara 提交于
      There are some log tail updates that are not protected by j_checkpoint_mutex.
      Some of these are harmless because they happen during startup or shutdown but
      updates in journal_commit_transaction() and journal_flush() can really race
      with other log tail updates (e.g. someone doing journal_flush() with someone
      running cleanup_journal_tail()). So protect all log tail updates with
      j_checkpoint_mutex.
      Signed-off-by: NJan Kara <jack@suse.cz>
      1ce8486d
    • J
      jbd: Split updating of journal superblock and marking journal empty · 9754e39c
      Jan Kara 提交于
      There are three case of updating journal superblock. In the first case, we want
      to mark journal as empty (setting s_sequence to 0), in the second case we want
      to update log tail, in the third case we want to update s_errno. Split these
      cases into separate functions. It makes the code slightly more straightforward
      and later patches will make the distinction even more important.
      Signed-off-by: NJan Kara <jack@suse.cz>
      9754e39c
  2. 11 4月, 2012 1 次提交
    • J
      jbd: Refine commit writeout logic · 2db938be
      Jan Kara 提交于
      Currently we write out all journal buffers in WRITE_SYNC mode. This improves
      performance for fsync heavy workloads but hinders performance when writes
      are mostly asynchronous, most noticably it slows down readers and users
      complain about slow desktop response etc.
      
      So submit writes as asynchronous in the normal case and only submit writes as
      WRITE_SYNC if we detect someone is waiting for current transaction commit.
      
      I've gathered some numbers to back this change. The first is the read latency
      test. It measures time to read 1 MB after several seconds of sleeping in
      presence of streaming writes.
      
      Top 10 times (out of 90) in us:
      Before		After
      2131586		697473
      1709932		557487
      1564598		535642
      1480462		347573
      1478579		323153
      1408496		222181
      1388960		181273
      1329565		181070
      1252486		172832
      1223265		172278
      
      Average:
      619377		82180
      
      So the improvement in both maximum and average latency is massive.
      
      I've measured fsync throughput by:
      fs_mark -n 100 -t 1 -s 16384 -d /mnt/fsync/ -S 1 -L 4
      
      in presence of streaming reader. The numbers (fsyncs/s) are:
      Before		After
      9.9		6.3
      6.8		6.0
      6.3		6.2
      5.8		6.1
      
      So fsync performance seems unharmed by this change.
      Signed-off-by: NJan Kara <jack@suse.cz>
      2db938be
  3. 20 3月, 2012 1 次提交
  4. 14 3月, 2012 1 次提交
  5. 09 1月, 2012 1 次提交
    • J
      jbd: Remove j_barrier mutex · 00482785
      Jan Kara 提交于
      j_barrier mutex is used for serializing different journal lock operations.  The
      problem with it is that e.g. FIFREEZE ioctl results in process leaving kernel
      with j_barrier mutex held which makes lockdep freak out. Also hibernation code
      wants to freeze filesystem but it cannot do so because it then cannot hibernate
      the system because of mutex being locked.
      
      So we remove j_barrier mutex and use direct wait on j_barrier_count instead.
      Since locking journal is a rare operation we don't have to care about fairness
      or such things.
      
      CC: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NJoel Becker <jlbec@evilplan.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      00482785
  6. 22 11月, 2011 1 次提交
    • T
      freezer: unexport refrigerator() and update try_to_freeze() slightly · a0acae0e
      Tejun Heo 提交于
      There is no reason to export two functions for entering the
      refrigerator.  Calling refrigerator() instead of try_to_freeze()
      doesn't save anything noticeable or removes any race condition.
      
      * Rename refrigerator() to __refrigerator() and make it return bool
        indicating whether it scheduled out for freezing.
      
      * Update try_to_freeze() to return bool and relay the return value of
        __refrigerator() if freezing().
      
      * Convert all refrigerator() users to try_to_freeze().
      
      * Update documentation accordingly.
      
      * While at it, add might_sleep() to try_to_freeze().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Samuel Ortiz <samuel@sortiz.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Christoph Hellwig <hch@infradead.org>
      a0acae0e
  7. 02 11月, 2011 1 次提交
    • E
      jbd/jbd2: validate sb->s_first in journal_get_superblock() · 8762202d
      Eryu Guan 提交于
      I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when
      mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3
      image has s_first = 0 in journal superblock, and the 0 is passed to
      journal->j_head in journal_reset(), then to blocknr in
      cleanup_journal_tail(), in the end the J_ASSERT failed.
      
      So validate s_first after reading journal superblock from disk in
      journal_get_superblock() to ensure s_first is valid.
      
      The following script could reproduce it:
      
      fstype=ext3
      blocksize=1024
      img=$fstype.img
      offset=0
      found=0
      magic="c0 3b 39 98"
      
      dd if=/dev/zero of=$img bs=1M count=8
      mkfs -t $fstype -b $blocksize -F $img
      filesize=`stat -c %s $img`
      while [ $offset -lt $filesize ]
      do
              if od -j $offset -N 4 -t x1 $img | grep -i "$magic";then
                      echo "Found journal: $offset"
                      found=1
                      break
              fi
              offset=`echo "$offset+$blocksize" | bc`
      done
      
      if [ $found -ne 1 ];then
              echo "Magic \"$magic\" not found"
              exit 1
      fi
      
      dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1
      
      mkdir -p ./mnt
      mount -o loop $img ./mnt
      
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8762202d
  8. 27 6月, 2011 1 次提交
    • J
      jbd: Fix oops in journal_remove_journal_head() · bb189247
      Jan Kara 提交于
      journal_remove_journal_head() can oops when trying to access journal_head
      returned by bh2jh(). This is caused for example by the following race:
      
      	TASK1					TASK2
        journal_commit_transaction()
          ...
          processing t_forget list
            __journal_refile_buffer(jh);
            if (!jh->b_transaction) {
              jbd_unlock_bh_state(bh);
      					journal_try_to_free_buffers()
      					  journal_grab_journal_head(bh)
      					  jbd_lock_bh_state(bh)
      					  __journal_try_to_free_buffer()
      					  journal_put_journal_head(jh)
              journal_remove_journal_head(bh);
      
      journal_put_journal_head() in TASK2 sees that b_jcount == 0 and buffer is not
      part of any transaction and thus frees journal_head before TASK1 gets to doing
      so. Note that even buffer_head can be released by try_to_free_buffers() after
      journal_put_journal_head() which adds even larger opportunity for oops (but I
      didn't see this happen in reality).
      
      Fix the problem by making transactions hold their own journal_head reference
      (in b_jcount). That way we don't have to remove journal_head explicitely via
      journal_remove_journal_head() and instead just remove journal_head when
      b_jcount drops to zero. The result of this is that [__]journal_refile_buffer(),
      [__]journal_unfile_buffer(), and __journal_remove_checkpoint() can free
      journal_head which needs modification of a few callers. Also we have to be
      careful because once journal_head is removed, buffer_head might be freed as
      well. So we have to get our own buffer_head reference where it matters.
      Signed-off-by: NJan Kara <jack@suse.cz>
      bb189247
  9. 25 6月, 2011 1 次提交
    • L
      jbd: Add fixed tracepoints · 99cb1a31
      Lukas Czerner 提交于
      This commit adds fixed tracepoint for jbd. It has been based on fixed
      tracepoints for jbd2, however there are missing those for collecting
      statistics, since I think that it will require more intrusive patch so I
      should have its own commit, if someone decide that it is needed. Also
      there are new tracepoints in __journal_drop_transaction() and
      journal_update_superblock().
      
      The list of jbd tracepoints:
      
      jbd_checkpoint
      jbd_start_commit
      jbd_commit_locking
      jbd_commit_flushing
      jbd_commit_logging
      jbd_drop_transaction
      jbd_end_commit
      jbd_do_submit_data
      jbd_cleanup_journal_tail
      jbd_update_superblock_end
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJan Kara <jack@suse.cz>
      99cb1a31
  10. 17 5月, 2011 1 次提交
    • T
      jbd: fix fsync() tid wraparound bug · d9b01934
      Ted Ts'o 提交于
      If an application program does not make any changes to the indirect
      blocks or extent tree, i_datasync_tid will not get updated.  If there
      are enough commits (i.e., 2**31) such that tid_geq()'s calculations
      wrap, and there isn't a currently active transaction at the time of
      the fdatasync() call, this can end up triggering a BUG_ON in
      fs/jbd/commit.c:
      
      	J_ASSERT(journal->j_running_transaction != NULL);
      
      It's pretty rare that this can happen, since it requires the use of
      fdatasync() plus *very* frequent and excessive use of fsync().  But
      with the right workload, it can.
      
      We fix this by replacing the use of tid_geq() with an equality test,
      since there's only one valid transaction id that is valid for us to
      start: namely, the currently running transaction (if it exists).
      
      CC: stable@kernel.org
      Reported-by: Martin_Zielinski@McAfee.com
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NJan Kara <jack@suse.cz>
      d9b01934
  11. 31 3月, 2011 1 次提交
  12. 01 3月, 2011 1 次提交
  13. 28 10月, 2010 4 次提交
  14. 18 8月, 2010 1 次提交
    • C
      remove SWRITE* I/O types · 9cb569d6
      Christoph Hellwig 提交于
      These flags aren't real I/O types, but tell ll_rw_block to always
      lock the buffer instead of giving up on a failed trylock.
      
      Instead add a new write_dirty_buffer helper that implements this semantic
      and use it from the existing SWRITE* callers.  Note that the ll_rw_block
      code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
      this patch fixes.
      
      In the ufs code clean up the helper that used to call ll_rw_block
      to mirror sync_dirty_buffer, which is the function it implements for
      compound buffers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9cb569d6
  15. 21 7月, 2010 1 次提交
  16. 22 5月, 2010 2 次提交
  17. 23 12月, 2009 1 次提交
  18. 12 11月, 2009 1 次提交
  19. 11 11月, 2009 1 次提交
  20. 16 9月, 2009 1 次提交
  21. 21 7月, 2009 1 次提交
  22. 16 7月, 2009 1 次提交
  23. 03 4月, 2009 1 次提交
  24. 12 2月, 2009 1 次提交
    • J
      jbd: fix return value of journal_start_commit() · 8fe4cd0d
      Jan Kara 提交于
      journal_start_commit() returns 1 if either a transaction is committing or
      the function has queued a transaction commit.  But it returns 0 if we
      raced with somebody queueing the transaction commit as well.  This
      resulted in ext3_sync_fs() not functioning correctly (description from
      Arthur Jones): In the case of a data=ordered umount with pending long
      symlinks which are delayed due to a long list of other I/O on the backing
      block device, this causes the buffer associated with the long symlinks to
      not be moved to the inode dirty list in the second phase of fsync_super.
      Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
      flag and the dirty pages are never written to the backing block device,
      causing long symlink corruption and exposing new or previously freed block
      data to userspace.
      
      This can be reproduced with a script created by Eric Sandeen
      <sandeen@redhat.com>:
      
              #!/bin/bash
      
              umount /mnt/test2
              mount /dev/sdb4 /mnt/test2
              rm -f /mnt/test2/*
              dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
              touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
              ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
              /mnt/test2/link
              umount /mnt/test2
              mount /dev/sdb4 /mnt/test2
              ls /mnt/test2/
      
      This patch fixes journal_start_commit() to always return 1 when there's
      a transaction committing or queued for commit.
      
      Cc: Eric Sandeen <sandeen@redhat.com>
      Cc: Mike Snitzer <snitzer@gmail.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fe4cd0d
  25. 23 10月, 2008 1 次提交
    • H
      jbd: fix error handling for checkpoint io · 4afe9785
      Hidehiro Kawai 提交于
      When a checkpointing IO fails, current JBD code doesn't check the error
      and continue journaling.  This means latest metadata can be lost from both
      the journal and filesystem.
      
      This patch leaves the failed metadata blocks in the journal space and
      aborts journaling in the case of log_do_checkpoint().  To achieve this, we
      need to do:
      
      1. don't remove the failed buffer from the checkpoint list where in
         the case of __try_to_free_cp_buf() because it may be released or
         overwritten by a later transaction
      2. log_do_checkpoint() is the last chance, remove the failed buffer
         from the checkpoint list and abort the journal
      3. when checkpointing fails, don't update the journal super block to
         prevent the journaled contents from being cleaned.  For safety,
         don't update j_tail and j_tail_sequence either
      4. when checkpointing fails, notify this error to the ext3 layer so
         that ext3 don't clear the needs_recovery flag, otherwise the
         journaled contents are ignored and cleaned in the recovery phase
      5. if the recovery fails, keep the needs_recovery flag
      6. prevent cleanup_journal_tail() from being called between
         __journal_drop_transaction() and journal_abort() (a race issue
         between journal_flush() and __log_wait_for_space()
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4afe9785
  26. 26 7月, 2008 2 次提交
  27. 28 4月, 2008 1 次提交
  28. 31 3月, 2008 1 次提交
  29. 20 3月, 2008 1 次提交
  30. 07 2月, 2008 1 次提交
  31. 20 10月, 2007 2 次提交
  32. 19 10月, 2007 1 次提交