1. 22 5月, 2019 1 次提交
  2. 06 4月, 2019 1 次提交
    • T
      jbd2: fix race when writing superblock · 91544201
      Theodore Ts'o 提交于
      [ Upstream commit 538bcaa6261b77e71d37f5596c33127c1a3ec3f7 ]
      
      The jbd2 superblock is lockless now, so there is probably a race
      condition between writing it so disk and modifing contents of it, which
      may lead to checksum error. The following race is the one case that we
      have captured.
      
      jbd2                                fsstress
      jbd2_journal_commit_transaction
       jbd2_journal_update_sb_log_tail
        jbd2_write_superblock
         jbd2_superblock_csum_set         jbd2_journal_revoke
                                           jbd2_journal_set_features(revork)
                                           modify superblock
         submit_bh(checksum incorrect)
      
      Fix this by locking the buffer head before modifing it.  We always
      write the jbd2 superblock after we modify it, so this just means
      calling the lock_buffer() a little earlier.
      
      This checksum corruption problem can be reproduced by xfstests
      generic/475.
      Reported-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      91544201
  3. 21 5月, 2018 2 次提交
  4. 20 2月, 2018 1 次提交
  5. 19 2月, 2018 1 次提交
    • T
      ext4: pass -ESHUTDOWN code to jbd2 layer · fb7c0244
      Theodore Ts'o 提交于
      Previously the jbd2 layer assumed that a file system check would be
      required after a journal abort.  In the case of the deliberate file
      system shutdown, this should not be necessary.  Allow the jbd2 layer
      to distinguish between these two cases by using the ESHUTDOWN errno.
      
      Also add proper locking to __journal_abort_soft().
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      fb7c0244
  6. 18 12月, 2017 1 次提交
    • T
      ext4: fix up remaining files with SPDX cleanups · f5166768
      Theodore Ts'o 提交于
      A number of ext4 source files were skipped due because their copyright
      permission statements didn't match the expected text used by the
      automated conversion utilities.  I've added SPDX tags for the rest.
      
      While looking at some of these files, I've noticed that we have quite
      a bit of variation on the licenses that were used --- in particular
      some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
      and we have some files that have a LGPL-2.1 license (which was quite
      surprising).
      
      I've not attempted to do any license changes.  Even if it is perfectly
      legal to relicense to GPL 2.0-only for consistency's sake, that should
      be done with ext4 developer community discussion.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      
      f5166768
  7. 03 11月, 2017 1 次提交
    • J
      ext4: Support for synchronous DAX faults · b8a6176c
      Jan Kara 提交于
      We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to
      prepare blocks for writing and the inode has some uncommitted metadata
      changes. In the fault handler ext4_dax_fault() we then detect this case
      (through VM_FAULT_NEEDDSYNC return value) and call helper
      dax_finish_sync_fault() to flush metadata changes and insert page table
      entry. Note that this will also dirty corresponding radix tree entry
      which is what we want - fsync(2) will still provide data integrity
      guarantees for applications not using userspace flushing. And
      applications using userspace flushing can avoid calling fsync(2) and
      thus avoid the performance overhead.
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b8a6176c
  8. 19 10月, 2017 1 次提交
  9. 20 6月, 2017 1 次提交
  10. 04 5月, 2017 2 次提交
  11. 30 4月, 2017 2 次提交
    • J
      jbd2: fix dbench4 performance regression for 'nobarrier' mounts · 5052b069
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
      JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
      filesystem is mounted with nobarrier mount option, journal superblock
      writes ended up being async writes after this patch and that caused
      heavy performance regression for dbench4 benchmark with high number of
      processes. In my test setup with HP RAID array with non-volatile write
      cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.
      
      Fix the problem by making sure journal superblock writes are always
      treated as synchronous since they generally block progress of the
      journalling machinery and thus the whole filesystem.
      
      Fixes: b685d3d6
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      5052b069
    • J
      jbd2: Fix lockdep splat with generic/270 test · c52c47e4
      Jan Kara 提交于
      I've hit a lockdep splat with generic/270 test complaining that:
      
      3216.fsstress.b/3533 is trying to acquire lock:
       (jbd2_handle){++++..}, at: [<ffffffff813152e0>] jbd2_log_wait_commit+0x0/0x150
      
      but task is already holding lock:
       (jbd2_handle){++++..}, at: [<ffffffff8130bd3b>] start_this_handle+0x35b/0x850
      
      The underlying problem is that jbd2_journal_force_commit_nested()
      (called from ext4_should_retry_alloc()) may get called while a
      transaction handle is started. In such case it takes care to not wait
      for commit of the running transaction (which would deadlock) but only
      for a commit of a transaction that is already committing (which is safe
      as that doesn't wait for any filesystem locks).
      
      In fact there are also other callers of jbd2_log_wait_commit() that take
      care to pass tid of a transaction that is already committing and for
      those cases, the lockdep instrumentation is too restrictive and leading
      to false positive reports. Fix the problem by calling
      jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
      transaction isn't already committing.
      
      Fixes: 1eaa566dSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c52c47e4
  12. 19 4月, 2017 1 次提交
    • P
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney 提交于
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  13. 16 3月, 2017 1 次提交
  14. 02 2月, 2017 1 次提交
    • S
      jbd2: fix use after free in kjournald2() · dbfcef6b
      Sahitya Tummala 提交于
      Below is the synchronization issue between unmount and kjournald2
      contexts, which results into use after free issue in kjournald2().
      Fix this issue by using journal->j_state_lock to synchronize the
      wait_event() done in journal_kill_thread() and the wake_up() done
      in kjournald2().
      
      TASK 1:
      umount cmd:
         |--jbd2_journal_destroy() {
             |--journal_kill_thread() {
                  write_lock(&journal->j_state_lock);
      	    journal->j_flags |= JBD2_UNMOUNT;
      	    ...
      	    write_unlock(&journal->j_state_lock);
      	    wake_up(&journal->j_wait_commit);	   TASK 2 wakes up here:
      	    					   kjournald2() {
      						     ...
      						     checks JBD2_UNMOUNT flag and calls goto end-loop;
      						     ...
      						     end_loop:
      						       write_unlock(&journal->j_state_lock);
      						       journal->j_task = NULL; --> If this thread gets
      						       pre-empted here, then TASK 1 wait_event will
      						       exit even before this thread is completely
      						       done.
      	    wait_event(journal->j_wait_done_commit, journal->j_task == NULL);
      	    ...
      	    write_lock(&journal->j_state_lock);
      	    write_unlock(&journal->j_state_lock);
      	  }
             |--kfree(journal);
           }
      }
      						       wake_up(&journal->j_wait_done_commit); --> this step
      						       now results into use after free issue.
      						   }
      Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dbfcef6b
  15. 14 1月, 2017 1 次提交
    • T
      fs/jbd2, locking/mutex, sched/wait: Use mutex_lock_io() for journal->j_checkpoint_mutex · 6fa7aa50
      Tejun Heo 提交于
      When an ext4 fs is bogged down by a lot of metadata IOs (in the
      reported case, it was deletion of millions of files, but any massive
      amount of journal writes would do), after the journal is filled up,
      tasks which try to access the filesystem and aren't currently
      performing the journal writes end up waiting in
      __jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.
      
      Because those mutex sleeps aren't marked as iowait, this condition can
      lead to misleadingly low iowait and /proc/stat:procs_blocked.  While
      iowait propagation is far from strict, this condition can be triggered
      fairly easily and annotating these sleeps correctly helps initial
      diagnosis quite a bit.
      
      Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
      these sleeps are properly marked as iowait.
      Reported-by: NMingbo Wan <mingbo@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fa7aa50
  16. 25 12月, 2016 1 次提交
  17. 01 11月, 2016 1 次提交
  18. 16 9月, 2016 1 次提交
  19. 30 6月, 2016 2 次提交
    • J
      jbd2: track more dependencies on transaction commit · 1eaa566d
      Jan Kara 提交于
      So far we were tracking only dependency on transaction commit due to
      starting a new handle (which may require commit to start a new
      transaction). Now add tracking also for other cases where we wait for
      transaction commit. This way lockdep can catch deadlocks e. g. because we
      call jbd2_journal_stop() for a synchronous handle with some locks held
      which rank below transaction start.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1eaa566d
    • J
      jbd2: move lockdep tracking to journal_s · ab714aff
      Jan Kara 提交于
      Currently lockdep map is tracked in each journal handle. To be able to
      expand lockdep support to cover also other cases where we depend on
      transaction commit and where handle is not available, move lockdep map
      into struct journal_s. Since this makes the lockdep map shared for all
      handles, we have to use rwsem_acquire_read() for acquisitions now.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ab714aff
  20. 25 6月, 2016 1 次提交
    • M
      jbd2: get rid of superfluous __GFP_REPEAT · f2db1971
      Michal Hocko 提交于
      jbd2_alloc is explicit about its allocation preferences wrt.  the
      allocation size.  Sub page allocations go to the slab allocator and
      larger are using either the page allocator or vmalloc.  This is all good
      but the logic is unnecessarily complex.
      
      1) as per Ted, the vmalloc fallback is a left-over:
      
       : jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so
       : the code path that calls vmalloc() should never get called.  When we
       : conveted jbd2_alloc() to suppor sub-page size allocations in commit
       : d2eecb03, there was an assumption that it could be called with a size
       : greater than PAGE_SIZE, but that's certaily not true today.
      
      Moreover vmalloc allocation might even lead to a deadlock because the
      callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
      
      2) __GFP_REPEAT for requests <= PAGE_ALLOC_COSTLY_ORDER is ignored
         since the flag was introduced.
      
      Let's simplify the code flow and use the slab allocator for sub-page
      requests and the page allocator for others.  Even though order > 0 is
      not currently used as per above leave that option open.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-18-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2db1971
  21. 08 6月, 2016 3 次提交
  22. 24 4月, 2016 1 次提交
    • J
      jbd2: add support for avoiding data writes during transaction commits · 41617e1a
      Jan Kara 提交于
      Currently when filesystem needs to make sure data is on permanent
      storage before committing a transaction it adds inode to transaction's
      inode list. During transaction commit, jbd2 writes back all dirty
      buffers that have allocated underlying blocks and waits for the IO to
      finish. However when doing writeback for delayed allocated data, we
      allocate blocks and immediately submit the data. Thus asking jbd2 to
      write dirty pages just unnecessarily adds more work to jbd2 possibly
      writing back other redirtied blocks.
      
      Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
      outstanding data writes before committing a transaction and thus avoid
      unnecessary writes.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      41617e1a
  23. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  24. 10 3月, 2016 1 次提交
    • O
      jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path · c0a2ad9b
      OGAWA Hirofumi 提交于
      On umount path, jbd2_journal_destroy() writes latest transaction ID
      (->j_tail_sequence) to be used at next mount.
      
      The bug is that ->j_tail_sequence is not holding latest transaction ID
      in some cases. So, at next mount, there is chance to conflict with
      remaining (not overwritten yet) transactions.
      
      	mount (id=10)
      	write transaction (id=11)
      	write transaction (id=12)
      	umount (id=10) <= the bug doesn't write latest ID
      
      	mount (id=10)
      	write transaction (id=11)
      	crash
      
      	mount
      	[recovery process]
      		transaction (id=11)
      		transaction (id=12) <= valid transaction ID, but old commit
                                             must not replay
      
      Like above, this bug become the cause of recovery failure, or FS
      corruption.
      
      So why ->j_tail_sequence doesn't point latest ID?
      
      Because if checkpoint transactions was reclaimed by memory pressure
      (i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated.
      (And another case is, __jbd2_journal_clean_checkpoint_list() is called
      with empty transaction.)
      
      So in above cases, ->j_tail_sequence is not pointing latest
      transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not
      done too.
      
      So, to fix this problem with minimum changes, this patch updates
      ->j_tail_sequence, and issue REQ_FLUSH.  (With more complex changes,
      some optimizations would be possible to avoid unnecessary REQ_FLUSH
      for example though.)
      
      BTW,
      
      	journal->j_tail_sequence =
      		++journal->j_transaction_sequence;
      
      Increment of ->j_transaction_sequence seems to be unnecessary, but
      ext3 does this.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c0a2ad9b
  25. 23 2月, 2016 3 次提交
  26. 19 10月, 2015 1 次提交
    • D
      ext4, jbd2: ensure entering into panic after recording an error in superblock · 4327ba52
      Daeho Jeong 提交于
      If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
      journaling will be aborted first and the error number will be recorded
      into JBD2 superblock and, finally, the system will enter into the
      panic state in "errors=panic" option.  But, in the rare case, this
      sequence is little twisted like the below figure and it will happen
      that the system enters into panic state, which means the system reset
      in mobile environment, before completion of recording an error in the
      journal superblock. In this case, e2fsck cannot recognize that the
      filesystem failure occurred in the previous run and the corruption
      wouldn't be fixed.
      
      Task A                        Task B
      ext4_handle_error()
      -> jbd2_journal_abort()
        -> __journal_abort_soft()
          -> __jbd2_journal_abort_hard()
          | -> journal->j_flags |= JBD2_ABORT;
          |
          |                         __ext4_abort()
          |                         -> jbd2_journal_abort()
          |                         | -> __journal_abort_soft()
          |                         |   -> if (journal->j_flags & JBD2_ABORT)
          |                         |           return;
          |                         -> panic()
          |
          -> jbd2_journal_update_sb_errno()
      Tested-by: NHobin Woo <hobin.woo@samsung.com>
      Signed-off-by: NDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      4327ba52
  27. 18 10月, 2015 2 次提交
  28. 15 10月, 2015 1 次提交
  29. 29 7月, 2015 1 次提交
  30. 23 7月, 2015 1 次提交
    • D
      ext4, jbd2: add REQ_FUA flag when recording an error in the superblock · 564bc402
      Daeho Jeong 提交于
      When an error condition is detected, an error status should be recorded into
      superblocks of EXT4 or JBD2. However, the write request is submitted now
      without REQ_FUA flag, even in "barrier=1" mode, which is followed by
      panic() function in "errors=panic" mode. On mobile devices which make
      whole system reset as soon as kernel panic occurs, this write request
      containing an error flag will disappear just from storage cache without
      written to the physical cells. Therefore, when next start, even forever,
      the error flag cannot be shown in both superblocks, and e2fsck cannot fix
      the filesystem problems automatically, unless e2fsck is executed in
      force checking mode.
      
      [ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]
      Signed-off-by: NDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      564bc402
  31. 26 6月, 2015 1 次提交