1. 06 11月, 2019 3 次提交
  2. 21 10月, 2019 2 次提交
    • T
      jbd2: Free journal head outside of locked region · 7855a57d
      Thomas Gleixner 提交于
      On PREEMPT_RT bit-spinlocks have the same semantics as on PREEMPT_RT=n,
      i.e. they disable preemption. That means functions which are not safe to be
      called in preempt disabled context on RT trigger a might_sleep() assert.
      
      The journal head bit spinlock is mostly held for short code sequences with
      trivial RT safe functionality, except for one place:
      
      jbd2_journal_put_journal_head() invokes __journal_remove_journal_head()
      with the journal head bit spinlock held. __journal_remove_journal_head()
      invokes kmem_cache_free() which must not be called with preemption disabled
      on RT.
      
      Jan suggested to rework the removal function so the actual free happens
      outside the bit-spinlocked region.
      
      Split it into two parts:
      
        - Do the sanity checks and the buffer head detach under the lock
      
        - Do the actual free after dropping the lock
      
      There is error case handling in the free part which needs to dereference
      the b_size field of the now detached buffer head. Due to paranoia (caused
      by ignorance) the size is retrieved in the detach function and handed into
      the free function. Might be over-engineered, but better safe than sorry.
      
      This makes the journal head bit-spinlock usage RT compliant and also avoids
      nested locking which is not covered by lockdep.
      Suggested-by: NJan Kara <jack@suse.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-ext4@vger.kernel.org
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20190809124233.13277-8-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      7855a57d
    • T
      jbd2: Make state lock a spinlock · 46417064
      Thomas Gleixner 提交于
      Bit-spinlocks are problematic on PREEMPT_RT if functions which might sleep
      on RT, e.g. spin_lock(), alloc/free(), are invoked inside the lock held
      region because bit spinlocks disable preemption even on RT.
      
      A first attempt was to replace state lock with a spinlock placed in struct
      buffer_head and make the locking conditional on PREEMPT_RT and
      DEBUG_BIT_SPINLOCKS.
      
      Jan pointed out that there is a 4 byte hole in struct journal_head where a
      regular spinlock fits in and he would not object to convert the state lock
      to a spinlock unconditionally.
      
      Aside of solving the RT problem, this also gains lockdep coverage for the
      journal head state lock (bit-spinlocks are not covered by lockdep as it's
      hard to fit a lockdep map into a single bit).
      
      The trivial change would have been to convert the jbd_*lock_bh_state()
      inlines, but that comes with the downside that these functions take a
      buffer head pointer which needs to be converted to a journal head pointer
      which adds another level of indirection.
      
      As almost all functions which use this lock have a journal head pointer
      readily available, it makes more sense to remove the lock helper inlines
      and write out spin_*lock() at all call sites.
      
      Fixup all locking comments as well.
      Suggested-by: NJan Kara <jack@suse.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: linux-ext4@vger.kernel.org
      Link: https://lore.kernel.org/r/20190809124233.13277-7-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      46417064
  3. 25 9月, 2019 1 次提交
  4. 21 6月, 2019 2 次提交
    • T
      jbd2: drop declaration of journal_sync_buffer() · 9382cde8
      Theodore Ts'o 提交于
      The journal_sync_buffer() function was never carried over from jbd to
      jbd2.  So get rid of the vestigal declaration of this (non-existent)
      function.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9382cde8
    • R
      jbd2: introduce jbd2_inode dirty range scoping · 6ba0e7dc
      Ross Zwisler 提交于
      Currently both journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() operate on the entire address space
      of each of the inodes associated with a given journal entry.  The
      consequence of this is that if we have an inode where we are constantly
      appending dirty pages we can end up waiting for an indefinite amount of
      time in journal_finish_inode_data_buffers() while we wait for all the
      pages under writeback to be written out.
      
      The easiest way to cause this type of workload is do just dd from
      /dev/zero to a file until it fills the entire filesystem.  This can
      cause journal_finish_inode_data_buffers() to wait for the duration of
      the entire dd operation.
      
      We can improve this situation by scoping each of the inode dirty ranges
      associated with a given transaction.  We do this via the jbd2_inode
      structure so that the scoping is contained within jbd2 and so that it
      follows the lifetime and locking rules for that structure.
      
      This allows us to limit the writeback & wait in
      journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() respectively to the dirty range for
      a given struct jdb2_inode, keeping us from waiting forever if the inode
      in question is still being appended to.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      6ba0e7dc
  5. 31 5月, 2019 1 次提交
  6. 11 5月, 2019 1 次提交
  7. 07 4月, 2019 1 次提交
  8. 15 2月, 2019 2 次提交
    • T
      jbd2: fold jbd2_superblock_csum_{verify,set} into their callers · a58ca992
      Theodore Ts'o 提交于
      The functions jbd2_superblock_csum_verify() and
      jbd2_superblock_csum_set() only get called from one location, so to
      simplify things, fold them into their callers.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      a58ca992
    • T
      jbd2: fix race when writing superblock · 538bcaa6
      Theodore Ts'o 提交于
      The jbd2 superblock is lockless now, so there is probably a race
      condition between writing it so disk and modifing contents of it, which
      may lead to checksum error. The following race is the one case that we
      have captured.
      
      jbd2                                fsstress
      jbd2_journal_commit_transaction
       jbd2_journal_update_sb_log_tail
        jbd2_write_superblock
         jbd2_superblock_csum_set         jbd2_journal_revoke
                                           jbd2_journal_set_features(revork)
                                           modify superblock
         submit_bh(checksum incorrect)
      
      Fix this by locking the buffer head before modifing it.  We always
      write the jbd2 superblock after we modify it, so this just means
      calling the lock_buffer() a little earlier.
      
      This checksum corruption problem can be reproduced by xfstests
      generic/475.
      Reported-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      538bcaa6
  9. 01 2月, 2019 1 次提交
    • X
      jbd2: fix deadlock while checkpoint thread waits commit thread to finish · 53cf9784
      Xiaoguang Wang 提交于
      This issue was found when I tried to put checkpoint work in a separate thread,
      the deadlock below happened:
               Thread1                                |   Thread2
      __jbd2_log_wait_for_space                       |
      jbd2_log_do_checkpoint (hold j_checkpoint_mutex)|
        if (jh->b_transaction != NULL)                |
          ...                                         |
          jbd2_log_start_commit(journal, tid);        |jbd2_update_log_tail
                                                      |  will lock j_checkpoint_mutex,
                                                      |  but will be blocked here.
                                                      |
          jbd2_log_wait_commit(journal, tid);         |
          wait_event(journal->j_wait_done_commit,     |
           !tid_gt(tid, journal->j_commit_sequence)); |
           ...                                        |wake_up(j_wait_done_commit)
        }                                             |
      
      then deadlock occurs, Thread1 will never be waken up.
      
      To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()
      when we are going to wait for transaction commit.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      53cf9784
  10. 21 5月, 2018 2 次提交
  11. 20 2月, 2018 1 次提交
  12. 19 2月, 2018 1 次提交
    • T
      ext4: pass -ESHUTDOWN code to jbd2 layer · fb7c0244
      Theodore Ts'o 提交于
      Previously the jbd2 layer assumed that a file system check would be
      required after a journal abort.  In the case of the deliberate file
      system shutdown, this should not be necessary.  Allow the jbd2 layer
      to distinguish between these two cases by using the ESHUTDOWN errno.
      
      Also add proper locking to __journal_abort_soft().
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      fb7c0244
  13. 18 12月, 2017 1 次提交
    • T
      ext4: fix up remaining files with SPDX cleanups · f5166768
      Theodore Ts'o 提交于
      A number of ext4 source files were skipped due because their copyright
      permission statements didn't match the expected text used by the
      automated conversion utilities.  I've added SPDX tags for the rest.
      
      While looking at some of these files, I've noticed that we have quite
      a bit of variation on the licenses that were used --- in particular
      some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
      and we have some files that have a LGPL-2.1 license (which was quite
      surprising).
      
      I've not attempted to do any license changes.  Even if it is perfectly
      legal to relicense to GPL 2.0-only for consistency's sake, that should
      be done with ext4 developer community discussion.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      
      f5166768
  14. 03 11月, 2017 1 次提交
    • J
      ext4: Support for synchronous DAX faults · b8a6176c
      Jan Kara 提交于
      We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to
      prepare blocks for writing and the inode has some uncommitted metadata
      changes. In the fault handler ext4_dax_fault() we then detect this case
      (through VM_FAULT_NEEDDSYNC return value) and call helper
      dax_finish_sync_fault() to flush metadata changes and insert page table
      entry. Note that this will also dirty corresponding radix tree entry
      which is what we want - fsync(2) will still provide data integrity
      guarantees for applications not using userspace flushing. And
      applications using userspace flushing can avoid calling fsync(2) and
      thus avoid the performance overhead.
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b8a6176c
  15. 19 10月, 2017 1 次提交
  16. 20 6月, 2017 1 次提交
  17. 04 5月, 2017 2 次提交
  18. 30 4月, 2017 2 次提交
    • J
      jbd2: fix dbench4 performance regression for 'nobarrier' mounts · 5052b069
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
      JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
      filesystem is mounted with nobarrier mount option, journal superblock
      writes ended up being async writes after this patch and that caused
      heavy performance regression for dbench4 benchmark with high number of
      processes. In my test setup with HP RAID array with non-volatile write
      cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.
      
      Fix the problem by making sure journal superblock writes are always
      treated as synchronous since they generally block progress of the
      journalling machinery and thus the whole filesystem.
      
      Fixes: b685d3d6
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      5052b069
    • J
      jbd2: Fix lockdep splat with generic/270 test · c52c47e4
      Jan Kara 提交于
      I've hit a lockdep splat with generic/270 test complaining that:
      
      3216.fsstress.b/3533 is trying to acquire lock:
       (jbd2_handle){++++..}, at: [<ffffffff813152e0>] jbd2_log_wait_commit+0x0/0x150
      
      but task is already holding lock:
       (jbd2_handle){++++..}, at: [<ffffffff8130bd3b>] start_this_handle+0x35b/0x850
      
      The underlying problem is that jbd2_journal_force_commit_nested()
      (called from ext4_should_retry_alloc()) may get called while a
      transaction handle is started. In such case it takes care to not wait
      for commit of the running transaction (which would deadlock) but only
      for a commit of a transaction that is already committing (which is safe
      as that doesn't wait for any filesystem locks).
      
      In fact there are also other callers of jbd2_log_wait_commit() that take
      care to pass tid of a transaction that is already committing and for
      those cases, the lockdep instrumentation is too restrictive and leading
      to false positive reports. Fix the problem by calling
      jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
      transaction isn't already committing.
      
      Fixes: 1eaa566dSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c52c47e4
  19. 19 4月, 2017 1 次提交
    • P
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney 提交于
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  20. 16 3月, 2017 1 次提交
  21. 02 2月, 2017 1 次提交
    • S
      jbd2: fix use after free in kjournald2() · dbfcef6b
      Sahitya Tummala 提交于
      Below is the synchronization issue between unmount and kjournald2
      contexts, which results into use after free issue in kjournald2().
      Fix this issue by using journal->j_state_lock to synchronize the
      wait_event() done in journal_kill_thread() and the wake_up() done
      in kjournald2().
      
      TASK 1:
      umount cmd:
         |--jbd2_journal_destroy() {
             |--journal_kill_thread() {
                  write_lock(&journal->j_state_lock);
      	    journal->j_flags |= JBD2_UNMOUNT;
      	    ...
      	    write_unlock(&journal->j_state_lock);
      	    wake_up(&journal->j_wait_commit);	   TASK 2 wakes up here:
      	    					   kjournald2() {
      						     ...
      						     checks JBD2_UNMOUNT flag and calls goto end-loop;
      						     ...
      						     end_loop:
      						       write_unlock(&journal->j_state_lock);
      						       journal->j_task = NULL; --> If this thread gets
      						       pre-empted here, then TASK 1 wait_event will
      						       exit even before this thread is completely
      						       done.
      	    wait_event(journal->j_wait_done_commit, journal->j_task == NULL);
      	    ...
      	    write_lock(&journal->j_state_lock);
      	    write_unlock(&journal->j_state_lock);
      	  }
             |--kfree(journal);
           }
      }
      						       wake_up(&journal->j_wait_done_commit); --> this step
      						       now results into use after free issue.
      						   }
      Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dbfcef6b
  22. 14 1月, 2017 1 次提交
    • T
      fs/jbd2, locking/mutex, sched/wait: Use mutex_lock_io() for journal->j_checkpoint_mutex · 6fa7aa50
      Tejun Heo 提交于
      When an ext4 fs is bogged down by a lot of metadata IOs (in the
      reported case, it was deletion of millions of files, but any massive
      amount of journal writes would do), after the journal is filled up,
      tasks which try to access the filesystem and aren't currently
      performing the journal writes end up waiting in
      __jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.
      
      Because those mutex sleeps aren't marked as iowait, this condition can
      lead to misleadingly low iowait and /proc/stat:procs_blocked.  While
      iowait propagation is far from strict, this condition can be triggered
      fairly easily and annotating these sleeps correctly helps initial
      diagnosis quite a bit.
      
      Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
      these sleeps are properly marked as iowait.
      Reported-by: NMingbo Wan <mingbo@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fa7aa50
  23. 25 12月, 2016 1 次提交
  24. 01 11月, 2016 1 次提交
  25. 16 9月, 2016 1 次提交
  26. 30 6月, 2016 2 次提交
    • J
      jbd2: track more dependencies on transaction commit · 1eaa566d
      Jan Kara 提交于
      So far we were tracking only dependency on transaction commit due to
      starting a new handle (which may require commit to start a new
      transaction). Now add tracking also for other cases where we wait for
      transaction commit. This way lockdep can catch deadlocks e. g. because we
      call jbd2_journal_stop() for a synchronous handle with some locks held
      which rank below transaction start.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1eaa566d
    • J
      jbd2: move lockdep tracking to journal_s · ab714aff
      Jan Kara 提交于
      Currently lockdep map is tracked in each journal handle. To be able to
      expand lockdep support to cover also other cases where we depend on
      transaction commit and where handle is not available, move lockdep map
      into struct journal_s. Since this makes the lockdep map shared for all
      handles, we have to use rwsem_acquire_read() for acquisitions now.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ab714aff
  27. 25 6月, 2016 1 次提交
    • M
      jbd2: get rid of superfluous __GFP_REPEAT · f2db1971
      Michal Hocko 提交于
      jbd2_alloc is explicit about its allocation preferences wrt.  the
      allocation size.  Sub page allocations go to the slab allocator and
      larger are using either the page allocator or vmalloc.  This is all good
      but the logic is unnecessarily complex.
      
      1) as per Ted, the vmalloc fallback is a left-over:
      
       : jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so
       : the code path that calls vmalloc() should never get called.  When we
       : conveted jbd2_alloc() to suppor sub-page size allocations in commit
       : d2eecb03, there was an assumption that it could be called with a size
       : greater than PAGE_SIZE, but that's certaily not true today.
      
      Moreover vmalloc allocation might even lead to a deadlock because the
      callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
      
      2) __GFP_REPEAT for requests <= PAGE_ALLOC_COSTLY_ORDER is ignored
         since the flag was introduced.
      
      Let's simplify the code flow and use the slab allocator for sub-page
      requests and the page allocator for others.  Even though order > 0 is
      not currently used as per above leave that option open.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-18-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2db1971
  28. 08 6月, 2016 3 次提交
  29. 24 4月, 2016 1 次提交
    • J
      jbd2: add support for avoiding data writes during transaction commits · 41617e1a
      Jan Kara 提交于
      Currently when filesystem needs to make sure data is on permanent
      storage before committing a transaction it adds inode to transaction's
      inode list. During transaction commit, jbd2 writes back all dirty
      buffers that have allocated underlying blocks and waits for the IO to
      finish. However when doing writeback for delayed allocated data, we
      allocate blocks and immediately submit the data. Thus asking jbd2 to
      write dirty pages just unnecessarily adds more work to jbd2 possibly
      writing back other redirtied blocks.
      
      Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
      outstanding data writes before committing a transaction and thus avoid
      unnecessary writes.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      41617e1a