1. 18 3月, 2020 2 次提交
    • X
      alinux: jbd2: track slow handle which is preventing transaction committing · 83cd9d23
      Xiaoguang Wang 提交于
      While transaction is going to commit, it first sets its state to be
      T_LOCKED and waits all outstanding handles to complete, and the
      committing transaction will always be in locked state so long as it
      has outstanding handles, also the whole fs will be locked and all later
      fs modification operations will be stucked in wait_transaction_locked().
      
      It's hard to tell why handles are that slow, so here we add a new staic
      tracepoint to track such slow handle, and show io wait time and sched
      wait time, output likes below:
        fsstress-20347 [024] ....  1570.305454: jbd2_slow_handle_stats: dev 254,17
      tid 15853 type 4 line_no 3101 interval 126 sync 0 requested_blocks 24
      dirtied_blocks 0 trans_wait 122 space_wait 0 sched_wait 0 io_wait 126
      
      "trans_wait 122" means that this current committing transaction has been
      locked for 122ms, due to this handle is not completed quickly.
      
      From "io_wait 126", we can see that io is the major reason.
      
      In this patch, we also add a per fs control file used to determine
      whether a handle can be considered to be slow.
          /proc/fs/jbd2/vdb1-8/stall_thresh
      default value is 100ms, users can set new threshold by echoing new value
      to this file.
      
      Later I also plan to add a proc file fs per fs to record these info.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      83cd9d23
    • X
      alinux: fs: record page or bio info while process is waitting on it · 2864376f
      Xiaoguang Wang 提交于
      If one process context is stucked in wait_on_buffer(), lock_buffer(),
      lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
      hard to tell ture reason, for example, whether this page is under io,
      or this page is just locked too long by other process context.
      
      Normally io request has multiple bios, and every bio contains multiple
      pages which will hold data to be read from or written to device, so here
      we record page info or bio info in task_struct while process calls
      lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
      and wait_on_bit_io(), we add a new proce interface:
      [lege@localhost linux]$ cat /proc/4516/wait_res
      1 ffffd0969f95d3c0 4295369599 4295381596
      
      Above info means that thread 4516 is waitting on a page, address is
      ffffd0969f95d3c0, and has waited for 11997ms.
      
      First field denotes the page address process is waitting on.
      Second field denotes the wait moment and the third denotes current moment.
      
      In practice, if we found a process waitting on one page for too long time,
      we can get page's address by reading /proc/$pid/wait_page, and search this
      page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
      if search operation hits one, we can get the request and know why this io
      request hangs that long.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2864376f
  2. 17 1月, 2020 1 次提交
    • Z
      jbd2: discard dirty data when forgetting an un-journalled buffer · 3aeefb45
      zhangyi (F) 提交于
      commit 597599268e3b91cac71faf48743f4783dec682fc upstream.
      
      We do not unmap and clear dirty flag when forgetting a buffer without
      journal or does not belongs to any transaction, so the invalid dirty
      data may still be written to the disk later. It's fine if the
      corresponding block is never used before the next mount, and it's also
      fine that we invoke clean_bdev_aliases() related functions to unmap
      the block device mapping when re-allocating such freed block as data
      block. But this logic is somewhat fragile and risky that may lead to
      data corruption if we forget to clean bdev aliases. So, It's better to
      discard dirty data during forget time.
      
      We have been already handled all the cases of forgetting journalled
      buffer, this patch deal with the remaining two cases.
      
      - buffer is not journalled yet,
      - buffer is journalled but doesn't belongs to any transaction.
      
      We invoke __bforget() instead of __brelese() when forgetting an
      un-journalled buffer in jbd2_journal_forget(). After this patch we can
      remove all clean_bdev_aliases() related calls in ext4.
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3aeefb45
  3. 28 7月, 2019 1 次提交
    • R
      jbd2: introduce jbd2_inode dirty range scoping · af3812b6
      Ross Zwisler 提交于
      commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream.
      
      Currently both journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() operate on the entire address space
      of each of the inodes associated with a given journal entry.  The
      consequence of this is that if we have an inode where we are constantly
      appending dirty pages we can end up waiting for an indefinite amount of
      time in journal_finish_inode_data_buffers() while we wait for all the
      pages under writeback to be written out.
      
      The easiest way to cause this type of workload is do just dd from
      /dev/zero to a file until it fills the entire filesystem.  This can
      cause journal_finish_inode_data_buffers() to wait for the duration of
      the entire dd operation.
      
      We can improve this situation by scoping each of the inode dirty ranges
      associated with a given transaction.  We do this via the jbd2_inode
      structure so that the scoping is contained within jbd2 and so that it
      follows the lifetime and locking rules for that structure.
      
      This allows us to limit the writeback & wait in
      journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() respectively to the dirty range for
      a given struct jdb2_inode, keeping us from waiting forever if the inode
      in question is still being appended to.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af3812b6
  4. 22 5月, 2019 1 次提交
  5. 24 3月, 2019 2 次提交
    • Z
      jbd2: fix compile warning when using JBUFFER_TRACE · 584f390d
      zhangyi (F) 提交于
      commit 01215d3edb0f384ddeaa5e4a22c1ae5ff634149f upstream.
      
      The jh pointer may be used uninitialized in the two cases below and the
      compiler complain about it when enabling JBUFFER_TRACE macro, fix them.
      
      In file included from fs/jbd2/transaction.c:19:0:
      fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’:
      ./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized]
       #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
                                            ^
      fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here
        struct journal_head *jh;
                             ^
      In file included from fs/jbd2/transaction.c:19:0:
      fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’:
      ./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized]
       #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
                                            ^
      fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here
        struct journal_head *jh;
                             ^
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      584f390d
    • Z
      jbd2: clear dirty flag when revoking a buffer from an older transaction · dbe4bc99
      zhangyi (F) 提交于
      commit 904cdbd41d749a476863a0ca41f6f396774f26e4 upstream.
      
      Now, we capture a data corruption problem on ext4 while we're truncating
      an extent index block. Imaging that if we are revoking a buffer which
      has been journaled by the committing transaction, the buffer's jbddirty
      flag will not be cleared in jbd2_journal_forget(), so the commit code
      will set the buffer dirty flag again after refile the buffer.
      
      fsx                               kjournald2
                                        jbd2_journal_commit_transaction
      jbd2_journal_revoke                commit phase 1~5...
       jbd2_journal_forget
         belongs to older transaction    commit phase 6
         jbddirty not clear               __jbd2_journal_refile_buffer
                                           __jbd2_journal_unfile_buffer
                                            test_clear_buffer_jbddirty
                                             mark_buffer_dirty
      
      Finally, if the freed extent index block was allocated again as data
      block by some other files, it may corrupt the file data after writing
      cached pages later, such as during unmount time. (In general,
      clean_bdev_aliases() related helpers should be invoked after
      re-allocation to prevent the above corruption, but unfortunately we
      missed it when zeroout the head of extra extent blocks in
      ext4_ext_handle_unwritten_extents()).
      
      This patch mark buffer as freed and set j_next_transaction to the new
      transaction when it already belongs to the committing transaction in
      jbd2_journal_forget(), so that commit code knows it should clear dirty
      bits when it is done with the buffer.
      
      This problem can be reproduced by xfstests generic/455 easily with
      seeds (3246 3247 3248 3249).
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dbe4bc99
  6. 17 6月, 2018 1 次提交
  7. 21 5月, 2018 1 次提交
  8. 18 4月, 2018 1 次提交
    • T
      ext4: set h_journal if there is a failure starting a reserved handle · b2569260
      Theodore Ts'o 提交于
      If ext4 tries to start a reserved handle via
      jbd2_journal_start_reserved(), and the journal has been aborted, this
      can result in a NULL pointer dereference.  This is because the fields
      h_journal and h_transaction in the handle structure share the same
      memory, via a union, so jbd2_journal_start_reserved() will clear
      h_journal before calling start_this_handle().  If this function fails
      due to an aborted handle, h_journal will still be NULL, and the call
      to jbd2_journal_free_reserved() will pass a NULL journal to
      sub_reserve_credits().
      
      This can be reproduced by running "kvm-xfstests -c dioread_nolock
      generic/475".
      
      Cc: stable@kernel.org # 3.11
      Fixes: 8f7d89f3 ("jbd2: transaction reservation support")
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: NJan Kara <jack@suse.cz>
      b2569260
  9. 10 1月, 2018 1 次提交
    • T
      jbd2: fix sphinx kernel-doc build warnings · f69120ce
      Tobin C. Harding 提交于
      Sphinx emits various (26) warnings when building make target 'htmldocs'.
      Currently struct definitions contain duplicate documentation, some as
      kernel-docs and some as standard c89 comments.  We can reduce
      duplication while cleaning up the kernel docs.
      
      Move all kernel-docs to right above each struct member.  Use the set of
      all existing comments (kernel-doc and c89).  Add documentation for
      missing struct members and function arguments.
      Signed-off-by: NTobin C. Harding <me@tobin.cc>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      f69120ce
  10. 18 12月, 2017 1 次提交
    • T
      ext4: fix up remaining files with SPDX cleanups · f5166768
      Theodore Ts'o 提交于
      A number of ext4 source files were skipped due because their copyright
      permission statements didn't match the expected text used by the
      automated conversion utilities.  I've added SPDX tags for the rest.
      
      While looking at some of these files, I've noticed that we have quite
      a bit of variation on the licenses that were used --- in particular
      some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
      and we have some files that have a LGPL-2.1 license (which was quite
      surprising).
      
      I've not attempted to do any license changes.  Even if it is perfectly
      legal to relicense to GPL 2.0-only for consistency's sake, that should
      be done with ext4 developer community discussion.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      
      f5166768
  11. 22 5月, 2017 1 次提交
    • T
      jbd2: preserve original nofs flag during journal restart · b4709067
      Tahsin Erdogan 提交于
      When a transaction starts, start_this_handle() saves current
      PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
      Journal restart is a special case that calls start_this_handle() without
      stopping the transaction. start_this_handle() isn't aware that the
      original value is already stored so it overwrites it with current value.
      
      For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
      at the end:
      
        jbd2_journal_start()
        jbd2__journal_restart()
        jbd2_journal_stop()
      
      Make jbd2__journal_restart() restore the original value before calling
      start_this_handle().
      
      Fixes: 81378da6 ("jbd2: mark the transaction context with the scope GFP_NOFS context")
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      b4709067
  12. 16 5月, 2017 2 次提交
  13. 04 5月, 2017 1 次提交
  14. 05 2月, 2017 1 次提交
  15. 13 10月, 2016 1 次提交
  16. 22 9月, 2016 1 次提交
  17. 30 6月, 2016 3 次提交
    • J
      jbd2: track more dependencies on transaction commit · 1eaa566d
      Jan Kara 提交于
      So far we were tracking only dependency on transaction commit due to
      starting a new handle (which may require commit to start a new
      transaction). Now add tracking also for other cases where we wait for
      transaction commit. This way lockdep can catch deadlocks e. g. because we
      call jbd2_journal_stop() for a synchronous handle with some locks held
      which rank below transaction start.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1eaa566d
    • J
      jbd2: move lockdep tracking to journal_s · ab714aff
      Jan Kara 提交于
      Currently lockdep map is tracked in each journal handle. To be able to
      expand lockdep support to cover also other cases where we depend on
      transaction commit and where handle is not available, move lockdep map
      into struct journal_s. Since this makes the lockdep map shared for all
      handles, we have to use rwsem_acquire_read() for acquisitions now.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ab714aff
    • J
      jbd2: move lockdep instrumentation for jbd2 handles · 7a4b188f
      Jan Kara 提交于
      The transaction the handle references is free to commit once we've
      decremented t_updates counter. Move the lockdep instrumentation to that
      place. Currently it was a bit later which did not really matter but
      subsequent improvements to lockdep instrumentation would cause false
      positives with it.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7a4b188f
  18. 24 4月, 2016 1 次提交
    • J
      jbd2: add support for avoiding data writes during transaction commits · 41617e1a
      Jan Kara 提交于
      Currently when filesystem needs to make sure data is on permanent
      storage before committing a transaction it adds inode to transaction's
      inode list. During transaction commit, jbd2 writes back all dirty
      buffers that have allocated underlying blocks and waits for the IO to
      finish. However when doing writeback for delayed allocated data, we
      allocate blocks and immediately submit the data. Thus asking jbd2 to
      write dirty pages just unnecessarily adds more work to jbd2 possibly
      writing back other redirtied blocks.
      
      Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
      outstanding data writes before committing a transaction and thus avoid
      unnecessary writes.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      41617e1a
  19. 18 4月, 2016 1 次提交
  20. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  21. 14 3月, 2016 1 次提交
    • M
      jbd2: do not fail journal because of frozen_buffer allocation failure · 490c1b44
      Michal Hocko 提交于
      Journal transaction might fail prematurely because the frozen_buffer
      is allocated by GFP_NOFS request:
      [   72.440013] do_get_write_access: OOM for frozen_buffer
      [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
      (...snipped....)
      [   72.495559] do_get_write_access: OOM for frozen_buffer
      [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.496839] do_get_write_access: OOM for frozen_buffer
      [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.505766] Aborting journal on device sda1-8.
      [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
      
      This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
      allocations upon OOM" because small GPF_NOFS allocations never failed.
      This allocation seems essential for the journal and GFP_NOFS is too
      restrictive to the memory allocator so let's use __GFP_NOFAIL here to
      emulate the previous behavior.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      490c1b44
  22. 07 1月, 2016 1 次提交
  23. 05 12月, 2015 1 次提交
    • J
      jbd2: fix null committed data return in undo_access · 087ffd4e
      Junxiao Bi 提交于
      introduced jbd2_write_access_granted() to improve write|undo_access
      speed, but missed to check the status of b_committed_data which caused
      a kernel panic on ocfs2.
      
      [ 6538.405938] ------------[ cut here ]------------
      [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400!
      [ 6538.406686] invalid opcode: 0000 [#1] SMP
      [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod
      [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1
      [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
      [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000
      [ 6538.406686] RIP: 0010:[<ffffffffa06a286b>]  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686] RSP: 0018:ffff880075b7b7f8  EFLAGS: 00010246
      [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0
      [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430
      [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001
      [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001
      [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0
      [ 6538.406686] FS:  00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000
      [ 6538.406686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0
      [ 6538.406686] Stack:
      [ 6538.406686]  ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800
      [ 6538.406686]  ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d
      [ 6538.406686]  ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000
      [ 6538.406686] Call Trace:
      [ 6538.406686]  [<ffffffffa06a309d>] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3654>] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a397e>] _ocfs2_free_clusters+0xee/0x210 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0682d50>] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3ad5>] ocfs2_free_clusters+0x15/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa065072c>] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2]
      [ 6538.406686]  [<ffffffffa06843ac>] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654600>] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654394>] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2]
      [ 6538.406686]  [<ffffffffa065acd4>] ocfs2_remove_btree_range+0x374/0x630 [ocfs2]
      [ 6538.406686]  [<ffffffffa017486b>] ? jbd2_journal_stop+0x25b/0x470 [jbd2]
      [ 6538.406686]  [<ffffffffa065d5b5>] ocfs2_commit_truncate+0x305/0x670 [ocfs2]
      [ 6538.406686]  [<ffffffffa0683430>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa067adb7>] ocfs2_truncate_file+0x297/0x380 [ocfs2]
      [ 6538.406686]  [<ffffffffa01759e4>] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2]
      [ 6538.406686]  [<ffffffffa067c7a2>] ocfs2_setattr+0x572/0x860 [ocfs2]
      [ 6538.406686]  [<ffffffff810e4a3f>] ? current_fs_time+0x3f/0x50
      [ 6538.406686]  [<ffffffff812124b7>] notify_change+0x1d7/0x340
      [ 6538.406686]  [<ffffffff8121abf9>] ? generic_getxattr+0x79/0x80
      [ 6538.406686]  [<ffffffff811f5876>] do_truncate+0x66/0x90
      [ 6538.406686]  [<ffffffff81120e30>] ? __audit_syscall_entry+0xb0/0x110
      [ 6538.406686]  [<ffffffff811f5bb3>] do_sys_ftruncate.clone.0+0xf3/0x120
      [ 6538.406686]  [<ffffffff811f5bee>] SyS_ftruncate+0xe/0x10
      [ 6538.406686]  [<ffffffff816aa2ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
      [ 6538.406686] RIP  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686]  RSP <ffff880075b7b7f8>
      [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]---
      [ 6538.694492] Kernel panic - not syncing: Fatal exception
      [ 6538.695484] Kernel Offset: disabled
      
      Fixes: de92c8ca("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      087ffd4e
  24. 25 11月, 2015 1 次提交
    • J
      jbd2: Fix unreclaimed pages after truncate in data=journal mode · bc23f0c8
      Jan Kara 提交于
      Ted and Namjae have reported that truncated pages don't get timely
      reclaimed after being truncated in data=journal mode. The following test
      triggers the issue easily:
      
      for (i = 0; i < 1000; i++) {
      	pwrite(fd, buf, 1024*1024, 0);
      	fsync(fd);
      	fsync(fd);
      	ftruncate(fd, 0);
      }
      
      The reason is that journal_unmap_buffer() finds that truncated buffers
      are not journalled (jh->b_transaction == NULL), they are part of
      checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
      been already written out (!buffer_dirty(bh)). We clean such buffers but
      we leave them in the checkpoint list. Since checkpoint transaction holds
      a reference to the journal head, these buffers cannot be released until
      the checkpoint transaction is cleaned up. And at that point we don't
      call release_buffer_page() anymore so pages detached from mapping are
      lingering in the system waiting for reclaim to find them and free them.
      
      Fix the problem by removing buffers from transaction checkpoint lists
      when journal_unmap_buffer() finds out they don't have to be there
      anymore.
      Reported-and-tested-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Fixes: de1b7941Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bc23f0c8
  25. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  26. 04 8月, 2015 1 次提交
    • L
      jbd2: limit number of reserved credits · 6d3ec14d
      Lukas Czerner 提交于
      Currently there is no limitation on number of reserved credits we can
      ask for. If we ask for more reserved credits than 1/2 of maximum
      transaction size, or if total number of credits exceeds the maximum
      transaction size per operation (which is currently only possible with
      the former) we will spin forever in start_this_handle().
      
      Fix this by adding this limitation at the start of start_this_handle().
      
      This patch also removes the credit limitation 1/2 of maximum transaction
      size, since we really only want to limit the number of reserved credits.
      There is not much point to limit the credits if there is still space in
      the journal.
      
      This accidentally also fixes the online resize, where due to the
      limitation of the journal credits we're unable to grow file systems with
      1k block size and size between 16M and 32M. It has been partially fixed
      by 2c869b26, but not entirely.
      
      Thanks Jan Kara for helping me getting the correct fix.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6d3ec14d
  27. 13 7月, 2015 1 次提交
    • J
      jbd2: speedup jbd2_journal_dirty_metadata() · 6e06ae88
      Jan Kara 提交于
      It is often the case that we mark buffer as having dirty metadata when
      the buffer is already in that state (frequent for bitmaps, inode table
      blocks, superblock). Thus it is unnecessary to contend on grabbing
      journal head reference and bh_state lock. Avoid that by checking whether
      any modification to the buffer is needed before grabbing any locks or
      references.
      
      [ Note: this is a fixed version of commit 2143c196, which was
        reverted in ebeaa8dd due to a false positive triggering of an
        assertion check. -- Ted ]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6e06ae88
  28. 28 6月, 2015 1 次提交
    • L
      Revert "jbd2: speedup jbd2_journal_dirty_metadata()" · ebeaa8dd
      Linus Torvalds 提交于
      This reverts commit 2143c196.
      
      This commit seems to be the cause of the following jbd2 assertion
      failure:
      
         ------------[ cut here ]------------
         kernel BUG at fs/jbd2/transaction.c:1325!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ...
         CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679 #1
         Hardware name:                  /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014
         task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000
         RIP: jbd2_journal_dirty_metadata+0x237/0x290
         Call Trace:
           __ext4_handle_dirty_metadata+0x43/0x1f0
           ext4_handle_dirty_dirent_node+0xde/0x160
           ? jbd2_journal_get_write_access+0x36/0x50
           ext4_delete_entry+0x112/0x160
           ? __ext4_journal_start_sb+0x52/0xb0
           ext4_unlink+0xfa/0x260
           vfs_unlink+0xec/0x190
           do_unlinkat+0x24a/0x270
           SyS_unlink+0x11/0x20
           entry_SYSCALL_64_fastpath+0x12/0x6a
         ---[ end trace ae033ebde8d080b4 ]---
      
      which is not easily reproducible (I've seen it just once, and then Ted
      was able to reproduce it once).  Revert it while Ted and Jan try to
      figure out what is wrong.
      
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebeaa8dd
  29. 21 6月, 2015 1 次提交
    • J
      jbd2: speedup jbd2_journal_dirty_metadata() · 2143c196
      Jan Kara 提交于
      It is often the case that we mark buffer as having dirty metadata when
      the buffer is already in that state (frequent for bitmaps, inode table
      blocks, superblock). Thus it is unnecessary to contend on grabbing
      journal head reference and bh_state lock. Avoid that by checking whether
      any modification to the buffer is needed before grabbing any locks or
      references.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      2143c196
  30. 09 6月, 2015 4 次提交
    • J
      jbd2: speedup jbd2_journal_get_[write|undo]_access() · de92c8ca
      Jan Kara 提交于
      jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
      frequently called for buffers that are already part of the running
      transaction - most frequently it is the case for bitmaps, inode table
      blocks, and superblock. Since in such cases we have nothing to do, it is
      unfortunate we still grab reference to journal head, lock the bh, lock
      bh_state only to find out there's nothing to do.
      
      Improving this is a bit subtle though since until we find out journal
      head is attached to the running transaction, it can disappear from under
      us because checkpointing / commit decided it's no longer needed. We deal
      with this by protecting journal_head slab with RCU. We still have to be
      careful about journal head being freed & reallocated within slab and
      about exposing journal head in consistent state (in particular
      b_modified and b_frozen_data must be in correct state before we allow
      user to touch the buffer).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      de92c8ca
    • J
      jbd2: more simplifications in do_get_write_access() · 8b00f400
      Jan Kara 提交于
      Check for the simple case of unjournaled buffer first, handle it and
      bail out. This allows us to remove one if and unindent the difficult case
      by one tab. The result is easier to read.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8b00f400
    • J
      jbd2: simplify error path on allocation failure in do_get_write_access() · d012aa59
      Jan Kara 提交于
      We were acquiring bh_state_lock when allocation of buffer failed in
      do_get_write_access() only to be able to jump to a label that releases
      the lock and does all other checks that don't make sense for this error
      path. Just jump into the right label instead.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d012aa59
    • J
      jbd2: simplify code flow in do_get_write_access() · ee57aba1
      Jan Kara 提交于
      needs_copy is set only in one place in do_get_write_access(), just move
      the frozen buffer copying into that place and factor it out to a
      separate function to make do_get_write_access() slightly more readable.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ee57aba1
  31. 08 6月, 2015 1 次提交
    • M
      jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL · 6ccaf3e2
      Michal Hocko 提交于
      This basically reverts 47def826 (jbd2: Remove __GFP_NOFAIL from jbd2
      layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
      to open coding the endless loop around the allocator rather than
      removing the dependency on the non failing allocation. So the
      deprecation was a clear failure and the reality tells us that
      __GFP_NOFAIL is not even close to go away.
      
      It is still true that __GFP_NOFAIL allocations are generally discouraged
      and new uses should be evaluated and an alternative (pre-allocations or
      reservations) should be considered but it doesn't make any sense to lie
      the allocator about the requirements. Allocator can take steps to help
      making a progress if it knows the requirements.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      6ccaf3e2
  32. 15 5月, 2015 1 次提交
    • L
      ext4: fix NULL pointer dereference when journal restart fails · 9d506594
      Lukas Czerner 提交于
      Currently when journal restart fails, we'll have the h_transaction of
      the handle set to NULL to indicate that the handle has been effectively
      aborted. We handle this situation quietly in the jbd2_journal_stop() and just
      free the handle and exit because everything else has been done before we
      attempted (and failed) to restart the journal.
      
      Unfortunately there are a number of problems with that approach
      introduced with commit
      
      41a5b913 "jbd2: invalidate handle if jbd2_journal_restart()
      fails"
      
      First of all in ext4 jbd2_journal_stop() will be called through
      __ext4_journal_stop() where we would try to get a hold of the superblock
      by dereferencing h_transaction which in this case would lead to NULL
      pointer dereference and crash.
      
      In addition we're going to free the handle regardless of the refcount
      which is bad as well, because others up the call chain will still
      reference the handle so we might potentially reference already freed
      memory.
      
      Moreover it's expected that we'll get aborted handle as well as detached
      handle in some of the journalling function as the error propagates up
      the stack, so it's unnecessary to call WARN_ON every time we get
      detached handle.
      
      And finally we might leak some memory by forgetting to free reserved
      handle in jbd2_journal_stop() in the case where handle was detached from
      the transaction (h_transaction is NULL).
      
      Fix the NULL pointer dereference in __ext4_journal_stop() by just
      calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
      the potential memory leak in jbd2_journal_stop() and use proper
      handle refcounting before we attempt to free it to avoid use-after-free
      issues.
      
      And finally remove all WARN_ON(!transaction) from the code so that we do
      not get random traces when something goes wrong because when journal
      restart fails we will get to some of those functions.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      9d506594