1. 14 1月, 2017 1 次提交
    • T
      fs/jbd2, locking/mutex, sched/wait: Use mutex_lock_io() for journal->j_checkpoint_mutex · 6fa7aa50
      Tejun Heo 提交于
      When an ext4 fs is bogged down by a lot of metadata IOs (in the
      reported case, it was deletion of millions of files, but any massive
      amount of journal writes would do), after the journal is filled up,
      tasks which try to access the filesystem and aren't currently
      performing the journal writes end up waiting in
      __jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.
      
      Because those mutex sleeps aren't marked as iowait, this condition can
      lead to misleadingly low iowait and /proc/stat:procs_blocked.  While
      iowait propagation is far from strict, this condition can be triggered
      fairly easily and annotating these sleeps correctly helps initial
      diagnosis quite a bit.
      
      Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
      these sleeps are properly marked as iowait.
      Reported-by: NMingbo Wan <mingbo@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fa7aa50
  2. 25 12月, 2016 1 次提交
  3. 01 11月, 2016 1 次提交
  4. 13 10月, 2016 1 次提交
  5. 12 10月, 2016 1 次提交
  6. 22 9月, 2016 1 次提交
  7. 16 9月, 2016 1 次提交
  8. 30 6月, 2016 4 次提交
    • A
      jbd2: make journal y2038 safe · abcfb5d9
      Arnd Bergmann 提交于
      The jbd2 journal stores the commit time in 64-bit seconds and 32-bit
      nanoseconds, which avoids an overflow in 2038, but it gets the numbers
      from current_kernel_time(), which uses 'long' seconds on 32-bit
      architectures.
      
      This simply changes the code to call current_kernel_time64() so
      we use 64-bit seconds consistently.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      abcfb5d9
    • J
      jbd2: track more dependencies on transaction commit · 1eaa566d
      Jan Kara 提交于
      So far we were tracking only dependency on transaction commit due to
      starting a new handle (which may require commit to start a new
      transaction). Now add tracking also for other cases where we wait for
      transaction commit. This way lockdep can catch deadlocks e. g. because we
      call jbd2_journal_stop() for a synchronous handle with some locks held
      which rank below transaction start.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1eaa566d
    • J
      jbd2: move lockdep tracking to journal_s · ab714aff
      Jan Kara 提交于
      Currently lockdep map is tracked in each journal handle. To be able to
      expand lockdep support to cover also other cases where we depend on
      transaction commit and where handle is not available, move lockdep map
      into struct journal_s. Since this makes the lockdep map shared for all
      handles, we have to use rwsem_acquire_read() for acquisitions now.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ab714aff
    • J
      jbd2: move lockdep instrumentation for jbd2 handles · 7a4b188f
      Jan Kara 提交于
      The transaction the handle references is free to commit once we've
      decremented t_updates counter. Move the lockdep instrumentation to that
      place. Currently it was a bit later which did not really matter but
      subsequent improvements to lockdep instrumentation would cause false
      positives with it.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7a4b188f
  9. 25 6月, 2016 1 次提交
    • M
      jbd2: get rid of superfluous __GFP_REPEAT · f2db1971
      Michal Hocko 提交于
      jbd2_alloc is explicit about its allocation preferences wrt.  the
      allocation size.  Sub page allocations go to the slab allocator and
      larger are using either the page allocator or vmalloc.  This is all good
      but the logic is unnecessarily complex.
      
      1) as per Ted, the vmalloc fallback is a left-over:
      
       : jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so
       : the code path that calls vmalloc() should never get called.  When we
       : conveted jbd2_alloc() to suppor sub-page size allocations in commit
       : d2eecb03, there was an assumption that it could be called with a size
       : greater than PAGE_SIZE, but that's certaily not true today.
      
      Moreover vmalloc allocation might even lead to a deadlock because the
      callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
      
      2) __GFP_REPEAT for requests <= PAGE_ALLOC_COSTLY_ORDER is ignored
         since the flag was introduced.
      
      Let's simplify the code flow and use the slab allocator for sub-page
      requests and the page allocator for others.  Even though order > 0 is
      not currently used as per above leave that option open.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-18-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2db1971
  10. 08 6月, 2016 3 次提交
  11. 24 4月, 2016 1 次提交
    • J
      jbd2: add support for avoiding data writes during transaction commits · 41617e1a
      Jan Kara 提交于
      Currently when filesystem needs to make sure data is on permanent
      storage before committing a transaction it adds inode to transaction's
      inode list. During transaction commit, jbd2 writes back all dirty
      buffers that have allocated underlying blocks and waits for the IO to
      finish. However when doing writeback for delayed allocated data, we
      allocate blocks and immediately submit the data. Thus asking jbd2 to
      write dirty pages just unnecessarily adds more work to jbd2 possibly
      writing back other redirtied blocks.
      
      Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
      outstanding data writes before committing a transaction and thus avoid
      unnecessary writes.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      41617e1a
  12. 18 4月, 2016 1 次提交
  13. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  14. 14 3月, 2016 1 次提交
    • M
      jbd2: do not fail journal because of frozen_buffer allocation failure · 490c1b44
      Michal Hocko 提交于
      Journal transaction might fail prematurely because the frozen_buffer
      is allocated by GFP_NOFS request:
      [   72.440013] do_get_write_access: OOM for frozen_buffer
      [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
      (...snipped....)
      [   72.495559] do_get_write_access: OOM for frozen_buffer
      [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.496839] do_get_write_access: OOM for frozen_buffer
      [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.505766] Aborting journal on device sda1-8.
      [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
      
      This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
      allocations upon OOM" because small GPF_NOFS allocations never failed.
      This allocation seems essential for the journal and GFP_NOFS is too
      restrictive to the memory allocator so let's use __GFP_NOFAIL here to
      emulate the previous behavior.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      490c1b44
  15. 10 3月, 2016 1 次提交
    • O
      jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path · c0a2ad9b
      OGAWA Hirofumi 提交于
      On umount path, jbd2_journal_destroy() writes latest transaction ID
      (->j_tail_sequence) to be used at next mount.
      
      The bug is that ->j_tail_sequence is not holding latest transaction ID
      in some cases. So, at next mount, there is chance to conflict with
      remaining (not overwritten yet) transactions.
      
      	mount (id=10)
      	write transaction (id=11)
      	write transaction (id=12)
      	umount (id=10) <= the bug doesn't write latest ID
      
      	mount (id=10)
      	write transaction (id=11)
      	crash
      
      	mount
      	[recovery process]
      		transaction (id=11)
      		transaction (id=12) <= valid transaction ID, but old commit
                                             must not replay
      
      Like above, this bug become the cause of recovery failure, or FS
      corruption.
      
      So why ->j_tail_sequence doesn't point latest ID?
      
      Because if checkpoint transactions was reclaimed by memory pressure
      (i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated.
      (And another case is, __jbd2_journal_clean_checkpoint_list() is called
      with empty transaction.)
      
      So in above cases, ->j_tail_sequence is not pointing latest
      transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not
      done too.
      
      So, to fix this problem with minimum changes, this patch updates
      ->j_tail_sequence, and issue REQ_FLUSH.  (With more complex changes,
      some optimizations would be possible to avoid unnecessary REQ_FLUSH
      for example though.)
      
      BTW,
      
      	journal->j_tail_sequence =
      		++journal->j_transaction_sequence;
      
      Increment of ->j_transaction_sequence seems to be unnecessary, but
      ext3 does this.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c0a2ad9b
  16. 23 2月, 2016 4 次提交
  17. 07 1月, 2016 1 次提交
  18. 05 12月, 2015 1 次提交
    • J
      jbd2: fix null committed data return in undo_access · 087ffd4e
      Junxiao Bi 提交于
      introduced jbd2_write_access_granted() to improve write|undo_access
      speed, but missed to check the status of b_committed_data which caused
      a kernel panic on ocfs2.
      
      [ 6538.405938] ------------[ cut here ]------------
      [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400!
      [ 6538.406686] invalid opcode: 0000 [#1] SMP
      [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod
      [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1
      [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
      [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000
      [ 6538.406686] RIP: 0010:[<ffffffffa06a286b>]  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686] RSP: 0018:ffff880075b7b7f8  EFLAGS: 00010246
      [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0
      [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430
      [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001
      [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001
      [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0
      [ 6538.406686] FS:  00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000
      [ 6538.406686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0
      [ 6538.406686] Stack:
      [ 6538.406686]  ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800
      [ 6538.406686]  ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d
      [ 6538.406686]  ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000
      [ 6538.406686] Call Trace:
      [ 6538.406686]  [<ffffffffa06a309d>] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3654>] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a397e>] _ocfs2_free_clusters+0xee/0x210 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0682d50>] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3ad5>] ocfs2_free_clusters+0x15/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa065072c>] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2]
      [ 6538.406686]  [<ffffffffa06843ac>] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654600>] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654394>] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2]
      [ 6538.406686]  [<ffffffffa065acd4>] ocfs2_remove_btree_range+0x374/0x630 [ocfs2]
      [ 6538.406686]  [<ffffffffa017486b>] ? jbd2_journal_stop+0x25b/0x470 [jbd2]
      [ 6538.406686]  [<ffffffffa065d5b5>] ocfs2_commit_truncate+0x305/0x670 [ocfs2]
      [ 6538.406686]  [<ffffffffa0683430>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa067adb7>] ocfs2_truncate_file+0x297/0x380 [ocfs2]
      [ 6538.406686]  [<ffffffffa01759e4>] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2]
      [ 6538.406686]  [<ffffffffa067c7a2>] ocfs2_setattr+0x572/0x860 [ocfs2]
      [ 6538.406686]  [<ffffffff810e4a3f>] ? current_fs_time+0x3f/0x50
      [ 6538.406686]  [<ffffffff812124b7>] notify_change+0x1d7/0x340
      [ 6538.406686]  [<ffffffff8121abf9>] ? generic_getxattr+0x79/0x80
      [ 6538.406686]  [<ffffffff811f5876>] do_truncate+0x66/0x90
      [ 6538.406686]  [<ffffffff81120e30>] ? __audit_syscall_entry+0xb0/0x110
      [ 6538.406686]  [<ffffffff811f5bb3>] do_sys_ftruncate.clone.0+0xf3/0x120
      [ 6538.406686]  [<ffffffff811f5bee>] SyS_ftruncate+0xe/0x10
      [ 6538.406686]  [<ffffffff816aa2ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
      [ 6538.406686] RIP  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686]  RSP <ffff880075b7b7f8>
      [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]---
      [ 6538.694492] Kernel panic - not syncing: Fatal exception
      [ 6538.695484] Kernel Offset: disabled
      
      Fixes: de92c8ca("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      087ffd4e
  19. 25 11月, 2015 1 次提交
    • J
      jbd2: Fix unreclaimed pages after truncate in data=journal mode · bc23f0c8
      Jan Kara 提交于
      Ted and Namjae have reported that truncated pages don't get timely
      reclaimed after being truncated in data=journal mode. The following test
      triggers the issue easily:
      
      for (i = 0; i < 1000; i++) {
      	pwrite(fd, buf, 1024*1024, 0);
      	fsync(fd);
      	fsync(fd);
      	ftruncate(fd, 0);
      }
      
      The reason is that journal_unmap_buffer() finds that truncated buffers
      are not journalled (jh->b_transaction == NULL), they are part of
      checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
      been already written out (!buffer_dirty(bh)). We clean such buffers but
      we leave them in the checkpoint list. Since checkpoint transaction holds
      a reference to the journal head, these buffers cannot be released until
      the checkpoint transaction is cleaned up. And at that point we don't
      call release_buffer_page() anymore so pages detached from mapping are
      lingering in the system waiting for reclaim to find them and free them.
      
      Fix the problem by removing buffers from transaction checkpoint lists
      when journal_unmap_buffer() finds out they don't have to be there
      anymore.
      Reported-and-tested-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Fixes: de1b7941Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bc23f0c8
  20. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  21. 19 10月, 2015 1 次提交
    • D
      ext4, jbd2: ensure entering into panic after recording an error in superblock · 4327ba52
      Daeho Jeong 提交于
      If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
      journaling will be aborted first and the error number will be recorded
      into JBD2 superblock and, finally, the system will enter into the
      panic state in "errors=panic" option.  But, in the rare case, this
      sequence is little twisted like the below figure and it will happen
      that the system enters into panic state, which means the system reset
      in mobile environment, before completion of recording an error in the
      journal superblock. In this case, e2fsck cannot recognize that the
      filesystem failure occurred in the previous run and the corruption
      wouldn't be fixed.
      
      Task A                        Task B
      ext4_handle_error()
      -> jbd2_journal_abort()
        -> __journal_abort_soft()
          -> __jbd2_journal_abort_hard()
          | -> journal->j_flags |= JBD2_ABORT;
          |
          |                         __ext4_abort()
          |                         -> jbd2_journal_abort()
          |                         | -> __journal_abort_soft()
          |                         |   -> if (journal->j_flags & JBD2_ABORT)
          |                         |           return;
          |                         -> panic()
          |
          -> jbd2_journal_update_sb_errno()
      Tested-by: NHobin Woo <hobin.woo@samsung.com>
      Signed-off-by: NDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      4327ba52
  22. 18 10月, 2015 3 次提交
  23. 15 10月, 2015 1 次提交
  24. 04 8月, 2015 1 次提交
    • L
      jbd2: limit number of reserved credits · 6d3ec14d
      Lukas Czerner 提交于
      Currently there is no limitation on number of reserved credits we can
      ask for. If we ask for more reserved credits than 1/2 of maximum
      transaction size, or if total number of credits exceeds the maximum
      transaction size per operation (which is currently only possible with
      the former) we will spin forever in start_this_handle().
      
      Fix this by adding this limitation at the start of start_this_handle().
      
      This patch also removes the credit limitation 1/2 of maximum transaction
      size, since we really only want to limit the number of reserved credits.
      There is not much point to limit the credits if there is still space in
      the journal.
      
      This accidentally also fixes the online resize, where due to the
      limitation of the journal credits we're unable to grow file systems with
      1k block size and size between 16M and 32M. It has been partially fixed
      by 2c869b26, but not entirely.
      
      Thanks Jan Kara for helping me getting the correct fix.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6d3ec14d
  25. 29 7月, 2015 1 次提交
  26. 23 7月, 2015 1 次提交
    • D
      ext4, jbd2: add REQ_FUA flag when recording an error in the superblock · 564bc402
      Daeho Jeong 提交于
      When an error condition is detected, an error status should be recorded into
      superblocks of EXT4 or JBD2. However, the write request is submitted now
      without REQ_FUA flag, even in "barrier=1" mode, which is followed by
      panic() function in "errors=panic" mode. On mobile devices which make
      whole system reset as soon as kernel panic occurs, this write request
      containing an error flag will disappear just from storage cache without
      written to the physical cells. Therefore, when next start, even forever,
      the error flag cannot be shown in both superblocks, and e2fsck cannot fix
      the filesystem problems automatically, unless e2fsck is executed in
      force checking mode.
      
      [ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]
      Signed-off-by: NDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      564bc402
  27. 13 7月, 2015 1 次提交
    • J
      jbd2: speedup jbd2_journal_dirty_metadata() · 6e06ae88
      Jan Kara 提交于
      It is often the case that we mark buffer as having dirty metadata when
      the buffer is already in that state (frequent for bitmaps, inode table
      blocks, superblock). Thus it is unnecessary to contend on grabbing
      journal head reference and bh_state lock. Avoid that by checking whether
      any modification to the buffer is needed before grabbing any locks or
      references.
      
      [ Note: this is a fixed version of commit 2143c196, which was
        reverted in ebeaa8dd due to a false positive triggering of an
        assertion check. -- Ted ]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6e06ae88
  28. 28 6月, 2015 1 次提交
    • L
      Revert "jbd2: speedup jbd2_journal_dirty_metadata()" · ebeaa8dd
      Linus Torvalds 提交于
      This reverts commit 2143c196.
      
      This commit seems to be the cause of the following jbd2 assertion
      failure:
      
         ------------[ cut here ]------------
         kernel BUG at fs/jbd2/transaction.c:1325!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ...
         CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679 #1
         Hardware name:                  /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014
         task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000
         RIP: jbd2_journal_dirty_metadata+0x237/0x290
         Call Trace:
           __ext4_handle_dirty_metadata+0x43/0x1f0
           ext4_handle_dirty_dirent_node+0xde/0x160
           ? jbd2_journal_get_write_access+0x36/0x50
           ext4_delete_entry+0x112/0x160
           ? __ext4_journal_start_sb+0x52/0xb0
           ext4_unlink+0xfa/0x260
           vfs_unlink+0xec/0x190
           do_unlinkat+0x24a/0x270
           SyS_unlink+0x11/0x20
           entry_SYSCALL_64_fastpath+0x12/0x6a
         ---[ end trace ae033ebde8d080b4 ]---
      
      which is not easily reproducible (I've seen it just once, and then Ted
      was able to reproduce it once).  Revert it while Ted and Jan try to
      figure out what is wrong.
      
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebeaa8dd
  29. 26 6月, 2015 1 次提交
  30. 21 6月, 2015 1 次提交
    • J
      jbd2: speedup jbd2_journal_dirty_metadata() · 2143c196
      Jan Kara 提交于
      It is often the case that we mark buffer as having dirty metadata when
      the buffer is already in that state (frequent for bitmaps, inode table
      blocks, superblock). Thus it is unnecessary to contend on grabbing
      journal head reference and bh_state lock. Avoid that by checking whether
      any modification to the buffer is needed before grabbing any locks or
      references.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      2143c196