1. 15 6月, 2015 4 次提交
    • R
      ext4: mballoc: avoid 20-argument function call · 97b4af2f
      Rasmus Villemoes 提交于
      Making a function call with 20 arguments is rather expensive in both
      stack and .text. In this case, doing the formatting manually doesn't
      make it any less readable, so we might as well save 155 bytes of .text
      and 112 bytes of stack.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      97b4af2f
    • L
      ext4: wait for existing dio workers in ext4_alloc_file_blocks() · 0d306dcf
      Lukas Czerner 提交于
      Currently existing dio workers can jump in and potentially increase
      extent tree depth while we're allocating blocks in
      ext4_alloc_file_blocks().  This may cause us to underestimate the
      number of credits needed for the transaction because the extent tree
      depth can change after our estimation.
      
      Fix this by waiting for all the existing dio workers in the same way
      as we do it in ext4_punch_hole.  We've seen errors caused by this in
      xfstest generic/299, however it's really hard to reproduce.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0d306dcf
    • L
      ext4: recalculate journal credits as inode depth changes · 4134f5c8
      Lukas Czerner 提交于
      Currently in ext4_alloc_file_blocks() the number of credits is
      calculated only once before we enter the allocation loop. However within
      the allocation loop the extent tree depth can change, hence the number
      of credits needed can increase potentially exceeding the number of credits
      reserved in the handle which can cause journal failures.
      
      Fix this by recalculating number of credits when the inode depth
      changes. Note that even though ext4_alloc_file_blocks() is only
      currently used by extent base inodes we will avoid recalculating number
      of credits unnecessarily in the case of indirect based inodes.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4134f5c8
    • D
      jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail() · b4f1afcd
      Dmitry Monakhov 提交于
      jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
      So allocations should be done with GFP_NOFS
      
      [Full stack trace snipped from 3.10-rh7]
      [<ffffffff815c4bd4>] dump_stack+0x19/0x1b
      [<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80
      [<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20
      [<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
      [<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210
      [<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20
      [<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20
      [<ffffffff81147939>] mempool_alloc+0x69/0x170
      [<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20
      [<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150
      [<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0
      [<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120
      [<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
      [<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
      [<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
      [<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2]
      [<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30
      [<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210
      [<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2]
      [<ffffffff811532ce>] ? lru_cache_add+0xe/0x10
      [<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4]
      [<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4]
      [<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4]
      [<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270
      [<ffffffff81146918>] __generic_file_write_iter+0x178/0x390
      [<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0
      [<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0
      [<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4]
      [<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0
      [<ffffffff811b93f0>] do_sync_write+0x90/0xe0
      [<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0
      [<ffffffff811ba5b8>] SyS_write+0x58/0xb0
      [<ffffffff815d4799>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      b4f1afcd
  2. 13 6月, 2015 4 次提交
    • F
      ext4: use swap() in mext_page_double_lock() · bf865467
      Fabian Frederick 提交于
      Use kernel.h macro definition.
      
      Thanks to Julia Lawall for Coccinelle scripting support.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bf865467
    • F
      ext4: use swap() in memswap() · 4b7e2db5
      Fabian Frederick 提交于
      Use kernel.h macro definition.
      
      Thanks to Julia Lawall for Coccinelle scripting support.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4b7e2db5
    • T
      ext4: fix race between truncate and __ext4_journalled_writepage() · bdf96838
      Theodore Ts'o 提交于
      The commit cf108bca: "ext4: Invert the locking order of page_lock
      and transaction start" caused __ext4_journalled_writepage() to drop
      the page lock before the page was written back, as part of changing
      the locking order to jbd2_journal_start -> page_lock.  However, this
      introduced a potential race if there was a truncate racing with the
      data=journalled writeback mode.
      
      Fix this by grabbing the page lock after starting the journal handle,
      and then checking to see if page had gotten truncated out from under
      us.
      
      This fixes a number of different warnings or BUG_ON's when running
      xfstests generic/086 in data=journalled mode, including:
      
      jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
      c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0
      
      	      	      	  - and -
      
      kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
          ...
      Call Trace:
       [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
       [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
       [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
       [<c027d883>] ? lock_buffer+0x36/0x36
       [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
       [<c0229139>] do_invalidatepage+0x22/0x26
       [<c0229198>] truncate_inode_page+0x5b/0x85
       [<c022934b>] truncate_inode_pages_range+0x156/0x38c
       [<c0229592>] truncate_inode_pages+0x11/0x15
       [<c022962d>] truncate_pagecache+0x55/0x71
       [<c02b913b>] ext4_setattr+0x4a9/0x560
       [<c01ca542>] ? current_kernel_time+0x10/0x44
       [<c026c4d8>] notify_change+0x1c7/0x2be
       [<c0256a00>] do_truncate+0x65/0x85
       [<c0226f31>] ? file_ra_state_init+0x12/0x29
      
      	      	      	  - and -
      
      WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
      irty_metadata+0x14a/0x1ae()
          ...
      Call Trace:
       [<c01b879f>] ? console_unlock+0x3a1/0x3ce
       [<c082cbb4>] dump_stack+0x48/0x60
       [<c0178b65>] warn_slowpath_common+0x89/0xa0
       [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
       [<c0178bef>] warn_slowpath_null+0x14/0x18
       [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
       [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
       [<c02b2f44>] write_end_fn+0x40/0x53
       [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
       [<c02b59e7>] ext4_writepage+0x354/0x3b8
       [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
       [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
       [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
       [<c02b5a5b>] __writepage+0x10/0x2e
       [<c0225956>] write_cache_pages+0x22d/0x32c
       [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
       [<c02b6ee8>] ext4_writepages+0x102/0x607
       [<c019adfe>] ? sched_clock_local+0x10/0x10e
       [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
       [<c01a8ad5>] ? lock_is_held+0x43/0x51
       [<c0226dff>] do_writepages+0x1c/0x29
       [<c0276bed>] __writeback_single_inode+0xc3/0x545
       [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
          ...
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bdf96838
    • T
      ext4 crypto: fail the mount if blocksize != pagesize · 1cb767cd
      Theodore Ts'o 提交于
      We currently don't correctly handle the case where blocksize !=
      pagesize, so disallow the mount in those cases.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1cb767cd
  3. 09 6月, 2015 6 次提交
  4. 08 6月, 2015 6 次提交
    • D
      ext4: BUG_ON assertion repeated for inode1, not done for inode2 · 8bc3b1e6
      David Moore 提交于
      During a source code review of fs/ext4/extents.c I noted identical
      consecutive lines. An assertion is repeated for inode1 and never done
      for inode2. This is not in keeping with the rest of the code in the
      ext4_swap_extents function and appears to be a bug.
      
      Assert that the inode2 mutex is not locked.
      Signed-off-by: NDavid Moore <dmoorefo@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      8bc3b1e6
    • T
    • L
      ext4: return error code from ext4_mb_good_group() · 42ac1848
      Lukas Czerner 提交于
      Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
      the allocation group is suitable for use or not. However we might get
      various errors and fail while initializing new group including -EIO
      which would never get propagated up the call chain. This might lead to
      an endless loop at writeback when we're trying to find a good group to
      allocate from and we fail to initialize new group (read error for
      example).
      
      Fix this by returning proper error code from ext4_mb_good_group() and
      using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
      we will always return only the first occurred error from
      ext4_mb_good_group() and we only propagate it back  to the caller if we
      do not get any other errors and we fail to allocate any blocks.
      
      Note that with other modes than errors=continue, we will fail
      immediately in ext4_mb_good_group() in case of error, however with
      errors=continue we should try to continue using the file system, that's
      why we're not going to fail immediately when we see an error from
      ext4_mb_good_group(), but rather when we fail to find a suitable block
      group to allocate from due to an problem in group initialization.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      42ac1848
    • L
      ext4: try to initialize all groups we can in case of failure on ppc64 · bbdc322f
      Lukas Czerner 提交于
      Currently on the machines with page size > block size when initializing
      block group buddy cache we initialize it for all the block group bitmaps
      in the page. However in the case of read error, checksum error, or if
      a single bitmap is in any way corrupted we would fail to initialize all
      of the bitmaps. This is problematic because we will not have access to
      the other allocation groups even though those might be perfectly fine
      and usable.
      
      Fix this by reading all the bitmaps instead of error out on the first
      problem and simply skip the bitmaps which were either not read properly,
      or are not valid.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bbdc322f
    • L
      ext4: verify block bitmap even after fresh initialization · 41e5b7ed
      Lukas Czerner 提交于
      If we want to rely on the buffer_verified() flag of the block bitmap
      buffer, we have to set it consistently. However currently if we're
      initializing uninitialized block bitmap in
      ext4_read_block_bitmap_nowait() we're not going to set buffer verified
      at all.
      
      We can do this by simply setting the flag on the buffer, but I think
      it's actually better to run ext4_validate_block_bitmap() to make sure
      that what we did in the ext4_init_block_bitmap() is right.
      
      So run ext4_validate_block_bitmap() even after the block bitmap
      initialization. Also bail out early from ext4_validate_block_bitmap() if
      we see corrupt bitmap, since we already know it's corrupt and we do not
      need to verify that.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      41e5b7ed
    • M
      jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL · 6ccaf3e2
      Michal Hocko 提交于
      This basically reverts 47def826 (jbd2: Remove __GFP_NOFAIL from jbd2
      layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
      to open coding the endless loop around the allocator rather than
      removing the dependency on the non failing allocation. So the
      deprecation was a clear failure and the reality tells us that
      __GFP_NOFAIL is not even close to go away.
      
      It is still true that __GFP_NOFAIL allocations are generally discouraged
      and new uses should be evaluated and an alternative (pre-allocations or
      reservations) should be considered but it doesn't make any sense to lie
      the allocator about the requirements. Allocator can take steps to help
      making a progress if it knows the requirements.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      6ccaf3e2
  5. 03 6月, 2015 1 次提交
    • T
      ext4 crypto: allocate bounce pages using GFP_NOWAIT · 3dbb5eb9
      Theodore Ts'o 提交于
      Previously we allocated bounce pages using a combination of
      alloc_page() and mempool_alloc() with the __GFP_WAIT bit set.
      Instead, use mempool_alloc() with GFP_NOWAIT.  The mempool_alloc()
      function will try using alloc_pages() initially, and then only use the
      mempool reserve of pages if alloc_pages() is unable to fulfill the
      request.
      
      This minimizes the the impact on the mm layer when we need to do a
      large amount of writeback of encrypted files, as Jaeguk Kim had
      reported that under a heavy fio workload on a system with restricted
      amounts memory (which unfortunately, includes many mobile handsets),
      he had observed the the OOM killer getting triggered several times.
      Using GFP_NOWAIT
      
      If the mempool_alloc() function fails, we will retry the page
      writeback at a later time; the function of the mempool is to ensure
      that we can writeback at least 32 pages at a time, so we can more
      efficiently dispatch I/O under high memory pressure situations.  In
      the future we should make this be a tunable so we can determine the
      best tradeoff between permanently sequestering memory and the ability
      to quickly launder pages so we can free up memory quickly when
      necessary.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      3dbb5eb9
  6. 01 6月, 2015 13 次提交
  7. 19 5月, 2015 6 次提交
    • T
      ext4 crypto: get rid of ci_mode from struct ext4_crypt_info · 1aaa6e8b
      Theodore Ts'o 提交于
      The ci_mode field was superfluous, and getting rid of it gets rid of
      an unused hole in the structure.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1aaa6e8b
    • T
      ext4 crypto: use slab caches · 8ee03714
      Theodore Ts'o 提交于
      Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for
      slighly better memory efficiency and debuggability.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8ee03714
    • T
      ext4: clean up superblock encryption mode fields · f5aed2c2
      Theodore Ts'o 提交于
      The superblock fields s_file_encryption_mode and s_dir_encryption_mode
      are vestigal, so remove them as a cleanup.  While we're at it, allow
      file systems with both encryption and inline_data enabled at the same
      time to work correctly.  We can't have encrypted inodes with inline
      data, but there's no reason to prohibit unencrypted inodes from using
      the inline data feature.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      f5aed2c2
    • T
      ext4 crypto: reorganize how we store keys in the inode · b7236e21
      Theodore Ts'o 提交于
      This is a pretty massive patch which does a number of different things:
      
      1) The per-inode encryption information is now stored in an allocated
         data structure, ext4_crypt_info, instead of directly in the node.
         This reduces the size usage of an in-memory inode when it is not
         using encryption.
      
      2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
         encryption structure instead.  This remove an unnecessary memory
         allocation and free for the fname_crypto_ctx as well as allowing us
         to reuse the ctfm in a directory for multiple lookups and file
         creations.
      
      3) We also cache the inode's policy information in the ext4_crypt_info
         structure so we don't have to continually read it out of the
         extended attributes.
      
      4) We now keep the keyring key in the inode's encryption structure
         instead of releasing it after we are done using it to derive the
         per-inode key.  This allows us to test to see if the key has been
         revoked; if it has, we prevent the use of the derived key and free
         it.
      
      5) When an inode is released (or when the derived key is freed), we
         will use memset_explicit() to zero out the derived key, so it's not
         left hanging around in memory.  This implies that when a user logs
         out, it is important to first revoke the key, and then unlink it,
         and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
         release any decrypted pages and dcache entries from the system
         caches.
      
      6) All this, and we also shrink the number of lines of code by around
         100.  :-)
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b7236e21
    • T
      ext4 crypto: separate kernel and userspace structure for the key · e2881b1b
      Theodore Ts'o 提交于
      Use struct ext4_encryption_key only for the master key passed via the
      kernel keyring.
      
      For internal kernel space users, we now use struct ext4_crypt_info.
      This will allow us to put information from the policy structure so we
      can cache it and avoid needing to constantly looking up the extended
      attribute.  We will do this in a spearate patch.  This patch is mostly
      mechnical to make it easier for patch review.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e2881b1b
    • T