1. 09 6月, 2015 4 次提交
  2. 08 6月, 2015 6 次提交
    • D
      ext4: BUG_ON assertion repeated for inode1, not done for inode2 · 8bc3b1e6
      David Moore 提交于
      During a source code review of fs/ext4/extents.c I noted identical
      consecutive lines. An assertion is repeated for inode1 and never done
      for inode2. This is not in keeping with the rest of the code in the
      ext4_swap_extents function and appears to be a bug.
      
      Assert that the inode2 mutex is not locked.
      Signed-off-by: NDavid Moore <dmoorefo@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      8bc3b1e6
    • T
    • L
      ext4: return error code from ext4_mb_good_group() · 42ac1848
      Lukas Czerner 提交于
      Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
      the allocation group is suitable for use or not. However we might get
      various errors and fail while initializing new group including -EIO
      which would never get propagated up the call chain. This might lead to
      an endless loop at writeback when we're trying to find a good group to
      allocate from and we fail to initialize new group (read error for
      example).
      
      Fix this by returning proper error code from ext4_mb_good_group() and
      using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
      we will always return only the first occurred error from
      ext4_mb_good_group() and we only propagate it back  to the caller if we
      do not get any other errors and we fail to allocate any blocks.
      
      Note that with other modes than errors=continue, we will fail
      immediately in ext4_mb_good_group() in case of error, however with
      errors=continue we should try to continue using the file system, that's
      why we're not going to fail immediately when we see an error from
      ext4_mb_good_group(), but rather when we fail to find a suitable block
      group to allocate from due to an problem in group initialization.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      42ac1848
    • L
      ext4: try to initialize all groups we can in case of failure on ppc64 · bbdc322f
      Lukas Czerner 提交于
      Currently on the machines with page size > block size when initializing
      block group buddy cache we initialize it for all the block group bitmaps
      in the page. However in the case of read error, checksum error, or if
      a single bitmap is in any way corrupted we would fail to initialize all
      of the bitmaps. This is problematic because we will not have access to
      the other allocation groups even though those might be perfectly fine
      and usable.
      
      Fix this by reading all the bitmaps instead of error out on the first
      problem and simply skip the bitmaps which were either not read properly,
      or are not valid.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bbdc322f
    • L
      ext4: verify block bitmap even after fresh initialization · 41e5b7ed
      Lukas Czerner 提交于
      If we want to rely on the buffer_verified() flag of the block bitmap
      buffer, we have to set it consistently. However currently if we're
      initializing uninitialized block bitmap in
      ext4_read_block_bitmap_nowait() we're not going to set buffer verified
      at all.
      
      We can do this by simply setting the flag on the buffer, but I think
      it's actually better to run ext4_validate_block_bitmap() to make sure
      that what we did in the ext4_init_block_bitmap() is right.
      
      So run ext4_validate_block_bitmap() even after the block bitmap
      initialization. Also bail out early from ext4_validate_block_bitmap() if
      we see corrupt bitmap, since we already know it's corrupt and we do not
      need to verify that.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      41e5b7ed
    • M
      jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL · 6ccaf3e2
      Michal Hocko 提交于
      This basically reverts 47def826 (jbd2: Remove __GFP_NOFAIL from jbd2
      layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
      to open coding the endless loop around the allocator rather than
      removing the dependency on the non failing allocation. So the
      deprecation was a clear failure and the reality tells us that
      __GFP_NOFAIL is not even close to go away.
      
      It is still true that __GFP_NOFAIL allocations are generally discouraged
      and new uses should be evaluated and an alternative (pre-allocations or
      reservations) should be considered but it doesn't make any sense to lie
      the allocator about the requirements. Allocator can take steps to help
      making a progress if it knows the requirements.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      6ccaf3e2
  3. 03 6月, 2015 1 次提交
    • T
      ext4 crypto: allocate bounce pages using GFP_NOWAIT · 3dbb5eb9
      Theodore Ts'o 提交于
      Previously we allocated bounce pages using a combination of
      alloc_page() and mempool_alloc() with the __GFP_WAIT bit set.
      Instead, use mempool_alloc() with GFP_NOWAIT.  The mempool_alloc()
      function will try using alloc_pages() initially, and then only use the
      mempool reserve of pages if alloc_pages() is unable to fulfill the
      request.
      
      This minimizes the the impact on the mm layer when we need to do a
      large amount of writeback of encrypted files, as Jaeguk Kim had
      reported that under a heavy fio workload on a system with restricted
      amounts memory (which unfortunately, includes many mobile handsets),
      he had observed the the OOM killer getting triggered several times.
      Using GFP_NOWAIT
      
      If the mempool_alloc() function fails, we will retry the page
      writeback at a later time; the function of the mempool is to ensure
      that we can writeback at least 32 pages at a time, so we can more
      efficiently dispatch I/O under high memory pressure situations.  In
      the future we should make this be a tunable so we can determine the
      best tradeoff between permanently sequestering memory and the ability
      to quickly launder pages so we can free up memory quickly when
      necessary.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      3dbb5eb9
  4. 01 6月, 2015 13 次提交
  5. 19 5月, 2015 7 次提交
    • T
      ext4 crypto: get rid of ci_mode from struct ext4_crypt_info · 1aaa6e8b
      Theodore Ts'o 提交于
      The ci_mode field was superfluous, and getting rid of it gets rid of
      an unused hole in the structure.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1aaa6e8b
    • T
      ext4 crypto: use slab caches · 8ee03714
      Theodore Ts'o 提交于
      Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for
      slighly better memory efficiency and debuggability.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8ee03714
    • T
      ext4: clean up superblock encryption mode fields · f5aed2c2
      Theodore Ts'o 提交于
      The superblock fields s_file_encryption_mode and s_dir_encryption_mode
      are vestigal, so remove them as a cleanup.  While we're at it, allow
      file systems with both encryption and inline_data enabled at the same
      time to work correctly.  We can't have encrypted inodes with inline
      data, but there's no reason to prohibit unencrypted inodes from using
      the inline data feature.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      f5aed2c2
    • T
      ext4 crypto: reorganize how we store keys in the inode · b7236e21
      Theodore Ts'o 提交于
      This is a pretty massive patch which does a number of different things:
      
      1) The per-inode encryption information is now stored in an allocated
         data structure, ext4_crypt_info, instead of directly in the node.
         This reduces the size usage of an in-memory inode when it is not
         using encryption.
      
      2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
         encryption structure instead.  This remove an unnecessary memory
         allocation and free for the fname_crypto_ctx as well as allowing us
         to reuse the ctfm in a directory for multiple lookups and file
         creations.
      
      3) We also cache the inode's policy information in the ext4_crypt_info
         structure so we don't have to continually read it out of the
         extended attributes.
      
      4) We now keep the keyring key in the inode's encryption structure
         instead of releasing it after we are done using it to derive the
         per-inode key.  This allows us to test to see if the key has been
         revoked; if it has, we prevent the use of the derived key and free
         it.
      
      5) When an inode is released (or when the derived key is freed), we
         will use memset_explicit() to zero out the derived key, so it's not
         left hanging around in memory.  This implies that when a user logs
         out, it is important to first revoke the key, and then unlink it,
         and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
         release any decrypted pages and dcache entries from the system
         caches.
      
      6) All this, and we also shrink the number of lines of code by around
         100.  :-)
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b7236e21
    • T
      ext4 crypto: separate kernel and userspace structure for the key · e2881b1b
      Theodore Ts'o 提交于
      Use struct ext4_encryption_key only for the master key passed via the
      kernel keyring.
      
      For internal kernel space users, we now use struct ext4_crypt_info.
      This will allow us to put information from the policy structure so we
      can cache it and avoid needing to constantly looking up the extended
      attribute.  We will do this in a spearate patch.  This patch is mostly
      mechnical to make it easier for patch review.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e2881b1b
    • T
    • T
      ext4 crypto: optimize filename encryption · 5b643f9c
      Theodore Ts'o 提交于
      Encrypt the filename as soon it is passed in by the user.  This avoids
      our needing to encrypt the filename 2 or 3 times while in the process
      of creating a filename.
      
      Similarly, when looking up a directory entry, encrypt the filename
      early, or if the encryption key is not available, base-64 decode the
      file syystem so that the hash value and the last 16 bytes of the
      encrypted filename is available in the new struct ext4_filename data
      structure.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      5b643f9c
  6. 15 5月, 2015 8 次提交
    • T
      ext4: fix an ext3 collapse range regression in xfstests · b9576fc3
      Theodore Ts'o 提交于
      The xfstests test suite assumes that an attempt to collapse range on
      the range (0, 1) will return EOPNOTSUPP if the file system does not
      support collapse range.  Commit 280227a7: "ext4: move check under
      lock scope to close a race" broke this, and this caused xfstests to
      fail when run when testing file systems that did not have the extents
      feature enabled.
      Reported-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b9576fc3
    • V
      kernfs: do not account ino_ida allocations to memcg · 499611ed
      Vladimir Davydov 提交于
      root->ino_ida is used for kernfs inode number allocations. Since IDA has
      a layered structure, different IDs can reside on the same layer, which
      is currently accounted to some memory cgroup. The problem is that each
      kmem cache of a memory cgroup has its own directory on sysfs (under
      /sys/fs/kernel/<cache-name>/cgroup). If the inode number of such a
      directory or any file in it gets allocated from a layer accounted to the
      cgroup which the cache is created for, the cgroup will get pinned for
      good, because one has to free all kmem allocations accounted to a cgroup
      in order to release it and destroy all its kmem caches. That said we
      must not account layers of ino_ida to any memory cgroup.
      
      Since per net init operations may create new sysfs entries directly
      (e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
      per each namespace, which, in turn, creates new sysfs entries), an easy
      way to reproduce this issue is by creating network namespace(s) from
      inside a kmem-active memory cgroup.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      499611ed
    • D
      jbd2: fix r_count overflows leading to buffer overflow in journal recovery · e531d0bc
      Darrick J. Wong 提交于
      The journal revoke block recovery code does not check r_count for
      sanity, which means that an evil value of r_count could result in
      the kernel reading off the end of the revoke table and into whatever
      garbage lies beyond.  This could crash the kernel, so fix that.
      
      However, in testing this fix, I discovered that the code to write
      out the revoke tables also was not correctly checking to see if the
      block was full -- the current offset check is fine so long as the
      revoke table space size is a multiple of the record size, but this
      is not true when either journal_csum_v[23] are set.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      e531d0bc
    • E
      ext4: check for zero length extent explicitly · 2f974865
      Eryu Guan 提交于
      The following commit introduced a bug when checking for zero length extent
      
      5946d089 ext4: check for overlapping extents in ext4_valid_extent_entries()
      
      Zero length extent could pass the check if lblock is zero.
      
      Adding the explicit check for zero length back.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      2f974865
    • L
      ext4: fix NULL pointer dereference when journal restart fails · 9d506594
      Lukas Czerner 提交于
      Currently when journal restart fails, we'll have the h_transaction of
      the handle set to NULL to indicate that the handle has been effectively
      aborted. We handle this situation quietly in the jbd2_journal_stop() and just
      free the handle and exit because everything else has been done before we
      attempted (and failed) to restart the journal.
      
      Unfortunately there are a number of problems with that approach
      introduced with commit
      
      41a5b913 "jbd2: invalidate handle if jbd2_journal_restart()
      fails"
      
      First of all in ext4 jbd2_journal_stop() will be called through
      __ext4_journal_stop() where we would try to get a hold of the superblock
      by dereferencing h_transaction which in this case would lead to NULL
      pointer dereference and crash.
      
      In addition we're going to free the handle regardless of the refcount
      which is bad as well, because others up the call chain will still
      reference the handle so we might potentially reference already freed
      memory.
      
      Moreover it's expected that we'll get aborted handle as well as detached
      handle in some of the journalling function as the error propagates up
      the stack, so it's unnecessary to call WARN_ON every time we get
      detached handle.
      
      And finally we might leak some memory by forgetting to free reserved
      handle in jbd2_journal_stop() in the case where handle was detached from
      the transaction (h_transaction is NULL).
      
      Fix the NULL pointer dereference in __ext4_journal_stop() by just
      calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
      the potential memory leak in jbd2_journal_stop() and use proper
      handle refcounting before we attempt to free it to avoid use-after-free
      issues.
      
      And finally remove all WARN_ON(!transaction) from the code so that we do
      not get random traces when something goes wrong because when journal
      restart fails we will get to some of those functions.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      9d506594
    • T
      ext4: remove unused function prototype from ext4.h · 92c82639
      Theodore Ts'o 提交于
      The ext4_extent_tree_init() function hasn't been in the ext4 code for
      a long time ago, except in an unused function prototype in ext4.h
      
      Google-Bug-Id: 4530137
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      92c82639
    • T
      ext4: don't save the error information if the block device is read-only · 1b46617b
      Theodore Ts'o 提交于
      Google-Bug-Id: 20939131
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1b46617b
    • T
      ext4: fix lazytime optimization · 8f4d8558
      Theodore Ts'o 提交于
      We had a fencepost error in the lazytime optimization which means that
      timestamp would get written to the wrong inode.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8f4d8558
  7. 13 5月, 2015 1 次提交
    • H
      parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards architectures · d045c77c
      Helge Deller 提交于
      On architectures where the stack grows upwards (CONFIG_STACK_GROWSUP=y,
      currently parisc and metag only) stack randomization sometimes leads to crashes
      when the stack ulimit is set to lower values than STACK_RND_MASK (which is 8 MB
      by default if not defined in arch-specific headers).
      
      The problem is, that when the stack vm_area_struct is set up in fs/exec.c, the
      additional space needed for the stack randomization (as defined by the value of
      STACK_RND_MASK) was not taken into account yet and as such, when the stack
      randomization code added a random offset to the stack start, the stack
      effectively got smaller than what the user defined via rlimit_max(RLIMIT_STACK)
      which then sometimes leads to out-of-stack situations and crashes.
      
      This patch fixes it by adding the maximum possible amount of memory (based on
      STACK_RND_MASK) which theoretically could be added by the stack randomization
      code to the initial stack size. That way, the user-defined stack size is always
      guaranteed to be at minimum what is defined via rlimit_max(RLIMIT_STACK).
      
      This bug is currently not visible on the metag architecture, because on metag
      STACK_RND_MASK is defined to 0 which effectively disables stack randomization.
      
      The changes to fs/exec.c are inside an "#ifdef CONFIG_STACK_GROWSUP"
      section, so it does not affect other platformws beside those where the
      stack grows upwards (parisc and metag).
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Cc: linux-parisc@vger.kernel.org
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: linux-metag@vger.kernel.org
      Cc: stable@vger.kernel.org # v3.16+
      d045c77c