1. 19 11月, 2016 3 次提交
  2. 15 11月, 2016 1 次提交
    • D
      ext4: use current_time() for inode timestamps · eeca7ea1
      Deepa Dinamani 提交于
      CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
      current_time() will be transitioned to be y2038 safe
      along with vfs.
      
      current_time() returns timestamps according to the
      granularities set in the super_block.
      The granularity check in ext4_current_time() to call
      current_time() or CURRENT_TIME_SEC is not required.
      Use current_time() directly to obtain timestamps
      unconditionally, and remove ext4_current_time().
      
      Quota files are assumed to be on the same filesystem.
      Hence, use current_time() for these files as well.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      eeca7ea1
  3. 14 11月, 2016 2 次提交
    • T
      ext4: don't lock buffer in ext4_commit_super if holding spinlock · 1566a48a
      Theodore Ts'o 提交于
      If there is an error reported in mballoc via ext4_grp_locked_error(),
      the code is holding a spinlock, so ext4_commit_super() must not try to
      lock the buffer head, or else it will trigger a BUG:
      
        BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
        in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
        CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
         ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
         ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
         ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
        Call Trace:
          [<ffffffff81318c89>] dump_stack+0x67/0x9e
          [<ffffffff810810b0>] ___might_sleep+0xf0/0x140
          [<ffffffff81081153>] __might_sleep+0x53/0xb0
          [<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
          [<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
          [<ffffffff81081153>] ? __might_sleep+0x53/0xb0
          [<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320
      
      Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
      (and it is the only caller which does so), avoid locking and unlocking
      the buffer in this case.
      
      This can result in races with ext4_commit_super() if there are other
      problems (which is what commit 4743f839 was trying to address),
      but a Warning is better than BUG.
      
      Fixes: 4743f839
      Cc: stable@vger.kernel.org # 4.9
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      1566a48a
    • T
      ext4: allow ext4_truncate() to return an error · 2c98eb5e
      Theodore Ts'o 提交于
      This allows us to properly propagate errors back up to
      ext4_truncate()'s callers.  This also means we no longer have to
      silently ignore some errors (e.g., when trying to add the inode to the
      orphan inode list).
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      2c98eb5e
  4. 13 10月, 2016 1 次提交
  5. 30 9月, 2016 2 次提交
  6. 06 9月, 2016 2 次提交
    • D
      ext4: improve ext4lazyinit scalability · e22834f0
      Dmitry Monakhov 提交于
      ext4lazyinit is a global thread. This thread performs itable
      initalization under li_list_mtx mutex.
      
      It basically does the following:
      ext4_lazyinit_thread
        ->mutex_lock(&eli->li_list_mtx);
        ->ext4_run_li_request(elr)
          ->ext4_init_inode_table-> Do a lot of IO if the list is large
      
      And when new mount/umount arrive they have to block on ->li_list_mtx
      because  lazy_thread holds it during full walk procedure.
      ext4_fill_super
       ->ext4_register_li_request
         ->mutex_lock(&ext4_li_info->li_list_mtx);
         ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
      In my case mount takes 40minutes on server with 36 * 4Tb HDD.
      Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
      Even more. If one of filesystems was frozen lazyinit_thread will simply
      block on sb_start_write() so other mount/umount will be stuck forever.
      
      This patch changes logic like follows:
      - grab ->s_umount read sem before processing new li_request.
        After that it is safe to drop li_list_mtx because all callers of
        li_remove_request are holding ->s_umount for write.
      - li_thread skips frozen SB's
      
      Locking order:
      Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
      the only way to to grab ->s_mount inside li_thread is via down_read_trylock
      
      xfstests:ext4/023
      #PSBM-49658
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e22834f0
    • J
      ext4: enable quota enforcement based on mount options · 49da9392
      Jan Kara 提交于
      When quota information is stored in quota files, we enable only quota
      accounting on mount and enforcement is enabled only in response to
      Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a
      possibility to enable quota enforcement on mount by specifying
      corresponding quota mount option (usrquota, grpquota, prjquota).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      49da9392
  7. 01 8月, 2016 1 次提交
  8. 15 7月, 2016 1 次提交
    • V
      ext4: short-cut orphan cleanup on error · c65d5c6c
      Vegard Nossum 提交于
      If we encounter a filesystem error during orphan cleanup, we should stop.
      Otherwise, we may end up in an infinite loop where the same inode is
      processed again and again.
      
          EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended
          EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters
          Aborting journal on device loop0-8.
          EXT4-fs (loop0): Remounting filesystem read-only
          EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted
          EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
          EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
          EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure
          EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted
          EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted
          EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
          EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed!
          [...]
          EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed!
          [...]
          EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed!
          [...]
      
      See-also: c9eb13a9 ("ext4: fix hang when processing corrupted orphaned inode list")
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c65d5c6c
  9. 11 7月, 2016 1 次提交
  10. 06 7月, 2016 1 次提交
    • T
      ext4: validate s_reserved_gdt_blocks on mount · 5b9554dc
      Theodore Ts'o 提交于
      If s_reserved_gdt_blocks is extremely large, it's possible for
      ext4_init_block_bitmap(), which is called when ext4 sets up an
      uninitialized block bitmap, to corrupt random kernel memory.  Add the
      same checks which e2fsck has --- it must never be larger than
      blocksize / sizeof(__u32) --- and then add a backup check in
      ext4_init_block_bitmap() in case the superblock gets modified after
      the file system is mounted.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      5b9554dc
  11. 04 7月, 2016 2 次提交
  12. 08 6月, 2016 1 次提交
  13. 17 5月, 2016 1 次提交
  14. 26 4月, 2016 1 次提交
  15. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  16. 04 4月, 2016 1 次提交
    • T
      ext4: ignore quota mount options if the quota feature is enabled · c325a67c
      Theodore Ts'o 提交于
      Previously, ext4 would fail the mount if the file system had the quota
      feature enabled and quota mount options (used for the older quota
      setups) were present.  This broke xfstests, since xfs silently ignores
      the usrquote and grpquota mount options if they are specified.  This
      commit changes things so that we are consistent with xfs; having the
      mount options specified is harmless, so no sense break users by
      forbidding them.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c325a67c
  17. 02 4月, 2016 1 次提交
  18. 01 4月, 2016 1 次提交
    • T
      ext4: add lockdep annotations for i_data_sem · daf647d2
      Theodore Ts'o 提交于
      With the internal Quota feature, mke2fs creates empty quota inodes and
      quota usage tracking is enabled as soon as the file system is mounted.
      Since quotacheck is no longer preallocating all of the blocks in the
      quota inode that are likely needed to be written to, we are now seeing
      a lockdep false positive caused by needing to allocate a quota block
      from inside ext4_map_blocks(), while holding i_data_sem for a data
      inode.  This results in this complaint:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&ei->i_data_sem);
                                      lock(&s->s_dquot.dqio_mutex);
                                      lock(&ei->i_data_sem);
         lock(&s->s_dquot.dqio_mutex);
      
      Google-Bug-Id: 27907753
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      daf647d2
  19. 13 3月, 2016 1 次提交
  20. 09 3月, 2016 2 次提交
    • J
      ext4: remove i_ioend_count · 600be30a
      Jan Kara 提交于
      Remove counter of pending io ends as it is unused.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      600be30a
    • J
      ext4: use i_mutex to serialize unaligned AIO DIO · e142d052
      Jan Kara 提交于
      Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
      However the code cleanups that happened after 2011 when the lock was
      introduced made aio_mutex acquired at almost the same places where we
      already have exclusion using i_mutex. So just use i_mutex for the
      exclusion of unaligned AIO DIO.
      
      The change moves waiting for pending unwritten extent conversion under
      i_mutex. That makes special handling of O_APPEND writes unnecessary and
      also avoids possible livelocking of unaligned AIO DIO with aligned one
      (nothing was preventing contiguous stream of aligned AIO DIOs to let
      unaligned AIO DIO wait forever).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e142d052
  21. 23 2月, 2016 2 次提交
  22. 20 2月, 2016 1 次提交
  23. 09 2月, 2016 1 次提交
  24. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  25. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  26. 09 1月, 2016 2 次提交
  27. 08 12月, 2015 2 次提交
    • J
      ext4: document lock ordering · e74031fd
      Jan Kara 提交于
      We have enough locks that it's probably worth documenting the lock
      ordering rules we have in ext4.
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e74031fd
    • J
      ext4: fix races between page faults and hole punching · ea3d7209
      Jan Kara 提交于
      Currently, page faults and hole punching are completely unsynchronized.
      This can result in page fault faulting in a page into a range that we
      are punching after truncate_pagecache_range() has been called and thus
      we can end up with a page mapped to disk blocks that will be shortly
      freed. Filesystem corruption will shortly follow. Note that the same
      race is avoided for truncate by checking page fault offset against
      i_size but there isn't similar mechanism available for punching holes.
      
      Fix the problem by creating new rw semaphore i_mmap_sem in inode and
      grab it for writing over truncate, hole punching, and other functions
      removing blocks from extent tree and for read over page faults. We
      cannot easily use i_data_sem for this since that ranks below transaction
      start and we need something ranking above it so that it can be held over
      the whole truncate / hole punching operation. Also remove various
      workarounds we had in the code to reduce race window when page fault
      could have created pages with stale mapping information.
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ea3d7209
  28. 17 11月, 2015 1 次提交
    • D
      ext2, ext4: warn when mounting with dax enabled · ef83b6e8
      Dan Williams 提交于
      Similar to XFS warn when mounting DAX while it is still considered under
      development.  Also, aspects of the DAX implementation, for example
      synchronization against multiple faults and faults causing block
      allocation, depend on the correct implementation in the filesystem.  The
      maturity of a given DAX implementation is filesystem specific.
      
      Cc: <stable@vger.kernel.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: linux-ext4@vger.kernel.org
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Acked-by: NJan Kara <jack@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef83b6e8
  29. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  30. 19 10月, 2015 1 次提交
    • D
      ext4: do not allow journal_opts for fs w/o journal · 1e381f60
      Dmitry Monakhov 提交于
      It is appeared that we can pass journal related mount options and such options
      be shown in /proc/mounts
      
      Example:
      #mkfs.ext4 -F /dev/vdb
      #tune2fs -O ^has_journal /dev/vdb
      #mount /dev/vdb /mnt/  -ocommit=20,journal_async_commit
      #cat /proc/mounts  | grep /mnt
       /dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
      
      But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
      nothing with reality because there is no journal at all.
      
      This patch disallow following options for journalless configurations:
       - journal_checksum
       - journal_async_commit
       - commit=%ld
       - data={writeback,ordered,journal}
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      1e381f60