1. 15 9月, 2016 3 次提交
    • F
      ext4: remove unneeded test in ext4_alloc_file_blocks() · c3fe493c
      Fabian Frederick 提交于
      ext4_alloc_file_blocks() is called from ext4_zero_range() and
      ext4_fallocate() both already testing EXT4_INODE_EXTENTS
      We can call ext_depth(inode) unconditionnally.
      
      [ Added BUG_ON check to make sure ext4_alloc_file_blocks() won't get
        called for a indirect-mapped inode in the future.  -- tytso ]
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c3fe493c
    • F
      ext4: fix memory leak in ext4_insert_range() · edf15aa1
      Fabian Frederick 提交于
      Running xfstests generic/013 with kmemleak gives the following:
      
      unreferenced object 0xffff8801d3d27de0 (size 96):
        comm "fsstress", pid 4941, jiffies 4294860168 (age 53.485s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff818eaaf3>] kmemleak_alloc+0x23/0x40
          [<ffffffff81179805>] __kmalloc+0xf5/0x1d0
          [<ffffffff8122ef5c>] ext4_find_extent+0x1ec/0x2f0
          [<ffffffff8123530c>] ext4_insert_range+0x34c/0x4a0
          [<ffffffff81235942>] ext4_fallocate+0x4e2/0x8b0
          [<ffffffff81181334>] vfs_fallocate+0x134/0x210
          [<ffffffff8118203f>] SyS_fallocate+0x3f/0x60
          [<ffffffff818efa9b>] entry_SYSCALL_64_fastpath+0x13/0x8f
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Problem seems mitigated by dropping refs and freeing path
      when there's no path[depth].p_ext
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      edf15aa1
    • W
      ext4: bugfix for mmaped pages in mpage_release_unused_pages() · 4e800c03
      wangguang 提交于
      Pages clear buffers after ext4 delayed block allocation failed,
      However, it does not clean its pte_dirty flag.
      if the pages unmap ,in cording to the pte_dirty ,
      unmap_page_range may try to call __set_page_dirty,
      
      which may lead to the bugon at 
      mpage_prepare_extent_to_map:head = page_buffers(page);.
      
      This patch just call clear_page_dirty_for_io to clean pte_dirty 
      at mpage_release_unused_pages for pages mmaped. 
      
      Steps to reproduce the bug:
      
      (1) mmap a file in ext4
      	addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
      	       	            fd, 0);
      	memset(addr, 'i', 4096);
      
      (2) return EIO at 
      
      	ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent 
      
      which causes this log message to be print:
      
                      ext4_msg(sb, KERN_CRIT,
                              "Delayed block allocation failed for "
                              "inode %lu at logical offset %llu with"
                              " max blocks %u with error %d",
                              inode->i_ino,
                              (unsigned long long)map->m_lblk,
                              (unsigned)map->m_len, -err);
      
      (3)Unmap the addr cause warning at
      
      	__set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));
      
      (4) wait for a minute,then bugon happen.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: Nwangguang <wangguang03@zte.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4e800c03
  2. 06 9月, 2016 5 次提交
    • D
      ext4: improve ext4lazyinit scalability · e22834f0
      Dmitry Monakhov 提交于
      ext4lazyinit is a global thread. This thread performs itable
      initalization under li_list_mtx mutex.
      
      It basically does the following:
      ext4_lazyinit_thread
        ->mutex_lock(&eli->li_list_mtx);
        ->ext4_run_li_request(elr)
          ->ext4_init_inode_table-> Do a lot of IO if the list is large
      
      And when new mount/umount arrive they have to block on ->li_list_mtx
      because  lazy_thread holds it during full walk procedure.
      ext4_fill_super
       ->ext4_register_li_request
         ->mutex_lock(&ext4_li_info->li_list_mtx);
         ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
      In my case mount takes 40minutes on server with 36 * 4Tb HDD.
      Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
      Even more. If one of filesystems was frozen lazyinit_thread will simply
      block on sb_start_write() so other mount/umount will be stuck forever.
      
      This patch changes logic like follows:
      - grab ->s_umount read sem before processing new li_request.
        After that it is safe to drop li_list_mtx because all callers of
        li_remove_request are holding ->s_umount for write.
      - li_thread skips frozen SB's
      
      Locking order:
      Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
      the only way to to grab ->s_mount inside li_thread is via down_read_trylock
      
      xfstests:ext4/023
      #PSBM-49658
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e22834f0
    • J
      ext4: cleanup ext4_sync_parent() · 6ae4c5a6
      Jan Kara 提交于
      A condition !hlist_empty(&inode->i_dentry) is always true for open file.
      Just remove it. Also ext4_sync_parent() could use some explanation why
      races with rmdir() are not an issue - add a comment explaining that.
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6ae4c5a6
    • K
      ext4: remove old feature helpers · 0b7b7779
      Kaho Ng 提交于
      Use the ext4_{has,set,clear}_feature_* helpers to replace the old
      feature helpers.
      Signed-off-by: NKaho Ng <ngkaho1234@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b7b7779
    • J
      ext4: enable quota enforcement based on mount options · 49da9392
      Jan Kara 提交于
      When quota information is stored in quota files, we enable only quota
      accounting on mount and enforcement is enabled only in response to
      Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a
      possibility to enable quota enforcement on mount by specifying
      corresponding quota mount option (usrquota, grpquota, prjquota).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      49da9392
    • D
      ext4: reinforce check of i_dtime when clearing high fields of uid and gid · 93e3b4e6
      Daeho Jeong 提交于
      Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid
      of deleted and evicted inode to fix up interoperability with old
      kernels. However, it checks only i_dtime of an inode to determine
      whether the inode was deleted and evicted, and this is very risky,
      because i_dtime can be used for the pointer maintaining orphan inode
      list, too. We need to further check whether the i_dtime is being
      used for the orphan inode list even if the i_dtime is not NULL.
      
      We found that high 16-bit fields of uid/gid of inode are unintentionally
      and permanently cleared when the inode truncation is just triggered,
      but not finished, and the inode metadata, whose high uid/gid bits are
      cleared, is written on disk, and the sudden power-off follows that
      in order.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: NHobin Woo <hobin.woo@samsung.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      93e3b4e6
  3. 30 8月, 2016 8 次提交
  4. 12 8月, 2016 2 次提交
    • J
      ext4: avoid deadlock when expanding inode size · 2e81a4ee
      Jan Kara 提交于
      When we need to move xattrs into external xattr block, we call
      ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
      up calling ext4_mark_inode_dirty() again which will recurse back into
      the inode expansion code leading to deadlocks.
      
      Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
      its management into ext4_expand_extra_isize_ea() since its manipulation
      is safe there (due to xattr_sem) from possible races with
      ext4_xattr_set_handle() which plays with it as well.
      
      CC: stable@vger.kernel.org   # 4.4.x
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      2e81a4ee
    • J
      ext4: properly align shifted xattrs when expanding inodes · 443a8c41
      Jan Kara 提交于
      We did not count with the padding of xattr value when computing desired
      shift of xattrs in the inode when expanding i_extra_isize. As a result
      we could create unaligned start of inline xattrs. Account for alignment
      properly.
      
      CC: stable@vger.kernel.org  # 4.4.x-
      Signed-off-by: NJan Kara <jack@suse.cz>
      443a8c41
  5. 11 8月, 2016 2 次提交
    • J
      ext4: fix xattr shifting when expanding inodes part 2 · 418c12d0
      Jan Kara 提交于
      When multiple xattrs need to be moved out of inode, we did not properly
      recompute total size of xattr headers in the inode and the new header
      position. Thus when moving the second and further xattr we asked
      ext4_xattr_shift_entries() to move too much and from the wrong place,
      resulting in possible xattr value corruption or general memory
      corruption.
      
      CC: stable@vger.kernel.org  # 4.4.x
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      418c12d0
    • J
      ext4: fix xattr shifting when expanding inodes · d0141191
      Jan Kara 提交于
      The code in ext4_expand_extra_isize_ea() treated new_extra_isize
      argument sometimes as the desired target i_extra_isize and sometimes as
      the amount by which we need to grow current i_extra_isize. These happen
      to coincide when i_extra_isize is 0 which used to be the common case and
      so nobody noticed this until recently when we added i_projid to the
      inode and so i_extra_isize now needs to grow from 28 to 32 bytes.
      
      The result of these bugs was that we sometimes unnecessarily decided to
      move xattrs out of inode even if there was enough space and we often
      ended up corrupting in-inode xattrs because arguments to
      ext4_xattr_shift_entries() were just wrong. This could demonstrate
      itself as BUG_ON in ext4_xattr_shift_entries() triggering.
      
      Fix the problem by introducing new isize_diff variable and use it where
      appropriate.
      
      CC: stable@vger.kernel.org   # 4.4.x
      Reported-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d0141191
  6. 01 8月, 2016 2 次提交
  7. 27 7月, 2016 2 次提交
    • M
      mm, memcg: use consistent gfp flags during readahead · 8a5c743e
      Michal Hocko 提交于
      Vladimir has noticed that we might declare memcg oom even during
      readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
      restriction) while __do_page_cache_readahead uses
      page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
      OOMs.  This gfp mask discrepancy is really unfortunate and easily
      fixable.  Drop page_cache_alloc_readahead() which only has one user and
      outsource the gfp_mask logic into readahead_gfp_mask and propagate this
      mask from __do_page_cache_readahead down to read_pages.
      
      This alone would have only very limited impact as most filesystems are
      implementing ->readpages and the common implementation mpage_readpages
      does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
      use readahead_gfp_mask instead as this function is called only during
      readahead as well.  The same applies to read_cache_pages.
      
      ext4 has its own ext4_mpage_readpages but the path which has pages !=
      NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
      doing a very similar pattern to mpage_readpages so the same can be
      applied to them as well.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@suse.com: restrict gfp mask in mpage_alloc]
        Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Changman Lee <cm224.lee@samsung.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a5c743e
    • R
      dax: remote unused fault wrappers · 6b524995
      Ross Zwisler 提交于
      Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
      removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
      dax_pmd_fault() respectively, and update all callers.
      
      The dax_fault() and dax_pmd_fault() wrappers were initially intended to
      capture some filesystem independent functionality around page faults
      (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
      and ctime).
      
      However, the following commits:
      
         5726b27b ("ext2: Add locking for DAX faults")
         ea3d7209 ("ext4: fix races between page faults and hole punching")
      
      added locking to the ext2 and ext4 filesystems after these common
      operations but before __dax_fault() and __dax_pmd_fault() were called.
      This means that these wrappers are no longer used, and are unlikely to
      be used in the future.
      
      XFS has had locking analogous to what was recently added to ext2 and
      ext4 since DAX support was initially introduced by:
      
         6b698ede ("xfs: add DAX file operations support")
      
      Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b524995
  8. 15 7月, 2016 3 次提交
    • V
      ext4: verify extent header depth · 7bc94916
      Vegard Nossum 提交于
      Although the extent tree depth of 5 should enough be for the worst
      case of 2*32 extents of length 1, the extent tree code does not
      currently to merge nodes which are less than half-full with a sibling
      node, or to shrink the tree depth if possible.  So it's possible, at
      least in theory, for the tree depth to be greater than 5.  However,
      even in the worst case, a tree depth of 32 is highly unlikely, and if
      the file system is maliciously corrupted, an insanely large eh_depth
      can cause memory allocation failures that will trigger kernel warnings
      (here, eh_depth = 65280):
      
          JBD2: ext4.exe wants too many credits credits:195849 rsv_credits:0 max:256
          ------------[ cut here ]------------
          WARNING: CPU: 0 PID: 50 at fs/jbd2/transaction.c:293 start_this_handle+0x569/0x580
          CPU: 0 PID: 50 Comm: ext4.exe Not tainted 4.7.0-rc5+ #508
          Stack:
           604a8947 625badd8 0002fd09 00000000
           60078643 00000000 62623910 601bf9bc
           62623970 6002fc84 626239b0 900000125
          Call Trace:
           [<6001c2dc>] show_stack+0xdc/0x1a0
           [<601bf9bc>] dump_stack+0x2a/0x2e
           [<6002fc84>] __warn+0x114/0x140
           [<6002fdff>] warn_slowpath_null+0x1f/0x30
           [<60165829>] start_this_handle+0x569/0x580
           [<60165d4e>] jbd2__journal_start+0x11e/0x220
           [<60146690>] __ext4_journal_start_sb+0x60/0xa0
           [<60120a81>] ext4_truncate+0x131/0x3a0
           [<60123677>] ext4_setattr+0x757/0x840
           [<600d5d0f>] notify_change+0x16f/0x2a0
           [<600b2b16>] do_truncate+0x76/0xc0
           [<600c3e56>] path_openat+0x806/0x1300
           [<600c55c9>] do_filp_open+0x89/0xf0
           [<600b4074>] do_sys_open+0x134/0x1e0
           [<600b4140>] SyS_open+0x20/0x30
           [<6001ea68>] handle_syscall+0x88/0x90
           [<600295fd>] userspace+0x3fd/0x500
           [<6001ac55>] fork_handler+0x85/0x90
      
          ---[ end trace 08b0b88b6387a244 ]---
      
      [ Commit message modified and the extent tree depath check changed
      from 5 to 32 -- tytso ]
      
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7bc94916
    • V
      ext4: short-cut orphan cleanup on error · c65d5c6c
      Vegard Nossum 提交于
      If we encounter a filesystem error during orphan cleanup, we should stop.
      Otherwise, we may end up in an infinite loop where the same inode is
      processed again and again.
      
          EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended
          EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters
          Aborting journal on device loop0-8.
          EXT4-fs (loop0): Remounting filesystem read-only
          EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted
          EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
          EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
          EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure
          EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted
          EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted
          EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
          EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed!
          [...]
          EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed!
          [...]
          EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed!
          [...]
      
      See-also: c9eb13a9 ("ext4: fix hang when processing corrupted orphaned inode list")
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c65d5c6c
    • V
      ext4: fix reference counting bug on block allocation error · 554a5ccc
      Vegard Nossum 提交于
      If we hit this error when mounted with errors=continue or
      errors=remount-ro:
      
          EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata
      
      then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to
      continue. However, ext4_mb_release_context() is the wrong thing to call
      here since we are still actually using the allocation context.
      
      Instead, just error out. We could retry the allocation, but there is a
      possibility of getting stuck in an infinite loop instead, so this seems
      safer.
      
      [ Fixed up so we don't return EAGAIN to userspace. --tytso ]
      
      Fixes: 8556e8f3 ("ext4: Don't allow new groups to be added during block allocation")
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      554a5ccc
  9. 11 7月, 2016 1 次提交
  10. 06 7月, 2016 3 次提交
  11. 04 7月, 2016 5 次提交
  12. 30 6月, 2016 1 次提交
    • V
      ext4: check for extents that wrap around · f70749ca
      Vegard Nossum 提交于
      An extent with lblock = 4294967295 and len = 1 will pass the
      ext4_valid_extent() test:
      
      	ext4_lblk_t last = lblock + len - 1;
      
      	if (len == 0 || lblock > last)
      		return 0;
      
      since last = 4294967295 + 1 - 1 = 4294967295. This would later trigger
      the BUG_ON(es->es_lblk + es->es_len < es->es_lblk) in ext4_es_end().
      
      We can simplify it by removing the - 1 altogether and changing the test
      to use lblock + len <= lblock, since now if len = 0, then lblock + 0 ==
      lblock and it fails, and if len > 0 then lblock + len > lblock in order
      to pass (i.e. it doesn't overflow).
      
      Fixes: 5946d089 ("ext4: check for overlapping extents in ext4_valid_extent_entries()")
      Fixes: 2f974865 ("ext4: check for zero length extent explicitly")
      Cc: Eryu Guan <guaneryu@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPhil Turnbull <phil.turnbull@oracle.com>
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      f70749ca
  13. 27 6月, 2016 2 次提交
  14. 09 6月, 2016 1 次提交
    • M
      ext4: use bio op helprs in ext4 crypto code · 60a40096
      Mike Christie 提交于
      This was missed from my last patchset.
      
      This patch has ext4 crypto code use the bio op helper
      to set the operation. The operation (discard, write, writesame,
      etc) is now defined seperately from the other REQ bits. They
      still share the bi_rw field to save space, so we use these
      helpers so modules do not have to worry about setting/overwriting
      info.
      
      Jens, I am not sure how you handle patches on top of patches
      in the next branches. If you merge patches that fix issues
      in previous patches in next, then this patch could be part
      of
      
      commit 95fe6c1a
      Author: Mike Christie <mchristi@redhat.com>
      Date:   Sun Jun 5 14:31:48 2016 -0500
      
          block, fs, mm, drivers: use bio set/get op accessors
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      60a40096