1. 13 10月, 2013 1 次提交
    • D
      ext4: fix memory leak in xattr · 6e4ea8e3
      Dave Jones 提交于
      If we take the 2nd retry path in ext4_expand_extra_isize_ea, we
      potentionally return from the function without having freed these
      allocations.  If we don't do the return, we over-write the previous
      allocation pointers, so we leak either way.
      
      Spotted with Coverity.
      
      [ Fixed by tytso to set is and bs to NULL after freeing these
        pointers, in case in the retry loop we later end up triggering an
        error causing a jump to cleanup, at which point we could have a double
        free bug. -- Ted ]
      Signed-off-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Cc: stable@vger.kernel.org
      6e4ea8e3
  2. 16 9月, 2013 1 次提交
    • J
      ext4: fix performance regression in writeback of random writes · 9c12a831
      Jan Kara 提交于
      The Linux Kernel Performance project guys have reported that commit
      4e7ea81d introduces a performance regression for the following fio
      workload:
      
      [global]
      direct=0
      ioengine=mmap
      size=1500M
      bs=4k
      pre_read=1
      numjobs=1
      overwrite=1
      loops=5
      runtime=300
      group_reporting
      invalidate=0
      directory=/mnt/
      file_service_type=random:36
      file_service_type=random:36
      
      [job0]
      startdelay=0
      rw=randrw
      filename=data0/f1:data0/f2
      
      [job1]
      startdelay=0
      rw=randrw
      filename=data0/f2:data0/f1
      ...
      
      [job7]
      startdelay=0
      rw=randrw
      filename=data0/f2:data0/f1
      
      The culprit of the problem is that after the commit ext4_writepages()
      are more aggressive in writing back pages. Thus we have less consecutive
      dirty pages resulting in more seeking.
      
      This increased aggressivity is caused by a bug in the condition
      terminating ext4_writepages(). We start writing from the beginning of
      the file even if we should have terminated ext4_writepages() because
      wbc->nr_to_write <= 0.
      
      After fixing the condition the throughput of the fio workload is about 20%
      better than before writeback reorganization.
      Reported-by: N"Yan, Zheng" <zheng.z.yan@intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9c12a831
  3. 13 9月, 2013 1 次提交
  4. 11 9月, 2013 1 次提交
    • D
      fs: convert fs shrinkers to new scan/count API · 1ab6c499
      Dave Chinner 提交于
      Convert the filesystem shrinkers to use the new API, and standardise some
      of the behaviours of the shrinkers at the same time.  For example,
      nr_to_scan means the number of objects to scan, not the number of objects
      to free.
      
      I refactored the CIFS idmap shrinker a little - it really needs to be
      broken up into a shrinker per tree and keep an item count with the tree
      root so that we don't need to walk the tree every time the shrinker needs
      to count the number of objects in the tree (i.e.  all the time under
      memory pressure).
      
      [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
      [assorted fixes folded in]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ab6c499
  5. 04 9月, 2013 2 次提交
    • C
      direct-io: Handle O_(D)SYNC AIO · 02afc27f
      Christoph Hellwig 提交于
      Call generic_write_sync() from the deferred I/O completion handler if
      O_DSYNC is set for a write request.  Also make sure various callers
      don't call generic_write_sync if the direct I/O code returns
      -EIOCBQUEUED.
      
      Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
      Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      02afc27f
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  6. 29 8月, 2013 10 次提交
    • E
      ext4: allow specifying external journal by pathname mount option · ad4eec61
      Eric Sandeen 提交于
      It's always been a hassle that if an external journal's
      device number changes, the filesystem won't mount.
      And since boot-time enumeration can change, device number
      changes aren't unusual.
      
      The current mechanism to update the journal location is by
      passing in a mount option w/ a new devnum, but that's a hassle;
      it's a manual approach, fixing things after the fact.
      
      Adding a mount option, "-o journal_path=/dev/$DEVICE" would
      help, since then we can do i.e.
      
      # mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
      
      and it'll mount even if the devnum has changed, as shown here:
      
      # losetup /dev/loop0 journalfile
      # mke2fs -L mylabel-journal -O journal_dev /dev/loop0 
      # mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
      
      Change the journal device number:
      
      # losetup -d /dev/loop0
      # losetup /dev/loop1 journalfile 
      
      And today it will fail:
      
      # mount /dev/sdb1 /mnt/test
      mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
             missing codepage or helper program, or other error
             In some cases useful info is found in syslog - try
             dmesg | tail  or so
      
      # dmesg | tail -n 1
      [17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
      
      But with this new mount option, we can specify the new path:
      
      # mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
      #
      
      (which does update the encoded device number, incidentally):
      
      # umount /dev/sdb1
      # dumpe2fs -h /dev/sdb1 | grep "Journal device"
      dumpe2fs 1.41.12 (17-May-2010)
      Journal device:	          0x0701
      
      But best of all we can just always mount by journal-path, and
      it'll always work:
      
      # mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
      #
      
      So the journal_path option can be specified in fstab, and as long as
      the disk is available somewhere, and findable by label (or by UUID),
      we can mount.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      ad4eec61
    • D
      ext4: mark group corrupt on group descriptor checksum · bdfb6ff4
      Darrick J. Wong 提交于
      If the group descriptor fails validation, mark the whole blockgroup
      corrupt so that the inode/block allocators skip this group.  The
      previous approach takes the risk of writing to a damaged group
      descriptor; hopefully it was never the case that the [ib]bitmap fields
      pointed to another valid block and got dirtied, since the memset would
      fill the page with 1s.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bdfb6ff4
    • D
      ext4: mark block group as corrupt on inode bitmap error · 87a39389
      Darrick J. Wong 提交于
      If we detect either a discrepancy between the inode bitmap and the
      inode counts or the inode bitmap fails to pass validation checks, mark
      the block group corrupt and refuse to allocate or deallocate inodes
      from the group.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      87a39389
    • D
      ext4: mark block group as corrupt on block bitmap error · 163a203d
      Darrick J. Wong 提交于
      When we notice a block-bitmap corruption (because of device failure or
      something else), we should mark this group as corrupt and prevent
      further block allocations/deallocations from it. Currently, we end up
      generating one error message for every block in the bitmap. This
      potentially could make the system unstable as noticed in some
      bugs. With this patch, the error will be printed only the first time
      and mark the entire block group as corrupted. This prevents future
      access allocations/deallocations from it.
      
      Also tested by corrupting the block
      bitmap and forcefully introducing the mb_free_blocks error:
      (1) create a largefile (2Gb)
      $ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
      (2) umount filesystem. use dumpe2fs to see which block-bitmaps
      are in use by largefile and note their block numbers
      (3) use dd to zero-out the used block bitmaps
      $ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
      (4) mount the FS and delete the largefile.
      (5) recreate the largefile. verify that the new largefile does not
      get any blocks from the groups marked as bad.
      Without the patch, we will see mb_free_blocks error for each bit in
      each zero'ed out bitmap at (4). With the patch, we only see the error
      once per blockgroup:
      [  309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      
      Google-Bug-Id: 7258357
      
      [darrick.wong@oracle.com]
      Further modifications (by Darrick) to make more obvious that this corruption
      bit applies to blocks only.  Set the corruption flag if the block group bitmap
      verification fails.
      
      Original-author: Aditya Kali <adityakali@google.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      163a203d
    • D
      ext4: fix type declaration of ext4_validate_block_bitmap · dbde0abe
      Darrick J. Wong 提交于
      The block_group parameter to ext4_validate_block_bitmap is both used
      as a ext4_group_t inside the function and the same type is passed in
      by all callers.  We might as well use the typedef consistently instead
      of open-coding the 'unsigned int'.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      dbde0abe
    • D
      ext4: error out if verifying the block bitmap fails · 48d9eb97
      Darrick J. Wong 提交于
      The block bitmap verification code assumes that calling ext4_error()
      either panics the system or makes the fs readonly.  However, this is
      not always true: when 'errors=continue' is specified, an error is
      printed but we don't return any indication of error to the caller,
      which is (probably) the block allocator, which pretends that the crud
      we read in off the disk is a usable bitmap.  Yuck.
      
      A block bitmap that fails the check should at least return no bitmap
      to the caller.  The block allocator should be told to go look in a
      different group, but that's a separate issue.
      
      The easiest way to reproduce this is to modify bg_block_bitmap (on a
      ^flex_bg fs) to point to a block outside the block group; or you can
      create a metadata_csum filesystem and zero out the block bitmaps.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      48d9eb97
    • Z
      ext4: isolate ext4_extents.h file · d7b2a00c
      Zheng Liu 提交于
      After applied the commit (4a092d73), we have reduced the number of
      source files that need to #include ext4_extents.h.  But we can do
      better.
      
      This commit defines ext4_zeroout_es() in extents.c and move
      EXT_MAX_BLOCKS into ext4.h in order not to include ext4_extents.h in
      indirect.c and ioctl.c.  Meanwhile we just need to include this file in
      extent_status.c when ES_AGGRESSIVE_TEST is defined.  Otherwise, this
      commit removes a duplicated declaration in trace/events/ext4.h.
      
      After applied this patch, we just need to include ext4_extents.h file
      in {super,migrate,move_extents,extents}.c, and it is easy for us to
      define a new extent disk layout.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d7b2a00c
    • A
      70261f56
    • D
      ext4: convert write_begin methods to stable_page_writes semantics · 7afe5aa5
      Dmitry Monakhov 提交于
      Use wait_for_stable_page() instead of wait_on_page_writeback()
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      7afe5aa5
    • A
      ext4: fix use of potentially uninitialized variables in debugging code · 27b1b228
      Andi Shyti 提交于
      If ext_debugging is enabled and path[depth].p_ext is NULL, len
      and lblock are printed non initialized
      Signed-off-by: NAndi Shyti <andi@etezian.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      27b1b228
  7. 20 8月, 2013 1 次提交
  8. 17 8月, 2013 14 次提交
    • J
      ext4: fix lost truncate due to race with writeback · 90e775b7
      Jan Kara 提交于
      The following race can lead to a loss of i_disksize update from truncate
      thus resulting in a wrong inode size if the inode size isn't updated
      again before inode is reclaimed:
      
      ext4_setattr()				mpage_map_and_submit_extent()
        EXT4_I(inode)->i_disksize = attr->ia_size;
        ...					  ...
      					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
      					  /* False because i_size isn't
      					   * updated yet */
      					  if (disksize > i_size_read(inode))
      					  /* True, because i_disksize is
      					   * already truncated */
      					  if (disksize > EXT4_I(inode)->i_disksize)
      					    /* Overwrite i_disksize
      					     * update from truncate */
      					    ext4_update_i_disksize()
        i_size_write(inode, attr->ia_size);
      
      For other places updating i_disksize such race cannot happen because
      i_mutex prevents these races. Writeback is the only place where we do
      not hold i_mutex and we cannot grab it there because of lock ordering.
      
      We fix the race by doing both i_disksize and i_size update in truncate
      atomically under i_data_sem and in mpage_map_and_submit_extent() we move
      the check against i_size under i_data_sem as well.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      90e775b7
    • J
      ext4: simplify truncation code in ext4_setattr() · 5208386c
      Jan Kara 提交于
      Merge conditions in ext4_setattr() handling inode size changes, also
      move ext4_begin_ordered_truncate() call somewhat earlier because it
      simplifies error recovery in case of failure. Also add error handling in
      case i_disksize update fails.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      5208386c
    • J
      ext4: fix ext4_writepages() in presence of truncate · 5f1132b2
      Jan Kara 提交于
      Inode size can arbitrarily change while writeback is in progress. When
      ext4_writepages() has prepared a long extent for mapping and truncate
      then reduces i_size, mpage_map_and_submit_buffers() will always map just
      one buffer in a page instead of all of them due to lblk < blocks check.
      So we end up not using all blocks we've allocated (thus leaking them)
      and also delalloc accounting goes wrong manifesting as a warning like:
      
      ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
      with only 0 reserved data blocks
      
      Note that the problem can happen only when blocksize < pagesize because
      otherwise we have only a single buffer in the page.
      
      Fix the problem by removing the size check from the mapping loop. We
      have an extent allocated so we have to use it all before checking for
      i_size. We also rename add_page_bufs_to_extent() to
      mpage_process_page_bufs() and make that function submit the page for IO
      if all buffers (upto EOF) in it are mapped.
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NZheng Liu <gnehzuil.liu@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      5f1132b2
    • J
      ext4: move test whether extent to map can be extended to one place · 09930042
      Jan Kara 提交于
      Currently the logic whether the current buffer can be added to an extent
      of buffers to map is split between mpage_add_bh_to_extent() and
      add_page_bufs_to_extent(). Move the whole logic to
      mpage_add_bh_to_extent() which makes things a bit more straightforward
      and make following i_size fixes easier.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      09930042
    • J
      ext4: fix warning in ext4_da_update_reserve_space() · 7d734532
      Jan Kara 提交于
      reaim workfile.dbase test easily triggers warning in
      ext4_da_update_reserve_space():
      
      EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
      ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
      blocks with reserved 9 data blocks)
      
      The problem is that (one of) tests creates file and then randomly writes
      to it with O_SYNC. That results in writing back pages of the file in
      random order so we create extents for written blocks say 0, 2, 4, 6, 8
      - this last allocation also allocates new block for extents. Then we
      writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
      indirect extent block because extents fit in the inode again. Then we
      writeout block 10 and we need to allocate indirect extent block again
      which triggers the warning because we don't have the reservation
      anymore.
      
      Fix the problem by giving back freed metadata blocks resulting from
      extent merging into inode's reservation pool.
      Signed-off-by: NJan Kara <jack@suse.cz>
      7d734532
    • T
      ext4: avoid reusing recently deleted inodes in no journal mode · 19883bd9
      Theodore Ts'o 提交于
      In no journal mode, if an inode has recently been deleted, we
      shouldn't reuse it right away.  Otherwise it's possible, after an
      unclean shutdown, to hit a situation where a recently deleted inode
      gets reused for some other purpose before the inode table block has
      been written to disk.  However, if the directory entry has been
      updated, then the directory entry will be pointing at the old inode
      contents.
      
      E2fsck will make sure the file system is consistent after the
      unclean shutdown.  However, if the recently deleted inode is a
      character mode device, or an inode with the immutable bit set, even
      after the file system has been fixed up by e2fsck, it can be
      possible for a *.pyc file to be pointing at a character mode
      device, and when python tries to open the *.pyc file, Hilarity
      Ensues.  We could change all of userspace to be very suspicious
      about stat'ing files before opening them, and clearing the
      immutable flag if necessary --- or we can just avoid reusing an
      inode number if it has been recently deleted.
      
      Google-Bug-Id: 10017573
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      19883bd9
    • T
      ext4: allocate delayed allocation blocks before rename · 0e202704
      Theodore Ts'o 提交于
      When ext4_rename() overwrites an already existing file, call
      ext4_alloc_da_blocks() before starting the journal handle which
      actually does the rename, instead of doing this afterwards.  This
      improves the likelihood that the contents will survive a crash if an
      application replaces a file using the sequence:
      
      1)  write replacement contents to foo.new
      2)  <omit fsync of foo.new>
      3)  rename foo.new to foo
      
      It is still not a guarantee, since ext4_alloc_da_blocks() is *not*
      doing a file integrity sync; this means if foo.new is a very large
      file, it may not be completely flushed out to disk.
      
      However, for files smaller than a megabyte or so, any dirty pages
      should be flushed out before we do the rename operation, and so at the
      next journal commit, the CACHE FLUSH command will make sure al of
      these pages are safely on the disk platter.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0e202704
    • T
      ext4: start handle at least possible moment when renaming files · 5b61de75
      Theodore Ts'o 提交于
      In ext4_rename(), don't start the journal handle until the the
      directory entries have been successfully looked up.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5b61de75
    • T
      ext4: add support for extent pre-caching · 7869a4a6
      Theodore Ts'o 提交于
      Add a new fiemap flag which forces the all of the extents in an inode
      to be cached in the extent_status tree.  This is critically important
      when using AIO to a preallocated file, since if we need to read in
      blocks from the extent tree, the io_submit(2) system call becomes
      synchronous, and the AIO is no longer "A", which is bad.
      
      In addition, for most files which have an external leaf tree block,
      the cost of caching the information in the extent status tree will be
      less than caching the entire 4k block in the buffer cache.  So it is
      generally a win to keep the extent information cached.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7869a4a6
    • T
      ext4: cache all of an extent tree's leaf block upon reading · 107a7bd3
      Theodore Ts'o 提交于
      When we read in an extent tree leaf block from disk, arrange to have
      all of its entries cached.  In nearly all cases the in-memory
      representation will be more compact than the on-disk representation in
      the buffer cache, and it allows us to get the information without
      having to traverse the extent tree for successive extents.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      107a7bd3
    • T
      ext4: use unsigned int for es_status values · 3be78c73
      Theodore Ts'o 提交于
      Don't use an unsigned long long for the es_status flags; this requires
      that we pass 64-bit values around which is painful on 32-bit systems.
      Instead pass the extent status flags around using the low 4 bits of an
      unsigned int, and shift them into place when we are reading or writing
      es_pblk.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      3be78c73
    • T
      ext4: print the block number of invalid extent tree blocks · c349179b
      Theodore Ts'o 提交于
      When we find an invalid extent tree block, report the block number of
      the bad block for debugging purposes.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      c349179b
    • T
      ext4: refactor code to read the extent tree block · 7d7ea89e
      Theodore Ts'o 提交于
      Refactor out the code needed to read the extent tree block into a
      single read_extent_tree_block() function.  In addition to simplifying
      the code, it also makes sure that we call the ext4_ext_load_extent
      tracepoint whenever we need to read an extent tree block from disk.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      7d7ea89e
    • J
      jbd2: Fix oops in jbd2_journal_file_inode() · a361293f
      Jan Kara 提交于
      Commit 0713ed0c added
      jbd2_journal_file_inode() call into ext4_block_zero_page_range().
      However that function gets called from truncate path and thus inode
      needn't have jinode attached - that happens in ext4_file_open() but
      the file needn't be ever open since mount. Calling
      jbd2_journal_file_inode() without jinode attached results in the oops.
      
      We fix the problem by attaching jinode to inode also in ext4_truncate()
      and ext4_punch_hole() when we are going to zero out partial blocks.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a361293f
  9. 12 8月, 2013 2 次提交
    • J
      jbd2: Fix use after free after error in jbd2_journal_dirty_metadata() · 91aa11fa
      Jan Kara 提交于
      When jbd2_journal_dirty_metadata() returns error,
      __ext4_handle_dirty_metadata() stops the handle. However callers of this
      function do not count with that fact and still happily used now freed
      handle. This use after free can result in various issues but very likely
      we oops soon.
      
      The motivation of adding __ext4_journal_stop() into
      __ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
      improve error reporting. So replace __ext4_journal_stop() with
      ext4_journal_abort_handle() which was there before that commit and add
      WARN_ON_ONCE() to dump stack to provide useful information.
      Reported-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org	# 3.2+
      91aa11fa
    • T
      ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT · cde2d7a7
      Theodore Ts'o 提交于
      Previously we weren't swapping only some of the extent_status LRU
      fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl.  The
      much safer thing to do is to just completely flush the extent status
      tree when doing the swap.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      Cc: stable@vger.kernel.org
      cde2d7a7
  10. 09 8月, 2013 2 次提交
  11. 30 7月, 2013 2 次提交
  12. 27 7月, 2013 2 次提交
  13. 21 7月, 2013 1 次提交
    • Z
      ext4: fix a BUG when opening a file with O_TMPFILE flag · e94bd349
      Zheng Liu 提交于
      When we try to open a file with O_TMPFILE flag, we will trigger a bug.
      The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
      this check always fails because we set ->i_nlink = 1 in
      inode_init_always().  We can use the following program to trigger it:
      
      int main(int argc, char *argv[])
      {
      	int fd;
      
      	fd = open(argv[1], O_TMPFILE, 0666);
      	if (fd < 0) {
      		perror("open ");
      		return -1;
      	}
      	close(fd);
      	return 0;
      }
      
      The oops message looks like this:
      
      kernel BUG at fs/ext4/namei.c:2572!
      invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: dlci bridge stp hidp cmtp kernelcapi l2tp_ppp l2tp_netlink l2tp_core sctp libcrc32c rfcomm tun fuse nfnetli
      nk can_raw ipt_ULOG can_bcm x25 scsi_transport_iscsi ipx p8023 p8022 appletalk phonet psnap vmw_vsock_vmci_transport af_key vmw_vmci rose vsock atm can netrom ax25 af_rxrpc ir
      da pppoe pppox ppp_generic slhc bluetooth nfc rfkill rds caif_socket caif crc_ccitt af_802154 llc2 llc snd_hda_codec_realtek snd_hda_intel snd_hda_codec serio_raw snd_pcm pcsp
      kr edac_core snd_page_alloc snd_timer snd soundcore r8169 mii sr_mod cdrom pata_atiixp radeon backlight drm_kms_helper ttm
      CPU: 1 PID: 1812571 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #12
      Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
      task: ffff88007dfe69a0 ti: ffff88010f7b6000 task.ti: ffff88010f7b6000
      RIP: 0010:[<ffffffff8125ce69>]  [<ffffffff8125ce69>] ext4_orphan_add+0x299/0x2b0
      RSP: 0018:ffff88010f7b7cf8  EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff8800966d3020 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88007dfe70b8 RDI: 0000000000000001
      RBP: ffff88010f7b7d40 R08: ffff880126a3c4e0 R09: ffff88010f7b7ca0
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271fd668
      R13: ffff8800966d2f78 R14: ffff88011d7089f0 R15: ffff88007dfe69a0
      FS:  00007f70441a3740(0000) GS:ffff88012a800000(0000) knlGS:00000000f77c96c0
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000002834000 CR3: 0000000107964000 CR4: 00000000000007e0
      DR0: 0000000000780000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Stack:
       0000000000002000 00000020810b6dde 0000000000000000 ffff88011d46db00
       ffff8800966d3020 ffff88011d7089f0 ffff88009c7f4c10 ffff88010f7b7f2c
       ffff88007dfe69a0 ffff88010f7b7da8 ffffffff8125cfac ffff880100000004
      Call Trace:
       [<ffffffff8125cfac>] ext4_tmpfile+0x12c/0x180
       [<ffffffff811cba78>] path_openat+0x238/0x700
       [<ffffffff8100afc4>] ? native_sched_clock+0x24/0x80
       [<ffffffff811cc647>] do_filp_open+0x47/0xa0
       [<ffffffff811db73f>] ? __alloc_fd+0xaf/0x200
       [<ffffffff811ba2e4>] do_sys_open+0x124/0x210
       [<ffffffff81010725>] ? syscall_trace_enter+0x25/0x290
       [<ffffffff811ba3ee>] SyS_open+0x1e/0x20
       [<ffffffff816ca8d4>] tracesys+0xdd/0xe2
       [<ffffffff81001001>] ? start_thread_common.constprop.6+0x1/0xa0
      Code: 04 00 00 00 89 04 24 31 c0 e8 c4 77 04 00 e9 43 fe ff ff 66 25 00 d0 66 3d 00 80 0f 84 0e fe ff ff 83 7b 48 00 0f 84 04 fe ff ff <0f> 0b 49 8b 8c 24 50 07 00 00 e9 88 fe ff ff 0f 1f 84 00 00 00
      
      Here we couldn't call clear_nlink() directly because in d_tmpfile() we
      will call inode_dec_link_count() to decrease ->i_nlink.  So this commit
      tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Tested-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NDave Jones <davej@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      e94bd349