1. 17 8月, 2013 4 次提交
  2. 12 8月, 2013 2 次提交
    • J
      jbd2: Fix use after free after error in jbd2_journal_dirty_metadata() · 91aa11fa
      Jan Kara 提交于
      When jbd2_journal_dirty_metadata() returns error,
      __ext4_handle_dirty_metadata() stops the handle. However callers of this
      function do not count with that fact and still happily used now freed
      handle. This use after free can result in various issues but very likely
      we oops soon.
      
      The motivation of adding __ext4_journal_stop() into
      __ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
      improve error reporting. So replace __ext4_journal_stop() with
      ext4_journal_abort_handle() which was there before that commit and add
      WARN_ON_ONCE() to dump stack to provide useful information.
      Reported-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org	# 3.2+
      91aa11fa
    • T
      ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT · cde2d7a7
      Theodore Ts'o 提交于
      Previously we weren't swapping only some of the extent_status LRU
      fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl.  The
      much safer thing to do is to just completely flush the extent status
      tree when doing the swap.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      Cc: stable@vger.kernel.org
      cde2d7a7
  3. 09 8月, 2013 2 次提交
  4. 30 7月, 2013 2 次提交
  5. 27 7月, 2013 2 次提交
  6. 21 7月, 2013 1 次提交
    • Z
      ext4: fix a BUG when opening a file with O_TMPFILE flag · e94bd349
      Zheng Liu 提交于
      When we try to open a file with O_TMPFILE flag, we will trigger a bug.
      The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
      this check always fails because we set ->i_nlink = 1 in
      inode_init_always().  We can use the following program to trigger it:
      
      int main(int argc, char *argv[])
      {
      	int fd;
      
      	fd = open(argv[1], O_TMPFILE, 0666);
      	if (fd < 0) {
      		perror("open ");
      		return -1;
      	}
      	close(fd);
      	return 0;
      }
      
      The oops message looks like this:
      
      kernel BUG at fs/ext4/namei.c:2572!
      invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: dlci bridge stp hidp cmtp kernelcapi l2tp_ppp l2tp_netlink l2tp_core sctp libcrc32c rfcomm tun fuse nfnetli
      nk can_raw ipt_ULOG can_bcm x25 scsi_transport_iscsi ipx p8023 p8022 appletalk phonet psnap vmw_vsock_vmci_transport af_key vmw_vmci rose vsock atm can netrom ax25 af_rxrpc ir
      da pppoe pppox ppp_generic slhc bluetooth nfc rfkill rds caif_socket caif crc_ccitt af_802154 llc2 llc snd_hda_codec_realtek snd_hda_intel snd_hda_codec serio_raw snd_pcm pcsp
      kr edac_core snd_page_alloc snd_timer snd soundcore r8169 mii sr_mod cdrom pata_atiixp radeon backlight drm_kms_helper ttm
      CPU: 1 PID: 1812571 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #12
      Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
      task: ffff88007dfe69a0 ti: ffff88010f7b6000 task.ti: ffff88010f7b6000
      RIP: 0010:[<ffffffff8125ce69>]  [<ffffffff8125ce69>] ext4_orphan_add+0x299/0x2b0
      RSP: 0018:ffff88010f7b7cf8  EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff8800966d3020 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88007dfe70b8 RDI: 0000000000000001
      RBP: ffff88010f7b7d40 R08: ffff880126a3c4e0 R09: ffff88010f7b7ca0
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271fd668
      R13: ffff8800966d2f78 R14: ffff88011d7089f0 R15: ffff88007dfe69a0
      FS:  00007f70441a3740(0000) GS:ffff88012a800000(0000) knlGS:00000000f77c96c0
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000002834000 CR3: 0000000107964000 CR4: 00000000000007e0
      DR0: 0000000000780000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Stack:
       0000000000002000 00000020810b6dde 0000000000000000 ffff88011d46db00
       ffff8800966d3020 ffff88011d7089f0 ffff88009c7f4c10 ffff88010f7b7f2c
       ffff88007dfe69a0 ffff88010f7b7da8 ffffffff8125cfac ffff880100000004
      Call Trace:
       [<ffffffff8125cfac>] ext4_tmpfile+0x12c/0x180
       [<ffffffff811cba78>] path_openat+0x238/0x700
       [<ffffffff8100afc4>] ? native_sched_clock+0x24/0x80
       [<ffffffff811cc647>] do_filp_open+0x47/0xa0
       [<ffffffff811db73f>] ? __alloc_fd+0xaf/0x200
       [<ffffffff811ba2e4>] do_sys_open+0x124/0x210
       [<ffffffff81010725>] ? syscall_trace_enter+0x25/0x290
       [<ffffffff811ba3ee>] SyS_open+0x1e/0x20
       [<ffffffff816ca8d4>] tracesys+0xdd/0xe2
       [<ffffffff81001001>] ? start_thread_common.constprop.6+0x1/0xa0
      Code: 04 00 00 00 89 04 24 31 c0 e8 c4 77 04 00 e9 43 fe ff ff 66 25 00 d0 66 3d 00 80 0f 84 0e fe ff ff 83 7b 48 00 0f 84 04 fe ff ff <0f> 0b 49 8b 8c 24 50 07 00 00 e9 88 fe ff ff 0f 1f 84 00 00 00
      
      Here we couldn't call clear_nlink() directly because in d_tmpfile() we
      will call inode_dec_link_count() to decrease ->i_nlink.  So this commit
      tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Tested-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NDave Jones <davej@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      e94bd349
  7. 16 7月, 2013 2 次提交
    • T
      ext4: call ext4_es_lru_add() after handling cache miss · 63b99968
      Theodore Ts'o 提交于
      If there are no items in the extent status tree, ext4_es_lru_add() is
      a no-op.  So it is not sufficient to call ext4_es_lru_add() before we
      try to lookup an entry in the extent status tree.  We also need to
      call it at the end of ext4_ext_map_blocks(), after items have been
      added to the extent status tree.
      
      This could lead to inodes with that have extent status trees but which
      are not in the LRU list, which means they won't get considered for
      eviction by the es_shrinker.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      63b99968
    • T
      ext4: yield during large unlinks · 76828c88
      Theodore Ts'o 提交于
      During large unlink operations on files with extents, we can use a lot
      of CPU time.  This adds a cond_resched() call when starting to examine
      the next level of a multi-level extent tree.  Multi-level extent trees
      are rare in the first place, and this should rarely be executed.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      76828c88
  8. 15 7月, 2013 3 次提交
    • T
      ext4: make the extent_status code more robust against ENOMEM failures · e15f742c
      Theodore Ts'o 提交于
      Some callers of ext4_es_remove_extent() and ext4_es_insert_extent()
      may not be completely robust against ENOMEM failures (or the
      consequences of reflecting ENOMEM back up to userspace may lead to
      xfstest or user application failure).
      
      To mitigate against this, when trying to insert an entry in the extent
      status tree, try to shrink the inode's extent status tree before
      returning ENOMEM.  If there are entries which don't record information
      about extents under delayed allocations, freeing one of them is
      preferable to returning ENOMEM.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      e15f742c
    • T
      ext4: simplify calculation of blocks to free on error · c8e15130
      Theodore Ts'o 提交于
      In ext4_ext_map_blocks(), if we have successfully allocated the data
      blocks, but then run into trouble inserting the extent into the extent
      tree, most likely due to an ENOSPC condition, determine the arguments
      to ext4_free_blocks() in a simpler way which is easier to prove to be
      correct.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c8e15130
    • T
      ext4: fix error handling in ext4_ext_truncate() · 8acd5e9b
      Theodore Ts'o 提交于
      Previously ext4_ext_truncate() was ignoring potential error returns
      from ext4_es_remove_extent() and ext4_ext_remove_space().  This can
      lead to the on-diks extent tree and the extent status tree cache
      getting out of sync, which is particuarlly bad, and can lead to file
      system corruption and potential data loss.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      8acd5e9b
  9. 13 7月, 2013 2 次提交
  10. 12 7月, 2013 2 次提交
    • A
      ext4: rate limit printk in buffer_io_error() · e8974c39
      Anatol Pomozov 提交于
      If there are a lot of outstanding buffered IOs when a device is
      taken offline (due to hardware errors etc), ext4_end_bio prints
      out a message for each failed logical block. While this is desirable,
      we see thousands of such lines being printed out before the
      serial console gets overwhelmed, causing ext4_end_bio() wait for
      the printk to complete.
      
      This in itself isn't a disaster, except for the detail that this
      function is being called with the queue lock held.
      This causes any other function in the block layer
      to spin on its spin_lock_irqsave while the serial console is
      draining. If NMI watchdog is enabled on this machine then it
      eventually comes along and shoots the machine in the head.
      
      The end result is that losing any one disk causes the machine to
      go down. This patch rate limits the printk to bandaid around the
      problem.
      
      Tested: xfstests
      Change-Id: I8ab5690dcf4f3a67e78be147d45e489fdf4a88d8
      Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e8974c39
    • T
      ext4: don't show usrquota/grpquota twice in /proc/mounts · ad065dd0
      Theodore Ts'o 提交于
      We now print mount options in a generic fashion in
      ext4_show_options(), so we shouldn't be explicitly printing the
      {usr,grp}quota options in ext4_show_quota_options().
      
      Without this patch, /proc/mounts can look like this:
      
       /dev/vdb /vdb ext4 rw,relatime,quota,usrquota,data=ordered,usrquota 0 0
                                            ^^^^^^^^              ^^^^^^^^
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      ad065dd0
  11. 11 7月, 2013 1 次提交
    • J
      ext4: fix warning in ext4_evict_inode() · 822dbba3
      Jan Kara 提交于
      The following race can lead to ext4_evict_inode() seeing i_ioend_count
      > 0 and thus triggering a sanity check warning:
      
              CPU1                                    CPU2
      ext4_end_bio()                          ext4_evict_inode()
        ext4_finish_bio()
          end_page_writeback();
                                                truncate_inode_pages()
                                                  evict page
                                              WARN_ON(i_ioend_count > 0);
        ext4_put_io_end_defer()
          ext4_release_io_end()
            dec i_ioend_count
      
      This is possible use-after-free bug since we decrement i_ioend_count in
      possibly released inode.
      
      Since i_ioend_count is used only for sanity checks one possible solution
      would be to just remove it but for now I'd like to keep those sanity
      checks to help debugging the new ext4 writeback code.
      
      This patch changes ext4_end_bio() to call ext4_put_io_end_defer() before
      ext4_finish_bio() in the shortcut case when unwritten extent conversion
      isn't needed.  In that case we don't need the io_end so we are safe to
      drop it early.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      822dbba3
  12. 06 7月, 2013 2 次提交
    • T
      ext4: fix ext4_get_group_number() · 960fd856
      Theodore Ts'o 提交于
      The function ext4_get_group_number() was introduced as an optimization
      in commit bd86298e.  Unfortunately, this commit incorrectly
      calculate the group number for file systems with a 1k block size (when
      s_first_data_block is 1 instead of zero).  This could cause the
      following kernel BUG:
      
      [  568.877799] ------------[ cut here ]------------
      [  568.877833] kernel BUG at fs/ext4/mballoc.c:3728!
      [  568.877840] Oops: Exception in kernel mode, sig: 5 [#1]
      [  568.877845] SMP NR_CPUS=32 NUMA pSeries
      [  568.877852] Modules linked in: binfmt_misc
      [  568.877861] CPU: 1 PID: 3516 Comm: fs_mark Not tainted 3.10.0-03216-g7c6809ff-dirty #1
      [  568.877867] task: c0000001fb0b8000 ti: c0000001fa954000 task.ti: c0000001fa954000
      [  568.877873] NIP: c0000000002f42a4 LR: c0000000002f4274 CTR: c000000000317ef8
      [  568.877879] REGS: c0000001fa956ed0 TRAP: 0700   Not tainted  (3.10.0-03216-g7c6809ff-dirty)
      [  568.877884] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 24000428  XER: 00000000
      [  568.877902] SOFTE: 1
      [  568.877905] CFAR: c0000000002b5464
      [  568.877908]
      GPR00: 0000000000000001 c0000001fa957150 c000000000c6a408 c0000001fb588000
      GPR04: 0000000000003fff c0000001fa9571c0 c0000001fa9571c4 000138098c50625f
      GPR08: 1301200000000000 0000000000000002 0000000000000001 0000000000000000
      GPR12: 0000000024000422 c00000000f33a300 0000000000008000 c0000001fa9577f0
      GPR16: c0000001fb7d0100 c000000000c29190 c0000000007f46e8 c000000000a14672
      GPR20: 0000000000000001 0000000000000008 ffffffffffffffff 0000000000000000
      GPR24: 0000000000000100 c0000001fa957278 c0000001fdb2bc78 c0000001fa957288
      GPR28: 0000000000100100 c0000001fa957288 c0000001fb588000 c0000001fdb2bd10
      [  568.877993] NIP [c0000000002f42a4] .ext4_mb_release_group_pa+0xec/0x1c0
      [  568.877999] LR [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0
      [  568.878004] Call Trace:
      [  568.878008] [c0000001fa957150] [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 (unreliable)
      [  568.878017] [c0000001fa957200] [c0000000002fb070] .ext4_mb_discard_lg_preallocations+0x394/0x444
      [  568.878025] [c0000001fa957340] [c0000000002fb45c] .ext4_mb_release_context+0x33c/0x734
      [  568.878032] [c0000001fa957440] [c0000000002fbcf8] .ext4_mb_new_blocks+0x4a4/0x5f4
      [  568.878039] [c0000001fa957510] [c0000000002ef56c] .ext4_ext_map_blocks+0xc28/0x1178
      [  568.878047] [c0000001fa957640] [c0000000002c1a94] .ext4_map_blocks+0x2c8/0x490
      [  568.878054] [c0000001fa957730] [c0000000002c536c] .ext4_writepages+0x738/0xc60
      [  568.878062] [c0000001fa957950] [c000000000168a78] .do_writepages+0x5c/0x80
      [  568.878069] [c0000001fa9579d0] [c00000000015d1c4] .__filemap_fdatawrite_range+0x88/0xb0
      [  568.878078] [c0000001fa957aa0] [c00000000015d23c] .filemap_write_and_wait_range+0x50/0xfc
      [  568.878085] [c0000001fa957b30] [c0000000002b8edc] .ext4_sync_file+0x220/0x3c4
      [  568.878092] [c0000001fa957be0] [c0000000001f849c] .vfs_fsync_range+0x64/0x80
      [  568.878098] [c0000001fa957c70] [c0000000001f84f0] .vfs_fsync+0x38/0x4c
      [  568.878105] [c0000001fa957d00] [c0000000001f87f4] .do_fsync+0x54/0x90
      [  568.878111] [c0000001fa957db0] [c0000000001f8894] .SyS_fsync+0x28/0x3c
      [  568.878120] [c0000001fa957e30] [c000000000009c88] syscall_exit+0x0/0x7c
      [  568.878125] Instruction dump:
      [  568.878130] 60000000 813d0034 81610070 38000000 7f8b4800 419e001c 813f007c 7d2bfe70
      [  568.878144] 7d604a78 7c005850 54000ffe 7c0007b4 <0b000000> e8a10076 e87f0090 7fa4eb78
      [  568.878160] ---[ end trace 594d911d9654770b ]---
      
      In addition fix the STD_GROUP optimization so that it works for
      bigalloc file systems as well.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NLi Zhong <lizhongfs@gmail.com>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      Cc: stable@vger.kernel.org  # 3.10
      960fd856
    • J
      ext4: silence warning in ext4_writepages() · 27d7c4ed
      Jan Kara 提交于
      The loop in mpage_map_and_submit_extent() is guaranteed to always run
      at least once since the caller of mpage_map_and_submit_extent() makes
      sure map->m_len > 0. So make that explicit using do-while instead of
      pure while which also silences the compiler warning about
      uninitialized 'err' variable.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      27d7c4ed
  13. 03 7月, 2013 2 次提交
    • A
      ext4: ->tmpfile() support · af51a2ac
      Al Viro 提交于
      very similar to ext3 counterpart...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      af51a2ac
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
  14. 01 7月, 2013 13 次提交
    • A
      ext4: optimize starting extent in ext4_ext_rm_leaf() · 6ae06ff5
      Ashish Sangwan 提交于
      Both hole punch and truncate use ext4_ext_rm_leaf() for removing
      blocks.  Currently we choose the last extent as the starting
      point for removing blocks:
      
      	ex = EXT_LAST_EXTENT(eh);
      
      This is OK for truncate but for hole punch we can optimize the extent
      selection as the path is already initialized.  We could use this
      information to select proper starting extent.  The code change in this
      patch will not affect truncate as for truncate path[depth].p_ext will
      always be NULL.
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6ae06ff5
    • T
      ext4: translate flag bits to strings in tracepoints · 21ddd568
      Theodore Ts'o 提交于
      Translate the bitfields used in various flags argument to strings to
      make the tracepoint output more human-readable.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      21ddd568
    • T
      ext4: fix up error handling for mpage_map_and_submit_extent() · cb530541
      Theodore Ts'o 提交于
      The function mpage_released_unused_page() must only be called once;
      otherwise the kernel will BUG() when the second call to
      mpage_released_unused_page() tries to unlock the pages which had been
      unlocked by the first call.
      
      Also restructure the error handling so that we only give up on writing
      the dirty pages in the case of ENOSPC where retrying the allocation
      won't help.  Otherwise, a transient failure, such as a kmalloc()
      failure in calling ext4_map_blocks() might cause us to give up on
      those pages, leading to a scary message in /var/log/messages plus data
      loss.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      cb530541
    • L
      ext4: only zero partial blocks in ext4_zero_partial_blocks() · e1be3a92
      Lukas Czerner 提交于
      Currently if we pass range into ext4_zero_partial_blocks() which covers
      entire block we would attempt to zero it even though we should only zero
      unaligned part of the block.
      
      Fix this by checking whether the range covers the whole block skip
      zeroing if so.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e1be3a92
    • T
      ext4: check error return from ext4_write_inline_data_end() · 42c832de
      Theodore Ts'o 提交于
      The function ext4_write_inline_data_end() can return an error.  So we
      need to assign it to a signed integer variable to check for an error
      return (since copied is an unsigned int).
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      42c832de
    • J
      ext4: delete unnecessary C statements · 353eefd3
      jon ernst 提交于
      Comparing unsigned variable with 0 always returns false.
      err = 0 is duplicated and unnecessary.
      
      [ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ]
      Signed-off-by: N"Jon Ernst" <jonernst07@gmx.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      353eefd3
    • A
      ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() · 64cb9273
      Al Viro 提交于
      Both ext3 and ext4 htree_dirblock_to_tree() is just filling the
      in-core rbtree for use by call_filldir().  All updates of ->f_pos are
      done by the latter; bumping it here (on error) is obviously wrong - we
      might very well have it nowhere near the block we'd found an error in.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      64cb9273
    • A
      ext4: pass inode pointer instead of file pointer to punch hole · aeb2817a
      Ashish Sangwan 提交于
      No need to pass file pointer when we can directly pass inode pointer.
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      aeb2817a
    • B
      ext4: improve free space calculation for inline_data · c4932dbe
      boxi liu 提交于
      In ext4 feature inline_data,it use the xattr's space to store the
      inline data in inode.When we calculate the inline data as the xattr,we
      add the pad.But in get_max_inline_xattr_value_size() function we count
      the free space without pad.It cause some contents are moved to a block
      even if it can be
      stored in the inode.
      Signed-off-by: Nliulei <lewis.liulei@huawei.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NTao Ma <boyu.mt@taobao.com>
      c4932dbe
    • J
      ext4: reduce object size when !CONFIG_PRINTK · e7c96e8e
      Joe Perches 提交于
      Reduce the object size ~10% could be useful for embedded systems.
      
      Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
      arguments, passing " " to functions when !CONFIG_PRINTK and still
      verifying format and arguments with no_printk.
      
      $ size fs/ext4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
       264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old
      
          $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
          # CONFIG_PRINTK is not set
          CONFIG_EXT4_FS=y
          CONFIG_EXT4_USE_FOR_EXT23=y
          CONFIG_EXT4_FS_POSIX_ACL=y
          # CONFIG_EXT4_FS_SECURITY is not set
          # CONFIG_EXT4_DEBUG is not set
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e7c96e8e
    • Z
      ext4: improve extent cache shrink mechanism to avoid to burn CPU time · d3922a77
      Zheng Liu 提交于
      Now we maintain an proper in-order LRU list in ext4 to reclaim entries
      from extent status tree when we are under heavy memory pressure.  For
      keeping this order, a spin lock is used to protect this list.  But this
      lock burns a lot of CPU time.  We can use the following steps to trigger
      it.
      
        % cd /dev/shm
        % dd if=/dev/zero of=ext4-img bs=1M count=2k
        % mkfs.ext4 ext4-img
        % mount -t ext4 -o loop ext4-img /mnt
        % cd /mnt
        % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
        % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
        % perf record -a -g
        % perf report
      
      This commit tries to fix this problem.  Now a new member called
      i_touch_when is added into ext4_inode_info to record the last access
      time for an inode.  Meanwhile we never need to keep a proper in-order
      LRU list.  So this can avoid to burns some CPU time.  When we try to
      reclaim some entries from extent status tree, we use list_sort() to get
      a proper in-order list.  Then we traverse this list to discard some
      entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
      time of sorting this list.  When we traverse the list, we skip the inode
      that is newer than this time, and move this inode to the tail of LRU
      list.  When the head of the list is newer than s_es_last_sorted, we will
      sort the LRU list again.
      
      In this commit, we break the loop if s_extent_cache_cnt == 0 because
      that means that all extents in extent status tree have been reclaimed.
      
      Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
      changed to save a local variable in these functions.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d3922a77
    • A
      ext4: implement error handling of ext4_mb_new_preallocation() · 2c00ef3e
      Alexey Khoroshilov 提交于
      If memory allocation in ext4_mb_new_group_pa() is failed,
      it returns error code, ext4_mb_new_preallocation() propages it,
      but ext4_mb_new_blocks() ignores it.
      
      An observed result was:
      
      - allocation fail means ext4_mb_new_group_pa() does not update
        ext4_allocation_context;
      
      - ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
        ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
        of number of blocks requested (1);
      
      - that activates update cycle in ext4_splice_branch():
          for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
            *(where->p + i) = cpu_to_le32(current_block++);
      
      - it iterates 511 times and corrupts a chunk of memory including inode
        structure;
      
      - page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();
      
      - system hangs with 'scheduling while atomic' BUG.
      
      The patch implements a check for ext4_mb_new_preallocation() error
      code and handles its failure as if ext4_mb_regular_allocator() fails.
      
      Found by Linux File System Verification project (linuxtesting.org).
      
      [ Patch restructed by tytso to make the flow of control easier to follow. ]
      Signed-off-by: NAlexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2c00ef3e
    • M
      ext4: fix corruption when online resizing a fs with 1K block size · 6ca792ed
      Maarten ter Huurne 提交于
      Subtracting the number of the first data block places the superblock
      backups one block too early, corrupting the file system. When the block
      size is larger than 1K, the first data block is 0, so the subtraction
      has no effect and no corruption occurs.
      Signed-off-by: NMaarten ter Huurne <maarten@treewalker.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      CC: stable@vger.kernel.org
      6ca792ed