1. 29 8月, 2014 1 次提交
    • D
      jbd2: fix descriptor block size handling errors with journal_csum · db9ee220
      Darrick J. Wong 提交于
      It turns out that there are some serious problems with the on-disk
      format of journal checksum v2.  The foremost is that the function to
      calculate descriptor tag size returns sizes that are too big.  This
      causes alignment issues on some architectures and is compounded by the
      fact that some parts of jbd2 use the structure size (incorrectly) to
      determine the presence of a 64bit journal instead of checking the
      feature flags.
      
      Therefore, introduce journal checksum v3, which enlarges the
      descriptor block tag format to allow for full 32-bit checksums of
      journal blocks, fix the journal tag function to return the correct
      sizes, and fix the jbd2 recovery code to use feature flags to
      determine 64bitness.
      
      Add a few function helpers so we don't have to open-code quite so
      many pieces.
      
      Switching to a 16-byte block size was found to increase journal size
      overhead by a maximum of 0.1%, to convert a 32-bit journal with no
      checksumming to a 32-bit journal with checksum v3 enabled.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reported-by: NTR Reardon <thomas_reardon@hotmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      db9ee220
  2. 15 7月, 2014 1 次提交
    • T
      ext4: rearrange initialization to fix EXT4FS_DEBUG · d5e03cbb
      Theodore Ts'o 提交于
      The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for
      ext4 developers only.  (You have to modify fs/ext4/ext4.h to enable
      it.)
      
      Rearrange how we initialize data structures to avoid calling
      ext4_count_free_clusters() until the multiblock allocator has been
      initialized.
      
      This also allows us to only call ext4_count_free_clusters() once, and
      simplifies the code somewhat.
      
      (Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a
      !CONFIG_SMP compile breakage in the original patch.)
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      d5e03cbb
  3. 12 7月, 2014 1 次提交
  4. 06 7月, 2014 2 次提交
  5. 13 5月, 2014 2 次提交
  6. 12 5月, 2014 2 次提交
  7. 22 4月, 2014 1 次提交
  8. 21 4月, 2014 1 次提交
    • L
      ext4: rename uninitialized extents to unwritten · 556615dc
      Lukas Czerner 提交于
      Currently in ext4 there is quite a mess when it comes to naming
      unwritten extents. Sometimes we call it uninitialized and sometimes we
      refer to it as unwritten.
      
      The right name for the extent which has been allocated but does not
      contain any written data is _unwritten_. Other file systems are
      using this name consistently, even the buffer head state refers to it as
      unwritten. We need to fix this confusion in ext4.
      
      This commit changes every reference to an uninitialized extent (meaning
      allocated but unwritten) to unwritten extent. This includes comments,
      function names and variable names. It even covers abbreviation of the
      word uninitialized (such as uninit) and some misspellings.
      
      This commit does not change any of the code paths at all. This has been
      confirmed by comparing md5sums of the assembly code of each object file
      after all the function names were stripped from it.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      556615dc
  9. 07 4月, 2014 1 次提交
    • A
      ext4: initialize multi-block allocator before checking block descriptors · 00764937
      Azat Khuzhin 提交于
      With EXT4FS_DEBUG ext4_count_free_clusters() will call
      ext4_read_block_bitmap() without s_group_info initialized, so we need to
      initialize multi-block allocator before.
      
      And dependencies that must be solved, to allow this:
      - multi-block allocator needs in group descriptors
      - need to install s_op before initializing multi-block allocator,
        because in ext4_mb_init_backend() new inode is created.
      - initialize number of group desc blocks (s_gdb_count) otherwise
        number of clusters returned by ext4_free_clusters_after_init() is not correct.
        (see ext4_bg_num_gdb_nometa())
      
      Here is the stack backtrace:
      
      (gdb) bt
       #0  ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
       #1  ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
           desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
           bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
       #2  0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
           block_group=block_group@entry=0,
           bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
       #3  0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
           block_group=block_group@entry=0) at balloc.c:489
       #4  0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
       #5  0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
           sb=0xffff880079a10000) at super.c:2143
       #6  ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
           data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
           ...
      Signed-off-by: NAzat Khuzhin <a3at.mail@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      00764937
  10. 25 3月, 2014 1 次提交
  11. 19 3月, 2014 1 次提交
    • T
      ext4: each filesystem creates and uses its own mb_cache · 9c191f70
      T Makphaibulchoke 提交于
      This patch adds new interfaces to create and destory cache,
      ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove
      the cache creation and destory calls from ex4_init_xattr() and
      ext4_exitxattr() in fs/ext4/xattr.c.
      
      fs/ext4/super.c has been changed so that when a filesystem is mounted
      a cache is allocated and attched to its ext4_sb_info structure.
      
      fs/mbcache.c has been changed so that only one slab allocator is
      allocated and used by all mbcache structures.
      Signed-off-by: NT. Makphaibulchoke <tmac@hp.com>
      9c191f70
  12. 14 3月, 2014 1 次提交
  13. 13 3月, 2014 1 次提交
    • T
      fs: push sync_filesystem() down to the file system's remount_fs() · 02b9984d
      Theodore Ts'o 提交于
      Previously, the no-op "mount -o mount /dev/xxx" operation when the
      file system is already mounted read-write causes an implied,
      unconditional syncfs().  This seems pretty stupid, and it's certainly
      documented or guaraunteed to do this, nor is it particularly useful,
      except in the case where the file system was mounted rw and is getting
      remounted read-only.
      
      However, it's possible that there might be some file systems that are
      actually depending on this behavior.  In most file systems, it's
      probably fine to only call sync_filesystem() when transitioning from
      read-write to read-only, and there are some file systems where this is
      not needed at all (for example, for a pseudo-filesystem or something
      like romfs).
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Jan Kara <jack@suse.cz>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Anders Larsen <al@alarsen.net>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: xfs@oss.sgi.com
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-cifs@vger.kernel.org
      Cc: samba-technical@lists.samba.org
      Cc: codalist@coda.cs.cmu.edu
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: fuse-devel@lists.sourceforge.net
      Cc: cluster-devel@redhat.com
      Cc: linux-mtd@lists.infradead.org
      Cc: jfs-discussion@lists.sourceforge.net
      Cc: linux-nfs@vger.kernel.org
      Cc: linux-nilfs@vger.kernel.org
      Cc: linux-ntfs-dev@lists.sourceforge.net
      Cc: ocfs2-devel@oss.oracle.com
      Cc: reiserfs-devel@vger.kernel.org
      02b9984d
  14. 18 2月, 2014 1 次提交
  15. 13 2月, 2014 1 次提交
    • T
      ext4: don't try to modify s_flags if the the file system is read-only · 23301410
      Theodore Ts'o 提交于
      If an ext4 file system is created by some tool other than mke2fs
      (perhaps by someone who has a pathalogical fear of the GPL) that
      doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags,
      and that file system is then mounted read-only, don't try to modify
      the s_flags field.  Otherwise, if dm_verity is in use, the superblock
      will change, causing an dm_verity failure.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      23301410
  16. 09 12月, 2013 2 次提交
  17. 08 11月, 2013 1 次提交
  18. 18 10月, 2013 1 次提交
    • T
      ext4: add ratelimiting to ext4 messages · efbed4dc
      Theodore Ts'o 提交于
      In the case of a storage device that suddenly disappears, or in the
      case of significant file system corruption, this can result in a huge
      flood of messages being sent to the console.  This can overflow the
      file system containing /var/log/messages, or if a serial console is
      configured, this can slow down the system so much that a hardware
      watchdog can end up triggering forcing a system reboot.
      
      Google-Bug-Id: 7258357
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      efbed4dc
  19. 04 9月, 2013 1 次提交
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  20. 29 8月, 2013 1 次提交
    • E
      ext4: allow specifying external journal by pathname mount option · ad4eec61
      Eric Sandeen 提交于
      It's always been a hassle that if an external journal's
      device number changes, the filesystem won't mount.
      And since boot-time enumeration can change, device number
      changes aren't unusual.
      
      The current mechanism to update the journal location is by
      passing in a mount option w/ a new devnum, but that's a hassle;
      it's a manual approach, fixing things after the fact.
      
      Adding a mount option, "-o journal_path=/dev/$DEVICE" would
      help, since then we can do i.e.
      
      # mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
      
      and it'll mount even if the devnum has changed, as shown here:
      
      # losetup /dev/loop0 journalfile
      # mke2fs -L mylabel-journal -O journal_dev /dev/loop0 
      # mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
      
      Change the journal device number:
      
      # losetup -d /dev/loop0
      # losetup /dev/loop1 journalfile 
      
      And today it will fail:
      
      # mount /dev/sdb1 /mnt/test
      mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
             missing codepage or helper program, or other error
             In some cases useful info is found in syslog - try
             dmesg | tail  or so
      
      # dmesg | tail -n 1
      [17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
      
      But with this new mount option, we can specify the new path:
      
      # mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
      #
      
      (which does update the encoded device number, incidentally):
      
      # umount /dev/sdb1
      # dumpe2fs -h /dev/sdb1 | grep "Journal device"
      dumpe2fs 1.41.12 (17-May-2010)
      Journal device:	          0x0701
      
      But best of all we can just always mount by journal-path, and
      it'll always work:
      
      # mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
      #
      
      So the journal_path option can be specified in fstab, and as long as
      the disk is available somewhere, and findable by label (or by UUID),
      we can mount.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      ad4eec61
  21. 20 8月, 2013 1 次提交
  22. 09 8月, 2013 2 次提交
  23. 27 7月, 2013 1 次提交
  24. 12 7月, 2013 1 次提交
    • T
      ext4: don't show usrquota/grpquota twice in /proc/mounts · ad065dd0
      Theodore Ts'o 提交于
      We now print mount options in a generic fashion in
      ext4_show_options(), so we shouldn't be explicitly printing the
      {usr,grp}quota options in ext4_show_quota_options().
      
      Without this patch, /proc/mounts can look like this:
      
       /dev/vdb /vdb ext4 rw,relatime,quota,usrquota,data=ordered,usrquota 0 0
                                            ^^^^^^^^              ^^^^^^^^
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      ad065dd0
  25. 06 7月, 2013 1 次提交
    • T
      ext4: fix ext4_get_group_number() · 960fd856
      Theodore Ts'o 提交于
      The function ext4_get_group_number() was introduced as an optimization
      in commit bd86298e.  Unfortunately, this commit incorrectly
      calculate the group number for file systems with a 1k block size (when
      s_first_data_block is 1 instead of zero).  This could cause the
      following kernel BUG:
      
      [  568.877799] ------------[ cut here ]------------
      [  568.877833] kernel BUG at fs/ext4/mballoc.c:3728!
      [  568.877840] Oops: Exception in kernel mode, sig: 5 [#1]
      [  568.877845] SMP NR_CPUS=32 NUMA pSeries
      [  568.877852] Modules linked in: binfmt_misc
      [  568.877861] CPU: 1 PID: 3516 Comm: fs_mark Not tainted 3.10.0-03216-g7c6809ff-dirty #1
      [  568.877867] task: c0000001fb0b8000 ti: c0000001fa954000 task.ti: c0000001fa954000
      [  568.877873] NIP: c0000000002f42a4 LR: c0000000002f4274 CTR: c000000000317ef8
      [  568.877879] REGS: c0000001fa956ed0 TRAP: 0700   Not tainted  (3.10.0-03216-g7c6809ff-dirty)
      [  568.877884] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 24000428  XER: 00000000
      [  568.877902] SOFTE: 1
      [  568.877905] CFAR: c0000000002b5464
      [  568.877908]
      GPR00: 0000000000000001 c0000001fa957150 c000000000c6a408 c0000001fb588000
      GPR04: 0000000000003fff c0000001fa9571c0 c0000001fa9571c4 000138098c50625f
      GPR08: 1301200000000000 0000000000000002 0000000000000001 0000000000000000
      GPR12: 0000000024000422 c00000000f33a300 0000000000008000 c0000001fa9577f0
      GPR16: c0000001fb7d0100 c000000000c29190 c0000000007f46e8 c000000000a14672
      GPR20: 0000000000000001 0000000000000008 ffffffffffffffff 0000000000000000
      GPR24: 0000000000000100 c0000001fa957278 c0000001fdb2bc78 c0000001fa957288
      GPR28: 0000000000100100 c0000001fa957288 c0000001fb588000 c0000001fdb2bd10
      [  568.877993] NIP [c0000000002f42a4] .ext4_mb_release_group_pa+0xec/0x1c0
      [  568.877999] LR [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0
      [  568.878004] Call Trace:
      [  568.878008] [c0000001fa957150] [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 (unreliable)
      [  568.878017] [c0000001fa957200] [c0000000002fb070] .ext4_mb_discard_lg_preallocations+0x394/0x444
      [  568.878025] [c0000001fa957340] [c0000000002fb45c] .ext4_mb_release_context+0x33c/0x734
      [  568.878032] [c0000001fa957440] [c0000000002fbcf8] .ext4_mb_new_blocks+0x4a4/0x5f4
      [  568.878039] [c0000001fa957510] [c0000000002ef56c] .ext4_ext_map_blocks+0xc28/0x1178
      [  568.878047] [c0000001fa957640] [c0000000002c1a94] .ext4_map_blocks+0x2c8/0x490
      [  568.878054] [c0000001fa957730] [c0000000002c536c] .ext4_writepages+0x738/0xc60
      [  568.878062] [c0000001fa957950] [c000000000168a78] .do_writepages+0x5c/0x80
      [  568.878069] [c0000001fa9579d0] [c00000000015d1c4] .__filemap_fdatawrite_range+0x88/0xb0
      [  568.878078] [c0000001fa957aa0] [c00000000015d23c] .filemap_write_and_wait_range+0x50/0xfc
      [  568.878085] [c0000001fa957b30] [c0000000002b8edc] .ext4_sync_file+0x220/0x3c4
      [  568.878092] [c0000001fa957be0] [c0000000001f849c] .vfs_fsync_range+0x64/0x80
      [  568.878098] [c0000001fa957c70] [c0000000001f84f0] .vfs_fsync+0x38/0x4c
      [  568.878105] [c0000001fa957d00] [c0000000001f87f4] .do_fsync+0x54/0x90
      [  568.878111] [c0000001fa957db0] [c0000000001f8894] .SyS_fsync+0x28/0x3c
      [  568.878120] [c0000001fa957e30] [c000000000009c88] syscall_exit+0x0/0x7c
      [  568.878125] Instruction dump:
      [  568.878130] 60000000 813d0034 81610070 38000000 7f8b4800 419e001c 813f007c 7d2bfe70
      [  568.878144] 7d604a78 7c005850 54000ffe 7c0007b4 <0b000000> e8a10076 e87f0090 7fa4eb78
      [  568.878160] ---[ end trace 594d911d9654770b ]---
      
      In addition fix the STD_GROUP optimization so that it works for
      bigalloc file systems as well.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NLi Zhong <lizhongfs@gmail.com>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      Cc: stable@vger.kernel.org  # 3.10
      960fd856
  26. 01 7月, 2013 2 次提交
    • J
      ext4: reduce object size when !CONFIG_PRINTK · e7c96e8e
      Joe Perches 提交于
      Reduce the object size ~10% could be useful for embedded systems.
      
      Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
      arguments, passing " " to functions when !CONFIG_PRINTK and still
      verifying format and arguments with no_printk.
      
      $ size fs/ext4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
       264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old
      
          $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
          # CONFIG_PRINTK is not set
          CONFIG_EXT4_FS=y
          CONFIG_EXT4_USE_FOR_EXT23=y
          CONFIG_EXT4_FS_POSIX_ACL=y
          # CONFIG_EXT4_FS_SECURITY is not set
          # CONFIG_EXT4_DEBUG is not set
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e7c96e8e
    • Z
      ext4: improve extent cache shrink mechanism to avoid to burn CPU time · d3922a77
      Zheng Liu 提交于
      Now we maintain an proper in-order LRU list in ext4 to reclaim entries
      from extent status tree when we are under heavy memory pressure.  For
      keeping this order, a spin lock is used to protect this list.  But this
      lock burns a lot of CPU time.  We can use the following steps to trigger
      it.
      
        % cd /dev/shm
        % dd if=/dev/zero of=ext4-img bs=1M count=2k
        % mkfs.ext4 ext4-img
        % mount -t ext4 -o loop ext4-img /mnt
        % cd /mnt
        % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
        % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
        % perf record -a -g
        % perf report
      
      This commit tries to fix this problem.  Now a new member called
      i_touch_when is added into ext4_inode_info to record the last access
      time for an inode.  Meanwhile we never need to keep a proper in-order
      LRU list.  So this can avoid to burns some CPU time.  When we try to
      reclaim some entries from extent status tree, we use list_sort() to get
      a proper in-order list.  Then we traverse this list to discard some
      entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
      time of sorting this list.  When we traverse the list, we skip the inode
      that is newer than this time, and move this inode to the tail of LRU
      list.  When the head of the list is newer than s_es_last_sorted, we will
      sort the LRU list again.
      
      In this commit, we break the loop if s_extent_cache_cnt == 0 because
      that means that all extents in extent status tree have been reclaimed.
      
      Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
      changed to save a local variable in these functions.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d3922a77
  27. 17 6月, 2013 1 次提交
  28. 13 6月, 2013 2 次提交
  29. 05 6月, 2013 2 次提交
  30. 28 5月, 2013 2 次提交
  31. 07 5月, 2013 1 次提交