1. 01 7月, 2013 2 次提交
    • J
      ext4: reduce object size when !CONFIG_PRINTK · e7c96e8e
      Joe Perches 提交于
      Reduce the object size ~10% could be useful for embedded systems.
      
      Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
      arguments, passing " " to functions when !CONFIG_PRINTK and still
      verifying format and arguments with no_printk.
      
      $ size fs/ext4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
       264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old
      
          $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
          # CONFIG_PRINTK is not set
          CONFIG_EXT4_FS=y
          CONFIG_EXT4_USE_FOR_EXT23=y
          CONFIG_EXT4_FS_POSIX_ACL=y
          # CONFIG_EXT4_FS_SECURITY is not set
          # CONFIG_EXT4_DEBUG is not set
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e7c96e8e
    • Z
      ext4: improve extent cache shrink mechanism to avoid to burn CPU time · d3922a77
      Zheng Liu 提交于
      Now we maintain an proper in-order LRU list in ext4 to reclaim entries
      from extent status tree when we are under heavy memory pressure.  For
      keeping this order, a spin lock is used to protect this list.  But this
      lock burns a lot of CPU time.  We can use the following steps to trigger
      it.
      
        % cd /dev/shm
        % dd if=/dev/zero of=ext4-img bs=1M count=2k
        % mkfs.ext4 ext4-img
        % mount -t ext4 -o loop ext4-img /mnt
        % cd /mnt
        % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
        % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
        % perf record -a -g
        % perf report
      
      This commit tries to fix this problem.  Now a new member called
      i_touch_when is added into ext4_inode_info to record the last access
      time for an inode.  Meanwhile we never need to keep a proper in-order
      LRU list.  So this can avoid to burns some CPU time.  When we try to
      reclaim some entries from extent status tree, we use list_sort() to get
      a proper in-order list.  Then we traverse this list to discard some
      entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
      time of sorting this list.  When we traverse the list, we skip the inode
      that is newer than this time, and move this inode to the tail of LRU
      list.  When the head of the list is newer than s_es_last_sorted, we will
      sort the LRU list again.
      
      In this commit, we break the loop if s_extent_cache_cnt == 0 because
      that means that all extents in extent status tree have been reclaimed.
      
      Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
      changed to save a local variable in these functions.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d3922a77
  2. 17 6月, 2013 1 次提交
  3. 13 6月, 2013 2 次提交
  4. 05 6月, 2013 2 次提交
  5. 28 5月, 2013 2 次提交
  6. 07 5月, 2013 1 次提交
  7. 21 4月, 2013 1 次提交
  8. 12 4月, 2013 1 次提交
  9. 10 4月, 2013 3 次提交
    • T
      ext4: fix miscellaneous big endian warnings · d6a77105
      Theodore Ts'o 提交于
      None of these result in any bug, but they makes sparse complain.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d6a77105
    • L
      ext4: introduce reserved space · 27dd4385
      Lukas Czerner 提交于
      Currently in ENOSPC condition when writing into unwritten space, or
      punching a hole, we might need to split the extent and grow extent tree.
      However since we can not allocate any new metadata blocks we'll have to
      zero out unwritten part of extent or punched out part of extent, or in
      the worst case return ENOSPC even though use actually does not allocate
      any space.
      
      Also in delalloc path we do reserve metadata and data blocks for the
      time we're going to write out, however metadata block reservation is
      very tricky especially since we expect that logical connectivity implies
      physical connectivity, however that might not be the case and hence we
      might end up allocating more metadata blocks than previously reserved.
      So in future, metadata reservation checks should be removed since we can
      not assure that we do not under reserve.
      
      And this is where reserved space comes into the picture. When mounting
      the file system we slice off a little bit of the file system space (2%
      or 4096 clusters, whichever is smaller) which can be then used for the
      cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.
      
      The number of reserved clusters can be set via sysfs, however it can
      never be bigger than number of free clusters in the file system.
      
      Note that this patch fixes the failure of xfstest 274 as expected.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      27dd4385
    • A
      procfs: new helper - PDE_DATA(inode) · d9dda78b
      Al Viro 提交于
      The only part of proc_dir_entry the code outside of fs/proc
      really cares about is PDE(inode)->data.  Provide a helper
      for that; static inline for now, eventually will be moved
      to fs/proc, along with the knowledge of struct proc_dir_entry
      layout.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9dda78b
  10. 09 4月, 2013 1 次提交
  11. 04 4月, 2013 3 次提交
    • L
      ext4: make ext4_block_in_group() much more efficient · 68911009
      Lukas Czerner 提交于
      Currently in when getting the block group number for a particular
      block in ext4_block_in_group() we're using
      ext4_get_group_no_and_offset() which uses do_div() to get the block
      group and the remainer which is offset within the group.
      
      We don't need all of that in ext4_block_in_group() as we only need to
      figure out the group number.
      
      This commit changes ext4_block_in_group() to calculate group number
      directly. This shows as a big improvement with regards to cpu
      utilization. Measuring fallocate -l 15T on fresh file system with perf
      showed that 23% of cpu time was spend in the
      ext4_get_group_no_and_offset(). With this change it completely
      disappears from the list only bumping the occurrence of
      ext4_init_block_bitmap() which is the biggest user of
      ext4_block_in_group() by 4%. As the result of this change on my system
      the fallocate call was approx. 10% faster.
      
      However since there is '-g' option in mkfs which allow us setting
      different groups size (mostly for developers) I've introduced new per
      file system flag whether we have a standard block group size or
      not. The flag is used to determine whether we can use the bit shift
      optimization or not.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      68911009
    • D
      ext4: unregister es_shrinker if mount failed · a75ae78f
      Dmitry Monakhov 提交于
      Otherwise destroyed ext_sb_info will be part of global shinker list
      and result in the following OOPS:
      
      JBD2: corrupted journal superblock
      JBD2: recovery failed
      EXT4-fs (dm-2): error loading journal
      general protection fault: 0000 [#1] SMP
      Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
      mod
      CPU 1
      Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
      RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
      RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
      RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
      RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
      R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
      FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
      Stack:
      ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
      00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
      ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
      Call Trace:
      [<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
      [<ffffffff81249381>] mount_bdev+0x331/0x340
      [<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
      [<ffffffff81362035>] ext4_mount+0x15/0x20
      [<ffffffff8124869a>] mount_fs+0x9a/0x2e0
      [<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
      [<ffffffff81279c02>] do_new_mount+0x172/0x2e0
      [<ffffffff8127aa56>] do_mount+0x376/0x380
      [<ffffffff8127ab98>] sys_mount+0x138/0x150
      [<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
      Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
      8
      RIP  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
      RSP <ffff88011d5cbcd8>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      a75ae78f
    • D
      ext4: fix journal callback list traversal · 5d3ee208
      Dmitry Monakhov 提交于
      It is incorrect to use list_for_each_entry_safe() for journal callback
      traversial because ->next may be removed by other task:
      ->ext4_mb_free_metadata()
        ->ext4_mb_free_metadata()
          ->ext4_journal_callback_del()
      
      This results in the following issue:
      
      WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
      Hardware name:
      list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
      Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
      Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
      Call Trace:
       [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
       [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
       [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
       [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
       [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
       [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
       [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
       [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
       [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
       [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
       [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
       [<ffffffff810ac6be>] kthread+0x10e/0x120
       [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
       [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
       [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
      
      This patch fix the issue as follows:
      - ext4_journal_commit_callback() make list truly traversial safe
        simply by always starting from list_head
      - fix race between two ext4_journal_callback_del() and
        ext4_journal_callback_try_del()
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.com
      5d3ee208
  12. 13 3月, 2013 1 次提交
    • E
      fs: Readd the fs module aliases. · fa7614dd
      Eric W. Biederman 提交于
      I had assumed that the only use of module aliases for filesystems
      prior to "fs: Limit sys_mount to only request filesystem modules."
      was in request_module.  It turns out I was wrong.  At least mkinitcpio
      in Arch linux uses these aliases.
      
      So readd the preexising aliases, to keep from breaking userspace.
      
      Userspace eventually will have to follow and use the same aliases the
      kernel does.  So at some point we may be delete these aliases without
      problems.  However that day is not today.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      fa7614dd
  13. 12 3月, 2013 1 次提交
  14. 04 3月, 2013 1 次提交
    • E
      fs: Limit sys_mount to only request filesystem modules. · 7f78e035
      Eric W. Biederman 提交于
      Modify the request_module to prefix the file system type with "fs-"
      and add aliases to all of the filesystems that can be built as modules
      to match.
      
      A common practice is to build all of the kernel code and leave code
      that is not commonly needed as modules, with the result that many
      users are exposed to any bug anywhere in the kernel.
      
      Looking for filesystems with a fs- prefix limits the pool of possible
      modules that can be loaded by mount to just filesystems trivially
      making things safer with no real cost.
      
      Using aliases means user space can control the policy of which
      filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
      with blacklist and alias directives.  Allowing simple, safe,
      well understood work-arounds to known problematic software.
      
      This also addresses a rare but unfortunate problem where the filesystem
      name is not the same as it's module name and module auto-loading
      would not work.  While writing this patch I saw a handful of such
      cases.  The most significant being autofs that lives in the module
      autofs4.
      
      This is relevant to user namespaces because we can reach the request
      module in get_fs_type() without having any special permissions, and
      people get uncomfortable when a user specified string (in this case
      the filesystem type) goes all of the way to request_module.
      
      After having looked at this issue I don't think there is any
      particular reason to perform any filtering or permission checks beyond
      making it clear in the module request that we want a filesystem
      module.  The common pattern in the kernel is to call request_module()
      without regards to the users permissions.  In general all a filesystem
      module does once loaded is call register_filesystem() and go to sleep.
      Which means there is not much attack surface exposed by loading a
      filesytem module unless the filesystem is mounted.  In a user
      namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
      which most filesystems do not set today.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7f78e035
  15. 03 3月, 2013 4 次提交
  16. 02 3月, 2013 1 次提交
  17. 23 2月, 2013 1 次提交
  18. 18 2月, 2013 2 次提交
    • Z
      ext4: reclaim extents from extent status tree · 74cd15cd
      Zheng Liu 提交于
      Although extent status is loaded on-demand, we also need to reclaim
      extent from the tree when we are under a heavy memory pressure because
      in some cases fragmented extent tree causes status tree costs too much
      memory.
      
      Here we maintain a lru list in super_block.  When the extent status of
      an inode is accessed and changed, this inode will be move to the tail
      of the list.  The inode will be dropped from this list when it is
      cleared.  In the inode, a counter is added to count the number of
      cached objects in extent status tree.  Here only written/unwritten/hole
      extent is counted because delayed extent doesn't be reclaimed due to
      fiemap, bigalloc and seek_data/hole need it.  The counter will be
      increased as a new extent is allocated, and it will be decreased as a
      extent is freed.
      
      In this commit we use normal shrinker framework to reclaim memory from
      the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
      to count the number of reclaimable extents.  ext4_es_shrink() tries to
      reclaim written/unwritten/hole extents from extent status tree.  The
      inode that has been shrunk is moved to the tail of lru list.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      74cd15cd
    • Z
      ext4: remove single extent cache · 69eb33dc
      Zheng Liu 提交于
      Single extent cache could be removed because we have extent status tree
      as a extent cache, and it would be better.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      69eb33dc
  19. 09 2月, 2013 2 次提交
    • T
      ext4: pass context information to jbd2__journal_start() · 9924a92a
      Theodore Ts'o 提交于
      So we can better understand what bits of ext4 are responsible for
      long-running jbd2 handles, use jbd2__journal_start() so we can pass
      context information for logging purposes.
      
      The recommended way for finding the longer-running handles is:
      
         T=/sys/kernel/debug/tracing
         EVENT=$T/events/jbd2/jbd2_handle_stats
         echo "interval > 5" > $EVENT/filter
         echo 1 > $EVENT/enable
      
         ./run-my-fs-benchmark
      
         cat $T/trace > /tmp/problem-handles
      
      This will list handles that were active for longer than 20ms.  Having
      longer-running handles is bad, because a commit started at the wrong
      time could stall for those 20+ milliseconds, which could delay an
      fsync() or an O_SYNC operation.  Here is an example line from the
      trace file describing a handle which lived on for 311 jiffies, or over
      1.2 seconds:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9924a92a
    • T
      ext4: move the jbd2 wrapper functions out of super.c · 722887dd
      Theodore Ts'o 提交于
      Move the jbd2 wrapper functions which start and stop handles out of
      super.c, where they don't really logically belong, and into
      ext4_jbd2.c.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      722887dd
  20. 03 2月, 2013 4 次提交
  21. 29 1月, 2013 1 次提交
  22. 28 1月, 2013 2 次提交
  23. 25 1月, 2013 1 次提交