1. 28 5月, 2013 1 次提交
  2. 07 5月, 2013 1 次提交
  3. 21 4月, 2013 1 次提交
  4. 12 4月, 2013 1 次提交
  5. 10 4月, 2013 3 次提交
    • T
      ext4: fix miscellaneous big endian warnings · d6a77105
      Theodore Ts'o 提交于
      None of these result in any bug, but they makes sparse complain.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d6a77105
    • L
      ext4: introduce reserved space · 27dd4385
      Lukas Czerner 提交于
      Currently in ENOSPC condition when writing into unwritten space, or
      punching a hole, we might need to split the extent and grow extent tree.
      However since we can not allocate any new metadata blocks we'll have to
      zero out unwritten part of extent or punched out part of extent, or in
      the worst case return ENOSPC even though use actually does not allocate
      any space.
      
      Also in delalloc path we do reserve metadata and data blocks for the
      time we're going to write out, however metadata block reservation is
      very tricky especially since we expect that logical connectivity implies
      physical connectivity, however that might not be the case and hence we
      might end up allocating more metadata blocks than previously reserved.
      So in future, metadata reservation checks should be removed since we can
      not assure that we do not under reserve.
      
      And this is where reserved space comes into the picture. When mounting
      the file system we slice off a little bit of the file system space (2%
      or 4096 clusters, whichever is smaller) which can be then used for the
      cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.
      
      The number of reserved clusters can be set via sysfs, however it can
      never be bigger than number of free clusters in the file system.
      
      Note that this patch fixes the failure of xfstest 274 as expected.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      27dd4385
    • A
      procfs: new helper - PDE_DATA(inode) · d9dda78b
      Al Viro 提交于
      The only part of proc_dir_entry the code outside of fs/proc
      really cares about is PDE(inode)->data.  Provide a helper
      for that; static inline for now, eventually will be moved
      to fs/proc, along with the knowledge of struct proc_dir_entry
      layout.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9dda78b
  6. 09 4月, 2013 1 次提交
  7. 04 4月, 2013 3 次提交
    • L
      ext4: make ext4_block_in_group() much more efficient · 68911009
      Lukas Czerner 提交于
      Currently in when getting the block group number for a particular
      block in ext4_block_in_group() we're using
      ext4_get_group_no_and_offset() which uses do_div() to get the block
      group and the remainer which is offset within the group.
      
      We don't need all of that in ext4_block_in_group() as we only need to
      figure out the group number.
      
      This commit changes ext4_block_in_group() to calculate group number
      directly. This shows as a big improvement with regards to cpu
      utilization. Measuring fallocate -l 15T on fresh file system with perf
      showed that 23% of cpu time was spend in the
      ext4_get_group_no_and_offset(). With this change it completely
      disappears from the list only bumping the occurrence of
      ext4_init_block_bitmap() which is the biggest user of
      ext4_block_in_group() by 4%. As the result of this change on my system
      the fallocate call was approx. 10% faster.
      
      However since there is '-g' option in mkfs which allow us setting
      different groups size (mostly for developers) I've introduced new per
      file system flag whether we have a standard block group size or
      not. The flag is used to determine whether we can use the bit shift
      optimization or not.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      68911009
    • D
      ext4: unregister es_shrinker if mount failed · a75ae78f
      Dmitry Monakhov 提交于
      Otherwise destroyed ext_sb_info will be part of global shinker list
      and result in the following OOPS:
      
      JBD2: corrupted journal superblock
      JBD2: recovery failed
      EXT4-fs (dm-2): error loading journal
      general protection fault: 0000 [#1] SMP
      Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
      mod
      CPU 1
      Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
      RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
      RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
      RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
      RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
      R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
      FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
      Stack:
      ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
      00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
      ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
      Call Trace:
      [<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
      [<ffffffff81249381>] mount_bdev+0x331/0x340
      [<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
      [<ffffffff81362035>] ext4_mount+0x15/0x20
      [<ffffffff8124869a>] mount_fs+0x9a/0x2e0
      [<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
      [<ffffffff81279c02>] do_new_mount+0x172/0x2e0
      [<ffffffff8127aa56>] do_mount+0x376/0x380
      [<ffffffff8127ab98>] sys_mount+0x138/0x150
      [<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
      Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
      8
      RIP  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
      RSP <ffff88011d5cbcd8>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      a75ae78f
    • D
      ext4: fix journal callback list traversal · 5d3ee208
      Dmitry Monakhov 提交于
      It is incorrect to use list_for_each_entry_safe() for journal callback
      traversial because ->next may be removed by other task:
      ->ext4_mb_free_metadata()
        ->ext4_mb_free_metadata()
          ->ext4_journal_callback_del()
      
      This results in the following issue:
      
      WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
      Hardware name:
      list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
      Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
      Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
      Call Trace:
       [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
       [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
       [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
       [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
       [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
       [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
       [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
       [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
       [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
       [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
       [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
       [<ffffffff810ac6be>] kthread+0x10e/0x120
       [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
       [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
       [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
      
      This patch fix the issue as follows:
      - ext4_journal_commit_callback() make list truly traversial safe
        simply by always starting from list_head
      - fix race between two ext4_journal_callback_del() and
        ext4_journal_callback_try_del()
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.com
      5d3ee208
  8. 13 3月, 2013 1 次提交
    • E
      fs: Readd the fs module aliases. · fa7614dd
      Eric W. Biederman 提交于
      I had assumed that the only use of module aliases for filesystems
      prior to "fs: Limit sys_mount to only request filesystem modules."
      was in request_module.  It turns out I was wrong.  At least mkinitcpio
      in Arch linux uses these aliases.
      
      So readd the preexising aliases, to keep from breaking userspace.
      
      Userspace eventually will have to follow and use the same aliases the
      kernel does.  So at some point we may be delete these aliases without
      problems.  However that day is not today.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      fa7614dd
  9. 12 3月, 2013 1 次提交
  10. 04 3月, 2013 1 次提交
    • E
      fs: Limit sys_mount to only request filesystem modules. · 7f78e035
      Eric W. Biederman 提交于
      Modify the request_module to prefix the file system type with "fs-"
      and add aliases to all of the filesystems that can be built as modules
      to match.
      
      A common practice is to build all of the kernel code and leave code
      that is not commonly needed as modules, with the result that many
      users are exposed to any bug anywhere in the kernel.
      
      Looking for filesystems with a fs- prefix limits the pool of possible
      modules that can be loaded by mount to just filesystems trivially
      making things safer with no real cost.
      
      Using aliases means user space can control the policy of which
      filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
      with blacklist and alias directives.  Allowing simple, safe,
      well understood work-arounds to known problematic software.
      
      This also addresses a rare but unfortunate problem where the filesystem
      name is not the same as it's module name and module auto-loading
      would not work.  While writing this patch I saw a handful of such
      cases.  The most significant being autofs that lives in the module
      autofs4.
      
      This is relevant to user namespaces because we can reach the request
      module in get_fs_type() without having any special permissions, and
      people get uncomfortable when a user specified string (in this case
      the filesystem type) goes all of the way to request_module.
      
      After having looked at this issue I don't think there is any
      particular reason to perform any filtering or permission checks beyond
      making it clear in the module request that we want a filesystem
      module.  The common pattern in the kernel is to call request_module()
      without regards to the users permissions.  In general all a filesystem
      module does once loaded is call register_filesystem() and go to sleep.
      Which means there is not much attack surface exposed by loading a
      filesytem module unless the filesystem is mounted.  In a user
      namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
      which most filesystems do not set today.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7f78e035
  11. 03 3月, 2013 4 次提交
  12. 02 3月, 2013 1 次提交
  13. 23 2月, 2013 1 次提交
  14. 18 2月, 2013 2 次提交
    • Z
      ext4: reclaim extents from extent status tree · 74cd15cd
      Zheng Liu 提交于
      Although extent status is loaded on-demand, we also need to reclaim
      extent from the tree when we are under a heavy memory pressure because
      in some cases fragmented extent tree causes status tree costs too much
      memory.
      
      Here we maintain a lru list in super_block.  When the extent status of
      an inode is accessed and changed, this inode will be move to the tail
      of the list.  The inode will be dropped from this list when it is
      cleared.  In the inode, a counter is added to count the number of
      cached objects in extent status tree.  Here only written/unwritten/hole
      extent is counted because delayed extent doesn't be reclaimed due to
      fiemap, bigalloc and seek_data/hole need it.  The counter will be
      increased as a new extent is allocated, and it will be decreased as a
      extent is freed.
      
      In this commit we use normal shrinker framework to reclaim memory from
      the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
      to count the number of reclaimable extents.  ext4_es_shrink() tries to
      reclaim written/unwritten/hole extents from extent status tree.  The
      inode that has been shrunk is moved to the tail of lru list.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      74cd15cd
    • Z
      ext4: remove single extent cache · 69eb33dc
      Zheng Liu 提交于
      Single extent cache could be removed because we have extent status tree
      as a extent cache, and it would be better.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      69eb33dc
  15. 09 2月, 2013 2 次提交
    • T
      ext4: pass context information to jbd2__journal_start() · 9924a92a
      Theodore Ts'o 提交于
      So we can better understand what bits of ext4 are responsible for
      long-running jbd2 handles, use jbd2__journal_start() so we can pass
      context information for logging purposes.
      
      The recommended way for finding the longer-running handles is:
      
         T=/sys/kernel/debug/tracing
         EVENT=$T/events/jbd2/jbd2_handle_stats
         echo "interval > 5" > $EVENT/filter
         echo 1 > $EVENT/enable
      
         ./run-my-fs-benchmark
      
         cat $T/trace > /tmp/problem-handles
      
      This will list handles that were active for longer than 20ms.  Having
      longer-running handles is bad, because a commit started at the wrong
      time could stall for those 20+ milliseconds, which could delay an
      fsync() or an O_SYNC operation.  Here is an example line from the
      trace file describing a handle which lived on for 311 jiffies, or over
      1.2 seconds:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9924a92a
    • T
      ext4: move the jbd2 wrapper functions out of super.c · 722887dd
      Theodore Ts'o 提交于
      Move the jbd2 wrapper functions which start and stop handles out of
      super.c, where they don't really logically belong, and into
      ext4_jbd2.c.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      722887dd
  16. 03 2月, 2013 4 次提交
  17. 29 1月, 2013 1 次提交
  18. 28 1月, 2013 2 次提交
  19. 25 1月, 2013 2 次提交
  20. 13 1月, 2013 1 次提交
  21. 27 12月, 2012 1 次提交
  22. 26 12月, 2012 2 次提交
  23. 20 12月, 2012 1 次提交
  24. 11 12月, 2012 2 次提交
    • C
      ext4: ensure Inode flags consistency are checked at build time · 9a4c8019
      Carlos Maiolino 提交于
      
      Flags being used by atomic operations in inode flags (e.g.
      ext4_test_inode_flag(), should be consistent with that actually stored
      in inodes, i.e.: EXT4_XXX_FL.
      
      It ensures that this consistency is checked at build-time, not at
      run-time.
      
      Currently, the flags consistency are being checked at run-time, but,
      there is no real reason to not do a build-time check instead of a
      run-time check. The code is comparing macro defined values with enum
      type variables, where both are constants, so, there is no problem in
      comparing constants at build-time.
      
      enum variables are treated as constants by the C compiler, according
      to the C99 specs (see www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf 
      sec. 6.2.5, item 16), so, there is no real problem in comparing an
      enumeration type at build time
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9a4c8019
    • T
      ext4: Remove CONFIG_EXT4_FS_XATTR · 939da108
      Tao Ma 提交于
      Ted has sent out a RFC about removing this feature. Eric and Jan
      confirmed that both RedHat and SUSE enable this feature in all their
      product.  David also said that "As far as I know, it's enabled in all
      Android kernels that use ext4."  So it seems OK for us.
      
      And what's more, as inline data depends its implementation on xattr,
      and to be frank, I don't run any test again inline data enabled while
      xattr disabled.  So I think we should add inline data and remove this
      config option in the same release.
      
      [ The savings if you disable CONFIG_EXT4_FS_XATTR is only 27k, which
        isn't much in the grand scheme of things.  Since no one seems to be
        testing this configuration except for some automated compile farms, on
        balance we are better removing this config option, and so that it is
        effectively always enabled. -- tytso ]
      
      Cc: David Brown <davidb@codeaurora.org>
      Cc: Eric Sandeen <sandeen@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      939da108