1. 11 1月, 2011 3 次提交
  2. 20 12月, 2010 2 次提交
  3. 16 12月, 2010 3 次提交
  4. 15 12月, 2010 1 次提交
    • T
      ext4: Turn off multiple page-io submission by default · 1449032b
      Theodore Ts'o 提交于
      Jon Nelson has found a test case which causes postgresql to fail with
      the error:
      
      psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
      
      Under memory pressure, it looks like part of a file can end up getting
      replaced by zero's.  Until we can figure out the cause, we'll roll
      back the change and use block_write_full_page() instead of
      ext4_bio_write_page().  The new, more efficient writing function can
      be used via the mount option mblk_io_submit, so we can test and fix
      the new page I/O code.
      
      To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
      memory such that the system just at the end of triggering writeback
      before running the following sql script:
      
      begin;
      create temporary table foo as select x as a, ARRAY[x] as b FROM
      generate_series(1, 10000000 ) AS x;
      create index foo_a_idx on foo (a);
      create index foo_b_idx on foo USING GIN (b);
      rollback;
      
      If the temporary table is created on a hard drive partition which is
      encrypted using dm_crypt, then under memory pressure, approximately
      30-40% of the time, pgsql will issue the above failure.
      
      This patch should fix this problem, and the problem will come back if
      the file system is mounted with the mblk_io_submit mount option.
      Reported-by: NJon Nelson <jnelson@jamponi.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1449032b
  5. 09 11月, 2010 2 次提交
    • T
      ext4: fix potential race when freeing ext4_io_page structures · 83668e71
      Theodore Ts'o 提交于
      Use an atomic_t and make sure we don't free the structure while we
      might still be submitting I/O for that page.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      83668e71
    • T
      ext4: handle writeback of inodes which are being freed · f7ad6d2e
      Theodore Ts'o 提交于
      The following BUG can occur when an inode which is getting freed when
      it still has dirty pages outstanding, and it gets deleted (in this
      because it was the target of a rename).  In ordered mode, we need to
      make sure the data pages are written just in case we crash before the
      rename (or unlink) is committed.  If the inode is being freed then
      when we try to igrab the inode, we end up tripping the BUG_ON at
      fs/ext4/page-io.c:146.
      
      To solve this problem, we need to keep track of the number of io
      callbacks which are pending, and avoid destroying the inode until they
      have all been completed.  That way we don't have to bump the inode
      count to keep the inode from being destroyed; an approach which
      doesn't work because the count could have already been dropped down to
      zero before the inode writeback has started (at which point we're not
      allowed to bump the count back up to 1, since it's already started
      getting freed).
      
      Thanks to Dave Chinner for suggesting this approach, which is also
      used by XFS.
      
        kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
        Call Trace:
         [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
         [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
         [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
         [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
         [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
         [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
         [<ffffffff81087910>] do_writepages+0x1c/0x25
         [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
         [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
         [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
         [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
         [<ffffffff810c14a3>] evict+0x22/0x92
         [<ffffffff810c1a3d>] iput+0x212/0x249
         [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
         [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
         [<ffffffff810be613>] dput+0x13a/0x147
         [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
         [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
         [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
         [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
         [<ffffffff810b99c6>] sys_rename+0x16/0x18
         [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
      Reported-by: NNick Bowler <nbowler@elliptictech.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Tested-by: NNick Bowler <nbowler@elliptictech.com>
      f7ad6d2e
  6. 28 10月, 2010 12 次提交
    • E
      ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static · eee4adc7
      Eric Sandeen 提交于
      These functions are only used within fs/ext4/mballoc.c, so move them
      so they are used after they are defined, and then make them be static.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      eee4adc7
    • T
      ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end() · 61d08673
      Theodore Ts'o 提交于
      Fix a namespace leak from fs/ext4
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      61d08673
    • T
      ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static · 4a873a47
      Theodore Ts'o 提交于
      Fix a namespace leak by moving the function to the file where it is
      used and making it static.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4a873a47
    • T
      ext4: make various ext4 functions be static · 1f109d5a
      Theodore Ts'o 提交于
      These functions have no need to be exported beyond file context.
      
      No functions needed to be moved for this commit; just some function
      declarations changed to be static and removed from header files.
      
      (A similar patch was submitted by Eric Sandeen, but I wanted to handle
      code movement in separate patches to make sure code changes didn't
      accidentally get dropped.)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1f109d5a
    • T
      ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() · 5dabfc78
      Theodore Ts'o 提交于
      This is a cleanup to avoid namespace leaks out of fs/ext4
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5dabfc78
    • L
      ext4: Add batched discard support for ext4 · 7360d173
      Lukas Czerner 提交于
      Walk through allocation groups and trim all free extents. It can be
      invoked through FITRIM ioctl on the file system. The main idea is to
      provide a way to trim the whole file system if needed, since some SSD's
      may suffer from performance loss after the whole device was filled (it
      does not mean that fs is full!).
      
      It search for free extents in allocation groups specified by Byte range
      start -> start+len. When the free extent is within this range, blocks
      are marked as used and then trimmed. Afterwards these blocks are marked
      as free in per-group bitmap.
      
      Since fstrim is a long operation it is good to have an ability to
      interrupt it by a signal. This was added by Dmitry Monakhov.
      Thanks Dimitry.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7360d173
    • T
      ext4: use bio layer instead of buffer layer in mpage_da_submit_io · bd2d0210
      Theodore Ts'o 提交于
      Call the block I/O layer directly instad of going through the buffer
      layer.  This should give us much better performance and scalability,
      as well as lowering our CPU utilization when doing buffered writeback.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bd2d0210
    • E
      ext4: remove unused ext4_sb_info members · 640e9396
      Eric Sandeen 提交于
      Not that these take up a lot of room, but the structure is long enough
      as it is, and there's no need to confuse people with these various
      undocumented & unused structure members...
      Signed-off-by: NEric Sandeen <sandeen@redaht.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      640e9396
    • T
      ext4: improve llseek error handling for overly large seek offsets · e0d10bfa
      Toshiyuki Okajima 提交于
      The llseek system call should return EINVAL if passed a seek offset
      which results in a write error.  What this maximum offset should be
      depends on whether or not the huge_file file system feature is set,
      and whether or not the file is extent based or not.
      
      
      If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
      written (write systemcall) is different from the maximum size which can be 
      sought (lseek systemcall).
      
      For example, the following 2 cases demonstrates the differences
      between the maximum size which can be written, versus the seek offset
      allowed by the llseek system call:
      
      #1: mkfs.ext3 <dev>; mount -t ext4 <dev>
      #2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>
      
      Table. the max file size which we can write or seek
             at each filesystem feature tuning and file flag setting
      +============+===============================+===============================+
      | \ File flag|                               |                               |
      |      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
      |case       \|                               |                               |
      +------------+-------------------------------+-------------------------------+
      | #1         |   write:      2194719883264   | write:       --------------   |
      |            |   seek:       2199023251456   | seek:        --------------   |
      +------------+-------------------------------+-------------------------------+
      | #2         |   write:      4402345721856   | write:       17592186044415   |
      |            |   seek:      17592186044415   | seek:        17592186044415   |
      +------------+-------------------------------+-------------------------------+
      
      The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
      (= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
      maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
      (llseek of ext4_file_operations is generic_file_llseek which uses
      sb->s_maxbytes.)
      
      Therefore we create ext4 llseek function which uses 2 maxbytes.
      
      The new own function originates from generic_file_llseek().
      If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
      inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      e0d10bfa
    • L
      ext4: add interface to advertise ext4 features in sysfs · 857ac889
      Lukas Czerner 提交于
      User-space should have the opportunity to check what features doest ext4
      support in each particular copy. This adds easy interface by creating new
      "features" directory in sys/fs/ext4/. In that directory files
      advertising feature names can be created.
      
      Add lazy_itable_init to the feature list.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      857ac889
    • L
      ext4: add support for lazy inode table initialization · bfff6873
      Lukas Czerner 提交于
      When the lazy_itable_init extended option is passed to mke2fs, it
      considerably speeds up filesystem creation because inode tables are
      not zeroed out.  The fact that parts of the inode table are
      uninitialized is not a problem so long as the block group descriptors,
      which contain information regarding how much of the inode table has
      been initialized, has not been corrupted However, if the block group
      checksums are not valid, e2fsck must scan the entire inode table, and
      the the old, uninitialized data could potentially cause e2fsck to
      report false problems.
      
      Hence, it is important for the inode tables to be initialized as soon
      as possble.  This commit adds this feature so that mke2fs can safely
      use the lazy inode table initialization feature to speed up formatting
      file systems.
      
      This is done via a new new kernel thread called ext4lazyinit, which is
      created on demand and destroyed, when it is no longer needed.  There
      is only one thread for all ext4 filesystems in the system. When the
      first filesystem with inititable mount option is mounted, ext4lazyinit
      thread is created, then the filesystem can register its request in the
      request list.
      
      This thread then walks through the list of requests picking up
      scheduled requests and invoking ext4_init_inode_table(). Next schedule
      time for the request is computed by multiplying the time it took to
      zero out last inode table with wait multiplier, which can be set with
      the (init_itable=n) mount option (default is 10).  We are doing
      this so we do not take the whole I/O bandwidth. When the thread is no
      longer necessary (request list is empty) it frees the appropriate
      structures and exits (and can be created later later by another
      filesystem).
      
      We do not disturb regular inode allocations in any way, it just do not
      care whether the inode table is, or is not zeroed. But when zeroing, we
      have to skip used inodes, obviously. Also we should prevent new inode
      allocations from the group, while zeroing is on the way. For that we
      take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
      in the ext4_claim_inode, so when we are unlucky and allocator hits the
      group which is currently being zeroed, it just has to wait.
      
      This can be suppresed using the mount option no_init_itable.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bfff6873
    • C
      ext4: use dedicated slab caches for group_info structures · fb1813f4
      Curt Wohlgemuth 提交于
      ext4_group_info structures are currently allocated with kmalloc().
      With a typical 4K block size, these are 136 bytes each -- meaning
      they'll each consume a 256-byte slab object.  On a system with many
      ext4 large partitions, that's a lot of wasted kernel slab space.
      (E.g., a single 1TB partition will have about 8000 block groups, using
      about 2MB of slab, of which nearly 1MB is wasted.)
      
      This patch creates an array of slab pointers created as needed --
      depending on the superblock block size -- and uses these slabs to
      allocate the group info objects.
      
      Google-Bug-Id: 2980809
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fb1813f4
  7. 10 8月, 2010 1 次提交
  8. 05 8月, 2010 1 次提交
    • E
      ext4: re-inline ext4_rec_len_(to|from)_disk functions · 0cfc9255
      Eric Sandeen 提交于
      commit 3d0518f4, "ext4: New rec_len encoding for very
      large blocksizes" made several changes to this path, but from
      a perf perspective, un-inlining ext4_rec_len_from_disk() seems
      most significant.  This function is called from ext4_check_dir_entry(),
      which on a file-creation workload is called extremely often.
      
      I tested this with bonnie:
      
      # bonnie++ -u root -s 0 -f -x 200 -d /mnt/test -n 32
      
      (this does 200 iterations) and got this for the file creations:
      
      ext4 stock:   Average =  21206.8 files/s
      ext4 inlined: Average =  22346.7 files/s  (+5%)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0cfc9255
  9. 02 8月, 2010 1 次提交
  10. 27 7月, 2010 7 次提交
  11. 30 6月, 2010 1 次提交
  12. 29 6月, 2010 2 次提交
  13. 15 6月, 2010 1 次提交
  14. 12 6月, 2010 1 次提交
    • T
      ext4: Clean up s_dirt handling · a0375156
      Theodore Ts'o 提交于
      We don't need to set s_dirt in most of the ext4 code when journaling
      is enabled.  In ext3/4 some of the summary statistics for # of free
      inodes, blocks, and directories are calculated from the per-block
      group statistics when the file system is mounted or unmounted.  As a
      result the superblock doesn't have to be updated, either via the
      journal or by setting s_dirt.  There are a few exceptions, most
      notably when resizing the file system, where the superblock needs to
      be modified --- and in that case it should be done as a journalled
      operation if possible, and s_dirt set only in no-journal mode.
      
      This patch will optimize out some unneeded disk writes when using ext4
      with a journal.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a0375156
  15. 28 5月, 2010 1 次提交
  16. 17 5月, 2010 1 次提交