1. 02 11月, 2010 1 次提交
  2. 29 10月, 2010 3 次提交
  3. 28 10月, 2010 36 次提交
    • D
      ext4: optimize orphan_list handling for ext4_setattr · 3d287de3
      Dmitry Monakhov 提交于
      Surprisingly chown() on ext4 is not SMP scalable operation. 
      Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
      result in significant performance overhead because of global orphan
      mutex, especially in no-journal mode (where orphan_add() is noop).
      It is possible to skip explicit orphan_del if possible.
      Results of fchown() micro-benchmark in no-journal mode
      while (1) {
         iteration++;
         fchown(fd, uid, gid);
         fchown(fd, uid + 1, gid + 1)
      }
      measured: iterations per millisecond
      | nr_tasks | w/o patch | with patch |
      |        1 |       142 |        185 |
      |        4 |       109 |        642 |
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3d287de3
    • N
    • K
      ext4: fix compile error in ext4_fallocate() · a6371b63
      Kazuya Mio 提交于
      When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got
      the following compile error.
      
        CC [M]  fs/ext4/extents.o
      fs/ext4/extents.c: In function 'ext4_fallocate':
      fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function)
      fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once
      fs/ext4/extents.c:3772: error: for each function it appears in.)
      make[2]: *** [fs/ext4/extents.o] Error 1
      
      The patch fixes this problem.
      Signed-off-by: NKazuya Mio <k-mio@sx.jp.nec.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a6371b63
    • E
      ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static · eee4adc7
      Eric Sandeen 提交于
      These functions are only used within fs/ext4/mballoc.c, so move them
      so they are used after they are defined, and then make them be static.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      eee4adc7
    • T
      ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end() · 61d08673
      Theodore Ts'o 提交于
      Fix a namespace leak from fs/ext4
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      61d08673
    • T
      ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static · 4a873a47
      Theodore Ts'o 提交于
      Fix a namespace leak by moving the function to the file where it is
      used and making it static.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4a873a47
    • T
      ext4: rename {ext,idx}_pblock and inline small extent functions · bf89d16f
      Theodore Ts'o 提交于
      Cleanup namespace leaks from fs/ext4 and the inline trivial functions
      ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the
      code size actually shrinks when we make these functions inline,
      they're so trivial.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bf89d16f
    • T
      ext4: make various ext4 functions be static · 1f109d5a
      Theodore Ts'o 提交于
      These functions have no need to be exported beyond file context.
      
      No functions needed to be moved for this commit; just some function
      declarations changed to be static and removed from header files.
      
      (A similar patch was submitted by Eric Sandeen, but I wanted to handle
      code movement in separate patches to make sure code changes didn't
      accidentally get dropped.)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1f109d5a
    • T
      ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() · 5dabfc78
      Theodore Ts'o 提交于
      This is a cleanup to avoid namespace leaks out of fs/ext4
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5dabfc78
    • T
      ext4: fix kernel oops if the journal superblock has a non-zero j_errno · 7f93cff9
      Theodore Ts'o 提交于
      Commit 84061e07 fixed an accounting bug only to introduce the
      possibility of a kernel OOPS if the journal has a non-zero j_errno
      field indicating that the file system had detected a fs inconsistency.
      After the journal replay, if the journal superblock indicates that the
      file system has an error, this indication is transfered to the file
      system and then ext4_commit_super() is called to write this to the
      disk.
      
      But since the percpu counters are now initialized after the journal
      replay, the call to ext4_commit_super() will cause a kernel oops since
      it needs to use the percpu counters the ext4 superblock structure.
      
      The fix is to skip setting the ext4 free block and free inode fields
      if the percpu counter has not been set.
      
      Thanks to Ken Sumrall for reporting and analyzing the root causes of
      this bug.
      
      Addresses-Google-Bug: #3054080
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7f93cff9
    • E
      ext4: update writeback_index based on last page scanned · 72f84e65
      Eric Sandeen 提交于
      As pointed out in a prior patch, updating the mapping's
      writeback_index based on pages written isn't quite right;
      what the writeback index is really supposed to reflect is
      the next page which should be scanned for writeback during
      periodic flush.
      
      As in write_cache_pages(), write_cache_pages_da() does
      this scanning for us as we assemble the mpd for later
      writeout.  If we keep track of the next page after the
      current scan, we can easily update writeback_index without
      worrying about pages written vs. pages skipped, etc.
      
      Without this, an fsync will reset writeback_index to
      0 (its starting index) + however many pages it wrote, which
      can mess up the progress of periodic flush.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      72f84e65
    • E
      ext4: implement writeback livelock avoidance using page tagging · 5b41d924
      Eric Sandeen 提交于
      This is analogous to Jan Kara's commit,
      f446daae
      mm: implement writeback livelock avoidance using page tagging
      
      but since we forked write_cache_pages, we need to reimplement
      it there (and in ext4_da_writepages, since range_cyclic handling
      was moved to there)
      
      If you start a large buffered IO to a file, and then set
      fsync after it, you'll find that fsync does not complete
      until the other IO stops.
      
      If you continue re-dirtying the file (say, putting dd
      with conv=notrunc in a loop), when fsync finally completes
      (after all IO is done), it reports via tracing that
      it has written many more pages than the file contains;
      in other words it has synced and re-synced pages in
      the file multiple times.
      
      This then leads to problems with our writeback_index
      update, since it advances it by pages written, and
      essentially sets writeback_index off the end of the
      file...
      
      With the following patch, we only sync as much as was
      dirty at the time of the sync.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5b41d924
    • E
      ext4: tidy up a void argument in inode.c · bbd08344
      Eric Sandeen 提交于
      This doesn't fix anything at all, it just removes a vestige
      of prior use from __mpage_da_writepage()
      
      __mpage_da_writepage() had a *void argument leftover from
      its previous life as a callback; make it reflect the actual type.
      
      Fixing this up makes it slightly more obvious to read, and 
      enables proper typechecking.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bbd08344
    • L
      ext4: add batched_discard into ext4 feature list · 27ee40df
      Lukas Czerner 提交于
      Should be applied on the top of "lazy inode table initialization"
      and "batched discard support" patch-sets.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      27ee40df
    • L
      ext4: Add batched discard support for ext4 · 7360d173
      Lukas Czerner 提交于
      Walk through allocation groups and trim all free extents. It can be
      invoked through FITRIM ioctl on the file system. The main idea is to
      provide a way to trim the whole file system if needed, since some SSD's
      may suffer from performance loss after the whole device was filled (it
      does not mean that fs is full!).
      
      It search for free extents in allocation groups specified by Byte range
      start -> start+len. When the free extent is within this range, blocks
      are marked as used and then trimmed. Afterwards these blocks are marked
      as free in per-group bitmap.
      
      Since fstrim is a long operation it is good to have an ability to
      interrupt it by a signal. This was added by Dmitry Monakhov.
      Thanks Dimitry.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7360d173
    • L
      ext4: Use return value from sb_issue_discard() · 77ca6cdf
      Lukas Czerner 提交于
      Use return value from sb_issue_discard() as return value in
      ext4_issue_discard(). Since sb_issue_discard() may result in more
      serious errors than just -EOPNOTSUPP it is worth to inform user of this
      function about them to handle error cases properly.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      77ca6cdf
    • N
      ext4: Check return value of sb_getblk() and friends · 87783690
      Namhyung Kim 提交于
      Fail block allocation if sb_getblk() returns NULL. In that case,
      sb_find_get_block() also likely to fail so that it should skip
      calling ext4_forget().
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      87783690
    • T
      ext4: use bio layer instead of buffer layer in mpage_da_submit_io · bd2d0210
      Theodore Ts'o 提交于
      Call the block I/O layer directly instad of going through the buffer
      layer.  This should give us much better performance and scalability,
      as well as lowering our CPU utilization when doing buffered writeback.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bd2d0210
    • T
      ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io() · 1de3e3df
      Theodore Ts'o 提交于
      This massively simplifies the ext4_da_writepages() code path by
      completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
      code iterating over a set of pages using pagevec_lookup(), and folds
      that functionality into mpage_da_submit_io()'s existing
      pagevec_lookup() loop.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1de3e3df
    • T
      ext4: inline walk_page_buffers() into mpage_da_submit_io · 3ecdb3a1
      Theodore Ts'o 提交于
      Expand the call:
      
        if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
                              ext4_bh_delay_or_unwritten))
      	goto redirty_page
      
      into mpage_da_submit_io().
      
      This will allow us to merge in mpage_put_bnr_to_bhs() in the next
      patch.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3ecdb3a1
    • T
      ext4: inline ext4_writepage() into mpage_da_submit_io() · cb20d518
      Theodore Ts'o 提交于
      As a prepratory step to switching to bio_submit, inline
      ext4_writepage() into mpage_da_submit() and then simplify things a
      bit.  This makes it clearer what mpage_da_submit needs to do.
      
      Also, move the ClearPageChecked(page) call into
      __ext4_journalled_writepage(), as a minor bit of cleanup refactoring.
      
      This also allows us to pull i_size_read() and
      ext4_should_journal_data() out of the loop, which should be a very
      minor CPU savings.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      cb20d518
    • T
      ext4: simplify ext4_writepage() · a42afc5f
      Theodore Ts'o 提交于
      The actual code in ext4_writepage() is unnecessarily convoluted.
      Simplify it so it is easier to understand, but otherwise logically
      equivalent.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a42afc5f
    • T
      ext4: call mpage_da_submit_io() from mpage_da_map_blocks() · 5a87b7a5
      Theodore Ts'o 提交于
      Eventually we need to completely reorganize the ext4 writepage
      callpath, but for now, we simplify things a little by calling
      mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
      places where we call mpage_da_map_blocks() it is followed up by a call
      to mpage_da_submit_io().
      
      We're also a wee bit better with respect to error handling, but there
      are still a number of issues where it's not clear what the right thing
      is to do with ext4 functions deep in the writeback codepath fails.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5a87b7a5
    • T
      ext4: use KMEM_CACHE instead of kmem_cache_create · 16828088
      Theodore Ts'o 提交于
      Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
      cache.  This slab tends to be fairly static, so it shouldn't be marked
      as likely to have free pages that can be reclaimed.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      16828088
    • T
      ext4: use search_dirblock() in ext4_dx_find_entry() · 7845c049
      Theodore Ts'o 提交于
      Use the search_dirblock() in ext4_dx_find_entry().  It makes the code
      easier to read, and it takes advantage of common code.  It also saves
      100 bytes or so of text space.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Brad Spengler <spender@grsecurity.net>
      7845c049
    • T
      ext4: avoid uninitialized memory references in ext3_htree_next_block() · 8941ec8b
      Theodore Ts'o 提交于
      If the first block of htree directory is missing '.' or '..' but is
      otherwise a valid directory, and we do a lookup for '.' or '..', it's
      possible to dereference an uninitialized memory pointer in
      ext4_htree_next_block().
      
      We avoid this by moving the special case from ext4_dx_find_entry() to
      ext4_find_entry(); this also means we can optimize ext4_find_entry()
      slightly when NFS looks up "..".
      
      Thanks to Brad Spengler for pointing a Clang warning that led me to
      look more closely at this code.  The warning was harmless, but it was
      useful in pointing out code that was too ugly to live.  This warning was
      also reported by Roman Borisov.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Brad Spengler <spender@grsecurity.net>
      8941ec8b
    • E
      ext4: remove unused ext4_sb_info members · 640e9396
      Eric Sandeen 提交于
      Not that these take up a lot of room, but the structure is long enough
      as it is, and there's no need to confuse people with these various
      undocumented & unused structure members...
      Signed-off-by: NEric Sandeen <sandeen@redaht.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      640e9396
    • E
      ext4: queue conversion after adding to inode's completed IO list · c999af2b
      Eric Sandeen 提交于
      By queuing the io end on the unwritten workqueue before adding it
      to our inode's list of completed IOs, I think we run the risk
      of the work getting completed, and the IO freed, before we try
      to add it to the inode's i_completed_io_list.
      
      It should be safe to add it to the inode's list of completed
      IOs, and -then- queue it for completion, I think.
      
      Thanks to Dave Chinner for pointing out the race.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c999af2b
    • E
      ext4: don't use ext4_allocation_contexts for tracing · 3e1e5f50
      Eric Sandeen 提交于
      Many tracepoints were populating an ext4_allocation_context
      to pass in, but this requires a slab allocation even when
      tracepoints are off.  In fact, 4 of 5 of these allocations
      were only for tracing.  In addition, we were only using a
      small fraction of the 144 bytes of this structure for this
      purpose.
      
      We can do away with all these alloc/frees of the ac and
      simply pass in the bits we care about, instead.
      
      I tested this by turning on tracing and running through
      xfstests on x86_64.  I did not actually do anything with
      the trace output, however.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3e1e5f50
    • T
      ext4: fix potential infinite loop in ext4_da_writepages() · 0c9169cc
      Toshiyuki Okajima 提交于
      On linux-2.6.36-rc2, if we execute the following script, we can hang
      the system when the /bin/sync command is executed:
      
      ========================================================================
      #!/bin/sh
      
      echo -n "HANG UP TEST: "
      /bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
      /sbin/mkfs.ext4 -Fq /tmp/img
      /bin/mount -o loop -t ext4 /tmp/img /mnt
      /bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
      seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
      /bin/sync
      /bin/umount /mnt
      echo "DONE"
      exit 0
      ========================================================================
      
      We can see the following backtrace if we get the kdump when this
      hangup occurs:
      
      ======================================================================
      kthread()
      => bdi_writeback_thread()
         => wb_do_writeback()
            => wb_writeback()
               => writeback_inodes_wb()
                  => writeback_sb_inodes()
                     => writeback_single_inode()
                        => ext4_da_writepages()  ---+ 
                                      ^ infinite    |
                                      |   loop      |
                                      +-------------+
      ======================================================================
      
      The reason why this hangup happens is described as follows:
      1) We write the last extent block of the file whose size is the filesystem 
         maximum size.
      2) "BH_Delay" flag is set on the buffer_head of its block.
      3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
         - the member, "m_len" of struct mpage_da_data is 1
        mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
        cannot clear "BH_Delay" flag of the buffer_head because the type of
        m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.
      
        Therefore an infinite loop occurs because ext4_da_writepages()
        cannot write the page (which corresponds to the block) since
        "BH_Delay" flag isn't cleared.
      ----------------------------------------------------------------------
      static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
      				struct ext4_map_blocks *map)
      {
      ...
      	int blocks = map->m_len;
      ...
      		do {
      			// cur_logical = 4294967295
      			// map->m_lblk = 4294967295
      			// blocks = 1
      			// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
      			// (cur_logical >= map->m_lblk + blocks) => true
      			if (cur_logical >= map->m_lblk + blocks)
      				break;
      ----------------------------------------------------------------------
      
      NOTE: Mounting with the nodelalloc option will avoid this codepath,
      and thus, avoid this hang
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0c9169cc
    • T
      ext4: improve llseek error handling for overly large seek offsets · e0d10bfa
      Toshiyuki Okajima 提交于
      The llseek system call should return EINVAL if passed a seek offset
      which results in a write error.  What this maximum offset should be
      depends on whether or not the huge_file file system feature is set,
      and whether or not the file is extent based or not.
      
      
      If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
      written (write systemcall) is different from the maximum size which can be 
      sought (lseek systemcall).
      
      For example, the following 2 cases demonstrates the differences
      between the maximum size which can be written, versus the seek offset
      allowed by the llseek system call:
      
      #1: mkfs.ext3 <dev>; mount -t ext4 <dev>
      #2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>
      
      Table. the max file size which we can write or seek
             at each filesystem feature tuning and file flag setting
      +============+===============================+===============================+
      | \ File flag|                               |                               |
      |      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
      |case       \|                               |                               |
      +------------+-------------------------------+-------------------------------+
      | #1         |   write:      2194719883264   | write:       --------------   |
      |            |   seek:       2199023251456   | seek:        --------------   |
      +------------+-------------------------------+-------------------------------+
      | #2         |   write:      4402345721856   | write:       17592186044415   |
      |            |   seek:      17592186044415   | seek:        17592186044415   |
      +------------+-------------------------------+-------------------------------+
      
      The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
      (= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
      maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
      (llseek of ext4_file_operations is generic_file_llseek which uses
      sb->s_maxbytes.)
      
      Therefore we create ext4 llseek function which uses 2 maxbytes.
      
      The new own function originates from generic_file_llseek().
      If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
      inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      e0d10bfa
    • M
      ext4: don't update sb journal_devnum when RO dev · c41303ce
      Maciej Żenczykowski 提交于
      An ext4 filesystem on a read-only device, with an external journal
      which is at a different device number then recorded in the superblock
      will fail to honor the read-only setting of the device and trigger
      a superblock update (write).
      
      For example:
        - ext4 on a software raid which is in read-only mode
        - external journal on a read-write device which has changed device num
        - attempt to mount with -o journal_dev=<new_number>
        - hits BUG_ON(mddev->ro = 1) in md.c
      
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c41303ce
    • L
      ext4: use sb_issue_zeroout in ext4_ext_zeroout · 2407518d
      Lukas Czerner 提交于
      Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
      own approach to zero out extents.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2407518d
    • L
      ext4: use sb_issue_zeroout in setup_new_group_blocks · a31437b8
      Lukas Czerner 提交于
      Use sb_issue_zeroout to zero out inode table and descriptor table
      blocks instead of old approach which involves journaling.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a31437b8
    • L
      ext4: add interface to advertise ext4 features in sysfs · 857ac889
      Lukas Czerner 提交于
      User-space should have the opportunity to check what features doest ext4
      support in each particular copy. This adds easy interface by creating new
      "features" directory in sys/fs/ext4/. In that directory files
      advertising feature names can be created.
      
      Add lazy_itable_init to the feature list.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      857ac889
    • L
      ext4: add support for lazy inode table initialization · bfff6873
      Lukas Czerner 提交于
      When the lazy_itable_init extended option is passed to mke2fs, it
      considerably speeds up filesystem creation because inode tables are
      not zeroed out.  The fact that parts of the inode table are
      uninitialized is not a problem so long as the block group descriptors,
      which contain information regarding how much of the inode table has
      been initialized, has not been corrupted However, if the block group
      checksums are not valid, e2fsck must scan the entire inode table, and
      the the old, uninitialized data could potentially cause e2fsck to
      report false problems.
      
      Hence, it is important for the inode tables to be initialized as soon
      as possble.  This commit adds this feature so that mke2fs can safely
      use the lazy inode table initialization feature to speed up formatting
      file systems.
      
      This is done via a new new kernel thread called ext4lazyinit, which is
      created on demand and destroyed, when it is no longer needed.  There
      is only one thread for all ext4 filesystems in the system. When the
      first filesystem with inititable mount option is mounted, ext4lazyinit
      thread is created, then the filesystem can register its request in the
      request list.
      
      This thread then walks through the list of requests picking up
      scheduled requests and invoking ext4_init_inode_table(). Next schedule
      time for the request is computed by multiplying the time it took to
      zero out last inode table with wait multiplier, which can be set with
      the (init_itable=n) mount option (default is 10).  We are doing
      this so we do not take the whole I/O bandwidth. When the thread is no
      longer necessary (request list is empty) it frees the appropriate
      structures and exits (and can be created later later by another
      filesystem).
      
      We do not disturb regular inode allocations in any way, it just do not
      care whether the inode table is, or is not zeroed. But when zeroing, we
      have to skip used inodes, obviously. Also we should prevent new inode
      allocations from the group, while zeroing is on the way. For that we
      take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
      in the ext4_claim_inode, so when we are unlucky and allocator hits the
      group which is currently being zeroed, it just has to wait.
      
      This can be suppresed using the mount option no_init_itable.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bfff6873