1. 18 10月, 2007 3 次提交
    • A
      Ext4: Uninitialized Block Groups · 717d50e4
      Andreas Dilger 提交于
      In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
      regardless of whether it is in use.  This is this the most time consuming part
      of the filesystem check.  The unintialized block group feature can greatly
      reduce e2fsck time by eliminating checking of uninitialized inodes.
      
      With this feature, there is a a high water mark of used inodes for each block
      group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
      group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
      of each group descriptor is used to ensure that corruption in the group
      descriptor's bit flags does not cause incorrect operation.
      
      The feature is enabled through a mkfs option
      
      	mke2fs /dev/ -O uninit_groups
      
      A patch adding support for uninitialized block groups to e2fsprogs tools has
      been posted to the linux-ext4 mailing list.
      
      The patches have been stress tested with fsstress and fsx.  In performance
      tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
      linearly with the total number of inodes in the filesytem.  In ext4 with the
      uninitialized block groups feature, the e2fsck time is constant, based
      solely on the number of used inodes rather than the total inode count.
      Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
      greatly reduce e2fsck time for users.  With performance improvement of 2-20
      times, depending on how full the filesystem is.
      
      The attached graph shows the major improvements in e2fsck times in filesystems
      with a large total inode count, but few inodes in use.
      
      In each group descriptor if we have
      
      EXT4_BG_INODE_UNINIT set in bg_flags:
              Inode table is not initialized/used in this group. So we can skip
              the consistency check during fsck.
      EXT4_BG_BLOCK_UNINIT set in bg_flags:
              No block in the group is used. So we can skip the block bitmap
              verification for this group.
      
      We also add two new fields to group descriptor as a part of
      uninitialized group patch.
      
              __le16  bg_itable_unused;       /* Unused inodes count */
              __le16  bg_checksum;            /* crc16(sb_uuid+group+desc) */
      
      bg_itable_unused:
      
      If we have EXT4_BG_INODE_UNINIT not set in bg_flags
      then bg_itable_unused will give the offset within
      the inode table till the inodes are used. This can be
      used by fsck to skip list of inodes that are marked unused.
      
      bg_checksum:
      Now that we depend on bg_flags and bg_itable_unused to determine
      the block and inode usage, we need to make sure group descriptor
      is not corrupt. We add checksum to group descriptor to
      detect corruption. If the descriptor is found to be corrupt, we
      mark all the blocks and inodes in the group used.
      Signed-off-by: NAvantika Mathur <mathur@us.ibm.com>
      Signed-off-by: NAndreas Dilger <adilger@clusterfs.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      717d50e4
    • E
      ext4: remove #ifdef CONFIG_EXT4_INDEX · 4074fe37
      Eric Sandeen 提交于
      CONFIG_EXT4_INDEX is not an exposed config option in the kernel, and it is
      unconditionally defined in ext4_fs.h.  tune2fs is already able to turn off
      dir indexing, so at this point it's just cluttering up the code.  Remove
      it.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4074fe37
    • C
      ext4: Remove (partial, never completed) fragment support · f077d0d7
      Coly Li 提交于
      Fragment support in ext2/3/4 was never implemented, and it probably will
      never be implemented.   So remove it from ext4.
      Signed-off-by: NColy Li <coyli@suse.de>
      Acked-by: NAndreas Dilger <adilger@clusterfs.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f077d0d7
  2. 18 7月, 2007 6 次提交
    • A
      ext4: Remove 65000 subdirectory limit · f8628a14
      Andreas Dilger 提交于
      This patch adds support to ext4 for allowing more than 65000
      subdirectories. Currently the maximum number of subdirectories is capped
      at 32000.
      
      If we exceed 65000 subdirectories in an htree directory it sets the
      inode link count to 1 and no longer counts subdirectories.  The
      directory link count is not actually used when determining if a
      directory is empty, as that only counts subdirectories and not regular
      files that might be in there. 
      
      A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
      the subdir count for any directory crosses 65000. A later fsck will clear
      EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any directory
      with >65000 subdirs.
      Signed-off-by: NAndreas Dilger <adilger@clusterfs.com>
      Signed-off-by: NKalpak Shah <kalpak@clusterfs.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      f8628a14
    • K
      ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields · 6dd4ee7c
      Kalpak Shah 提交于
      We need to make sure that existing ext3 filesystems can also avail the
      new fields that have been added to the ext4 inode. We use
      s_want_extra_isize and s_min_extra_isize to decide by how much we should
      expand the inode. If EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set
      then we expand the inode by max(s_want_extra_isize, s_min_extra_isize ,
      sizeof(ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is
      still an open question about whether users should be able to set
      s_*_extra_isize smaller than the known fields or not.
      
      This patch also adds the functionality to expand inodes to include the
      newly added fields. We start by trying to expand by s_want_extra_isize
      bytes and if its fails we try to expand by s_min_extra_isize bytes. This
      is done by changing the i_extra_isize if enough space is available in
      the inode and no EAs are present. If EAs are present and there is enough
      space in the inode then the EAs in the inode are shifted to make space.
      If enough space is not available in the inode due to the EAs then 1 or
      more EAs are shifted to the external EA block. In the worst case when
      even the external EA block does not have enough space we inform the user
      that some EA would need to be deleted or s_min_extra_isize would have to
      be reduced.
      Signed-off-by: NAndreas Dilger <adilger@clusterfs.com>
      Signed-off-by: NKalpak Shah <kalpak@clusterfs.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6dd4ee7c
    • K
      ext4: Add nanosecond timestamps · ef7f3835
      Kalpak Shah 提交于
      This patch adds nanosecond timestamps for ext4. This involves adding
      *time_extra fields to the ext4_inode to extend the timestamps to
      64-bits.  Creation time is also added by this patch.
      
      These extended fields will fit into an inode if the filesystem was
      formatted with large inodes (-I 256 or larger) and there are currently
      no EAs consuming all of the available space. For new inodes we always
      reserve enough space for the kernel's known extended fields, but for
      inodes created with an old kernel this might not have been the case. So
      this patch also adds the EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature
      flag(ro-compat so that older kernels can't create inodes with a smaller
      extra_isize). which indicates if the fields fitting inside
      s_min_extra_isize are available or not.  If the expansion of inodes if
      unsuccessful then this feature will be disabled.  This feature is only
      enabled if requested by the sysadmin.
      
      None of the extended inode fields is critical for correct filesystem
      operation.
      Signed-off-by: NAndreas Dilger <adilger@clusterfs.com>
      Signed-off-by: NKalpak Shah <kalpak@clusterfs.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ef7f3835
    • J
      jbd2: Fix CONFIG_JBD_DEBUG ifdef to be CONFIG_JBD2_DEBUG · e23291b9
      Jose R. Santos 提交于
      When the JBD code was forked to create the new JBD2 code base, the
      references to CONFIG_JBD_DEBUG where never changed to
      CONFIG_JBD2_DEBUG.  This patch fixes that.
      Signed-off-by: NJose R. Santos <jrs@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e23291b9
    • J
      ext4: copy i_flags to inode flags on write · ff9ddf7e
      Jan Kara 提交于
          
      Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
      ext4-specific i_flags.  Quota code changes these flags on quota files
      (to make it harder for sysadmin to screw himself) and these changes were
      not correctly propagated into the filesystem.
      
      (This is a forward port patch from ext3)
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ff9ddf7e
    • A
      fallocate support in ext4 · a2df2a63
      Amit Arora 提交于
      This patch implements ->fallocate() inode operation in ext4. With this
      patch users of ext4 file systems will be able to use fallocate() system
      call for persistent preallocation. Current implementation only supports
      preallocation for regular files (directories not supported as of date)
      with extent maps. This patch does not support block-mapped files currently.
      Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
      now.
      Signed-off-by: NAmit Arora <aarora@in.ibm.com>
      a2df2a63
  3. 01 6月, 2007 2 次提交
  4. 13 2月, 2007 1 次提交
  5. 12 10月, 2006 9 次提交
  6. 01 10月, 2006 2 次提交
  7. 27 9月, 2006 2 次提交
  8. 24 9月, 2006 1 次提交
  9. 01 8月, 2006 1 次提交
    • N
      [PATCH] ext3: avoid triggering ext3_error on bad NFS file handle · 2ccb48eb
      Neil Brown 提交于
      The inode number out of an NFS file handle gets passed eventually to
      ext3_get_inode_block() without any checking.  If ext3_get_inode_block()
      allows it to trigger an error, then bad filehandles can have unpleasant
      effect - ext3_error() will usually cause a forced read-only remount, or a
      panic if `errors=panic' was used.
      
      So remove the call to ext3_error there and put a matching check in
      ext3/namei.c where inode numbers are read off storage.
      
      [akpm@osdl.org: fix off-by-one error]
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Cc: <stable@kernel.org>
      Cc: "Stephen C. Tweedie" <sct@redhat.com>
      Cc: Eric Sandeen <esandeen@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2ccb48eb
  10. 26 6月, 2006 2 次提交
    • M
      [PATCH] ext3_fsblk_t: the rest of in-kernel filesystem blocks conversion · 43d23f90
      Mingming Cao 提交于
      Convert the ext3 in-kernel filesystem blocks to ext3_fsblk_t.  Convert the
      rest of all unsigned long type in-kernel filesystem blocks to ext3_fsblk_t,
      and replace the printk format string respondingly.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      43d23f90
    • M
      [PATCH] ext3_fsblk_t: filesystem, group blocks and bug fixes · 1c2bf374
      Mingming Cao 提交于
      Some of the in-kernel ext3 block variable type are treated as signed 4 bytes
      int type, thus limited ext3 filesystem to 8TB (4kblock size based).  While
      trying to fix them, it seems quite confusing in the ext3 code where some
      blocks are filesystem-wide blocks, some are group relative offsets that need
      to be signed value (as -1 has special meaning).  So it seem saner to define
      two types of physical blocks: one is filesystem wide blocks, another is
      group-relative blocks.  The following patches clarify these two types of
      blocks in the ext3 code, and fix the type bugs which limit current 32 bit ext3
      filesystem limit to 8TB.
      
      With this series of patches and the percpu counter data type changes in the mm
      tree, we are able to extend exts filesystem limit to 16TB.
      
      This work is also a pre-request for the recent >32 bit ext3 work, and makes
      the kernel to able to address 48 bit ext3 block a lot easier: Simply redefine
      ext3_fsblk_t from unsigned long to sector_t and redefine the format string for
      ext3 filesystem block corresponding.
      
      Two RFC with a series patches have been posted to ext2-devel list and have
      been reviewed and discussed:
      http://marc.theaimsgroup.com/?l=ext2-devel&m=114722190816690&w=2
      
      http://marc.theaimsgroup.com/?l=ext2-devel&m=114784919525942&w=2
      
      Patches are tested on both 32 bit machine and 64 bit machine, <8TB ext3 and
      >8TB ext3 filesystem(with the latest to be released e2fsprogs-1.39).  Tests
      includes overnight fsx, tiobench, dbench and fsstress.
      
      This patch:
      
      Defines ext3_fsblk_t and ext3_grpblk_t, and the printk format string for
      filesystem wide blocks.
      
      This patch classifies all block group relative blocks, and ext3_fsblk_t blocks
      occurs in the same function where used to be confusing before.  Also include
      kernel bug fixes for filesystem wide in-kernel block variables.  There are
      some fileystem wide blocks are treated as int/unsigned int type in the kernel
      currently, especially in ext3 block allocation and reservation code.  This
      patch fixed those bugs by converting those variables to ext3_fsblk_t(unsigned
      long) type.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1c2bf374
  11. 05 5月, 2006 1 次提交
  12. 25 4月, 2006 1 次提交
  13. 29 3月, 2006 1 次提交
  14. 27 3月, 2006 2 次提交
    • M
      [PATCH] ext3_get_blocks: support multiple blocks allocation in ext3_new_block() · b54e41ec
      Mingming Cao 提交于
      Change ext3_try_to_allocate() (called via ext3_new_blocks()) to try to
      allocate the requested number of blocks on a best effort basis: After
      allocated the first block, it will always attempt to allocate the next few(up
      to the requested size and not beyond the reservation window) adjacent blocks
      at the same time.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b54e41ec
    • M
      [PATCH] ext3_get_blocks: Mapping multiple blocks at a once · 89747d36
      Mingming Cao 提交于
      Currently ext3_get_block() only maps or allocates one block at a time.  This
      is quite inefficient for sequential IO workload.
      
      I have posted a early implements a simply multiple block map and allocation
      with current ext3.  The basic idea is allocating the 1st block in the existing
      way, and attempting to allocate the next adjacent blocks on a best effort
      basis.  More description about the implementation could be found here:
      http://marc.theaimsgroup.com/?l=ext2-devel&m=112162230003522&w=2
      
      The following the latest version of the patch: break the original patch into 5
      patches, re-worked some logicals, and fixed some bugs.  The break ups are:
      
       [patch 1] Adding map multiple blocks at a time in ext3_get_blocks()
       [patch 2] Extend ext3_get_blocks() to support multiple block allocation
       [patch 3] Implement multiple block allocation in ext3-try-to-allocate
       (called via ext3_new_block()).
       [patch 4] Proper accounting updates in ext3_new_blocks()
       [patch 5] Adjust reservation window size properly (by the given number
       of blocks to allocate) before block allocation to increase the
       possibility of allocating multiple blocks in a single call.
      
      Tests done so far includes fsx,tiobench and dbench.  The following numbers
      collected from Direct IO tests (1G file creation/read) shows the system time
      have been greatly reduced (more than 50% on my 8 cpu system) with the patches.
      
       1G file DIO write:
       	2.6.15		2.6.15+patches
       real    0m31.275s	0m31.161s
       user    0m0.000s	0m0.000s
       sys     0m3.384s	0m0.564s
      
       1G file DIO read:
       	2.6.15		2.6.15+patches
       real    0m30.733s	0m30.624s
       user    0m0.000s	0m0.004s
       sys     0m0.748s	0m0.380s
      
      Some previous test we did on buffered IO with using multiple blocks allocation
      and delayed allocation shows noticeable improvement on throughput and system
      time.
      
      This patch:
      
      Add support of mapping multiple blocks in one call.
      
      This is useful for DIO reads and re-writes (where blocks are already
      allocated), also is in line with Christoph's proposal of using getblocks() in
      mpage_readpage() or mpage_readpages().
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89747d36
  15. 23 3月, 2006 1 次提交
    • A
      [PATCH] ext3_readdir: use generic readahead · d8733c29
      Andrew Morton 提交于
      Linus points out that ext3_readdir's readahead only cuts in when
      ext3_readdir() is operating at the very start of the directory.  So for large
      directories we end up performing no readahead at all and we suck.
      
      So take it all out and use the core VM's page_cache_readahead().  This means
      that ext3 directory reads will use all of readahead's dynamic sizing goop.
      
      Note that we're using the directory's filp->f_ra to hold the readahead state,
      but readahead is actually being performed against the underlying blockdev's
      address_space.  Fortunately the readahead code is all set up to handle this.
      
      Tested with printk.  It works.  I was struggling to find a real workload which
      actually cared.
      
      (The patch also exports page_cache_readahead() to GPL modules)
      
      Cc: "Stephen C. Tweedie" <sct@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d8733c29
  16. 08 9月, 2005 1 次提交
    • M
      [PATCH] disk quotas fail when /etc/mtab is symlinked to /proc/mounts · 8fc2751b
      Mark Bellon 提交于
      If /etc/mtab is a regular file all of the mount options (of a file system)
      are written to /etc/mtab by the mount command.  The quota tools look there
      for the quota strings for their operation.  If, however, /etc/mtab is a
      symlink to /proc/mounts (a "good thing" in some environments) the tools
      don't write anything - they assume the kernel will take care of things.
      
      While the quota options are sent down to the kernel via the mount system
      call and the file system codes handle them properly unfortunately there is
      no code to echo the quota strings into /proc/mounts and the quota tools
      fail in the symlink case.
      
      The attached patchs modify the EXT[2|3] and JFS codes to add the necessary
      hooks.  The show_options function of each file system in these patches
      currently deal with only those things that seemed related to quotas;
      especially in the EXT3 case more can be done (later?).
      
      Jan Kara also noted the difficulty in moving these changes above the FS
      codes responding similarly to myself to Andrew's comment about possible
      VFS migration. Issue summary:
      
       - FS codes have to process the entire string of options anyway.
      
       - Only FS codes that use quotas must have a show_options function (for
         quotas to work properly) however quotas are only used in a small number
         of FS.
      
       - Since most of the quota using FS support other options these FS codes
         should have the a show_options function to show those options - and the
         quota echoing becomes virtually negligible.
      
      Based on feedback I have modified my patches from the original:
      
         JFS a missing patch has been restored to the posting
         EXT[2|3] and JFS always use the show_options function
             - Each FS has at least one FS specific option displayed
             - QUOTA output is under a CONFIG_QUOTA ifdef
             - a follow-on patch will add a multitude of options for each FS
         EXT[2|3] and JFS "quota" is treated as "usrquota"
         EXT3 journalled data check for journalled quota removed
         EXT[2|3] mount when quota specified but not compiled in
      
       - no changes from my original patch.  I tested the patch and the codes
         warn but
      
       - still mount.  With all due respection I believe the comments
         otherwise were a
      
       - misread of the patch.  Please reread/test and comment.  XFS patch
         removed - the XFS team already made the necessary changes EXT3 mixing
         old and new quotas are handled differently (not purely exclusive)
      
       - if old and new quotas for the same type are used together the old
         type is silently depricated for compatability (e.g.  usrquota and
         usrjquota)
      
       - mixing of old and new quotas is an error (e.g.  usrjquota and
         grpquota)
      Signed-off-by: NMark Bellon <mbellon@mvista.com>
      Acked-by: NDave Kleikamp <shaggy@austin.ibm.com>
      Cc: Jan Kara <jack@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8fc2751b
  17. 13 7月, 2005 1 次提交
  18. 24 6月, 2005 1 次提交
  19. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4