1. 28 10月, 2010 14 次提交
    • M
      ext4: don't update sb journal_devnum when RO dev · c41303ce
      Maciej Żenczykowski 提交于
      An ext4 filesystem on a read-only device, with an external journal
      which is at a different device number then recorded in the superblock
      will fail to honor the read-only setting of the device and trigger
      a superblock update (write).
      
      For example:
        - ext4 on a software raid which is in read-only mode
        - external journal on a read-write device which has changed device num
        - attempt to mount with -o journal_dev=<new_number>
        - hits BUG_ON(mddev->ro = 1) in md.c
      
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c41303ce
    • L
      ext4: use sb_issue_zeroout in ext4_ext_zeroout · 2407518d
      Lukas Czerner 提交于
      Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
      own approach to zero out extents.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2407518d
    • L
      ext4: use sb_issue_zeroout in setup_new_group_blocks · a31437b8
      Lukas Czerner 提交于
      Use sb_issue_zeroout to zero out inode table and descriptor table
      blocks instead of old approach which involves journaling.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a31437b8
    • L
      ext4: add interface to advertise ext4 features in sysfs · 857ac889
      Lukas Czerner 提交于
      User-space should have the opportunity to check what features doest ext4
      support in each particular copy. This adds easy interface by creating new
      "features" directory in sys/fs/ext4/. In that directory files
      advertising feature names can be created.
      
      Add lazy_itable_init to the feature list.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      857ac889
    • L
      ext4: add support for lazy inode table initialization · bfff6873
      Lukas Czerner 提交于
      When the lazy_itable_init extended option is passed to mke2fs, it
      considerably speeds up filesystem creation because inode tables are
      not zeroed out.  The fact that parts of the inode table are
      uninitialized is not a problem so long as the block group descriptors,
      which contain information regarding how much of the inode table has
      been initialized, has not been corrupted However, if the block group
      checksums are not valid, e2fsck must scan the entire inode table, and
      the the old, uninitialized data could potentially cause e2fsck to
      report false problems.
      
      Hence, it is important for the inode tables to be initialized as soon
      as possble.  This commit adds this feature so that mke2fs can safely
      use the lazy inode table initialization feature to speed up formatting
      file systems.
      
      This is done via a new new kernel thread called ext4lazyinit, which is
      created on demand and destroyed, when it is no longer needed.  There
      is only one thread for all ext4 filesystems in the system. When the
      first filesystem with inititable mount option is mounted, ext4lazyinit
      thread is created, then the filesystem can register its request in the
      request list.
      
      This thread then walks through the list of requests picking up
      scheduled requests and invoking ext4_init_inode_table(). Next schedule
      time for the request is computed by multiplying the time it took to
      zero out last inode table with wait multiplier, which can be set with
      the (init_itable=n) mount option (default is 10).  We are doing
      this so we do not take the whole I/O bandwidth. When the thread is no
      longer necessary (request list is empty) it frees the appropriate
      structures and exits (and can be created later later by another
      filesystem).
      
      We do not disturb regular inode allocations in any way, it just do not
      care whether the inode table is, or is not zeroed. But when zeroing, we
      have to skip used inodes, obviously. Also we should prevent new inode
      allocations from the group, while zeroing is on the way. For that we
      take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
      in the ext4_claim_inode, so when we are unlucky and allocator hits the
      group which is currently being zeroed, it just has to wait.
      
      This can be suppresed using the mount option no_init_itable.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bfff6873
    • T
      jbd2: Add sanity check for attempts to start handle during umount · 5c2178e7
      Theodore Ts'o 提交于
      An attempt to modify the file system during the call to
      jbd2_destroy_journal() can lead to a system lockup.  So add some
      checking to make it much more obvious when this happens to and to
      determine where the offending code is located.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5c2178e7
    • S
      ext4: fix NULL pointer dereference in print_daily_error_info() · a1c6c569
      Sergey Senozhatsky 提交于
      Fix NULL pointer dereference in print_daily_error_info, when   
      called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error 
      reporting timer in ext4_put_super.
      
      Google-Bug-Id: 3017663
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a1c6c569
    • L
      ext4: don't hold spinlock while calling ext4_issue_discard() · 53fdcf99
      Lukas Czerner 提交于
      We can't hold the block group spinlock because we ext4_issue_discard()
      calls wait and hence can get rescheduled.
      
      Google-Bug-Id: 3017678
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      53fdcf99
    • L
      ext4: check for negative error code from sb_issue_discard · 58298709
      Lukas Czerner 提交于
      sb_issue_discard() is returning negative error code, so check for
      -EOPNOTSUPP.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      58298709
    • E
      ext4: don't bump up LONG_MAX nr_to_write by a factor of 8 · b443e733
      Eric Sandeen 提交于
      I'm uneasy with lots of stuff going on in ext4_da_writepages(),
      but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
      making anything better, so avoid the multiplier in that case.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b443e733
    • E
      ext4: stop looping in ext4_num_dirty_pages when max_pages reached · 659c6009
      Eric Sandeen 提交于
      Today we simply break out of the inner loop when we have accumulated
      max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
      in the while (!done) loop, this does potentially a lot of work
      with no net effect.
      
      When we have accumulated max_pages, just clean up and return.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      659c6009
    • C
      ext4: use dedicated slab caches for group_info structures · fb1813f4
      Curt Wohlgemuth 提交于
      ext4_group_info structures are currently allocated with kmalloc().
      With a typical 4K block size, these are 136 bytes each -- meaning
      they'll each consume a 256-byte slab object.  On a system with many
      ext4 large partitions, that's a lot of wasted kernel slab space.
      (E.g., a single 1TB partition will have about 8000 block groups, using
      about 2MB of slab, of which nearly 1MB is wasted.)
      
      This patch creates an array of slab pointers created as needed --
      depending on the superblock block size -- and uses these slabs to
      allocate the group info objects.
      
      Google-Bug-Id: 2980809
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fb1813f4
    • B
      jbd2: Fix I/O hang in jbd2_journal_release_jbd_inode · 39e3ac25
      Brian King 提交于
      This fixes a hang seen in jbd2_journal_release_jbd_inode
      on a lot of Power 6 systems running with ext4. When we get
      in the hung state, all I/O to the disk in question gets blocked
      where we stay indefinitely. Looking at the task list, I can see
      we are stuck in jbd2_journal_release_jbd_inode waiting on a
      wake up. I added some debug code to detect this scenario and
      dump additional data if we were stuck in jbd2_journal_release_jbd_inode
      for longer than 30 minutes. When it hit, I was able to see that
      i_flags was 0, suggesting we missed the wake up.
      
      This patch changes i_flags to be an unsigned long, uses bit operators
      to access it, and adds barriers around the accesses. Prior to applying
      this patch, we were regularly hitting this hang on numerous systems
      in our test environment. After applying the patch, the hangs no longer
      occur.
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      39e3ac25
    • T
      ext4: fix EOFBLOCKS_FL handling · 58590b06
      Theodore Ts'o 提交于
      It turns out we have several problems with how EOFBLOCKS_FL is
      handled.  First of all, there was a fencepost error where we were not
      clearing the EOFBLOCKS_FL when fill in the last uninitialized block,
      but rather when we allocate the next block _after_ the uninitalized
      block.  Secondly we were not testing to see if we needed to clear the
      EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting
      an uninitialized block (which is the most common case).
      
      Google-Bug-Id: 2928259
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      58590b06
  2. 24 9月, 2010 5 次提交
  3. 23 9月, 2010 4 次提交
    • K
      /proc/pid/smaps: fix dirty pages accounting · 1c2499ae
      KOSAKI Motohiro 提交于
      Currently, /proc/<pid>/smaps has wrong dirty pages accounting.
      Shared_Dirty and Private_Dirty output only pte dirty pages and ignore
      PG_dirty page flag.  It is difference against documentation, but also
      inconsistent against Referenced field.  (Referenced checks both pte and
      page flags)
      
      This patch fixes it.
      
      Test program:
      
       large-array.c
       ---------------------------------------------------
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <unistd.h>
      
       char array[1*1024*1024*1024L];
      
       int main(void)
       {
               memset(array, 1, sizeof(array));
               pause();
      
               return 0;
       }
       ---------------------------------------------------
      
      Test case:
       1. run ./large-array
       2. cat /proc/`pidof large-array`/smaps
       3. swapoff -a
       4. cat /proc/`pidof large-array`/smaps again
      
      Test result:
       <before patch>
      
      00601000-40601000 rw-p 00000000 00:00 0
      Size:            1048576 kB
      Rss:             1048576 kB
      Pss:             1048576 kB
      Shared_Clean:          0 kB
      Shared_Dirty:          0 kB
      Private_Clean:    218992 kB   <-- showed pages as clean incorrectly
      Private_Dirty:    829584 kB
      Referenced:       388364 kB
      Swap:                  0 kB
      KernelPageSize:        4 kB
      MMUPageSize:           4 kB
      
       <after patch>
      
      00601000-40601000 rw-p 00000000 00:00 0
      Size:            1048576 kB
      Rss:             1048576 kB
      Pss:             1048576 kB
      Shared_Clean:          0 kB
      Shared_Dirty:          0 kB
      Private_Clean:         0 kB
      Private_Dirty:   1048576 kB  <-- fixed
      Referenced:       388480 kB
      Swap:                  0 kB
      KernelPageSize:        4 kB
      MMUPageSize:           4 kB
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c2499ae
    • J
      aio: do not return ERESTARTSYS as a result of AIO · a0c42bac
      Jan Kara 提交于
      OCFS2 can return ERESTARTSYS from its write function when the process is
      signalled while waiting for a cluster lock (and the filesystem is mounted
      with intr mount option).  Generally, it seems reasonable to allow
      filesystems to return this error code from its IO functions.  As we must
      not leak ERESTARTSYS (and similar error codes) to userspace as a result of
      an AIO operation, we have to properly convert it to EINTR inside AIO code
      (restarting the syscall isn't really an option because other AIO could
      have been already submitted by the same io_submit syscall).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0c42bac
    • A
      /proc/vmcore: fix seeking · c227e690
      Arnd Bergmann 提交于
      Commit 73296bc6 ("procfs: Use generic_file_llseek in /proc/vmcore")
      broke seeking on /proc/vmcore.  This changes it back to use default_llseek
      in order to restore the original behaviour.
      
      The problem with generic_file_llseek is that it only allows seeks up to
      inode->i_sb->s_maxbytes, which is zero on procfs and some other virtual
      file systems.  We should merge generic_file_llseek and default_llseek some
      day and clean this up in a proper way, but for 2.6.35/36, reverting vmcore
      is the safer solution.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Tested-by: NCAI Qian <caiqian@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c227e690
    • D
      Prevent freeing uninitialized pointer in compat_do_readv_writev · 767b68e9
      Dan Rosenberg 提交于
      In 32-bit compatibility mode, the error handling for
      compat_do_readv_writev() may free an uninitialized pointer, potentially
      leading to all sorts of ugly memory corruption.  This is reliably
      triggerable by unprivileged users by invoking the readv()/writev()
      syscalls with an invalid iovec pointer.  The below patch fixes this to
      emulate the non-compat version.
      
      Introduced by commit b8373363 ("compat: factor out
      compat_rw_copy_check_uvector from compat_do_readv_writev")
      Signed-off-by: NDan Rosenberg <dan.j.rosenberg@gmail.com>
      Cc: stable@kernel.org (2.6.35)
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      767b68e9
  4. 22 9月, 2010 2 次提交
    • J
      bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends · 692ebd17
      Jan Kara 提交于
      Inodes of devices such as /dev/zero can get dirty for example via
      utime(2) syscall or due to atime update. Backing device of such inodes
      (zero_bdi, etc.) is however unable to handle dirty inodes and thus
      __mark_inode_dirty complains.  In fact, inode should be rather dirtied
      against backing device of the filesystem holding it. This is generally a
      good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes
      in these pseudofilesystems are referenced from ordinary filesystem
      inodes and carry mapping with real data of the device. Thus for these
      inodes we have to use inode->i_mapping->backing_dev_info as we did so
      far. We distinguish these filesystems by checking whether sb->s_bdi
      points to a non-trivial backing device or not.
      
      Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /.
      There's a device inode A described by a path "/dev/sdb" on this
      filesystem. This inode will be dirtied against backing device "8:0"
      after this patch. bdev filesystem contains block device inode B coupled
      with our inode A. When someone modifies a page of /dev/sdb, it's B that
      gets dirtied and the dirtying happens against the backing device "8:16".
      Thus both inodes get filed to a correct bdi list.
      
      Cc: stable@kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      692ebd17
    • J
      char: Mark /dev/zero and /dev/kmem as not capable of writeback · 371d217e
      Jan Kara 提交于
      These devices don't do any writeback but their device inodes still can get
      dirty so mark bdi appropriately so that bdi code does the right thing and files
      inodes to lists of bdi carrying the device inodes.
      
      Cc: stable@kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      371d217e
  5. 20 9月, 2010 1 次提交
  6. 18 9月, 2010 3 次提交
  7. 17 9月, 2010 3 次提交
  8. 15 9月, 2010 4 次提交
    • J
      aio: check for multiplication overflow in do_io_submit · 75e1c70f
      Jeff Moyer 提交于
      Tavis Ormandy pointed out that do_io_submit does not do proper bounds
      checking on the passed-in iocb array:
      
             if (unlikely(nr < 0))
                     return -EINVAL;
      
             if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
                     return -EFAULT;                      ^^^^^^^^^^^^^^^^^^
      
      The attached patch checks for overflow, and if it is detected, the
      number of iocbs submitted is scaled down to a number that will fit in
      the long.  This is an ok thing to do, as sys_io_submit is documented as
      returning the number of iocbs submitted, so callers should handle a
      return value of less than the 'nr' argument passed in.
      Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75e1c70f
    • J
      cifs: fix potential double put of TCP session reference · 460cf341
      Jeff Layton 提交于
      cifs_get_smb_ses must be called on a server pointer on which it holds an
      active reference. It first does a search for an existing SMB session. If
      it finds one, it'll put the server reference and then try to ensure that
      the negprot is done, etc.
      
      If it encounters an error at that point then it'll return an error.
      There's a potential problem here though. When cifs_get_smb_ses returns
      an error, the caller will also put the TCP server reference leading to a
      double-put.
      
      Fix this by having cifs_get_smb_ses only put the server reference if
      it found an existing session that it could use and isn't returning an
      error.
      
      Cc: stable@kernel.org
      Reviewed-by: NSuresh Jayaraman <sjayaraman@suse.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NSteve French <sfrench@us.ibm.com>
      460cf341
    • S
      ceph: stop sending FLUSHSNAPs when we hit a dirty capsnap · cfc0bf66
      Sage Weil 提交于
      Stop sending FLUSHSNAP messages when we hit a capsnap that has dirty_pages
      or is still writing.  We'll send the newer capsnaps only after the older
      ones complete.
      Signed-off-by: NSage Weil <sage@newdream.net>
      cfc0bf66
    • S
      ceph: correctly set 'follows' in flushsnap messages · 8bef9239
      Sage Weil 提交于
      The 'follows' should match the seq for the snap context for the given snap
      cap, which is the context under which we have been dirtying and writing
      data and metadata.  The snapshot that _contains_ those updates thus
      _follows_ that context's seq #.
      Signed-off-by: NSage Weil <sage@newdream.net>
      8bef9239
  9. 14 9月, 2010 1 次提交
  10. 13 9月, 2010 3 次提交