1. 30 4月, 2010 1 次提交
  2. 29 4月, 2010 1 次提交
  3. 28 4月, 2010 2 次提交
  4. 27 4月, 2010 2 次提交
    • N
      nfsd4: bug in read_buf · 2bc3c117
      Neil Brown 提交于
      When read_buf is called to move over to the next page in the pagelist
      of an NFSv4 request, it sets argp->end to essentially a random
      number, certainly not an address within the page which argp->p now
      points to.  So subsequent calls to READ_BUF will think there is much
      more than a page of spare space (the cast to u32 ensures an unsigned
      comparison) so we can expect to fall off the end of the second
      page.
      
      We never encountered thsi in testing because typically the only
      operations which use more than two pages are write-like operations,
      which have their own decoding logic.  Something like a getattr after a
      write may cross a page boundary, but it would be very unusual for it to
      cross another boundary after that.
      
      Cc: stable@kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      2bc3c117
    • D
      xfs: more swap extent fixes for dynamic fork offsets · dd77ef92
      Dave Chinner 提交于
      A new xfsqa test (226) with a prototype xfs_fsr change to try to
      handle dynamic fork offsets better triggers an assertion failure
      where the inode data fork is in btree format, yet there is room in
      the inode for it to be in extent format. The two inodes look like:
      
      before: ino 0x101 (target), num_extents 11, Max in-fork extents 6, broot size 40, fork offset 96
      before: ino 0x115 (temp),  num_extents 5, Max in-fork extents 3, broot size 40, fork offset 56
      after: ino 0x101 (target), num_extents 5, Max in-fork extents 6, broot size 40, fork offset 96
      after: ino 0x115 (temp), num_extents 11, Max in-fork extents 3, broot size 40, fork offset 56
      
      Basically the target inode ends up with 5 extents in btree format,
      but it had space for 6 extents in extent format, so ends up
      incorrect. Notably here the broot size is the same, and that is
      where the kernel code is going wrong - the btree root will fit, so
      it lets the swap go ahead.
      
      The check should not allow the swap to take place if the number of
      extents while in btree format is less than the number of extents
      that can fit in the inode in extent format. Adding that check will
      prevent this swap and corruption from occurring.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      dd77ef92
  5. 26 4月, 2010 1 次提交
  6. 25 4月, 2010 7 次提交
    • J
      Catch filesystems lacking s_bdi · 5129a469
      Jörn Engel 提交于
      noop_backing_dev_info is used only as a flag to mark filesystems that
      don't have any backing store, like tmpfs, procfs, spufs, etc.
      Signed-off-by: NJoern Engel <joern@logfs.org>
      
      Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
      to the noop_backing_dev_info is not legal and will not result in
      them being flushed, but we already catch this condition in
      __mark_inode_dirty() when checking for a registered bdi.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      5129a469
    • P
      squashfs: fix potential buffer over-run on 4K block file systems · e0d1f700
      Phillip Lougher 提交于
      Sizing the buffer based on block size is incorrect, leading
      to a potential buffer over-run on 4K block size file systems
      (because the metadata block size is always 8K).  This bug
      doesn't seem have triggered because 4K block size file systems
      are not default, and also because metadata blocks after
      compression tend to be less than 4K.
      Signed-off-by: NPhillip Lougher <phillip@lougher.demon.co.uk>
      e0d1f700
    • P
      squashfs: add missing buffer free · 370ec3d1
      Phillip Lougher 提交于
      Signed-off-by: NPhillip Lougher <phillip@lougher.demon.co.uk>
      370ec3d1
    • P
      squashfs: fix warn_on when root inode is corrupted · 1cb08e97
      Phillip Lougher 提交于
      Fix warn_on triggered by mounting a fsfuzzer corrupted file system, where
      the root inode has been corrupted.
      Signed-off-by: NPhillip Lougher <phillip@lougher.demon.co.uk>
      Reported-by: NSteve Grubb <sgrubb@redhat.com>
      1cb08e97
    • A
      fs/block_dev.c: fix performance regression in O_DIRECT|O_SYNC writes to block devices · b8af67e2
      Anton Blanchard 提交于
      We are seeing a large regression in database performance on recent
      kernels.  The database opens a block device with O_DIRECT|O_SYNC and a
      number of threads write to different regions of the file at the same time.
      
      A simple test case is below.  I haven't defined DEVICE since getting it
      wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
      see about 17MB/sec and only a few threads in IO wait:
      
      procs  -----io---- -system-- -----cpu------
       r  b     bi    bo   in   cs us sy id wa st
       0  3      0 16170  656 2259  0  0 86 14  0
       0  2      0 16704  695 2408  0  0 92  8  0
       0  2      0 17308  744 2653  0  0 86 14  0
       0  2      0 17933  759 2777  0  0 89 10  0
      
      Most threads are blocking in vfs_fsync_range, which has:
      
              mutex_lock(&mapping->host->i_mutex);
              err = fop->fsync(file, dentry, datasync);
              if (!ret)
                      ret = err;
              mutex_unlock(&mapping->host->i_mutex);
      
      commit 148f948b (vfs: Introduce new
      helpers for syncing after writing to O_SYNC file or IS_SYNC inode) offers
      some explanation of what is going on:
      
          Use these new helpers for syncing from generic VFS functions. This makes
          O_SYNC writes to block devices acquire i_mutex for syncing. If we really
          care about this, we can make block_fsync() drop the i_mutex and reacquire
          it before it returns.
      
      Thanks Jan for such a good commit message!  As well as dropping i_mutex,
      Christoph suggests we should remove the call to sync_blockdev():
      
      > sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
      > the block device inode, which is exactly what we did just before calling
      > into ->fsync
      
      The patch below incorporates both suggestions. With it the testcase improves
      from 17MB/s to 68M/sec:
      
      procs  -----io---- -system-- -----cpu------
       r  b     bi    bo   in   cs us sy id wa st
       0  7      0 65536 1000 3878  0  0 70 30  0
       0 34      0 69632 1016 3921  0  1 46 53  0
       0 57      0 69632 1000 3921  0  0 55 45  0
       0 53      0 69640  754 4111  0  0 81 19  0
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <pthread.h>
      #include <unistd.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      
      #define NR_THREADS 64
      #define BUFSIZE (64 * 1024)
      
      #define DEVICE "/dev/mapper/XXXXXX"
      
      #define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))
      
      static int fd;
      
      static void *doit(void *arg)
      {
      	unsigned long offset = (long)arg;
      	char *b, *buf;
      
      	b = malloc(BUFSIZE + 1024);
      	buf = (char *)ALIGN((unsigned long)b, 1024);
      	memset(buf, 0, BUFSIZE);
      
      	while (1)
      		pwrite(fd, buf, BUFSIZE, offset);
      }
      
      int main(int argc, char *argv[])
      {
      	int flags = O_RDWR|O_DIRECT;
      	int i;
      	unsigned long offset = 0;
      
      	if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
      		flags |= O_SYNC;
      
      	fd = open(DEVICE, flags);
      	if (fd == -1) {
      		perror("open");
      		exit(1);
      	}
      
      	for (i = 0; i < NR_THREADS-1; i++) {
      		pthread_t tid;
      		pthread_create(&tid, NULL, doit, (void *)offset);
      		offset += BUFSIZE;
      	}
      	doit((void *)offset);
      
      	return 0;
      }
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8af67e2
    • J
      reiserfs: fix corruption during shrinking of xattrs · fb2162df
      Jeff Mahoney 提交于
      Commit 48b32a35 ("reiserfs: use generic
      xattr handlers") introduced a problem that causes corruption when extended
      attributes are replaced with a smaller value.
      
      The issue is that the reiserfs_setattr to shrink the xattr file was moved
      from before the write to after the write.
      
      The root issue has always been in the reiserfs xattr code, but was papered
      over by the fact that in the shrink case, the file would just be expanded
      again while the xattr was written.
      
      The end result is that the last 8 bytes of xattr data are lost.
      
      This patch fixes it to use new_size.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=14826Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reported-by: NChristian Kujau <lists@nerdbynature.de>
      Tested-by: NChristian Kujau <lists@nerdbynature.de>
      Cc: Edward Shishkin <edward.shishkin@gmail.com>
      Cc: Jethro Beekman <kernel@jbeekman.nl>
      Cc: Greg Surbey <gregsurbey@hotmail.com>
      Cc: Marco Gatti <marco.gatti@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2162df
    • J
      reiserfs: fix permissions on .reiserfs_priv · cac36f70
      Jeff Mahoney 提交于
      Commit 677c9b2e ("reiserfs: remove
      privroot hiding in lookup") removed the magic from the lookup code to hide
      the .reiserfs_priv directory since it was getting loaded at mount-time
      instead.  The intent was that the entry would be hidden from the user via
      a poisoned d_compare, but this was faulty.
      
      This introduced a security issue where unprivileged users could access and
      modify extended attributes or ACLs belonging to other users, including
      root.
      
      This patch resolves the issue by properly hiding .reiserfs_priv.  This was
      the intent of the xattr poisoning code, but it appears to have never
      worked as expected.  This is fixed by using d_revalidate instead of
      d_compare.
      
      This patch makes -oexpose_privroot a no-op.  I'm fine leaving it this way.
      The effort involved in working out the corner cases wrt permissions and
      caching outweigh the benefit of the feature.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Acked-by: NEdward Shishkin <edward.shishkin@gmail.com>
      Reported-by: NMatt McCutchen <matt@mattmccutchen.net>
      Tested-by: NMatt McCutchen <matt@mattmccutchen.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cac36f70
  7. 24 4月, 2010 1 次提交
  8. 23 4月, 2010 1 次提交
  9. 22 4月, 2010 9 次提交
  10. 21 4月, 2010 4 次提交
  11. 20 4月, 2010 5 次提交
    • T
      eCryptfs: Turn lower lookup error messages into debug messages · 9f37622f
      Tyler Hicks 提交于
      Vaugue warnings about ENAMETOOLONG errors when looking up an encrypted
      file name have caused many users to become concerned about their data.
      Since this is a rather harmless condition, I'm moving this warning to
      only be printed when the ecryptfs_verbosity module param is 1.
      Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
      9f37622f
    • T
      eCryptfs: Copy lower directory inode times and size on link · 3a8380c0
      Tyler Hicks 提交于
      The timestamps and size of a lower inode involved in a link() call was
      being copied to the upper parent inode.  Instead, we should be
      copying lower parent inode's timestamps and size to the upper parent
      inode.  I discovered this bug using the POSIX test suite at Tuxera.
      Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
      3a8380c0
    • J
      ecryptfs: fix use with tmpfs by removing d_drop from ecryptfs_destroy_inode · 133b8f9d
      Jeff Mahoney 提交于
      Since tmpfs has no persistent storage, it pins all its dentries in memory
      so they have d_count=1 when other file systems would have d_count=0.
      ->lookup is only used to create new dentries. If the caller doesn't
      instantiate it, it's freed immediately at dput(). ->readdir reads
      directly from the dcache and depends on the dentries being hashed.
      
      When an ecryptfs mount is mounted, it associates the lower file and dentry
      with the ecryptfs files as they're accessed. When it's umounted and
      destroys all the in-memory ecryptfs inodes, it fput's the lower_files and
      d_drop's the lower_dentries. Commit 4981e081 added this and a d_delete in
      2008 and several months later commit caeeeecf removed the d_delete. I
      believe the d_drop() needs to be removed as well.
      
      The d_drop effectively hides any file that has been accessed via ecryptfs
      from the underlying tmpfs since it depends on it being hashed for it to
      be accessible. I've removed the d_drop on my development node and see no
      ill effects with basic testing on both tmpfs and persistent storage.
      
      As a side effect, after ecryptfs d_drops the dentries on tmpfs, tmpfs
      BUGs on umount. This is due to the dentries being unhashed.
      tmpfs->kill_sb is kill_litter_super which calls d_genocide to drop
      the reference pinning the dentry. It skips unhashed and negative dentries,
      but shrink_dcache_for_umount_subtree doesn't. Since those dentries
      still have an elevated d_count, we get a BUG().
      
      This patch removes the d_drop call and fixes both issues.
      
      This issue was reported at:
      https://bugzilla.novell.com/show_bug.cgi?id=567887Reported-by: NÁrpád Bíró <biroa@demasz.hu>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Cc: Dustin Kirkland <kirkland@canonical.com>
      Cc: stable@kernel.org
      Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
      133b8f9d
    • C
      ecryptfs: fix error code for missing xattrs in lower fs · cfce08c6
      Christian Pulvermacher 提交于
      If the lower file system driver has extended attributes disabled,
      ecryptfs' own access functions return -ENOSYS instead of -EOPNOTSUPP.
      This breaks execution of programs in the ecryptfs mount, since the
      kernel expects the latter error when checking for security
      capabilities in xattrs.
      Signed-off-by: NChristian Pulvermacher <pulvermacher@gmx.de>
      Cc: stable@kernel.org
      Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
      cfce08c6
    • T
      eCryptfs: Decrypt symlink target for stat size · 3a60a168
      Tyler Hicks 提交于
      Create a getattr handler for eCryptfs symlinks that is capable of
      reading the lower target and decrypting its path.  Prior to this patch,
      a stat's st_size field would represent the strlen of the encrypted path,
      while readlink() would return the strlen of the decrypted path.  This
      could lead to confusion in some userspace applications, since the two
      values should be equal.
      
      https://bugs.launchpad.net/bugs/524919Reported-by: NLoïc Minier <loic.minier@canonical.com>
      Cc: stable@kernel.org
      Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
      3a60a168
  12. 18 4月, 2010 1 次提交
  13. 17 4月, 2010 2 次提交
    • D
      xfs: don't warn on EAGAIN in inode reclaim · f1d486a3
      Dave Chinner 提交于
      Any inode reclaim flush that returns EAGAIN will result in the inode
      reclaim being attempted again later. There is no need to issue a
      warning into the logs about this situation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      f1d486a3
    • D
      xfs: ensure that sync updates the log tail correctly · b6f8dd49
      Dave Chinner 提交于
      Updates to the VFS layer removed an extra ->sync_fs call into the
      filesystem during the sync process (from the quota code).
      Unfortunately the sync code was unknowingly relying on this call to
      make sure metadata buffers were flushed via a xfs_buftarg_flush()
      call to move the tail of the log forward in memory before the final
      transactions of the sync process were issued.
      
      As a result, the old code would write a very recent log tail value
      to the log by the end of the sync process, and so a subsequent crash
      would leave nothing for log recovery to do. Hence in qa test 182,
      log recovery only replayed a small handle for inode fsync
      transactions in this case.
      
      However, with the removal of the extra ->sync_fs call, the log tail
      was now not moved forward with the inode fsync transactions near the
      end of the sync procese the first (and only) buftarg flush occurred
      after these transactions went to disk. The result is that log
      recovery now sees a large number of transactions for metadata that
      is already on disk.
      
      This usually isn't a problem, but when the transactions include
      inode chunk allocation, the inode create transactions and all
      subsequent changes are replayed as we cannt rely on what is on disk
      is valid. As a result, if the inode was written and contains
      unlogged changes, the unlogged changes are lost, thereby violating
      sync semantics.
      
      The fix is to always issue a transaction after the buftarg flush
      occurs is the log iѕ not idle or covered. This results in a dummy
      transaction being written that contains the up-to-date log tail
      value, which will be very recent. Indeed, it will be at least as
      recent as the old code would have left on disk, so log recovery
      will behave exactly as it used to in this situation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      b6f8dd49
  14. 16 4月, 2010 2 次提交
  15. 15 4月, 2010 1 次提交