1. 30 10月, 2008 2 次提交
    • D
      Inode: Allow external list initialisation · 8290c35f
      David Chinner 提交于
      To allow XFS to combine the XFS and linux inodes into a single
      structure, we need to drive inode lookup from the XFS inode cache,
      not the generic inode cache. This means that we need initialise a
      struct inode from a context outside alloc_inode() as it is no longer
      used by XFS.
      
      After inode allocation and initialisation, we need to add the inode
      to the superblock list, the in-use list, hash it and do some
      accounting. This all needs to be done with the inode_lock held and
      there are already several places in fs/inode.c that do this list
      manipulation.  Factor out the common code, add a locking wrapper and
      export the function so ti can be called from XFS.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      8290c35f
    • D
      Inode: Allow external initialisers · 2cb1599f
      David Chinner 提交于
      To allow XFS to combine the XFS and linux inodes into a single
      structure, we need to drive inode lookup from the XFS inode cache,
      not the generic inode cache. This means that we need initialise a
      struct inode from a context outside alloc_inode() as it is no longer
      used by XFS.
      
      Factor and export the struct inode initialisation code from
      alloc_inode() to inode_init_always() as a counterpart to
      inode_init_once().  i.e. we have to call this init function for each
      inode instantiation (always), as opposed inode_init_once() which is
      only called on slab object instantiation (once).
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      2cb1599f
  2. 23 10月, 2008 4 次提交
  3. 21 10月, 2008 9 次提交
  4. 09 10月, 2008 6 次提交
  5. 04 10月, 2008 1 次提交
    • J
      generic block based fiemap implementation · 68c9d702
      Josef Bacik 提交于
      Any block based fs (this patch includes ext3) just has to declare its own
      fiemap() function and then call this generic function with its own
      get_block_t. This works well for block based filesystems that will map
      multiple contiguous blocks at one time, but will work for filesystems that
      only map one block at a time, you will just end up with an "extent" for each
      block. One gotcha is this will not play nicely where there is hole+data
      after the EOF. This function will assume its hit the end of the data as soon
      as it hits a hole after the EOF, so if there is any data past that it will
      not pick that up. AFAIK no block based fs does this anyway, but its in the
      comments of the function anyway just in case.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: linux-fsdevel@vger.kernel.org
      68c9d702
  6. 09 10月, 2008 1 次提交
    • M
      vfs: vfs-level fiemap interface · c4b929b8
      Mark Fasheh 提交于
      Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
      inode operation.
      
      Userspace can get extent information on a file via fiemap ioctl. As input,
      the fiemap ioctl takes a struct fiemap which includes an array of struct
      fiemap_extent (fm_extents). Size of the extent array is passed as
      fm_extent_count and number of extents returned will be written into
      fm_mapped_extents. Offset and length fields on the fiemap structure
      (fm_start, fm_length) describe a logical range which will be searched for
      extents. All extents returned will at least partially contain this range.
      The actual extent offsets and ranges returned will be unmodified from their
      offset and range on-disk.
      
      The fiemap ioctl returns '0' on success. On error, -1 is returned and errno
      is set. If errno is equal to EBADR, then fm_flags will contain those flags
      which were passed in which the kernel did not understand. On all other
      errors, the contents of fm_extents is undefined.
      
      As fiemap evolved, there have been many authors of the vfs patch. As far as
      I can tell, the list includes:
      Kalpak Shah <kalpak.shah@sun.com>
      Andreas Dilger <adilger@sun.com>
      Eric Sandeen <sandeen@redhat.com>
      Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      c4b929b8
  7. 04 10月, 2008 1 次提交
    • J
      nfsd: common grace period control · af558e33
      J. Bruce Fields 提交于
      Rewrite grace period code to unify management of grace period across
      lockd and nfsd.  The current code has lockd and nfsd cooperate to
      compute a grace period which is satisfactory to them both, and then
      individually enforce it.  This creates a slight race condition, since
      the enforcement is not coordinated.  It's also more complicated than
      necessary.
      
      Here instead we have lockd and nfsd each inform common code when they
      enter the grace period, and when they're ready to leave the grace
      period, and allow normal locking only after both of them are ready to
      leave.
      
      We also expect the locks_start_grace()/locks_end_grace() interface here
      to be simpler to build on for future cluster/high-availability work,
      which may require (for example) putting individual filesystems into
      grace, or enforcing grace periods across multiple cluster nodes.
      Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      af558e33
  8. 30 9月, 2008 1 次提交
    • T
      Configure out file locking features · bfcd17a6
      Thomas Petazzoni 提交于
      This patch adds the CONFIG_FILE_LOCKING option which allows to remove
      support for advisory locks. With this patch enabled, the flock()
      system call, the F_GETLK, F_SETLK and F_SETLKW operations of fcntl()
      and NFS support are disabled. These features are not necessarly needed
      on embedded systems. It allows to save ~11 Kb of kernel code and data:
      
         text          data     bss     dec     hex filename
      1125436        118764  212992 1457192  163c28 vmlinux.old
      1114299        118564  212992 1445855  160fdf vmlinux
       -11137    -200       0  -11337   -2C49 +/-
      
      This patch has originally been written by Matt Mackall
      <mpm@selenic.com>, and is part of the Linux Tiny project.
      Signed-off-by: NThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: matthew@wil.cx
      Cc: linux-fsdevel@vger.kernel.org
      Cc: mpm@selenic.com
      Cc: akpm@linux-foundation.org
      Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      bfcd17a6
  9. 29 7月, 2008 1 次提交
    • H
      vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a
      Hisashi Hifumi 提交于
      When we read some part of a file through pagecache, if there is a
      pagecache of corresponding index but this page is not uptodate, read IO
      is issued and this page will be uptodate.
      
      I think this is good for pagesize == blocksize environment but there is
      room for improvement on pagesize != blocksize environment.  Because in
      this case a page can have multiple buffers and even if a page is not
      uptodate, some buffers can be uptodate.
      
      So I suggest that when all buffers which correspond to a part of a file
      that we want to read are uptodate, use this pagecache and copy data from
      this pagecache to user buffer even if a page is not uptodate.  This can
      reduce read IO and improve system throughput.
      
      I wrote a benchmark program and got result number with this program.
      
      This benchmark do:
      
        1: mount and open a test file.
      
        2: create a 512MB file.
      
        3: close a file and umount.
      
        4: mount and again open a test file.
      
        5: pwrite randomly 300000 times on a test file.  offset is aligned
           by IO size(1024bytes).
      
        6: measure time of preading randomly 100000 times on a test file.
      
      The result was:
      	2.6.26
              330 sec
      
      	2.6.26-patched
              226 sec
      
      Arch:i386
      Filesystem:ext3
      Blocksize:1024 bytes
      Memory: 1GB
      
      On ext3/4, a file is written through buffer/block.  So random read/write
      mixed workloads or random read after random write workloads are optimized
      with this patch under pagesize != blocksize environment.  This test result
      showed this.
      
      The benchmark program is as follows:
      
      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <time.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/mount.h>
      
      #define LEN 1024
      #define LOOP 1024*512 /* 512MB */
      
      main(void)
      {
      	unsigned long i, offset, filesize;
      	int fd;
      	char buf[LEN];
      	time_t t1, t2;
      
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	memset(buf, 0, LEN);
      	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      	for (i = 0; i < LOOP; i++)
      		write(fd, buf, LEN);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	fd = open("/root/test1/testfile", O_RDWR);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      
      	filesize = LEN * LOOP;
      	for (i = 0; i < 300000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pwrite(fd, buf, LEN, offset);
      	}
      	printf("start test\n");
      	time(&t1);
      	for (i = 0; i < 100000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pread(fd, buf, LEN, offset);
      	}
      	time(&t2);
      	printf("%ld sec\n", t2-t1);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      }
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab22b9a
  10. 27 7月, 2008 10 次提交
  11. 26 7月, 2008 1 次提交
    • M
      locks: add special return value for asynchronous locks · bde74e4b
      Miklos Szeredi 提交于
      Use a special error value FILE_LOCK_DEFERRED to mean that a locking
      operation returned asynchronously.  This is returned by
      
        posix_lock_file() for sleeping locks to mean that the lock has been
        queued on the block list, and will be woken up when it might become
        available and needs to be retried (either fl_lmops->fl_notify() is
        called or fl_wait is woken up).
      
        f_op->lock() to mean either the above, or that the filesystem will
        call back with fl_lmops->fl_grant() when the result of the locking
        operation is known.  The filesystem can do this for sleeping as well
        as non-sleeping locks.
      
      This is to make sure, that return values of -EAGAIN and -EINPROGRESS by
      filesystems are not mistaken to mean an asynchronous locking.
      
      This also makes error handling in fs/locks.c and lockd/svclock.c slightly
      cleaner.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: David Teigland <teigland@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bde74e4b
  12. 25 7月, 2008 3 次提交
    • U
      flag parameters: NONBLOCK in pipe · be61a86d
      Ulrich Drepper 提交于
      This patch adds O_NONBLOCK support to pipe2.  It is minimally more involved
      than the patches for eventfd et.al but still trivial.  The interfaces of the
      create_write_pipe and create_read_pipe helper functions were changed and the
      one other caller as well.
      
      The following test must be adjusted for architectures other than x86 and
      x86-64 and in case the syscall numbers changed.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      #include <fcntl.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/syscall.h>
      
      #ifndef __NR_pipe2
      # ifdef __x86_64__
      #  define __NR_pipe2 293
      # elif defined __i386__
      #  define __NR_pipe2 331
      # else
      #  error "need __NR_pipe2"
      # endif
      #endif
      
      int
      main (void)
      {
        int fds[2];
        if (syscall (__NR_pipe2, fds, 0) == -1)
          {
            puts ("pipe2(0) failed");
            return 1;
          }
        for (int i = 0; i < 2; ++i)
          {
            int fl = fcntl (fds[i], F_GETFL);
            if (fl == -1)
              {
                puts ("fcntl failed");
                return 1;
              }
            if (fl & O_NONBLOCK)
              {
                printf ("pipe2(0) set non-blocking mode for fds[%d]\n", i);
                return 1;
              }
            close (fds[i]);
          }
      
        if (syscall (__NR_pipe2, fds, O_NONBLOCK) == -1)
          {
            puts ("pipe2(O_NONBLOCK) failed");
            return 1;
          }
        for (int i = 0; i < 2; ++i)
          {
            int fl = fcntl (fds[i], F_GETFL);
            if (fl == -1)
              {
                puts ("fcntl failed");
                return 1;
              }
            if ((fl & O_NONBLOCK) == 0)
              {
                printf ("pipe2(O_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i);
                return 1;
              }
            close (fds[i]);
          }
      
        puts ("OK");
      
        return 0;
      }
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be61a86d
    • U
      flag parameters: pipe · ed8cae8b
      Ulrich Drepper 提交于
      This patch introduces the new syscall pipe2 which is like pipe but it also
      takes an additional parameter which takes a flag value.  This patch implements
      the handling of O_CLOEXEC for the flag.  I did not add support for the new
      syscall for the architectures which have a special sys_pipe implementation.  I
      think the maintainers of those archs have the chance to go with the unified
      implementation but that's up to them.
      
      The implementation introduces do_pipe_flags.  I did that instead of changing
      all callers of do_pipe because some of the callers are written in assembler.
      I would probably screw up changing the assembly code.  To avoid breaking code
      do_pipe is now a small wrapper around do_pipe_flags.  Once all callers are
      changed over to do_pipe_flags the old do_pipe function can be removed.
      
      The following test must be adjusted for architectures other than x86 and
      x86-64 and in case the syscall numbers changed.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      #include <fcntl.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/syscall.h>
      
      #ifndef __NR_pipe2
      # ifdef __x86_64__
      #  define __NR_pipe2 293
      # elif defined __i386__
      #  define __NR_pipe2 331
      # else
      #  error "need __NR_pipe2"
      # endif
      #endif
      
      int
      main (void)
      {
        int fd[2];
        if (syscall (__NR_pipe2, fd, 0) != 0)
          {
            puts ("pipe2(0) failed");
            return 1;
          }
        for (int i = 0; i < 2; ++i)
          {
            int coe = fcntl (fd[i], F_GETFD);
            if (coe == -1)
              {
                puts ("fcntl failed");
                return 1;
              }
            if (coe & FD_CLOEXEC)
              {
                printf ("pipe2(0) set close-on-exit for fd[%d]\n", i);
                return 1;
              }
          }
        close (fd[0]);
        close (fd[1]);
      
        if (syscall (__NR_pipe2, fd, O_CLOEXEC) != 0)
          {
            puts ("pipe2(O_CLOEXEC) failed");
            return 1;
          }
        for (int i = 0; i < 2; ++i)
          {
            int coe = fcntl (fd[i], F_GETFD);
            if (coe == -1)
              {
                puts ("fcntl failed");
                return 1;
              }
            if ((coe & FD_CLOEXEC) == 0)
              {
                printf ("pipe2(O_CLOEXEC) does not set close-on-exit for fd[%d]\n", i);
                return 1;
              }
          }
        close (fd[0]);
        close (fd[1]);
      
        puts ("OK");
      
        return 0;
      }
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed8cae8b
    • K
      fix soft lock up at NFS mount via per-SB LRU-list of unused dentries · da3bbdd4
      Kentaro Makita 提交于
      [Summary]
      
       Split LRU-list of unused dentries to one per superblock to avoid soft
       lock up during NFS mounts and remounting of any filesystem.
      
       Previously I posted here:
       http://lkml.org/lkml/2008/3/5/590
      
      [Descriptions]
      
      - background
      
        dentry_unused is a list of dentries which are not referenced.
        dentry_unused grows up when references on directories or files are
        released.  This list can be very long if there is huge free memory.
      
      - the problem
      
        When shrink_dcache_sb() is called, it scans all dentry_unused linearly
        under spin_lock(), and if dentry->d_sb is differnt from given
        superblock, scan next dentry.  This scan costs very much if there are
        many entries, and very ineffective if there are many superblocks.
      
        IOW, When we need to shrink unused dentries on one dentry, but scans
        unused dentries on all superblocks in the system.  For example, we scan
        500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
        dentries on other superblocks.
      
        In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
        unused dentries on NFS, but scans 100,000,000 unused dentries on
        superblocks in the system such as local ext3 filesystems.  I hear NFS
        mounting took 1 min on some system in use.
      
      * : NFS uses virtual filesystem in rpc layer, so NFS is affected by
        this problem.
      
        100,000,000 is possible number on large systems.
      
        Per-superblock LRU of unused dentried can reduce the cost in
        reasonable manner.
      
      - How to fix
      
        I found this problem is solved by David Chinner's "Per-superblock
        unused dentry LRU lists V3"(1), so I rebase it and add some fix to
        reclaim with fairness, which is in Andrew Morton's comments(2).
      
        1) http://lkml.org/lkml/2006/5/25/318
        2) http://lkml.org/lkml/2006/5/25/320
      
        Split LRU-list of unused dentries to each superblocks.  Then, NFS
        mounting will check dentries under a superblock instead of all.  But
        this spliting will break LRU of dentry-unused.  So, I've attempted to
        make reclaim unused dentrins with fairness by calculate number of
        dentries to scan on this sb based on following way
      
        number of dentries to scan on this sb =
        count * (number of dentries on this sb / number of dentries in the machine)
      
      - ToDo
       - I have to measuring performance number and do stress tests.
      
       - When unmount occurs during prune_dcache(), scanning on same
        superblock, It is unable to reach next superblock because it is gone
        away.  We restart scannig superblock from first one, it causes
        unfairness of reclaim unused dentries on first superblock.  But I think
        this happens very rarely.
      
      - Test Results
      
        Result on 6GB boxes with excessive unused dentries.
      
      Without patch:
      
      $ cat /proc/sys/fs/dentry-state
      10181835        10180203        45      0       0       0
      # mount -t nfs 10.124.60.70:/work/kernel-src nfs
      real    0m1.830s
      user    0m0.001s
      sys     0m1.653s
      
       With this patch:
      $ cat /proc/sys/fs/dentry-state
      10236610        10234751        45      0       0       0
      # mount -t nfs 10.124.60.70:/work/kernel-src nfs
      real    0m0.106s
      user    0m0.002s
      sys     0m0.032s
      
      [akpm@linux-foundation.org: fix comments]
      Signed-off-by: NKentaro Makita <k-makita@np.css.fujitsu.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: David Chinner <dgc@sgi.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da3bbdd4