提交 · 56b26add02b4bdea81d5e0ebda60db1fe3311ad4 · openanolis / cloud-kernel

21 10月, 2008 9 次提交

A
[PATCH] kill the rest of struct file propagation in block ioctls · 56b26add
由 Al Viro 提交于 9月 19, 2008
```
Now we can switch blkdev_ioctl() block_device/mode
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
56b26add

[PATCH] get rid of blkdev_driver_ioctl() · e436fdae

由 Al Viro 提交于 9月 18, 2008

convert remaining callers to __blkdev_driver_ioctl()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e436fdae

[PATCH] sanitize blkdev_get() and friends · 572c4892

由 Al Viro 提交于 10月 08, 2007

* get rid of fake struct file/struct dentry in __blkdev_get()
* merge __blkdev_get() and do_open()
* get rid of flags argument of blkdev_get()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

572c4892

[PATCH] propagate mode through open_bdev_excl/close_bdev_excl · 30c40d2c

由 Al Viro 提交于 2月 22, 2008

replace open_bdev_excl/close_bdev_excl with variants taking fmode_t.
superblock gets the value used to mount it stored in sb->s_mode
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

30c40d2c

A
[PATCH] pass fmode_t to blkdev_put() · 9a1c3542
由 Al Viro 提交于 2月 22, 2008
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
9a1c3542

[PATCH] beginning of methods conversion · d4430d62

由 Al Viro 提交于 3月 02, 2008

To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
	1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both.  That's this changeset.
	2) for each driver convert to new methods.  *ALL* drivers
are converted in this series.
	3) kill the old (renamed) methods.

Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain.  The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.

New methods:
	open(bdev, mode)
	release(disk, mode)
	ioctl(bdev, mode, cmd, arg)		/* Called without BKL */
	compat_ioctl(bdev, mode, cmd, arg)
	locked_ioctl(bdev, mode, cmd, arg)	/* Called with BKL, legacy */
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d4430d62

A
[PATCH] move block_device_operations to blkdev.h · 08f85851
由 Al Viro 提交于 10月 08, 2007
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
08f85851
A
[PATCH] eliminate use of ->f_flags in block methods · 86d434de
由 Al Viro 提交于 8月 26, 2007
```
store needed information in f_mode
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
86d434de
A
[PATCH] introduce fmode_t, do annotations · aeb5d727
由 Al Viro 提交于 9月 02, 2008
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
aeb5d727

09 10月, 2008 6 次提交

Adjust block device size after an online resize of a disk. · c3279d14

由 Andrew Patterson 提交于 9月 04, 2008

The revalidate_disk routine now checks if a disk has been resized by
comparing the gendisk capacity to the bdev inode size. If they are
different (usually because the disk has been resized underneath the kernel)
the bdev inode size is adjusted to match the capacity.
Signed-off-by: NAndrew Patterson <andrew.patterson@hp.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

c3279d14

Wrapper for lower-level revalidate_disk routines. · 0c002c2f

由 Andrew Patterson 提交于 9月 04, 2008

This is a wrapper for the lower-level revalidate_disk call-backs such
as sd_revalidate_disk(). It allows us to perform pre and post
operations when calling them.

We will use this wrapper in a later patch to adjust block device sizes
after an online resize (a _post_ operation).
Signed-off-by: NAndrew Patterson <andrew.patterson@hp.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

0c002c2f

block: adjust formatting for large minors and add ext_range sysfs attr · 1f014290

由 Tejun Heo 提交于 8月 25, 2008

With extended minors and the soon-to-follow debug feature, large minor
numbers for block devices will be common.  This patch does the
followings to make printouts pretty.

* Adapt print formats such that large minors don't break the
  formatting.

* For extended MAJ:MIN, %02x%02x for MAJ:MIN used in
  printk_all_partitions() doesn't cut it anymore.  Update it such that
  %03x:%05x is used if either MAJ or MIN doesn't fit in %02x.

* Implement ext_range sysfs attribute which shows total minors the
  device can use including both conventional minor space and the
  extended one.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1f014290

Allow elevators to sort/merge discard requests · e17fc0a1

由 David Woodhouse 提交于 8月 09, 2008

But blkdev_issue_discard() still emits requests which are interpreted as
soft barriers, because naïve callers might otherwise issue subsequent
writes to those same sectors, which might cross on the queue (if they're
reallocated quickly enough).

Callers still _can_ issue non-barrier discard requests, but they have to
take care of queue ordering for themselves.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

e17fc0a1

Add BLKDISCARD ioctl to allow userspace to discard sectors · d30a2605

由 David Woodhouse 提交于 8月 11, 2008

We may well want mkfs tools to use this to mark the whole device as
unwanted before they format it, for example.

The ioctl takes a pair of uint64_ts, which are start offset and length
in _bytes_. Although at the moment it might make sense for them both to
be in 512-byte sectors, I don't want to limit the ABI to that.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

d30a2605

Add 'discard' request handling · fb2dce86

由 David Woodhouse 提交于 8月 05, 2008

Some block devices benefit from a hint that they can forget the contents
of certain sectors. Add basic support for this to the block core, along
with a 'blkdev_issue_discard()' helper function which issues such
requests.

The caller doesn't get to provide an end_io functio, since
blkdev_issue_discard() will automatically split the request up into
multiple bios if appropriate. Neither does the function wait for
completion -- it's expected that callers won't care about when, or even
_if_, the request completes. It's only a hint to the device anyway. By
definition, the file system doesn't _care_ about these sectors any more.

[With feedback from OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> and
Jens Axboe <jens.axboe@oracle.com]
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

fb2dce86

04 10月, 2008 1 次提交

generic block based fiemap implementation · 68c9d702

由 Josef Bacik 提交于 10月 03, 2008

Any block based fs (this patch includes ext3) just has to declare its own
fiemap() function and then call this generic function with its own
get_block_t. This works well for block based filesystems that will map
multiple contiguous blocks at one time, but will work for filesystems that
only map one block at a time, you will just end up with an "extent" for each
block. One gotcha is this will not play nicely where there is hole+data
after the EOF. This function will assume its hit the end of the data as soon
as it hits a hole after the EOF, so if there is any data past that it will
not pick that up. AFAIK no block based fs does this anyway, but its in the
comments of the function anyway just in case.
Signed-off-by: NJosef Bacik <jbacik@redhat.com>
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org

68c9d702

09 10月, 2008 1 次提交

vfs: vfs-level fiemap interface · c4b929b8

由 Mark Fasheh 提交于 10月 08, 2008

Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
inode operation.

Userspace can get extent information on a file via fiemap ioctl. As input,
the fiemap ioctl takes a struct fiemap which includes an array of struct
fiemap_extent (fm_extents). Size of the extent array is passed as
fm_extent_count and number of extents returned will be written into
fm_mapped_extents. Offset and length fields on the fiemap structure
(fm_start, fm_length) describe a logical range which will be searched for
extents. All extents returned will at least partially contain this range.
The actual extent offsets and ranges returned will be unmodified from their
offset and range on-disk.

The fiemap ioctl returns '0' on success. On error, -1 is returned and errno
is set. If errno is equal to EBADR, then fm_flags will contain those flags
which were passed in which the kernel did not understand. On all other
errors, the contents of fm_extents is undefined.

As fiemap evolved, there have been many authors of the vfs patch. As far as
I can tell, the list includes:
Kalpak Shah <kalpak.shah@sun.com>
Andreas Dilger <adilger@sun.com>
Eric Sandeen <sandeen@redhat.com>
Mark Fasheh <mfasheh@suse.com>
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: linux-api@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org

c4b929b8

04 10月, 2008 1 次提交

nfsd: common grace period control · af558e33

由 J. Bruce Fields 提交于 9月 06, 2007

Rewrite grace period code to unify management of grace period across
lockd and nfsd.  The current code has lockd and nfsd cooperate to
compute a grace period which is satisfactory to them both, and then
individually enforce it.  This creates a slight race condition, since
the enforcement is not coordinated.  It's also more complicated than
necessary.

Here instead we have lockd and nfsd each inform common code when they
enter the grace period, and when they're ready to leave the grace
period, and allow normal locking only after both of them are ready to
leave.

We also expect the locks_start_grace()/locks_end_grace() interface here
to be simpler to build on for future cluster/high-availability work,
which may require (for example) putting individual filesystems into
grace, or enforcing grace periods across multiple cluster nodes.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

af558e33

30 9月, 2008 1 次提交

Configure out file locking features · bfcd17a6

由 Thomas Petazzoni 提交于 8月 06, 2008

This patch adds the CONFIG_FILE_LOCKING option which allows to remove
support for advisory locks. With this patch enabled, the flock()
system call, the F_GETLK, F_SETLK and F_SETLKW operations of fcntl()
and NFS support are disabled. These features are not necessarly needed
on embedded systems. It allows to save ~11 Kb of kernel code and data:

   text          data     bss     dec     hex filename
1125436        118764  212992 1457192  163c28 vmlinux.old
1114299        118564  212992 1445855  160fdf vmlinux
 -11137    -200       0  -11337   -2C49 +/-

This patch has originally been written by Matt Mackall
<mpm@selenic.com>, and is part of the Linux Tiny project.
Signed-off-by: NThomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: NMatt Mackall <mpm@selenic.com>
Cc: matthew@wil.cx
Cc: linux-fsdevel@vger.kernel.org
Cc: mpm@selenic.com
Cc: akpm@linux-foundation.org
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

bfcd17a6

29 7月, 2008 1 次提交

vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a

由 Hisashi Hifumi 提交于 7月 28, 2008

When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.

I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment.  Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.

So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate.  This can
reduce read IO and improve system throughput.

I wrote a benchmark program and got result number with this program.

This benchmark do:

  1: mount and open a test file.

  2: create a 512MB file.

  3: close a file and umount.

  4: mount and again open a test file.

  5: pwrite randomly 300000 times on a test file.  offset is aligned
     by IO size(1024bytes).

  6: measure time of preading randomly 100000 times on a test file.

The result was:
	2.6.26
        330 sec

	2.6.26-patched
        226 sec

Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB

On ext3/4, a file is written through buffer/block.  So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment.  This test result
showed this.

The benchmark program is as follows:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>

#define LEN 1024
#define LOOP 1024*512 /* 512MB */

main(void)
{
	unsigned long i, offset, filesize;
	int fd;
	char buf[LEN];
	time_t t1, t2;

	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
		perror("cannot mount\n");
		exit(1);
	}
	memset(buf, 0, LEN);
	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
	if (fd < 0) {
		perror("cannot open file\n");
		exit(1);
	}
	for (i = 0; i < LOOP; i++)
		write(fd, buf, LEN);
	close(fd);
	if (umount("/root/test1/") < 0) {
		perror("cannot umount\n");
		exit(1);
	}
	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
		perror("cannot mount\n");
		exit(1);
	}
	fd = open("/root/test1/testfile", O_RDWR);
	if (fd < 0) {
		perror("cannot open file\n");
		exit(1);
	}

	filesize = LEN * LOOP;
	for (i = 0; i < 300000; i++){
		offset = (random() % filesize) & (~(LEN - 1));
		pwrite(fd, buf, LEN, offset);
	}
	printf("start test\n");
	time(&t1);
	for (i = 0; i < 100000; i++){
		offset = (random() % filesize) & (~(LEN - 1));
		pread(fd, buf, LEN, offset);
	}
	time(&t2);
	printf("%ld sec\n", t2-t1);
	close(fd);
	if (umount("/root/test1/") < 0) {
		perror("cannot umount\n");
		exit(1);
	}
}
Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8ab22b9a

27 7月, 2008 10 次提交

[PATCH] get rid of indirect users of namei.h · 3f8206d4

由 Al Viro 提交于 7月 26, 2008

fs.h needs path.h, not namei.h; nfs_fs.h doesn't need it at all.
Several places in the tree needed direct include.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3f8206d4

[PATCH] f_count may wrap around · 516e0cc5

由 Al Viro 提交于 7月 26, 2008

make it atomic_long_t; while we are at it, get rid of useless checks in affs,
hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

516e0cc5

[PATCH] kill nameidata passing to permission(), rename to inode_permission() · f419a2e3

由 Al Viro 提交于 7月 22, 2008

Incidentally, the name that gives hundreds of false positives on grep
is not a good idea...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f419a2e3

[patch 1/4] vfs: utimes: move owner check into inode_change_ok() · 9767d749

由 Miklos Szeredi 提交于 7月 01, 2008

Add a new ia_valid flag: ATTR_TIMES_SET, to handle the
UTIMES_OMIT/UTIMES_NOW and UTIMES_NOW/UTIMES_OMIT cases.  In these
cases neither ATTR_MTIME_SET nor ATTR_ATIME_SET is in the flags, yet
the POSIX draft specifies that permission checking is performed the
same way as if one or both of the times was explicitly set to a
timestamp.

See the path "vfs: utimensat(): fix error checking for
{UTIME_NOW,UTIME_OMIT} case" by Michael Kerrisk for the patch
introducing this behavior.

This is a cleanup, as well as allowing filesystems (NFS/fuse/...) to
perform their own permission checking instead of the default.

CC: Ulrich Drepper <drepper@redhat.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9767d749

[PATCH] fix MAY_CHDIR/MAY_ACCESS/LOOKUP_ACCESS mess · a110343f

由 Al Viro 提交于 7月 17, 2008

* MAY_CHDIR is redundant - it's an equivalent of MAY_ACCESS
* MAY_ACCESS on fuse should affect only the last step of pathname resolution
* fchdir() and chroot() should pass MAY_ACCESS, for the same reason why
  chdir() needs that.
* now that we pass MAY_ACCESS explicitly in all cases, LOOKUP_ACCESS can be
  removed; it has no business being in nameidata.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a110343f

[patch 5/5] vfs: remove mode parameter from vfs_symlink() · db2e747b

由 Miklos Szeredi 提交于 6月 24, 2008

Remove the unused mode parameter from vfs_symlink and callers.

Thanks to Tetsuo Handa for noticing.

CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

db2e747b

[patch 3/5] vfs: change remove_suid() to file_remove_suid() · 2f1936b8

由 Miklos Szeredi 提交于 6月 24, 2008

All calls to remove_suid() are made with a file pointer, because
(similarly to file_update_time) it is called when the file is written.

Clean up callers by passing in a file instead of a dentry.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

2f1936b8

[PATCH] sanitize ->permission() prototype · e6305c43

由 Al Viro 提交于 7月 15, 2008

* kill nameidata * argument; map the 3 bits in ->flags anybody cares
  about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
  MAY_... found in mask.

The obvious next target in that direction is permission(9)

folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e6305c43

[PATCH] reuse xxx_fifo_fops for xxx_pipe_fops · d2d9648e

由 Denys Vlasenko 提交于 7月 01, 2008

Merge fifo and pipe file_operations.
Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d2d9648e

mm: spinlock tree_lock · 19fd6231

由 Nick Piggin 提交于 7月 25, 2008

mapping->tree_lock has no read lockers.  convert the lock from an rwlock
to a spinlock.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

19fd6231

26 7月, 2008 1 次提交

locks: add special return value for asynchronous locks · bde74e4b

由 Miklos Szeredi 提交于 7月 25, 2008

Use a special error value FILE_LOCK_DEFERRED to mean that a locking
operation returned asynchronously.  This is returned by

  posix_lock_file() for sleeping locks to mean that the lock has been
  queued on the block list, and will be woken up when it might become
  available and needs to be retried (either fl_lmops->fl_notify() is
  called or fl_wait is woken up).

  f_op->lock() to mean either the above, or that the filesystem will
  call back with fl_lmops->fl_grant() when the result of the locking
  operation is known.  The filesystem can do this for sleeping as well
  as non-sleeping locks.

This is to make sure, that return values of -EAGAIN and -EINPROGRESS by
filesystems are not mistaken to mean an asynchronous locking.

This also makes error handling in fs/locks.c and lockd/svclock.c slightly
cleaner.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bde74e4b

25 7月, 2008 4 次提交

flag parameters: NONBLOCK in pipe · be61a86d

由 Ulrich Drepper 提交于 7月 23, 2008

This patch adds O_NONBLOCK support to pipe2.  It is minimally more involved
than the patches for eventfd et.al but still trivial.  The interfaces of the
create_write_pipe and create_read_pipe helper functions were changed and the
one other caller as well.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_pipe2
# ifdef __x86_64__
#  define __NR_pipe2 293
# elif defined __i386__
#  define __NR_pipe2 331
# else
#  error "need __NR_pipe2"
# endif
#endif

int
main (void)
{
  int fds[2];
  if (syscall (__NR_pipe2, fds, 0) == -1)
    {
      puts ("pipe2(0) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      int fl = fcntl (fds[i], F_GETFL);
      if (fl == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if (fl & O_NONBLOCK)
        {
          printf ("pipe2(0) set non-blocking mode for fds[%d]\n", i);
          return 1;
        }
      close (fds[i]);
    }

  if (syscall (__NR_pipe2, fds, O_NONBLOCK) == -1)
    {
      puts ("pipe2(O_NONBLOCK) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      int fl = fcntl (fds[i], F_GETFL);
      if (fl == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if ((fl & O_NONBLOCK) == 0)
        {
          printf ("pipe2(O_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i);
          return 1;
        }
      close (fds[i]);
    }

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: NUlrich Drepper <drepper@redhat.com>
Acked-by: NDavide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be61a86d

flag parameters: pipe · ed8cae8b

由 Ulrich Drepper 提交于 7月 23, 2008

This patch introduces the new syscall pipe2 which is like pipe but it also
takes an additional parameter which takes a flag value.  This patch implements
the handling of O_CLOEXEC for the flag.  I did not add support for the new
syscall for the architectures which have a special sys_pipe implementation.  I
think the maintainers of those archs have the chance to go with the unified
implementation but that's up to them.

The implementation introduces do_pipe_flags.  I did that instead of changing
all callers of do_pipe because some of the callers are written in assembler.
I would probably screw up changing the assembly code.  To avoid breaking code
do_pipe is now a small wrapper around do_pipe_flags.  Once all callers are
changed over to do_pipe_flags the old do_pipe function can be removed.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_pipe2
# ifdef __x86_64__
#  define __NR_pipe2 293
# elif defined __i386__
#  define __NR_pipe2 331
# else
#  error "need __NR_pipe2"
# endif
#endif

int
main (void)
{
  int fd[2];
  if (syscall (__NR_pipe2, fd, 0) != 0)
    {
      puts ("pipe2(0) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      int coe = fcntl (fd[i], F_GETFD);
      if (coe == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if (coe & FD_CLOEXEC)
        {
          printf ("pipe2(0) set close-on-exit for fd[%d]\n", i);
          return 1;
        }
    }
  close (fd[0]);
  close (fd[1]);

  if (syscall (__NR_pipe2, fd, O_CLOEXEC) != 0)
    {
      puts ("pipe2(O_CLOEXEC) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      int coe = fcntl (fd[i], F_GETFD);
      if (coe == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if ((coe & FD_CLOEXEC) == 0)
        {
          printf ("pipe2(O_CLOEXEC) does not set close-on-exit for fd[%d]\n", i);
          return 1;
        }
    }
  close (fd[0]);
  close (fd[1]);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: NUlrich Drepper <drepper@redhat.com>
Acked-by: NDavide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ed8cae8b

fix soft lock up at NFS mount via per-SB LRU-list of unused dentries · da3bbdd4

由 Kentaro Makita 提交于 7月 23, 2008

[Summary]

 Split LRU-list of unused dentries to one per superblock to avoid soft
 lock up during NFS mounts and remounting of any filesystem.

 Previously I posted here:
 http://lkml.org/lkml/2008/3/5/590

[Descriptions]

- background

  dentry_unused is a list of dentries which are not referenced.
  dentry_unused grows up when references on directories or files are
  released.  This list can be very long if there is huge free memory.

- the problem

  When shrink_dcache_sb() is called, it scans all dentry_unused linearly
  under spin_lock(), and if dentry->d_sb is differnt from given
  superblock, scan next dentry.  This scan costs very much if there are
  many entries, and very ineffective if there are many superblocks.

  IOW, When we need to shrink unused dentries on one dentry, but scans
  unused dentries on all superblocks in the system.  For example, we scan
  500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
  dentries on other superblocks.

  In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
  unused dentries on NFS, but scans 100,000,000 unused dentries on
  superblocks in the system such as local ext3 filesystems.  I hear NFS
  mounting took 1 min on some system in use.

* : NFS uses virtual filesystem in rpc layer, so NFS is affected by
  this problem.

  100,000,000 is possible number on large systems.

  Per-superblock LRU of unused dentried can reduce the cost in
  reasonable manner.

- How to fix

  I found this problem is solved by David Chinner's "Per-superblock
  unused dentry LRU lists V3"(1), so I rebase it and add some fix to
  reclaim with fairness, which is in Andrew Morton's comments(2).

  1) http://lkml.org/lkml/2006/5/25/318
  2) http://lkml.org/lkml/2006/5/25/320

  Split LRU-list of unused dentries to each superblocks.  Then, NFS
  mounting will check dentries under a superblock instead of all.  But
  this spliting will break LRU of dentry-unused.  So, I've attempted to
  make reclaim unused dentrins with fairness by calculate number of
  dentries to scan on this sb based on following way

  number of dentries to scan on this sb =
  count * (number of dentries on this sb / number of dentries in the machine)

- ToDo
 - I have to measuring performance number and do stress tests.

 - When unmount occurs during prune_dcache(), scanning on same
  superblock, It is unable to reach next superblock because it is gone
  away.  We restart scannig superblock from first one, it causes
  unfairness of reclaim unused dentries on first superblock.  But I think
  this happens very rarely.

- Test Results

  Result on 6GB boxes with excessive unused dentries.

Without patch:

$ cat /proc/sys/fs/dentry-state
10181835        10180203        45      0       0       0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real    0m1.830s
user    0m0.001s
sys     0m1.653s

 With this patch:
$ cat /proc/sys/fs/dentry-state
10236610        10234751        45      0       0       0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real    0m0.106s
user    0m0.002s
sys     0m0.032s

[akpm@linux-foundation.org: fix comments]
Signed-off-by: NKentaro Makita <k-makita@np.css.fujitsu.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Chinner <dgc@sgi.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

da3bbdd4

move memory_read_from_buffer() from fs.h to string.h · e108526e

由 Akinobu Mita 提交于 7月 23, 2008

James Bottomley warns that inclusion of linux/fs.h in a low level
driver was always a danger signal.  This patch moves
memory_read_from_buffer() from fs.h to string.h and fixes includes in
existing memory_read_from_buffer() users.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e108526e

15 7月, 2008 1 次提交

VFS: export sync_sb_inodes · 4ee6afd3

由 Artem Bityutskiy 提交于 5月 07, 2008

This patch exports the 'sync_sb_inodes()' which is needed for
UBIFS because it has to force write-back from time to time.
Namely, the UBIFS budgeting subsystem forces write-back when
its pessimistic callculations show that there is no free
space on the media.
Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>

4ee6afd3

14 7月, 2008 1 次提交

file lock: reorder struct file_lock to save space on 64 bit builds · afc1246f

由 Richard Kennedy 提交于 7月 11, 2008

Reduce sizeof struct file_lock by 8 on 64 bit builds allowing +1 objects
per slab in the file_lock_cache
Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

afc1246f

12 7月, 2008 1 次提交

vfs: export filemap_fdatawrite_range() · f4c0a0fd

由 Jan Kara 提交于 7月 11, 2008

Make filemap_fdatawrite_range() function public, so that it can later
be used in ordered mode rewrite by JBD/JBD2.
Signed-off-by: NJan Kara <jack@suse.cz>

f4c0a0fd

03 7月, 2008 1 次提交

Remove BKL from remote_llseek v2 · 9465efc9

由 Andi Kleen 提交于 6月 27, 2008

- Replace remote_llseek with generic_file_llseek_unlocked (to force compilation
failures in all users)
- Change all users to either use generic_file_llseek_unlocked directly or
take the BKL around. I changed the file systems who don't use the BKL
for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS
take the BKL, but explicitely in their own source now.

I moved them all over in a single patch to avoid unbisectable sections.

Open problem: 32bit kernels can corrupt fpos because its modification
is not atomic, but they can do that anyways because there's other paths who
modify it without BKL.

Do we need a special lock for the pos/f_version = 0 checks?

Trond says the NFS BKL is likely not needed, but keep it for now
until his full audit.

v2: Use generic_file_llseek_unlocked instead of remote_llseek_unlocked
    and factor duplicated code (suggested by hch)

Cc: Trond.Myklebust@netapp.com
Cc: swhiteho@redhat.com
Cc: sfrench@samba.org
Cc: vandrove@vc.cvut.cz
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

9465efc9

01 7月, 2008 1 次提交

Properly notify block layer of sync writes · 18ce3751

由 Jens Axboe 提交于 7月 01, 2008

fsync_buffers_list() and sync_dirty_buffer() both issue async writes and
then immediately wait on them. Conceptually, that makes them sync writes
and we should treat them as such so that the IO schedulers can handle
them appropriately.

This patch fixes a write starvation issue that Lin Ming reported, where
xx is stuck for more than 2 minutes because of a large number of
synchronous IO in the system:

INFO: task kjournald:20558 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kjournald     D ffff810010820978  6712 20558      2
ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
Call Trace:
[<ffffffff803ba6f2>] kobject_get+0x12/0x17
[<ffffffff80247537>] getnstimeofday+0x2f/0x83
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d195>] io_schedule+0x5d/0x9f
[<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
[<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff80243909>] wake_bit_function+0x0/0x23
[<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
[<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
[<ffffffff8023a676>] lock_timer_base+0x26/0x4b
[<ffffffff8030300a>] kjournald+0xc1/0x1fb
[<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
[<ffffffff80302f49>] kjournald+0x0/0x1fb
[<ffffffff802437bb>] kthread+0x47/0x74
[<ffffffff8022de51>] schedule_tail+0x28/0x5d
[<ffffffff8020cac8>] child_rip+0xa/0x12
[<ffffffff80243774>] kthread+0x0/0x74
[<ffffffff8020cabe>] child_rip+0x0/0x12

Lin Ming confirms that this patch fixes the issue. I've run tests with
it for the past week and no ill effects have been observed, so I'm
proposing it for inclusion into 2.6.26.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

18ce3751

openanolis / cloud-kernel 11 个月 前同步成功

openanolis / cloud-kernel
11 个月前同步成功