提交 · c96e41e92b4aaf11e1f9775ecf0d1c8cbff829ed · openanolis / cloud-kernel

29 10月, 2010 2 次提交

beginning of transtion: ->mount() · c96e41e9

由 Al Viro 提交于 7月 25, 2010

eventual replacement for ->get_sb() - does *not* get vfsmount,
return ERR_PTR(error) or root of subtree to be mounted.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c96e41e9

Fix compile brekage with !CONFIG_BLOCK · b31d42a5

由 Ingo Molnar 提交于 10月 28, 2010

Today's git tree fails to build on !CONFIG_BLOCK, due to upstream commit
367a51a3 ("fs: Add FITRIM ioctl"):

include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’
include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’
include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’

The commit adds uint64_t type usage to fs.h, but linux/types.h is not included
explicitly - it's only included implicitly via linux/blk_types.h, and there only if
CONFIG_BLOCK is enabled.

Add the explicit #include to fix this.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b31d42a5

28 10月, 2010 3 次提交

fs: Add FITRIM ioctl · 367a51a3

由 Lukas Czerner 提交于 10月 27, 2010

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.

struct fstrim_range {
	start;
	len;
	minlen;
}

start	- first Byte to trim
len	- number of Bytes to trim from start
minlen	- minimum extent length to trim, free extents shorter than this
	  number of Bytes will be ignored. This will be rounded up to fs
	  block size.

It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:

start = 0;
len = ULLONG_MAX;
minlen = 0;

So it will trim the whole file system at one run.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Reviewed-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

367a51a3

fasync: re-organize fasync entry insertion to allow it under a spinlock · f7347ce4

由 Linus Torvalds 提交于 10月 27, 2010

You currently cannot use "fasync_helper()" in an atomic environment to
insert a new fasync entry, because it will need to allocate the new
"struct fasync_struct".

Yet fcntl_setlease() wants to call this under lock_flocks(), which is in
the process of being converted from the BKL to a spinlock.

In order to fix this, this abstracts out the actual fasync list
insertion and the fasync allocations into functions of their own, and
teaches fs/locks.c to pre-allocate the fasync_struct entry.  That way
the actual list insertion can happen while holding the required
spinlock.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[bfields@redhat.com: rebase on top of my changes to Arnd's patch]
Tested-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

f7347ce4

locks/nfsd: allocate file lock outside of spinlock · c5b1f0d9

由 Arnd Bergmann 提交于 10月 27, 2010

As suggested by Christoph Hellwig, this moves allocation
of new file locks out of generic_setlease into the
callers, nfs4_open_delegation and fcntl_setlease in order
to allow GFP_KERNEL allocations when lock_flocks has
become a spinlock.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NJ. Bruce Fields <bfields@redhat.com>

c5b1f0d9

27 10月, 2010 3 次提交

fs: allow for more than 2^31 files · 518de9b3

由 Eric Dumazet 提交于 10月 26, 2010

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot.  The kernel was incapable
of creating either a named pipe or unix domain socket.  This comes down
to a common kernel function called unix_create1() which does:

        atomic_inc(&unix_nr_socks);
        if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

        n = (mempages * (PAGE_SIZE / 1024)) / 10;
        files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000).  That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.

get_max_files() is changed to return an unsigned long.  get_nr_files() is
changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704     0       2147483648
Reported-by: NRobin Holt <holt@sgi.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NDavid Miller <davem@davemloft.net>
Reviewed-by: NRobin Holt <holt@sgi.com>
Tested-by: NRobin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

518de9b3

IMA: explicit IMA i_flag to remove global lock on inode_delete · 196f5181

由 Eric Paris 提交于 10月 25, 2010

Currently for every removed inode IMA must take a global lock and search
the IMA rbtree looking for an associated integrity structure. Instead
we explicitly mark an inode when we add an integrity structure so we
only have to take the global lock and do the removal if it exists.
Signed-off-by: NEric Paris <eparis@redhat.com>
Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

196f5181

IMA: move read counter into struct inode · a178d202

由 Eric Paris 提交于 10月 25, 2010

IMA currently allocated an inode integrity structure for every inode in
core.  This stucture is about 120 bytes long.  Most files however
(especially on a system which doesn't make use of IMA) will never need
any of this space.  The problem is that if IMA is enabled we need to
know information about the number of readers and the number of writers
for every inode on the box.  At the moment we collect that information
in the per inode iint structure and waste the rest of the space.  This
patch moves those counters into the struct inode so we can eventually
stop allocating an IMA integrity structure except when absolutely
needed.

This patch does the minimum needed to move the location of the data.
Further cleanups, especially the location of counter updates, may still
be possible.
Signed-off-by: NEric Paris <eparis@redhat.com>
Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a178d202

26 10月, 2010 13 次提交

fs: inode split IO and LRU lists · 7ccf19a8

由 Nick Piggin 提交于 10月 21, 2010

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7ccf19a8

fs: use percpu counter for nr_dentry and nr_dentry_unused · 312d3ca8

由 Christoph Hellwig 提交于 10月 10, 2010

The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. Turn it into a per-cpu counter and always decrement it in d_free instead
of doing various batching operations to reduce lock hold times in the callers.

Based on an earlier patch from Nick Piggin <npiggin@suse.de>.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

312d3ca8

fs: do not assign default i_ino in new_inode · 85fe4025

由 Christoph Hellwig 提交于 10月 23, 2010

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

85fe4025

new helper: ihold() · 7de9c6ee

由 Al Viro 提交于 10月 23, 2010

Clones an existing reference to inode; caller must already hold one.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7de9c6ee

fs: remove inode_add_to_list/__inode_add_to_list · 646ec461

由 Christoph Hellwig 提交于 10月 23, 2010

Split up inode_add_to_list/__inode_add_to_list.  Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.

The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

646ec461

fs: Implement lazy LRU updates for inodes · 9e38d86f

由 Nick Piggin 提交于 10月 23, 2010

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9e38d86f

fs: Convert nr_inodes and nr_unused to per-cpu counters · cffbc8aa

由 Dave Chinner 提交于 10月 23, 2010

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

[AV: cleaned up a bit, fixed build breakage on weird configs
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

cffbc8aa

new helper: inode_unhashed() · 1d3382cb

由 Al Viro 提交于 10月 23, 2010

note: for race-free uses you inode_lock held
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1d3382cb

A
unexport invalidate_inodes · a8dade34
由 Al Viro 提交于 10月 24, 2010
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
a8dade34

vfs: introduce FMODE_UNSIGNED_OFFSET for allowing negative f_pos · 4a3956c7

由 KAMEZAWA Hiroyuki 提交于 10月 01, 2010

Now, rw_verify_area() checsk f_pos is negative or not.  And if negative,
returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..  has
negative offsets.  And we can't do any access via read/write to the
file(device).

So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4a3956c7

fs: allow for more than 2^31 files · 7e360c38

由 Eric Dumazet 提交于 10月 05, 2010

Andrew,

Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.

Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())

Thanks !

[PATCH V4] fs: allow for more than 2^31 files

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot.  The kernel was incapable
of creating either a named pipe or unix domain socket.  This comes down
to a common kernel function called unix_create1() which does:

        atomic_inc(&unix_nr_socks);
        if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

        n = (mempages * (PAGE_SIZE / 1024)) / 10;
        files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000).  That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.

get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704     0       2147483648
Reported-by: NRobin Holt <holt@sgi.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NDavid Miller <davem@davemloft.net>
Reviewed-by: NRobin Holt <holt@sgi.com>
Tested-by: NRobin Holt <holt@sgi.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7e360c38

fs: mark destroy_inode static · 56b0dacf

由 Christoph Hellwig 提交于 10月 06, 2010

Hugetlbfs used to need it, but after the destroy_inode and evict_inode
changes it's not required anymore.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

56b0dacf

fs: add sync_inode_metadata · c3765016

由 Christoph Hellwig 提交于 10月 06, 2010

Add a new helper to write out the inode using the writeback code,
that is including the correct dirty bit and list manipulation.  A few
of filesystems already opencode this, and a lot of others should be
using it instead of using write_inode_now which also writes out the
data.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c3765016

05 10月, 2010 1 次提交

fs/locks.c: prepare for BKL removal · b89f4321

由 Arnd Bergmann 提交于 9月 18, 2010

This prepares the removal of the big kernel lock from the
file locking code. We still use the BKL as long as fs/lockd
uses it and ceph might sleep, but we can flip the definition
to a private spinlock as soon as that's done.
All users outside of fs/lockd get converted to use
lock_flocks() instead of lock_kernel() where appropriate.

Based on an earlier patch to use a spinlock from Matthew
Wilcox, who has attempted this a few times before, the
earliest patch from over 10 years ago turned it into
a semaphore, which ended up being slower than the BKL
and was subsequently reverted.

Someone should do some serious performance testing when
this becomes a spinlock, since this has caused problems
before. Using a spinlock should be at least as good
as the BKL in theory, but who knows...
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org

b89f4321

22 9月, 2010 1 次提交

fs: {lock,unlock}_flocks() stubs to prepare for BKL removal · 8b15575c

由 Sage Weil 提交于 9月 21, 2010

The lock structs are currently protected by the BKL, but are accessed by
code in fs/locks.c and misc file system and DLM code.  These stubs will
allow all users to switch to the new interface before the implementation
is changed to a spinlock.
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NSage Weil <sage@newdream.net>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8b15575c

16 9月, 2010 1 次提交

libfs: use generic_file_llseek for simple_attr · 1ec5584e

由 Arnd Bergmann 提交于 8月 15, 2010

Simple attribute files need to be seekable to
allow resetting the file for another read.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

1ec5584e

10 9月, 2010 4 次提交

ext3/ext4: Factor out disk addressability check · 30ca22c7

由 Patrick J. LoPresti 提交于 7月 22, 2010

As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.

An identical check already appears in ext3 and ext4.  This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.

[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]
Signed-off-by: NPatrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

30ca22c7

block: remove the BLKDEV_IFL_BARRIER flag · 8c555367

由 Christoph Hellwig 提交于 8月 18, 2010

Remove support for barriers on discards, which is unused now.  Also
remove the DISCARD_NOBARRIER I/O type in favour of just setting the
rw flags up locally in blkdev_issue_discard.

tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

8c555367

block: remove the WRITE_BARRIER flag · 31725e65

由 Christoph Hellwig 提交于 8月 18, 2010

It's unused now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

31725e65

block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests · 4fed947c

由 Tejun Heo 提交于 9月 03, 2010

Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags.  REQ_FLUSH means the
device cache should be flushed before executing the request.  REQ_FUA
means that the data in the request should be on non-volatile media on
completion.

Block layer will choose the correct way of implementing the semantics
and execute it.  The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests.  Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.

Also, unlike the original REQ_HARDBARRIER, REQ_FLUSH/FUA requests are
never failed with -EOPNOTSUPP.  If the underlying device doesn't
support FLUSH/FUA, the block layer simply make those noop.  IOW, it no
longer distinguishes between writeback cache which doesn't support
cache flush and writethrough/no cache.  Devices which have WB cache
w/o flush are very difficult to come by these days and there's nothing
much we can do anyway, so it doesn't make sense to require everyone to
implement -EOPNOTSUPP handling.  This will simplify filesystems and
block drivers as they can drop -EOPNOTSUPP retry logic for barriers.

* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
  blk-flush.c.

* REQ_FLUSH w/o data can also be directly passed to drivers without
  sequencing but some drivers assume that zero length requests don't
  have rq->bio which isn't true for these requests requiring the use
  of proxy requests.

* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
  copied from bio to request.

* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
  WRITE_FLUSH_FUA are added.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

4fed947c

20 8月, 2010 1 次提交

kernel: __rcu annotations · 4d2deb40

由 Arnd Bergmann 提交于 2月 24, 2010

This adds annotations for RCU operations in core kernel components
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: NJosh Triplett <josh@joshtriplett.org>

4d2deb40

18 8月, 2010 4 次提交

fs: scale files_lock · 6416ccb7

由 Nick Piggin 提交于 8月 18, 2010

fs: scale files_lock

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.

Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.

Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

                throughput
2.6.34-rc2      24.5
+patch          24.9

                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.

Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: NNick Piggin <npiggin@kernel.dk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6416ccb7

tty: fix fu_list abuse · d996b62a

由 Nick Piggin 提交于 8月 18, 2010

tty: fix fu_list abuse

tty code abuses fu_list, which causes a bug in remount,ro handling.

If a tty device node is opened on a filesystem, then the last link to the inode
removed, the filesystem will be allowed to be remounted readonly. This is
because fs_may_remount_ro does not find the 0 link tty inode on the file sb
list (because the tty code incorrectly removed it to use for its own purpose).
This can result in a filesystem with errors after it is marked "clean".

Taking idea from Christoph's initial patch, allocate a tty private struct
at file->private_data and put our required list fields in there, linking
file and tty. This makes tty nodes behave the same way as other device nodes
and avoid meddling with the vfs, and avoids this bug.

The error handling is not trivial in the tty code, so for this bugfix, I take
the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
This is not a problem because our allocator doesn't fail small allocs as a rule
anyway. So proper error handling is left as an exercise for tty hackers.

[ Arguably filesystem's device inode would ideally be divorced from the
driver's pseudo inode when it is opened, but in practice it's not clear whether
that will ever be worth implementing. ]

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: NNick Piggin <npiggin@kernel.dk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d996b62a

fs: cleanup files_lock locking · ee2ffa0d

由 Nick Piggin 提交于 8月 18, 2010

fs: cleanup files_lock locking

Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: NNick Piggin <npiggin@kernel.dk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ee2ffa0d

remove SWRITE* I/O types · 9cb569d6

由 Christoph Hellwig 提交于 8月 11, 2010

These flags aren't real I/O types, but tell ll_rw_block to always
lock the buffer instead of giving up on a failed trylock.

Instead add a new write_dirty_buffer helper that implements this semantic
and use it from the existing SWRITE* callers.  Note that the ll_rw_block
code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
this patch fixes.

In the ufs code clean up the helper that used to call ll_rw_block
to mirror sync_dirty_buffer, which is the function it implements for
compound buffers.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9cb569d6

14 8月, 2010 2 次提交

Mark arguments to certain syscalls as being const · c7887325

由 David Howells 提交于 8月 11, 2010

Mark arguments to certain system calls as being const where they should be but
aren't.  The list includes:

 (*) The filename arguments of various stat syscalls, execve(), various utimes
     syscalls and some mount syscalls.

 (*) The filename arguments of some syscall helpers relating to the above.

 (*) The buffer argument of various write syscalls.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c7887325

bkl: Remove locked .ioctl file operation · b19dd42f

由 Arnd Bergmann 提交于 7月 04, 2010

The last user is gone, so we can safely remove this
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>

b19dd42f

12 8月, 2010 1 次提交

block: add secure discard · 8d57a98c

由 Adrian Hunter 提交于 8月 11, 2010

Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.
Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>
Acked-by: NJens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8d57a98c

11 8月, 2010 1 次提交

include/linux/fs.h: complete hexification of FMODE_* constants · 13bcbc00

由 Andrew Morton 提交于 8月 10, 2010

One straggler which was missed due to merge ordering issues.

Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

13bcbc00

10 8月, 2010 3 次提交

mm: implement writeback livelock avoidance using page tagging · f446daae

由 Jan Kara 提交于 8月 09, 2010

We try to avoid livelocks of writeback when some steadily creates dirty
pages in a mapping we are writing out.  For memory-cleaning writeback,
using nr_to_write works reasonably well but we cannot really use it for
data integrity writeback.  This patch tries to solve the problem.

The idea is simple: Tag all pages that should be written back with a
special tag (TOWRITE) in the radix tree.  This can be done rather quickly
and thus livelocks should not happen in practice.  Then we start doing the
hard work of locking pages and sending them to disk only for those pages
that have TOWRITE tag set.

Note: Adding new radix tree tag grows radix tree node from 288 to 296
bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same (13 and 7
respectively).
Signed-off-by: NJan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f446daae

Fix sget() race with failing mount · 7a4dec53

由 Al Viro 提交于 8月 09, 2010

If sget() finds a matching superblock being set up, it'll
grab an active reference to it and grab s_umount.  That's
fine - we'll wait for completion of foofs_get_sb() that way.
However, if said foofs_get_sb() fails we'll end up holding
the halfway-created superblock.  deactivate_locked_super()
called by foofs_get_sb() will just unlock the sucker since
we are holding another active reference to it.

What we need is a way to tell if superblock has been successfully
set up.  Unfortunately, neither ->s_root nor the check for
MS_ACTIVE quite fit.  Cheap and easy way, suitable for backport:
new flag set by the (only) caller of ->get_sb().  If that flag
isn't present by the time sget() grabbed s_umount on preexisting
superblock it has found, it's seeing a stillborn and should
just bury it with deactivate_locked_super() (and repeat the search).

Longer term we want to set that flag in ->get_sb() instances (and
check for it to distinguish between "sget() found us a live sb"
and "sget() has allocated an sb, we need to set it up" in there,
instead of checking ->s_root as we do now).
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org

7a4dec53

pass a struct path to vfs_statfs · ebabe9a9

由 Christoph Hellwig 提交于 7月 07, 2010

We'll need the path to implement the flags field for statvfs support.
We do have it available in all callers except:

 - ecryptfs_statfs.  This one doesn't actually need vfs_statfs but just
   needs to do a caller to the lower filesystem statfs method.
 - sys_ustat.  Add a non-exported statfs_by_dentry helper for it which
   doesn't won't be able to fill out the flags field later on.

In addition rename the helpers for statfs vs fstatfs to do_*statfs instead
of the misleading vfs prefix.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ebabe9a9

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功