提交 · bd9b51e79cb0b8bc00a7e0076a4a8963ca4a797c · openanolis / cloud-kernel

11 12月, 2014 1 次提交

make default ->i_fop have ->open() fail with ENXIO · bd9b51e7

由 Al Viro 提交于 11月 18, 2014

As it is, default ->i_fop has NULL ->open() (along with all other methods).
The only case where it matters is reopening (via procfs symlink) a file that
didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
to something sane (default would fail on read/write/ioctl/etc.).

	Unfortunately, such case exists - alloc_file() users, especially
anon_get_file() ones.  There we have tons of opened files of very different
kinds sharing the same inode.  As the result, attempt to reopen those via
procfs succeeds and you get a descriptor you can't do anything with.

	Moreover, in case of sockets we set ->i_fop that will only be used
on such reopen attempts - and put a failing ->open() into it to make sure
those do not succeed.

	It would be simpler to put such ->open() into default ->i_fop and leave
it unchanged both for anon inode (as we do anyway) and for socket ones.  Result:
	* everything going through do_dentry_open() works as it used to
	* sock_no_open() kludge is gone
	* attempts to reopen anon-inode files fail as they really ought to
	* ditto for aio_private_file()
	* ditto for perfmon - this one actually tried to imitate sock_no_open()
trick, but failed to set ->i_fop, so in the current tree reopens succeed and
yield completely useless descriptor.  Intent clearly had been to fail with
-ENXIO on such reopens; now it actually does.
	* everything else that used alloc_file() keeps working - it has ->i_fop
set for its inodes anyway
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bd9b51e7

07 6月, 2014 1 次提交

ia64: convert use of typedef ctl_table to struct ctl_table · 2841efa6

由 Joe Perches 提交于 6月 06, 2014

This typedef is unnecessary and should just be removed.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2841efa6

05 3月, 2014 1 次提交

ia64: Remove deprecated IRQF_DISABLED · 2958a489

由 Michael Opdenacker 提交于 3月 04, 2014

This patch removes the IRQF_DISABLED flag from ia64 architecture
code. It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: NMichael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: paul.gortmaker@windriver.com
Cc: viro@zeniv.linux.org.uk
Cc: srivatsa.bhat@linux.vnet.ibm.com
Cc: andriy.shevchenko@linux.intel.com
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Link: http://lkml.kernel.org/r/1393964953-17002-1-git-send-email-michael.opdenacker@free-electrons.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

2958a489

16 11月, 2013 1 次提交

consolidate simple ->d_delete() instances · b26d4cd3

由 Al Viro 提交于 10月 25, 2013

Rename simple_delete_dentry() to always_delete_dentry() and export it.
Export simple_dentry_operations, while we are at it, and get rid of
their duplicates
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b26d4cd3

04 6月, 2013 1 次提交

[IA64] perfmon: Use %*phD specifier to dump small buffers · 7451adc5

由 Andy Shevchenko 提交于 5月 29, 2013

pfm_uuid_t value is defined as unsigned char [16]. Thus, we may dump its value
as byte buffer using %*phD specifier.
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>

7451adc5

08 4月, 2013 1 次提交

ia64: Use generic idle loop · 91d591c3

由 Thomas Gleixner 提交于 3月 21, 2013

Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: NCc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20130321215234.406851909@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

91d591c3

04 3月, 2013 1 次提交

fs: Limit sys_mount to only request filesystem modules. · 7f78e035

由 Eric W. Biederman 提交于 3月 02, 2013

Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Acked-by: NKees Cook <keescook@chromium.org>
Reported-by: NKees Cook <keescook@google.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

7f78e035

23 2月, 2013 1 次提交

fs: Preserve error code in get_empty_filp(), part 2 · 39b65252

由 Anatol Pomozov 提交于 9月 12, 2012

Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
 - not enough memory for file structures
 - operation is not allowed
 - user is over its limit

Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.

Return error through pointer. Change all callers to preserve this error code.

[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]
Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

39b65252

09 10月, 2012 1 次提交

mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9

由 Konstantin Khlebnikov 提交于 10月 08, 2012

A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

 | effect                 | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump      | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct.  Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

314e51b9

27 9月, 2012 3 次提交
- A
  switch simple cases of fget_light to fdget · 2903ff01
  由 Al Viro 提交于 8月 28, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  2903ff01
- A
  make get_file() return its argument · cb0942b8
  由 Al Viro 提交于 8月 27, 2012
```
simplifies a bunch of callers...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  cb0942b8
- A
  switch itanic perfmonctl(2) to fget_light() · 7456a29b
  由 Al Viro 提交于 8月 26, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  7456a29b
21 9月, 2012 1 次提交

userns: On ia64 deal with current_uid and current_gid being kuid and kgid · 6c1ee033

由 Eric W. Biederman 提交于 8月 07, 2012

These ia64 uses of current_uid and current_gid slipped through the
cracks when I was converting everything to kuids and kgids convert
them now.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

6c1ee033

01 8月, 2012 1 次提交

mm: account the total_vm in the vm_stat_account() · 44de9d0c

由 Huang Shijie 提交于 7月 31, 2012

vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.

Even for mprotect_fixup(), we can get the right result in the end.
Signed-off-by: NHuang Shijie <shijie8@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

44de9d0c

31 5月, 2012 1 次提交

ia64 perfmon: fix get_unmapped_area() use there · 4ad310b8

由 Al Viro 提交于 5月 30, 2012

get_unmapped_area() returns -E... on failure, not 0.  Moreover, the
wrapper around it is completely pointless.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4ad310b8

21 4月, 2012 3 次提交

A
kill mm argument of vm_munmap() · bfce281c
由 Al Viro 提交于 4月 20, 2012
```
it's always current->mm
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
bfce281c

perfmon: kill some helpers and arguments · 9f3a4afb

由 Al Viro 提交于 4月 20, 2012

pfm_vm_munmap() is simply vm_munmap() and pfm_remove_smpl_mapping()
always get current as the first argument.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9f3a4afb

VM: add "vm_munmap()" helper function · a46ef99d

由 Linus Torvalds 提交于 4月 20, 2012

Like the vm_brk() function, this is the same as "do_munmap()", except it
does the VM locking for the caller.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a46ef99d

29 3月, 2012 1 次提交

Disintegrate asm/system.h for IA64 · c140d879

由 David Howells 提交于 3月 28, 2012

Disintegrate asm/system.h for IA64.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NTony Luck <tony.luck@intel.com>
cc: linux-ia64@vger.kernel.org

c140d879

04 1月, 2012 1 次提交
- A
  vfs: for usbfs, etc. internal vfsmounts ->mnt_sb->s_root == ->mnt_root · 4c1d5a64
  由 Al Viro 提交于 12月 07, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  4c1d5a64
14 1月, 2011 1 次提交

[IA64] fix build error - arch/ia64/kernel/perfmon.c · 09579770

由 Tony Luck 提交于 1月 13, 2011

arch/ia64/kernel/perfmon.c:621: error: duplicate 'static'

Introduced by commit c74a1cbb

    pass default dentry_operations to mount_pseudo()
Signed-off-by: NTony Luck <tony.luck@intel.com>

09579770

13 1月, 2011 1 次提交
- A
  pass default dentry_operations to mount_pseudo() · c74a1cbb
  由 Al Viro 提交于 1月 12, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  c74a1cbb
07 1月, 2011 3 次提交

fs: scale mntget/mntput · b3e19d92

由 Nick Piggin 提交于 1月 07, 2011

The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
  for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
  is difficult for vfsmounts, because we can't hold preempt off for the life of
  a reference, so a counter would need to be per-thread or tied strongly to a
  particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
  the total difference and hence find the refcount when summing all CPUs. Then,
  keep a single integer "long" refcount for slow and long lasting references,
  and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

b3e19d92

fs: dcache reduce branches in lookup path · fb045adb

由 Nick Piggin 提交于 1月 07, 2011

Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

fb045adb

fs: change d_delete semantics · fe15ce44

由 Nick Piggin 提交于 1月 07, 2011

Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: NNick Piggin <npiggin@kernel.dk>

fe15ce44

29 12月, 2010 1 次提交

[IA64] perfmon: Change vmalloc to vzalloc and drop memset. · e21763db

由 Jesper Juhl 提交于 10月 30, 2010

vzalloc() nicely zeroes memory for us, so we don't have to do a vmalloc()
and then manually memset() the returned memory when all we want is for it
to be zero. Patch changes this for pfm_rvmalloc().
Signed-off-by: NJesper Juhl <jj@chaosbits.net>
Signed-off-by: NTony Luck <tony.luck@intel.com>

e21763db

29 10月, 2010 1 次提交
- A
  convert get_sb_pseudo() users · 51139ada
  由 Al Viro 提交于 7月 25, 2010
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  51139ada
24 9月, 2010 1 次提交
- J
  [IA64] Remove unnecessary casts of private_data in perfmon.c · df0a59a1
  由 Joe Perches 提交于 7月 12, 2010
```
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
```
  df0a59a1
11 8月, 2010 1 次提交

ia64: perfmon: add d_dname method · 7ae6bdbd

由 Miklos Szeredi 提交于 8月 10, 2010

Switch ia64/perfmon to using the d_dname() instead of relying on
__d_path() to prepend the name of the root dentry to the path.

CC: Tony Luck <tony.luck@intel.com>
CC: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7ae6bdbd

22 7月, 2010 1 次提交

ia64/perfmon: Convert to unlocked_ioctl · 29f367bf

由 Arnd Bergmann 提交于 7月 04, 2010

The ioctl function in this driver does not
do anything that requires the BKL, so make
it use unlocked_ioctl.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-ia64@vger.kernel.org
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>

29f367bf

07 7月, 2010 1 次提交

[IA64] perfmon: convert to unlocked_ioctl · ba58aebf

由 Arnd Bergmann 提交于 7月 04, 2010

The ioctl function in this driver does not
do anything that requires the BKL, so make
it use unlocked_ioctl.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NTony Luck <tony.luck@intel.com>

ba58aebf

30 3月, 2010 1 次提交

include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6

由 Tejun Heo 提交于 3月 24, 2010

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: NTejun Heo <tj@kernel.org>
Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

5a0e3ad6

07 3月, 2010 1 次提交

mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930

由 Rik van Riel 提交于 3月 05, 2010

The old anon_vma code can lead to scalability issues with heavily forking
workloads.  Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes.  However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock.  This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands.  Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA.  At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated.  The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
 This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations.  This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures.  This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock.  To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time.  The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: NRik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5beb4930

27 2月, 2010 1 次提交

[IA64] remove trailing space in messages · 04157e4c

由 Frans Pop 提交于 2月 06, 2010

ia64 parts of system wide cleanup to drop trailing whitespace
from lines in message strings.
Signed-off-by: NFrans Pop <elendil@planet.nl>
Signed-off-by: NTony Luck <tony.luck@intel.com>

04157e4c

07 1月, 2010 1 次提交

[IA64] use helpers for rlimits · 02b763b8

由 Jiri Slaby 提交于 1月 06, 2010

Make sure compiler won't do weird things with limits. E.g. fetching
them twice may return 2 different values after writable limits are
implemented.

I.e. either use rlimit helpers added in
3e10e716
or ACCESS_ONCE if not applicable.
Signed-off-by: NJiri Slaby <jslaby@suse.cz>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>

02b763b8

17 12月, 2009 1 次提交

switch alloc_file() to passing struct path · 2c48b9c4

由 Al Viro 提交于 8月 09, 2009

... and have the caller grab both mnt and dentry; kill
leak in infiniband, while we are at it.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2c48b9c4

04 12月, 2009 1 次提交

tree-wide: fix assorted typos all over the place · af901ca1

由 André Goddard Rosa 提交于 11月 14, 2009

That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.
Signed-off-by: NAndré Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

af901ca1

19 11月, 2009 1 次提交

sysctl: Drop & in front of every proc_handler. · 6d456111

由 Eric W. Biederman 提交于 11月 16, 2009

For consistency drop & in front of every proc_handler.  Explicity
taking the address is unnecessary and it prevents optimizations
like stubbing the proc_handlers to NULL.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

6d456111

12 11月, 2009 1 次提交

sysctl ia64: Remove dead binary sysctl support · d00faf81

由 Eric W. Biederman 提交于 4月 03, 2009

Now that sys_sysctl is a generic wrapper around /proc/sys  .ctl_name
and .strategy members of sysctl tables are dead code.  Remove them.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

d00faf81

01 7月, 2009 1 次提交

[IA64] address compiler warnings perfmon.c/salinfo.c · fa276f36

由 Jan Beulich 提交于 6月 30, 2009

perfmon.c has a dubious cast directly from "int" to "void *". Add
an intermediate cast to "long" to keep gcc happy.

salinfo.c uses "down_trylock()" in a highly creative way (explained
in the comments in the file) ... but it does kick out this warning:

 arch/ia64/kernel/salinfo.c:195: warning: ignoring return value of 'down_trylock'

which people occasionally try to "fix" in ways that do not work. Use some
casts to keep gcc quiet.
Signed-off-by: NJan Beulich <jbeulich@novell.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>

fa276f36

openanolis / cloud-kernel 大约 2 年 前同步成功

openanolis / cloud-kernel
大约 2 年前同步成功