提交 · 8dd7f8036c123296fc4214f9d8810eb485570422 · openanolis / cloud-kernel

23 10月, 2008 1 次提交

PCI: add support for function level reset · 8dd7f803

由 Sheng Yang 提交于 10月 21, 2008

Sometimes, it's necessary to enable software's ability to quiesce and
reset endpoint hardware with function-level granularity, so provide
support for it.

The patch implement Function Level Reset(FLR) feature following PCI-e
spec. And this is the first step. We would add more generic method, like
D0/D3, to allow more devices support this function.

The patch contains two functions. pcie_reset_function() is the new
driver API, and, contains some action to quiesce a device.  The other
function is a helper:  pcie_execute_reset_function() just executes the
reset for a particular device function.

Current the usage model is in KVM. Function reset is necessary for
assigning device to a guest, or moving it between partitions.

For Function Level Reset(FLR), please refer to PCI Express spec chapter
6.6.2.
Signed-off-by: NSheng Yang <sheng@linux.intel.com>
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

8dd7f803

21 10月, 2008 11 次提交

fix WARN() for PPC · f6f286f3

由 Arjan van de Ven 提交于 10月 20, 2008

powerpc doesn't use the generic WARN_ON infrastructure.  The newly
introduced WARN() as a result didn't print the message, this patch adds
the printk for this specific case.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f6f286f3

PCI: fix pci_ioremap_bar() on s390 · 96499871

由 Heiko Carstens 提交于 10月 20, 2008

s390 doesn't have ioremap_*, so protect the definition of the new
pci_ioremap_bar function with CONFIG_HAS_IOMEM to avoid build breakage.
Acked-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

96499871

PCI: fix AER capability check · 270c66be

由 Yu Zhao 提交于 10月 19, 2008

The 'use pci_find_ext_capability everywhere' cleanup brought a new bug,
which makes the AER stop working.  Fix it by actually using find_ext_cap
instead of just find_cap.  Drop the unused config space size define while
we're at it.
Signed-off-by: NYu Zhao <yu.zhao@intel.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

270c66be

PCI: use pci_find_ext_capability everywhere · 0927678f

由 Jesse Barnes 提交于 10月 18, 2008

Remove some open coded (and buggy) versions of pci_find_ext_capability
in favor of the real routine in the PCI core.
Tested-by: NTomasz Czernecki <czernecki@gmail.com>
Acked-by: NAndrew Vasquez <andrew.vasquez@qlogic.com>
Reviewed-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

0927678f

PCI: introduce an pci_ioremap(pdev, barnr) function · aa42d7c6

由 Arjan van de Ven 提交于 9月 28, 2008

A common thing in many PCI drivers is to ioremap() an entire bar. This
is a slightly fragile thing right now, needing both an address and a
size, and many driver writers do.. various things there.

This patch introduces an pci_ioremap() function taking just a PCI device
struct and the bar number as arguments, and figures this all out itself,
in one place. In addition, we can add various sanity checks to this
function (the patch already checks to make sure that the bar in question
really is a MEM bar; few to no drivers do that sort of thing).

Hopefully with this type of API we get less chance of mistakes in
drivers with ioremap() operations.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

aa42d7c6

PCI: support PCIe ARI capability · 58c3a727

由 Yu Zhao 提交于 10月 14, 2008

This patch adds support for PCI Express Alternative Routing-ID
Interpretation (ARI) capability.

The ARI capability extends the Function Number field of the PCI Express
Endpoint by reusing the Device Number which is otherwise hardwired to 0.
With ARI, an Endpoint can have up to 256 functions.
Signed-off-by: NYu Zhao <yu.zhao@intel.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

58c3a727

PCI: use same arg names in PCI_VDEVICE comment · c322b28a

由 Zhao, Yu 提交于 10月 13, 2008

This cleanup makes the argument names in PCI_VDEVICE comment consistent
with those used in its definition.
Signed-off-by: NYu Zhao <yu.zhao@intel.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

c322b28a

x86/PCI: irq and pci_ids patch for Intel Ibex Peak DeviceIDs · 37a84ec6

由 Seth Heasley 提交于 8月 28, 2008

This patch updates the Intel Ibex Peak (PCH) LPC and SMBus Controller
DeviceIDs.

The LPC Controller ID is set by Firmware within the range of
0x3b00-3b1f.  This range is included in pci_ids.h using min and max
values, and irq.c now has code to handle the range (in lieu of 32
additions to a SWITCH statement).

The SMBus Controller ID is a fixed-value and will not change.
Signed-off-by: NSeth Heasley <seth.heasley@intel.com>
Acked-by: NJean Delvare <khali@linux-fr.org>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

37a84ec6

PCI: change MSI-x vector to 32bit · 16dbef4a

由 Yinghai Lu 提交于 8月 15, 2008

We are using 28bit pci (bus/dev/fn + 12 bits) as irq number, so the
cache for irq number should be 32 bit too.
Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
Cc: Andrew Vasquez <andrew.vasquez@qlogic.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

16dbef4a

PCI PM: Introduce function pci_wake_from_d3 · 0235c4fc

由 Rafael J. Wysocki 提交于 8月 18, 2008

Many device drivers use the following sequence of statements to enable
the device to wake up the system while being in the D3_hot or D3_cold
low power state:

        pci_enable_wake(pdev, PCI_D3hot, 1);
        pci_enable_wake(pdev, PCI_D3cold, 1);

However, the second call is not necessary if the first one succeeds (the
ordering of the statements above doesn't matter here) and it may even be
harmful, because we are not supposed to enable PME# after the wake-up
power has been enabled for the device.

To allow drivers to overcome this problem, introduce function
pci_wake_from_d3() that will enable the device to wake up the system
from any of D3_hot and D3_cold as long as the wake-up from at least one
of them is supported.
Acked-by: NPavel Machek <pavel@suse.cz>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

0235c4fc

PCI: remove dynids.use_driver_data · edbc25ca

由 Milton Miller 提交于 7月 10, 2008

The driver flag dynids.use_driver_data is almost consistently not set,
and causes more problems than it solves.  It was initially intended as a
flag to indicate whether a driver's usage of driver_data had been
carefully inspected and was ready for values from userspace.  That audit
was never done, so most drivers just get a 0 for driver_data when new
IDs are added from userspace via sysfs.  So remove the flag, allowing
drivers to see the data directly (a followon patch validates the passed
driver_data value against what the drivers expect).
Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
Acked-by: NJean Delvare <khali@linux-fr.org>
Signed-off-by: NMilton Miller <miltonm@bga.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

edbc25ca

20 10月, 2008 28 次提交

x86: sysfs: kill owner field from attribute · 01e8ef11

由 Parag Warudkar 提交于 10月 18, 2008

Tejun's commit 7b595756 made sysfs
attribute->owner unnecessary.  But the field was left in the structure to
ease the merge.  It's been over a year since that change and it is now
time to start killing attribute->owner along with its users - one arch at
a time!

This patch is attempt #1 to get rid of attribute->owner only for
CONFIG_X86_64 or CONFIG_X86_32 .  We will deal with other arches later on
as and when possible - avr32 will be the next since that is something I
can test.  Compile (make allyesconfig / make allmodconfig / custom config)
and boot tested.

akpm: the idea is that we put the declaration of sttribute.owner inside
`#ifndef CONFIG_X86'.  But that proved to be too ambitious for now because
new usages kept on turning up in subsystem trees.

[akpm: remove the ifdef for now]
Signed-off-by: NParag Warudkar <parag.lkml@gmail.com>
Cc: Greg KH <greg@kroah.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: David Brownell <david-b@pacbell.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

01e8ef11

include/linux/bcd.h: remove comments · 5a85a7dd

由 Adrian Bunk 提交于 10月 18, 2008

- the macros are gone
- there's no more code in this file,
  LGPL + GPL = GPL,
  and the code that was moved to lib/bcd.c is anyway trivial
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5a85a7dd

remove the obsolete BCD*BIN/BIN*BCD macros · a0098efd

由 Adrian Bunk 提交于 10月 18, 2008

Remove the following obsolete macros:

- BCD2BIN
- BIN2BCD
- BCD_TO_BIN
- BIN_TO_BCD
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a0098efd

rtc: use bcd2bin/bin2bcd · 357c6e63

由 Adrian Bunk 提交于 10月 18, 2008

Change various rtc related code to use the new bcd2bin/bin2bcd functions
instead of the obsolete BCD_TO_BIN/BIN_TO_BCD/BCD2BIN/BIN2BCD macros.
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Acked-by: NAlessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

357c6e63

make mm/rmap.c:anon_vma_cachep static · fdd2e5f8

由 Adrian Bunk 提交于 10月 18, 2008

This patch makes the needlessly global anon_vma_cachep static.
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fdd2e5f8

byteorder: provide swabb.h generically in asm/byteorder.h · 1d8cca44

由 Harvey Harrison 提交于 10月 18, 2008

This is needed during the transition to the new byteorder headers as the
swabb.h functionality will be provided from asm/byteorder.h in the new
version. To avoid breakage on arches still using the old implementation,
provide swabb.h from asm/byteorder.h as well.
Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1d8cca44

byteorder: use generic C version for value byteswapping · acf0108a

由 Harvey Harrison 提交于 10月 18, 2008

This makes the new implementation of the byteorder helpers match the old
in how it degraded when an arch-defined version was not available:

1) swab()
	- look for arch defined
	- if not, use generic c version

2) swabp()
	- look for arch-defined
	- if not, deref pointer and use swab()

3) swabs()
	- look for arch defined
	- if not, use swabp
Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

acf0108a

byteorder: add new headers for make headers-install · b8e465f4

由 Harvey Harrison 提交于 10月 18, 2008

Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b8e465f4

kdump: add is_vmcore_usable() and vmcore_unusable() · 85a0ee34

由 Simon Horman 提交于 10月 18, 2008

The usage of elfcorehdr_addr has changed recently such that being set to
ELFCORE_ADDR_MAX is used by is_kdump_kernel() to indicate if the code is
executing in a kernel executed as a crash kernel.

However, arch/ia64/kernel/setup.c:reserve_elfcorehdr will rest
elfcorehdr_addr to ELFCORE_ADDR_MAX on error, which means any subsequent
calls to is_kdump_kernel() will return 0, even though they should return
1.

Ok, at this point in time there are no subsequent calls, but I think its
fair to say that there is ample scope for error or at the very least
confusion.

This patch add an extra state, ELFCORE_ADDR_ERR, which indicates that
elfcorehdr_addr was passed on the command line, and thus execution is
taking place in a crashdump kernel, but vmcore can't be used for some
reason.  This is tested for using is_vmcore_usable() and set using
vmcore_unusable().  A subsequent patch makes use of this new code.

To summarise, the states that elfcorehdr_addr can now be in are as follows:

ELFCORE_ADDR_MAX: not a crashdump kernel
ELFCORE_ADDR_ERR: crashdump kernel but vmcore is unusable
any other value:  crash dump kernel and vmcore is usable
Signed-off-by: NSimon Horman <horms@verge.net.au>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

85a0ee34

kdump: make elfcorehdr_addr independent of CONFIG_PROC_VMCORE · 57cac4d1

由 Vivek Goyal 提交于 10月 18, 2008

o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE
  but also by the code which is not inside CONFIG_PROC_VMCORE.  For
  example, is_kdump_kernel() is used by powerpc code to determine if
  kernel is booting after a panic then use previous kernel's TCE table.
  So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be
  able to correctly determine that we are booting after a panic and setup
  calgary iommu accordingly.

o So remove the assumption that elfcorehdr_addr is under
  CONFIG_PROC_VMCORE.

o Move definition of elfcorehdr_addr to arch dependent crash files.
  (Unfortunately crash dump does not have an arch independent file
  otherwise that would have been the best place).

o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
  second kernel without KEXEC being enabled.

o I don't see sh setup code parsing the command line for
  elfcorehdr_addr.  I am wondering how does vmcore interface work on sh.
  Anyway, I am atleast defining elfcoredhr_addr so that compilation is not
  broken on sh.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NSimon Horman <horms@verge.net.au>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

57cac4d1

add CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS · 656eb2cd

由 Roland McGrath 提交于 10月 18, 2008

This adds a kconfig option to change the /proc/PID/coredump_filter default.
Fedora has been carrying a trivial patch to change the hard-wired value for
this default, since Fedora 8.  The default default can't change safely
because there are old GDB versions out there (all before 6.7) that are
confused by the core dump files created by the MMF_DUMP_ELF_HEADERS setting.
Signed-off-by: NRoland McGrath <roland@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Jones <davej@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

656eb2cd

make ptrace_untrace() static · b747c8c1

由 Adrian Bunk 提交于 10月 18, 2008

ptrace_untrace() can now become static.
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b747c8c1

bitmask: remove bitmap_scnprintf_len() · c4596435

由 Lai Jiangshan 提交于 10月 18, 2008

bitmap_scnprintf_len() is not used now, so we remove it.

Otherwise we have to maintain it and make its return
value always equal to bitmap_scnprintf()'s return value.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c4596435

seq_file: add seq_cpumask_list(), seq_nodemask_list() · 3eda2011

由 Lai Jiangshan 提交于 10月 18, 2008

seq_cpumask_list(), seq_nodemask_list() are very like seq_cpumask(),
seq_nodemask(), but they print human readable string.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3eda2011

memcg: allocate all page_cgroup at boot · 52d4b9ac

由 KAMEZAWA Hiroyuki 提交于 10月 18, 2008

Allocate all page_cgroup at boot and remove page_cgroup poitner from
struct page.  This patch adds an interface as

 struct page_cgroup *lookup_page_cgroup(struct page*)

All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.

Remove page_cgroup pointer reduces the amount of memory by
 - 4 bytes per PAGE_SIZE.
 - 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)

On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.

By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
  - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
  - we can avoid calling kmalloc/kfree.
  - we can avoid allocating tons of small objects which can be fragmented.
  - we can know what amount of memory will be used for this extra-lru handling.

I added printk message as

	"allocated %ld bytes of page_cgroup"
        "please try cgroup_disable=memory option if you don't want"

maybe enough informative for users.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52d4b9ac

cgroups: fix declaration of cgroup_mm_owner_callbacks · 886465f4

由 Paul Menage 提交于 10月 18, 2008

The choice of real/dummy declaration for cgroup_mm_owner_callbacks()
shouldn't be based on CONFIG_MM_OWNER, but on CONFIG_CGROUPS.  Otherwise
kernel/exit.c fails to compile when something other than a cgroups
controller selects CONFIG_MM_OWNER
Signed-off-by: NPaul Menage <menage@google.com>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

886465f4

cgroups: convert tasks file to use a seq_file with shared pid array · cc31edce

由 Paul Menage 提交于 10月 18, 2008

Rather than pre-generating the entire text for the "tasks" file each
time the file is opened, we instead just generate/update the array of
process ids and use a seq_file to report these to userspace.  All open
file handles on the same "tasks" file can share a pid array, which may
be updated any time that no thread is actively reading the array.  By
sharing the array, the potential for userspace to DoS the system by
opening many handles on the same "tasks" file is removed.

[Based on a patch by Lai Jiangshan, extended to use seq_file]
Signed-off-by: NPaul Menage <menage@google.com>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cc31edce

cgroups: fix probable race with put_css_set[_taskexit] and find_css_set · 146aa1bd

由 Lai Jiangshan 提交于 10月 18, 2008

put_css_set_taskexit may be called when find_css_set is called on other
cpu.  And the race will occur:

put_css_set_taskexit side                    find_css_set side

                                        |
atomic_dec_and_test(&kref->refcount)    |
    /* kref->refcount = 0 */            |
....................................................................
                                        |  read_lock(&css_set_lock)
                                        |  find_existing_css_set
                                        |  get_css_set
                                        |  read_unlock(&css_set_lock);
....................................................................
__release_css_set                       |
....................................................................
                                        | /* use a released css_set */
                                        |

[put_css_set is the same. But in the current code, all put_css_set are
put into cgroup mutex critical region as the same as find_css_set.]

[akpm@linux-foundation.org: repair comments]
[menage@google.com: eliminate race in css_set refcounting]
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NPaul Menage <menage@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

146aa1bd

ext3: add an option to control error handling on file data · 0e4fb5e2

由 Hidehiro Kawai 提交于 10月 18, 2008

If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently. Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error. It's scary for mission critical systems. On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable. So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.

If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error. If you mount it with data_err=ignore, it doesn't abort, just
call printk(). data_err=ignore is the default.
Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0e4fb5e2

fb: convert lock/unlock_kernel() into local fb mutex · 3e680aae

由 Krzysztof Helt 提交于 10月 18, 2008

Change lock_kernel()/unlock_kernel() to local fb mutex.  Each frame buffer
instance has its own mutex.

The one line try_to_load() function is unrolled to request_module() in two
places for readability.

[righi.andrea@gmail.com: fb: fix NULL pointer BUG dereference in fb_open()]
Signed-off-by: NKrzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: NAndrea Righi <righi.andrea@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3e680aae

container freezer: implement freezer cgroup subsystem · dc52ddc0

由 Matt Helsley 提交于 10月 18, 2008

This patch implements a new freezer subsystem in the control groups
framework.  It provides a way to stop and resume execution of all tasks in
a cgroup by writing in the cgroup filesystem.

The freezer subsystem in the container filesystem defines a file named
freezer.state.  Writing "FROZEN" to the state file will freeze all tasks
in the cgroup.  Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup.  Reading will return the current state.

* Examples of usage :

   # mkdir /containers/freezer
   # mount -t cgroup -ofreezer freezer  /containers
   # mkdir /containers/0
   # echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

   # cat /containers/0/freezer.state
   RUNNING

to freeze all tasks in the container :

   # echo FROZEN > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   FREEZING
   # cat /containers/0/freezer.state
   FROZEN

to unfreeze all tasks in the container :

   # echo RUNNING > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   RUNNING

This is the basic mechanism which should do the right thing for user space
task in a simple scenario.

It's important to note that freezing can be incomplete.  In that case we
return EBUSY.  This means that some tasks in the cgroup are busy doing
something that prevents us from completely freezing the cgroup at this
time.  After EBUSY, the cgroup will remain partially frozen -- reflected
by freezer.state reporting "FREEZING" when read.  The state will remain
"FREEZING" until one of these things happens:

	1) Userspace cancels the freezing operation by writing "RUNNING" to
		the freezer.state file
	2) Userspace retries the freezing operation by writing "FROZEN" to
		the freezer.state file (writing "FREEZING" is not legal
		and returns EIO)
	3) The tasks that blocked the cgroup from entering the "FROZEN"
		state disappear from the cgroup's set of tasks.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: export thaw_process]
Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
Tested-by: NMatt Helsley <matthltc@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dc52ddc0

container freezer: make refrigerator always available · 8174f150

由 Matt Helsley 提交于 10月 18, 2008

Now that the TIF_FREEZE flag is available in all architectures, extract
the refrigerator() and freeze_task() from kernel/power/process.c and make
it available to all.

The refrigerator() can now be used in a control group subsystem
implementing a control group freezer.
Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
Tested-by: NMatt Helsley <matthltc@us.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8174f150

container freezer: add TIF_FREEZE flag to all architectures · 83224b08

由 Matt Helsley 提交于 10月 18, 2008

This patch series introduces a cgroup subsystem that utilizes the swsusp
freezer to freeze a group of tasks.  It's immediately useful for batch job
management scripts.  It should also be useful in the future for
implementing container checkpoint/restart.

The freezer subsystem in the container filesystem defines a cgroup file
named freezer.state.  Reading freezer.state will return the current state
of the cgroup.  Writing "FROZEN" to the state file will freeze all tasks
in the cgroup.  Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup.

* Examples of usage :

   # mkdir /containers/freezer
   # mount -t cgroup -ofreezer freezer  /containers
   # mkdir /containers/0
   # echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

   # cat /containers/0/freezer.state
   RUNNING

to freeze all tasks in the container :

   # echo FROZEN > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   FREEZING
   # cat /containers/0/freezer.state
   FROZEN

to unfreeze all tasks in the container :

   # echo RUNNING > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   RUNNING

This patch:

The first step in making the refrigerator() available to all
architectures, even for those without power management.

The purpose of such a change is to be able to use the refrigerator() in a
new control group subsystem which will implement a control group freezer.

[akpm@linux-foundation.org: fix sparc]
Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
Acked-by: NPavel Machek <pavel@suse.cz>
Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NNigel Cunningham <nigel@tuxonice.net>
Tested-by: NMatt Helsley <matthltc@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

83224b08

coredump_filter: add hugepage dumping · e575f111

由 KOSAKI Motohiro 提交于 10月 18, 2008

Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped.  But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g.  vmalloc, sound related).

Thus hugepages are never dumped and it can't be debugged easily.  Many
developers want hugepages to be included into core-dump.

However, We can't read generic VM_RESERVED area because this area is often
IO mapping area.  then these area reading may change device state.  it is
definitly undesiable side-effect.

So adding a hugepage specific bit to the coredump filter is better.  It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.

In additional, libhugetlb use hugetlb private mapping pages as anonymous
page.  Then, hugepage private mapping pages should be core dumped by
default.

Then, /proc/[pid]/core_dump_filter has two new bits.

 - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
 - bit 6 mean hugetlb shared mapping pages are dumped or not.  (default: no)

I tested by following method.

% ulimit -c unlimited
% ./crash_hugepage  50
% ./crash_hugepage  50  -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage  50
% ./crash_hugepage  50  -p
% ls -lh
% gdb ./crash_hugepage core

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

#include "hugetlbfs.h"

int main(int argc, char** argv){
	char* p;
	int ch;
	int mmap_flags = MAP_SHARED;
	int fd;
	int nr_pages;

	while((ch = getopt(argc, argv, "p")) != -1) {
		switch (ch) {
		case 'p':
			mmap_flags &= ~MAP_SHARED;
			mmap_flags |= MAP_PRIVATE;
			break;
		default:
			/* nothing*/
			break;
		}
	}
	argc -= optind;
	argv += optind;

	if (argc == 0){
		printf("need # of pages\n");
		exit(1);
	}

	nr_pages = atoi(argv[0]);
	if (nr_pages < 2) {
		printf("nr_pages must >2\n");
		exit(1);
	}

	fd = hugetlbfs_unlinked_fd();
	p = mmap(NULL, nr_pages * gethugepagesize(),
		 PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

	sleep(2);

	*(p + gethugepagesize()) = 1; /* COW */
	sleep(2);

	/* crash! */
	*(int*)0 = 1;

	return 0;
}
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: NKawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e575f111

mm: rewrite vmap layer · db64fe02

由 Nick Piggin 提交于 10月 18, 2008

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap.  Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache.  This is all done under a global lock.  As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
 This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock.  It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems.  The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated.  So the addresses aren't allocated again until
a subsequent TLB flush.  A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap.  Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages.  Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
in nanoseconds per map+touch+unmap.

threads           vanilla         vmap rewrite
1                 14700           2900
2                 33600           3000
4                 49500           2800
8                 70631           2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram...  along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system.  I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now.  vmap is pretty well blown off the
profiles.

Before:
1352059 total                                      0.1401
798784 _write_lock                              8320.6667 <- vmlist_lock
529313 default_idle                             1181.5022
 15242 smp_call_function                         15.8771  <- vmap tlb flushing
  2472 __get_vm_area_node                         1.9312  <- vmap
  1762 remove_vm_area                             4.5885  <- vunmap
   316 map_vm_area                                0.2297  <- vmap
   312 kfree                                      0.1950
   300 _spin_lock                                 3.1250
   252 sn_send_IPI_phys                           0.4375  <- tlb flushing
   238 vmap                                       0.8264  <- vmap
   216 find_lock_page                             0.5192
   196 find_next_bit                              0.3603
   136 sn2_send_IPI                               0.2024
   130 pio_phys_write_mmr                         2.0312
   118 unmap_kernel_range                         0.1229

After:
 78406 total                                      0.0081
 40053 default_idle                              89.4040
 33576 ia64_spinlock_contention                 349.7500
  1650 _spin_lock                                17.1875
   319 __reg_op                                   0.5538
   281 _atomic_dec_and_lock                       1.0977
   153 mutex_unlock                               1.5938
   123 iget_locked                                0.1671
   117 xfs_dir_lookup                             0.1662
   117 dput                                       0.1406
   114 xfs_iget_core                              0.0268
    92 xfs_da_hashname                            0.1917
    75 d_alloc                                    0.0670
    68 vmap_page_range                            0.0462 <- vmap
    58 kmem_cache_alloc                           0.0604
    57 memset                                     0.0540
    52 rb_next                                    0.1625
    50 __copy_user                                0.0208
    49 bitmap_find_free_region                    0.2188 <- vmap
    46 ia64_sn_udelay                             0.1106
    45 find_inode_fast                            0.1406
    42 memcmp                                     0.2188
    42 finish_task_switch                         0.1094
    42 __d_lookup                                 0.0410
    40 radix_tree_lookup_slot                     0.1250
    37 _spin_unlock_irqrestore                    0.3854
    36 xfs_bmapi                                  0.0050
    36 kmem_cache_free                            0.0256
    35 xfs_vn_getattr                             0.0322
    34 radix_tree_lookup                          0.1062
    33 __link_path_walk                           0.0035
    31 xfs_da_do_buf                              0.0091
    30 _xfs_buf_find                              0.0204
    28 find_get_page                              0.0875
    27 xfs_iread                                  0.0241
    27 __strncpy_from_user                        0.2812
    26 _xfs_buf_initialize                        0.0406
    24 _xfs_buf_lookup_pages                      0.0179
    24 vunmap_page_range                          0.0250 <- vunmap
    23 find_lock_page                             0.0799
    22 vm_map_ram                                 0.0087 <- vmap
    20 kfree                                      0.0125
    19 put_page                                   0.0330
    18 __kmalloc                                  0.0176
    17 xfs_da_node_lookup_int                     0.0086
    17 _read_lock                                 0.0885
    17 page_waitqueue                             0.0664

vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.

[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db64fe02

fs: buffer lock use lock bitops · 51b07fc3

由 Nick Piggin 提交于 10月 18, 2008

trylock_buffer and unlock_buffer open and close a critical section.
Hence, we can use the lock bitops to get the desired memory ordering.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

51b07fc3

mm: page lock use lock bitops · 8413ac9d

由 Nick Piggin 提交于 10月 18, 2008

trylock_page, unlock_page open and close a critical section. Hence,
we can use the lock bitops to get the desired memory ordering.

Also, mark trylock as likely to succeed (and remove the annotation from
callers).
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8413ac9d

mm: pagecache insertion fewer atomics · f45840b5

由 Nick Piggin 提交于 10月 18, 2008

Setting and clearing the page locked when inserting it into swapcache /
pagecache when it has no other references can use non-atomic page flags
operations because no other CPU may be operating on it at this time.

This saves one atomic operation when inserting a page into pagecache.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f45840b5

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功