提交 · 93220587f76b8a4eca89cb655fc0cc04e9da663d · openeuler / Kernel

31 10月, 2013 8 次提交

ACPICA: Update aclinux.h for new OSL override mechanism. · 93220587

由 Lv Zheng 提交于 10月 29, 2013

The new ACPICA OSL override mechanism is used to solve these issues
for the Linux OSL:
 1. Linux can implement OSL using a macro.
 2. Linux can implement OSL using an inlined function.
 3. Linux can leave OSL not implemented for __KERNEL__ undefined code
    fragments.
 4. Linux can add sparse declarators (__iomem) to OSL.
 5. Linux can add memory tuning declarators (__init/__exit) to OSL.
This patch also moves Linux specific OSL to aclinux.h which has not been
maintained in the ACPICA code base.  Lv Zheng.

Known issue:

 From ACPICA's perspective, actypes.h should be included after inclusion
 of acenv.h.  But currently in Linux, aclinux.h included by acenv.h has
 included actypes.h to find ACPICA types for inline functions.  This is a
 known and existing issue and currently there is no real problem caused
 by this issue for Linux kernel build.  Thus this issue is not covered by
 this cleanup commit.
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

93220587

ACPICA: Add support to allow host OS to redefine individual OSL prototypes. · 7e94632f

由 Lv Zheng 提交于 10月 29, 2013

This change enables the host OS to redefine OSL prototypes found in the
acpiosxf.h file. This allows the host OS to implement OSL interfaces with
a macro or inlined function. Further, it allows the host OS to add any
additional required modifiers such as __iomem, __init, __exit, etc.,
as necessary on a per-interface basis. Enables maximum flexibility
for the OSL interfaces. Lv Zheng.
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7e94632f

ACPICA: Simplify configuration of global ACPI_REDUCED_HARDWARE macro. · c0144dc0

由 Bob Moore 提交于 10月 29, 2013

Surround definition of this with a #ifndef so that the kernel
can define it elsewhere if desired.
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

c0144dc0

ACPICA: Fix indentation issues for macro invocations. · cd27d79f

由 Lv Zheng 提交于 10月 29, 2013

During the automatic translation of the upstream ACPICA source code
into Linux kernel source code some extra white spaces are added by
the "indent" program at the beginning of each line which is an
invocation of a macro and there is no ";" at the end of the line.

For this reason, a new mode has been added to the translation scripts
to remove the extra spaces inserted before invoking such macros and add
an empty line between the invocations of such macros (like the other
function declarations).  This new mode is executed after executing
"indent" during the Linux release process.  Consequently, some
existing ACPICA source code in the Linux kernel tree needs to be
adjusted to allow the new scripts to work correctly.

The affected macros and files are:
 1. ACPI_HW_DEPENDENT_RETURN (acpixf.h/acdebug.h/acevents.h):
    This macro is used as a wrapper for hardware dependent APIs to offer
    a stub when the reduced hardware is configured during compilation.
 2. ACPI_EXPORT_SYMBOL (utglobal.c):
    This macro is used by Linux to export symbols to be found by Linux
    modules.  All such invocations are well formatted except those
    exported as global variables.

This can help to reduce the source code differences between Linux
and upstream ACPICA, and also help to automate the release process.
No functional or binary generation changes should result from it.
Lv Zheng.

[rjw: Changelog]
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

cd27d79f

ACPICA: Prevent possible build issues for use of ACPI_PRINTF_LIKE macro · 4506bf23

由 Lv Zheng 提交于 10月 29, 2013

The following build error:
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   CC      arch/x86/kernel/setup.o
 In file included from include/acpi/acpi.h:64:0,
                  from include/linux/iscsi_ibft.h:24,
                  from arch/x86/kernel/setup.c:43:
 include/acpi/acpixf.h:543:1: error: expected ',' or ';' before '{' token
 include/acpi/acpixf.h:540:1: warning: 'acpi_error' declared 'static' but never defined [-Wunused-function]
 make[2]: *** [arch/x86/kernel/setup.o] Error 1
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
can be triggerred by the following stub function (if implemented):
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 static inline void ACPI_INTERNAL_VAR_XFACE
 acpi_error(const char *module_name,
 	   u32 line_number, const char *format, ...) ACPI_PRINTF_LIKE(3)
 {
 }
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This patch changes the position of ACPI_PRINTF_LIKE(x) to follow the
style of __printf(x, x+1) used in Linux to prevent such issues from
happening.  Lv Zheng.
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

4506bf23

ACPICA: Deploy ACPI_EXPORT_SYMBOL_INIT for main ACPICA initialization interfaces. · d21f600b

由 Lv Zheng 提交于 10月 29, 2013

This changes can reduce source code differences between Linux and ACPICA
upstream to help improving the release automation.

The side effect of applying this patch in Linux is:
1. Some ACPICA initialization/termination APIs are no longer exported in
   Linux, these include:
    acpi_load_tables
    acpi_initialize_subsystem
    acpi_enable_subsystem
    acpi_initialize_objects
    acpi_terminate
2. This patch does not affect the following APIs as they are currently not
   marked with ACPI_EXPORT_SYMBOL in Linux:
    acpi_reallocate_root_table
    acpi_initialize_tables
Such functions should not be exported as they are internal to ACPI
subsystem in Linux, and will only be invoked inside of ACPI subsystem's
initialization routines marked with __init and termination routines marked
with __exit.  While on other OSPMs, such functions may still need to be
exported.

Thus this patch adds the configurability for ACPICA, so that it leaves
OSPMs to determine if the __init/__exit marked functions should be exported
or not.  Lv Zheng.
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

d21f600b

ACPICA: Clarify ACPI_FREE_BUFFER usage. · bb1cab3d

由 Bob Moore 提交于 10月 29, 2013

Add a comment to clarify reason for using ACPI_FREE_BUFFER directly
instead of ACPI_FREE.

In addition to that, change one instance in which ACPI_FREE_BUFFER()
should be used instead of ACPI_FREE().

[rjw: Subject and changelog]
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

bb1cab3d

ACPICA: Add EXPORT_ACPI_INTERFACES macro to external interface modules. · 839e928f

由 Lv Zheng 提交于 10月 29, 2013

For Linux, there are no functional changes/binary generation differences
introduced by this patch.

This change adds a new macro to all files that contain external ACPICA
interfaces. It can be detected and used by the host (via the host-specific
header) for any special processing required for such modules. Lv Zheng.
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

839e928f

30 10月, 2013 4 次提交

ACPICA: Hardcode access width for the reset register. · e07fcfd8

由 Bob Moore 提交于 10月 29, 2013

The ACPI spec requires the reset register width to be 8, so we
now hardcode it and ignore the FADT value. This provides/maintains
compatibility with other ACPI implementations that have allowed
BIOS code with bad register width values to go unnoticed.
Matthew Garett, Bob Moore, Lv Zheng.
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

e07fcfd8

ACPICA: Predefine name macros: Sort list. · bf4994ac

由 Bob Moore 提交于 10月 29, 2013

Sort the method names in acnames.h.
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

bf4994ac

ACPICA: Cleanup memory allocation macros and configurability. · b3c86c30

由 Lv Zheng 提交于 10月 29, 2013

In the common case, the ACPI_ALLOCATE and related macros now resolve
directly to their respective acpi_os* OSL interfaces. Two options:
1) The ACPI_ALLOCATE_ZEROED macro defaults to a simple local implementation
by default, unless overridden by the USE_NATIVE_ALLOCATE_ZEROED define.
2) For ACPI execution simulation environment (AcpiExec) which is not
shipped with the Linux kernel, the macros can optionally be resolved to
the local interfaces that track each allocation (used to immediately
detect memory leaks).
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

b3c86c30

ACPICA: Fix a macro for the hardware-reduced case · c26f3c90

由 Bob Moore 提交于 10月 29, 2013

This fix repairs a version of a macro that is used for the hardware
reduced case only. It adds a return statement to the macro definition
so that the translation into the Linux kernel source will not completely
delete the second line of the macro because it thinks that it is an empty
block. It actually clarifies the use of the macro anyway.
Reported-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

c26f3c90

24 9月, 2013 6 次提交

ACPICA: Update version to 20130823. · 94d7ba99

由 Bob Moore 提交于 9月 23, 2013

Version 20130823.
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Reviewed-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

94d7ba99

ACPICA: SCI Handlers: Update handler interface, eliminate unnecessary argument. · c53ae3a6

由 Bob Moore 提交于 9月 23, 2013

The SCI interrupt number is not needed for the SCI handlers, and was
just unnecessary overhead.
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Reviewed-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

c53ae3a6

ACPICA: Cleanup exception codes. · 31e93a16

由 Lv Zheng 提交于 9月 23, 2013

This patch adds AE_ACCESS for EACCES or EPERM.  Some error prompts are
also cleaned up in this patch.  Lv Zheng.
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Reviewed-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

31e93a16

ACPICA: Tables: Cleanup RSDP signature codes. · cacba865

由 Lv Zheng 提交于 9月 23, 2013

This patch introduces new macors to handle RSDP signature and cleans up the
affected codes.  Lv Zheng.
Some updates are only used for ACPICA utilities which are not shipped in
the kernel yet.
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Reviewed-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

cacba865

ACPICA: Add support for host-installed SCI handlers. · a2fd4b4b

由 Lv Zheng 提交于 9月 23, 2013

This change adds support to allow hosts to install System Control
Interrupt handlers. Certain ACPI functionality requires the host
to handle raw SCIs. For example, the "SCI Doorbell" that is defined
for memory power state support requires the host device driver to
handle SCIs to examine if the doorbell has been activated. Multiple
SCI handlers can be installed to allow for future expansion.
Debugger support is included.
Lv Zheng, Bob Moore. ACPICA BZ 1032.

Bug summary:
It is reported when the PCC (Platform Communication Channel, via
MPST table, defined in ACPI specification 5.0) subchannel responds
to the host, it issues an SCI and the host must probe the subchannel
for channel status.

Buglink: http://bugs.acpica.org/show_bug.cgi?id=1032Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Reviewed-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

a2fd4b4b

ACPICA: Linux-specific header: enable "aarch64" 64-bit build. · 30095207

由 Naresh Bhat 提交于 9月 23, 2013

Add support for the __aarch64__ define for 64-bit builds.
Signed-off-by: NNaresh Bhat <naresh.bhat@linaro.org>
Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Reviewed-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

30095207

22 9月, 2013 1 次提交

block: Add nr_bios to block_rq_remap tracepoint · 75afb352

由 Jun'ichi Nomura 提交于 9月 21, 2013

Adding the number of bios in a remapped request to 'block_rq_remap'
tracepoint.

Request remapper clones bios in a request to track the completion
status of each bio. So the number of bios can be useful information
for investigation.

Related discussions:
  http://www.redhat.com/archives/dm-devel/2013-August/msg00084.html
  http://www.redhat.com/archives/dm-devel/2013-September/msg00024.htmlSigned-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

75afb352

21 9月, 2013 1 次提交

btrfs: add lockdep and tracing annotations for uuid tree · 13fd8da9

由 David Sterba 提交于 9月 03, 2013

Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

13fd8da9

20 9月, 2013 2 次提交

Revert "drm: mark context support as a legacy subsystem" · c21eb21c

由 Dave Airlie 提交于 9月 20, 2013

This reverts commit 7c510133.

Well looks like not enough digging was done, libdrm_nouveau before 2.4.33
used contexts,

292da616fe1f936ca78a3fa8e1b1b19883e343b6 nouveau: pull in major libdrm rewrite

got rid of them,
Reported-by: NPaul Zimmerman <Paul.Zimmerman@synopsys.com>
Reported-by: NMikael Pettersson <mikpe@it.uu.se>
Signed-off-by: NDave Airlie <airlied@redhat.com>

c21eb21c

ip: generate unique IP identificator if local fragmentation is allowed · 703133de

由 Ansis Atteka 提交于 9月 18, 2013

If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.

For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.
Signed-off-by: NAnsis Atteka <aatteka@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

703133de

18 9月, 2013 1 次提交

perf: Fix UAPI export of PERF_EVENT_IOC_ID · a8e0108c

由 Vince Weaver 提交于 9月 17, 2013

Without the following patch I have problems compiling code using
the new PERF_EVENT_IOC_ID ioctl(). It looks like u64 was used
instead of __u64
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1309171450380.11444@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>

a8e0108c

17 9月, 2013 2 次提交

KVM: mmu: allow page tables to be in read-only slots · ba6a3541

由 Paolo Bonzini 提交于 9月 09, 2013

Page tables in a read-only memory slot will currently cause a triple
fault because the page walker uses gfn_to_hva and it fails on such a slot.

OVMF uses such a page table; however, real hardware seems to be fine with
that as long as the accessed/dirty bits are set. Save whether the slot
is readonly, and later check it when updating the accessed and dirty bits.
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ba6a3541

netfilter: ipset: Consistent userspace testing with nomatch flag · 0f1799ba

由 Jozsef Kadlecsik 提交于 9月 16, 2013

The "nomatch" commandline flag should invert the matching at testing,
similarly to the --return-nomatch flag of the "set" match of iptables.
Until now it worked with the elements with "nomatch" flag only. From
now on it works with elements without the flag too, i.e:

 # ipset n test hash:net
 # ipset a test 10.0.0.0/24 nomatch
 # ipset t test 10.0.0.1
 10.0.0.1 is NOT in set test.
 # ipset t test 10.0.0.1 nomatch
 10.0.0.1 is in set test.

 # ipset a test 192.168.0.0/24
 # ipset t test 192.168.0.1
 192.168.0.1 is in set test.
 # ipset t test 192.168.0.1 nomatch
 192.168.0.1 is NOT in set test.

 Before the patch the results were

 ...
 # ipset t test 192.168.0.1
 192.168.0.1 is in set test.
 # ipset t test 192.168.0.1 nomatch
 192.168.0.1 is in set test.
Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>

0f1799ba

16 9月, 2013 1 次提交

vxlan: Fix sparse warnings · 35e42379

由 Joseph Gasparakis 提交于 9月 13, 2013

This patch fixes sparse warnings when incorrectly handling the port number
and using int instead of unsigned int iterating through &vn->sock_list[].
Keeping the port as __be16 also makes things clearer wrt endianess.
Also, it was pointed out that vxlan_get_rx_port() had unnecessary checks
which got removed.
Signed-off-by: NJoseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35e42379

13 9月, 2013 14 次提交

HID: provide a helper for validating hid reports · 331415ff

由 Kees Cook 提交于 9月 11, 2013

Many drivers need to validate the characteristics of their HID report
during initialization to avoid misusing the reports. This adds a common
helper to perform validation of the report exisitng, the field existing,
and the expected number of values within the field.
Signed-off-by: NKees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Reviewed-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

331415ff

Remove GENERIC_HARDIRQ config option · 0244ad00

由 Martin Schwidefsky 提交于 8月 30, 2013

After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

0244ad00

netfilter: nf_conntrack: use RCU safe kfree for conntrack extensions · c13a84a8

由 Michal Kubeček 提交于 9月 11, 2013

Commit 68b80f11 (netfilter: nf_nat: fix RCU races) introduced
RCU protection for freeing extension data when reallocation
moves them to a new location. We need the same protection when
freeing them in nf_ct_ext_free() in order to prevent a
use-after-free by other threads referencing a NAT extension data
via bysource list.
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c13a84a8

thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() · c0292554

由 Kirill A. Shutemov 提交于 9月 12, 2013

do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
to handle fallback path.

Let's consolidate code back by introducing VM_FAULT_FALLBACK return
code.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NHillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c0292554

truncate: drop 'oldsize' truncate_pagecache() parameter · 7caef267

由 Kirill A. Shutemov 提交于 9月 12, 2013

truncate_pagecache() doesn't care about old size since commit
cedabed4 ("vfs: Fix vmtruncate() regression").  Let's drop it.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7caef267

mm: make lru_add_drain_all() selective · 5fbc4616

由 Chris Metcalf 提交于 9月 12, 2013

make lru_add_drain_all() only selectively interrupt the cpus that have
per-cpu free pages that can be drained.

This is important in nohz mode where calling mlockall(), for example,
otherwise will interrupt every core unnecessarily.

This is important on workloads where nohz cores are handling 10 Gb traffic
in userspace.  Those CPUs do not enter the kernel and place pages into LRU
pagevecs and they really, really don't want to be interrupted, or they
drop packets on the floor.
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Reviewed-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5fbc4616

memcg: add per cgroup writeback pages accounting · 3ea67d06

由 Sha Zhengju 提交于 9月 12, 2013

Add memcg routines to count writeback pages, later dirty pages will also
be accounted.

After Kame's commit 89c06bd5 ("memcg: use new logic for page stat
accounting"), we can use 'struct page' flag to test page state instead
of per page_cgroup flag.  But memcg has a feature to move a page from a
cgroup to another one and may have race between "move" and "page stat
accounting".  So in order to avoid the race we have designed a new lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(writeback pages accounting) to be pretected
in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
!CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
read lock in the most cases (no task is moving), and spin_lock_irqsave
on top in the slow path.

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
	--> memcg->move_lock
	  --> mapping->tree_lock
Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NGreg Thelen <gthelen@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ea67d06

memcg: remove MEMCG_NR_FILE_MAPPED · 68b4876d

由 Sha Zhengju 提交于 9月 12, 2013

While accounting memcg page stat, it's not worth to use
MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
complexity and presumed performance overhead.  We can use
MEM_CGROUP_STAT_FILE_MAPPED directly.
Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NFengguang Wu <fengguang.wu@intel.com>
Reviewed-by: NGreg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

68b4876d

memcg: rename RESOURCE_MAX to RES_COUNTER_MAX · 6de5a8bf

由 Sha Zhengju 提交于 9月 12, 2013

RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.
Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6de5a8bf

memcg: correct RESOURCE_MAX to ULLONG_MAX · 34ff8dc0

由 Sha Zhengju 提交于 9月 12, 2013

Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource
limit is unsigned long long, so we can set bigger value than that which is
strange.  The XXX_MAX should be reasonable max value, bigger than that
should be overflow.

Notice that this change will affect user output of default *.limit_in_bytes:
before change:

  $ cat /cgroup/memory/memory.limit_in_bytes
  9223372036854775807

after change:

  $ cat /cgroup/memory/memory.limit_in_bytes
  18446744073709551615

But it doesn't alter the API in term of input - we can still use "echo -1
> *.limit_in_bytes" to reset the numbers to "unlimited".
Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

34ff8dc0

mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8

由 Johannes Weiner 提交于 9月 12, 2013

The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved.  The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex.  The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:

OOM invoking task:
  mem_cgroup_handle_oom+0x241/0x3b0
  mem_cgroup_cache_charge+0xbe/0xe0
  add_to_page_cache_locked+0x4c/0x140
  add_to_page_cache_lru+0x22/0x50
  grab_cache_page_write_begin+0x8b/0xe0
  ext3_write_begin+0x88/0x270
  generic_file_buffered_write+0x116/0x290
  __generic_file_aio_write+0x27c/0x480
  generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
  do_sync_write+0xea/0x130
  vfs_write+0xf3/0x1f0
  sys_write+0x51/0x90
  system_call_fastpath+0x18/0x1d

OOM kill victim:
  do_truncate+0x58/0xa0              # takes i_mutex
  do_last+0x250/0xa30
  path_openat+0xd7/0x440
  do_filp_open+0x49/0xa0
  do_sys_open+0x106/0x240
  sys_open+0x20/0x30
  system_call_fastpath+0x18/0x1d

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit.  But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks.  For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/<pid>, which tries to acquire the same
mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

Debugged by Michal Hocko.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reported-by: NazurIt <azurit@pobox.sk>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3812c8c8

mm: memcg: enable memcg OOM killer only for user faults · 519e5247

由 Johannes Weiner 提交于 9月 12, 2013

System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really the
only option available.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

519e5247

arch: mm: pass userspace fault flag to generic fault handler · 759496ba

由 Johannes Weiner 提交于 9月 12, 2013

Unlike global OOM handling, memory cgroup code will invoke the OOM killer
in any OOM situation because it has no way of telling faults occuring in
kernel context - which could be handled more gracefully - from
user-triggered faults.

Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NMichal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

759496ba

memcg: enhance memcg iterator to support predicates · de57780d

由 Michal Hocko 提交于 9月 12, 2013

The caller of the iterator might know that some nodes or even subtrees
should be skipped but there is no way to tell iterators about that so the
only choice left is to let iterators to visit each node and do the
selection outside of the iterating code.  This, however, doesn't scale
well with hierarchies with many groups where only few groups are
interesting.

This patch adds mem_cgroup_iter_cond variant of the iterator with a
callback which gets called for every visited node.  There are three
possible ways how the callback can influence the walk.  Either the node is
visited, it is skipped but the tree walk continues down the tree or the
whole subtree of the current group is skipped.

[hughd@google.com: fix memcg-less page reclaim]
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

de57780d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功