- 10 10月, 2013 6 次提交
-
-
由 Rob Herring 提交于
Implement of_node_to_nid as weak function to remove the dependency on asm/prom.h. This is in preparation to make prom.h optional. Signed-off-by: NRob Herring <rob.herring@calxeda.com> Acked-by: NGrant Likely <grant.likely@linaro.org>
-
由 Rob Herring 提交于
Implement pci_address_to_pio as weak function to remove the dependency on asm/prom.h. This is in preparation to make prom.h optional. Signed-off-by: NRob Herring <rob.herring@calxeda.com> Acked-by: NGrant Likely <grant.likely@linaro.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: Grant Likely <grant.likely@linaro.org>
-
由 Rob Herring 提交于
Introduce common of_flat_dt_match_machine and of_flat_dt_get_machine_name functions to unify architectures' handling of machine level model and compatible properties. Several architectures match the root compatible string with an arch specific list of machine descriptors duplicating the same search algorithm. Create a common implementation with a simple architecture specific hook to iterate over each machine's match table. Signed-off-by: NRob Herring <rob.herring@calxeda.com> Acked-by: NGrant Likely <grant.likely@linaro.org>
-
由 Rob Herring 提交于
All arches do essentially the same thing now for early_init_dt_setup_initrd_arch, so it can now be removed. Signed-off-by: NRob Herring <rob.herring@calxeda.com> Acked-by: NVineet Gupta <vgupta@synopsys.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Acked-by: NGrant Likely <grant.likely@linaro.org>
-
由 Rob Herring 提交于
Most architectures scan the all the same items early in the FDT and none are really architecture specific. Create a common early_init_dt_scan to unify the early scan of root, memory, and chosen nodes in the flattened DT. Signed-off-by: NRob Herring <rob.herring@calxeda.com> Acked-by: NGrant Likely <grant.likely@linaro.org>
-
由 Rob Herring 提交于
Several architectures using DT support built-in dtb's in the init section. These platforms need to copy the dtb from init since the strings are referenced after unflattening. Every arch has their own copying routine which do the same thing. Create a common function, unflatten_and_copy_device_tree, to copy the dtb when unflattening the dtb. Signed-off-by: NRob Herring <rob.herring@calxeda.com> Acked-by: NGrant Likely <grant.likely@linaro.org>
-
- 28 9月, 2013 1 次提交
-
-
由 Heiko Carstens 提交于
Linus suggested to replace #ifndef CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX #define arch_mutex_cpu_relax() cpu_relax() #endif with just a simple #ifndef arch_mutex_cpu_relax # define arch_mutex_cpu_relax() cpu_relax() #endif to get rid of CONFIG_HAVE_CPU_RELAX_SIMPLE. So architectures can simply define arch_mutex_cpu_relax if they want an architecture specific function instead of having to add a select statement in their Kconfig in addition. Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
-
- 27 9月, 2013 1 次提交
-
-
由 K. Y. Srinivasan 提交于
The current code does not correctly negotiate the version numbers for the util driver when hosted on earlier hosts. The version numbers presented by this driver were not compatible with the version numbers supported by Windows Server 2008. Fix this problem. I would like to thank Olaf Hering (ohering@suse.com) for identifying the problem. Reported-by: NOlaf Hering <ohering@suse.com> Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 25 9月, 2013 5 次提交
-
-
由 Rob Herring 提交于
Much of of_irq.h is needlessly ifdef'ed. Clean this up and minimize the amount ifdef'ed code. This fixes some build warnings when CONFIG_OF is not enabled (seen on i386 and x86_64): include/linux/of_irq.h:82:7: warning: 'struct device_node' declared inside parameter list [enabled by default] include/linux/of_irq.h:82:7: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] include/linux/of_irq.h:87:47: warning: 'struct device_node' declared inside parameter list [enabled by default] Compile tested on i386, sparc and arm. Reported-by: NRandy Dunlap <rdunlap@infradead.org> Cc: Grant Likely <grant.likely@linaro.org> Signed-off-by: NRob Herring <rob.herring@calxeda.com>
-
由 Andrew Morton 提交于
Revert commit 3b38722e ("memcg, vmscan: integrate soft reclaim tighter with zone shrinking code") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Revert commit a5b7c87f ("vmscan, memcg: do softlimit reclaim also for targeted reclaim") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Revert commit de57780d ("memcg: enhance memcg iterator to support predicates") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
watchdog_tresh controls how often nmi perf event counter checks per-cpu hrtimer_interrupts counter and blows up if the counter hasn't changed since the last check. The counter is updated by per-cpu watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh period which guarantees that hrtimer is scheduled 2 times per the main period. Both hrtimer and perf event are started together when the watchdog is enabled. So far so good. But... But what happens when watchdog_thresh is updated from sysctl handler? proc_dowatchdog will set a new sampling period and hrtimer callback (watchdog_timer_fn) will use the new value in the next round. The problem, however, is that nobody tells the perf event that the sampling period has changed so it is ticking with the period configured when it has been set up. This might result in an ear ripping dissonance between perf and hrtimer parts if the watchdog_thresh is increased. And even worse it might lead to KABOOM if the watchdog is configured to panic on such a spurious lockup. This patch fixes the issue by updating both nmi perf even counter and hrtimers if the threshold value has changed. The nmi one is disabled and then reinitialized from scratch. This has an unpleasant side effect that the allocation of the new event might fail theoretically so the hard lockup detector would be disabled for such cpus. On the other hand such a memory allocation failure is very unlikely because the original event is deallocated right before. It would be much nicer if we just changed perf event period but there doesn't seem to be any API to do that right now. It is also unfortunate that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we cannot use on_each_cpu() and do the same thing from the per-cpu context. The update from the current CPU should be safe because perf_event_disable removes the event atomically before it clears the per-cpu watchdog_ev so it cannot change anything under running handler feet. The hrtimer is simply restarted (thanks to Don Zickus who has pointed this out) if it is queued because we cannot rely it will fire&adopt to the new sampling period before a new nmi event triggers (when the treshold is decreased). [akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place] Signed-off-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NDon Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Fabio Estevam <festevam@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 9月, 2013 1 次提交
-
-
由 Jun'ichi Nomura 提交于
Adding the number of bios in a remapped request to 'block_rq_remap' tracepoint. Request remapper clones bios in a request to track the completion status of each bio. So the number of bios can be useful information for investigation. Related discussions: http://www.redhat.com/archives/dm-devel/2013-August/msg00084.html http://www.redhat.com/archives/dm-devel/2013-September/msg00024.htmlSigned-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: NMike Snitzer <snitzer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 9月, 2013 2 次提交
-
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NJosef Bacik <jbacik@fusionio.com> Signed-off-by: NChris Mason <chris.mason@fusionio.com>
-
由 Michel Dänzer 提交于
CIK uses a different index for 1D DST surfaces compared to SI. Expose the new index so libdrm_radeon can use it properly for userspace drivers. Signed-off-by: NMichel Dänzer <michel.daenzer@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 20 9月, 2013 5 次提交
-
-
由 Mike Snitzer 提交于
Workaround the SCSI layer's problematic WRITE SAME heuristics by disabling WRITE SAME in the DM multipath device's queue_limits if an underlying device disabled it. The WRITE SAME heuristics, with both the original commit 5db44863 ("[SCSI] sd: Implement support for WRITE SAME") and the updated commit 66c28f97 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling WRITE SAME(10) even without successfully determining it is supported. After the first failed WRITE SAME the SCSI layer will disable WRITE SAME for the device (by setting sdkp->device->no_write_same which results in 'max_write_same_sectors' in device's queue_limits to be set to 0). When a device is stacked ontop of such a SCSI device any changes to that SCSI device's queue_limits do not automatically propagate up the stack. As such, a DM multipath device will not have its WRITE SAME support disabled. This causes the block layer to continue to issue WRITE SAME requests to the mpath device which causes paths to fail and (if mpath IO isn't configured to queue when no paths are available) it will result in actual IO errors to the upper layers. This fix doesn't help configurations that have additional devices stacked ontop of the mpath device (e.g. LVM created linear DM devices ontop). A proper fix that restacks all the queue_limits from the bottom of the device stack up will need to be explored if SCSI will continue to use this model of optimistically allowing op codes and then disabling them after they fail for the first time. Before this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. device-mapper: multipath: Failing path 8:112. end_request: I/O error, dev dm-6, sector 4616 dm-6: WRITE SAME failed. Manually zeroing. end_request: I/O error, dev dm-6, sector 4616 end_request: I/O error, dev dm-6, sector 5640 end_request: I/O error, dev dm-6, sector 6664 end_request: I/O error, dev dm-6, sector 7688 end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. end_request: I/O error, dev dm-6, sector 524296 Aborting journal on device dm-6-8. end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 33553920 After this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 0 It should be noted that WRITE SAME support wasn't enabled in DM multipath until v3.10. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: stable@vger.kernel.org # 3.10+
-
由 Peter Zijlstra 提交于
Solve the problems around the broken definition of perf_event_mmap_page:: cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially fixed by: 860f085b ("perf: Fix broken union in 'struct perf_event_mmap_page'") The problem with the fix (merged in v3.12-rc1 and not yet released officially), noticed by Vince Weaver is that the new behavior is not detectable by new user-space, and that due to the reuse of the field names it's easy to mis-compile a binary if old headers are used on a new kernel or new headers are used on an old kernel. To solve all that make this change explicit, detectable and self-contained, by iterating the ABI the following way: - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not confuse old user-space binaries. RDPMC will be marked as unavailable to old binaries but that's within the ABI, this is a capability bit. - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new libraries can reliably detect that bit 0 is deprecated and perma-zero without having to check the kernel version. - Use bits 2, 3, 4 for the newly defined, correct functionality: cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */ cap_user_time : 1, /* The time_* fields are used */ cap_user_time_zero : 1, /* The time_zero field is used */ - Rename all the bitfield names in perf_event.h to be different from the old names, to make sure it's not possible to mis-compile it accidentally with old assumptions. The 'size' field can then be used in the future to add new fields and it will act as a natural ABI version indicator as well. Also adjust tools/perf/ userspace for the new definitions, noticed by Adrian Hunter. Reported-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Also-Fixed-by: NAdrian Hunter <adrian.hunter@intel.com> Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
For some mysterious reason the sample_id field of PERF_RECORD_MMAP went AWOL. Reported-by: NVince Weaver <vince@deater.net> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Dave Airlie 提交于
This reverts commit 7c510133. Well looks like not enough digging was done, libdrm_nouveau before 2.4.33 used contexts, 292da616fe1f936ca78a3fa8e1b1b19883e343b6 nouveau: pull in major libdrm rewrite got rid of them, Reported-by: NPaul Zimmerman <Paul.Zimmerman@synopsys.com> Reported-by: NMikael Pettersson <mikpe@it.uu.se> Signed-off-by: NDave Airlie <airlied@redhat.com>
-
由 Ansis Atteka 提交于
If local fragmentation is allowed, then ip_select_ident() and ip_select_ident_more() need to generate unique IDs to ensure correct defragmentation on the peer. For example, if IPsec (tunnel mode) has to encrypt large skbs that have local_df bit set, then all IP fragments that belonged to different ESP datagrams would have used the same identificator. If one of these IP fragments would get lost or reordered, then peer could possibly stitch together wrong IP fragments that did not belong to the same datagram. This would lead to a packet loss or data corruption. Signed-off-by: NAnsis Atteka <aatteka@nicira.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 9月, 2013 1 次提交
-
-
由 Vince Weaver 提交于
Without the following patch I have problems compiling code using the new PERF_EVENT_IOC_ID ioctl(). It looks like u64 was used instead of __u64 Signed-off-by: NVince Weaver <vincent.weaver@maine.edu> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1309171450380.11444@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 17 9月, 2013 2 次提交
-
-
由 Paolo Bonzini 提交于
Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: NGleb Natapov <gleb@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jozsef Kadlecsik 提交于
The "nomatch" commandline flag should invert the matching at testing, similarly to the --return-nomatch flag of the "set" match of iptables. Until now it worked with the elements with "nomatch" flag only. From now on it works with elements without the flag too, i.e: # ipset n test hash:net # ipset a test 10.0.0.0/24 nomatch # ipset t test 10.0.0.1 10.0.0.1 is NOT in set test. # ipset t test 10.0.0.1 nomatch 10.0.0.1 is in set test. # ipset a test 192.168.0.0/24 # ipset t test 192.168.0.1 192.168.0.1 is in set test. # ipset t test 192.168.0.1 nomatch 192.168.0.1 is NOT in set test. Before the patch the results were ... # ipset t test 192.168.0.1 192.168.0.1 is in set test. # ipset t test 192.168.0.1 nomatch 192.168.0.1 is in set test. Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
-
- 16 9月, 2013 1 次提交
-
-
由 Joseph Gasparakis 提交于
This patch fixes sparse warnings when incorrectly handling the port number and using int instead of unsigned int iterating through &vn->sock_list[]. Keeping the port as __be16 also makes things clearer wrt endianess. Also, it was pointed out that vxlan_get_rx_port() had unnecessary checks which got removed. Signed-off-by: NJoseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 9月, 2013 15 次提交
-
-
由 Kees Cook 提交于
Many drivers need to validate the characteristics of their HID report during initialization to avoid misusing the reports. This adds a common helper to perform validation of the report exisitng, the field existing, and the expected number of values within the field. Signed-off-by: NKees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Reviewed-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
由 Martin Schwidefsky 提交于
After the last architecture switched to generic hard irqs the config options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code for !CONFIG_GENERIC_HARDIRQS can be removed. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Michal Kubeček 提交于
Commit 68b80f11 (netfilter: nf_nat: fix RCU races) introduced RCU protection for freeing extension data when reallocation moves them to a new location. We need the same protection when freeing them in nf_ct_ext_free() in order to prevent a use-after-free by other threads referencing a NAT extension data via bysource list. Signed-off-by: NMichal Kubecek <mkubecek@suse.cz> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Kirill A. Shutemov 提交于
do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault() to handle fallback path. Let's consolidate code back by introducing VM_FAULT_FALLBACK return code. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NHillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
truncate_pagecache() doesn't care about old size since commit cedabed4 ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Metcalf 提交于
make lru_add_drain_all() only selectively interrupt the cpus that have per-cpu free pages that can be drained. This is important in nohz mode where calling mlockall(), for example, otherwise will interrupt every core unnecessarily. This is important on workloads where nohz cores are handling 10 Gb traffic in userspace. Those CPUs do not enter the kernel and place pages into LRU pagevecs and they really, really don't want to be interrupted, or they drop packets on the floor. Signed-off-by: NChris Metcalf <cmetcalf@tilera.com> Reviewed-by: NTejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sha Zhengju 提交于
Add memcg routines to count writeback pages, later dirty pages will also be accounted. After Kame's commit 89c06bd5 ("memcg: use new logic for page stat accounting"), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between "move" and "page stat accounting". So in order to avoid the race we have designed a new lock: mem_cgroup_begin_update_page_stat() modify page information -->(a) mem_cgroup_update_page_stat() -->(b) mem_cgroup_end_update_page_stat() It requires both (a) and (b)(writeback pages accounting) to be pretected in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu read lock in the most cases (no task is moving), and spin_lock_irqsave on top in the slow path. There're two writeback interfaces to modify: test_{clear/set}_page_writeback(). And the lock order is: --> memcg->move_lock --> mapping->tree_lock Signed-off-by: NSha Zhengju <handai.szj@taobao.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NGreg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sha Zhengju 提交于
While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the complexity and presumed performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly. Signed-off-by: NSha Zhengju <handai.szj@taobao.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NFengguang Wu <fengguang.wu@intel.com> Reviewed-by: NGreg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sha Zhengju 提交于
RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX. Signed-off-by: NSha Zhengju <handai.szj@taobao.com> Signed-off-by: NQiang Huang <h.huangqiang@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sha Zhengju 提交于
Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource limit is unsigned long long, so we can set bigger value than that which is strange. The XXX_MAX should be reasonable max value, bigger than that should be overflow. Notice that this change will affect user output of default *.limit_in_bytes: before change: $ cat /cgroup/memory/memory.limit_in_bytes 9223372036854775807 after change: $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 But it doesn't alter the API in term of input - we can still use "echo -1 > *.limit_in_bytes" to reset the numbers to "unlimited". Signed-off-by: NSha Zhengju <handai.szj@taobao.com> Signed-off-by: NQiang Huang <h.huangqiang@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-by: NazurIt <azurit@pobox.sk> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Unlike global OOM handling, memory cgroup code will invoke the OOM killer in any OOM situation because it has no way of telling faults occuring in kernel context - which could be handled more gracefully - from user-triggered faults. Pass a flag that identifies faults originating in user space from the architecture-specific fault handlers to generic code so that memcg OOM handling can be improved. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
The caller of the iterator might know that some nodes or even subtrees should be skipped but there is no way to tell iterators about that so the only choice left is to let iterators to visit each node and do the selection outside of the iterating code. This, however, doesn't scale well with hierarchies with many groups where only few groups are interesting. This patch adds mem_cgroup_iter_cond variant of the iterator with a callback which gets called for every visited node. There are three possible ways how the callback can influence the walk. Either the node is visited, it is skipped but the tree walk continues down the tree or the whole subtree of the current group is skipped. [hughd@google.com: fix memcg-less page reclaim] Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. From the semantic point of view it is natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. This patch simply enables soft reclaim unconditionally in mem_cgroup_should_soft_reclaim so it is enabled for both global and targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn about the root of the reclaim to know where to stop checking soft limit state of parents up the hierarchy. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree and we can still reclaim from C (pressumably it is over the limit). mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NGlauber Costa <glommer@openvz.org> Reviewed-by: NTejun Heo <tj@kernel.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-