- 01 6月, 2012 7 次提交
-
-
由 Anton Vorontsov 提交于
Many architectures clear tasks' mm_cpumask like this: read_lock(&tasklist_lock); for_each_process(p) { if (p->mm) cpumask_clear_cpu(cpu, mm_cpumask(p->mm)); } read_unlock(&tasklist_lock); Depending on the context, the code above may have several problems, such as: 1. Working with task->mm w/o getting mm or grabing the task lock is dangerous as ->mm might disappear (exit_mm() assigns NULL under task_lock(), so tasklist lock is not enough). 2. Checking for process->mm is not enough because process' main thread may exit or detach its mm via use_mm(), but other threads may still have a valid mm. This patch implements a small helper function that does things correctly, i.e.: 1. We take the task's lock while whe handle its mm (we can't use get_task_mm()/mmput() pair as mmput() might sleep); 2. To catch exited main thread case, we use find_lock_task_mm(), which walks up all threads and returns an appropriate task (with task lock held). Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in the new helper, instead we take the rcu read lock. We can do this because the function is called after the cpu is taken down and marked offline, so no new tasks will get this cpu set in their mm mask. Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Commit 8f92054e ("CRED: Fix __task_cred()'s lockdep check and banner comment"): add the following validation condition: task->exit_state >= 0 to permit the access if the target task is dead and therefore unable to change its own credentials. OK, but afaics currently this can only help wait_task_zombie() which calls __task_cred() without rcu lock. Remove this validation and change wait_task_zombie() to use task_uid() instead. This means we do rcu_read_lock() only to shut up the lockdep, but we already do the same in, say, wait_task_stopped(). task_is_dead() should die, task->exit_state != 0 means that this task has passed exit_notify(), only do_wait-like code paths should use this. Unfortunately, we can't kill task_is_dead() right now, it has already acquired buggy users in drivers/staging. The fix already exists. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com> Acked-by: NDavid Howells <dhowells@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Boaz Harrosh 提交于
If we move call_usermodehelper_fns() to kmod.c file and EXPORT_SYMBOL it we can avoid exporting all it's helper functions: call_usermodehelper_setup call_usermodehelper_setfns call_usermodehelper_exec And make all of them static to kmod.c Since the optimizer will see all these as a single call site it will inline them inside call_usermodehelper_fns(). So we loose the call to _fns but gain 3 calls to the helpers. (Not that it matters) Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Boaz Harrosh 提交于
call_usermodehelper_freeinfo() is not used outside of kmod.c. So unexport it, and make it static to kmod.c Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Artem Bityutskiy 提交于
This is patchset makes fatfs stop using the VFS '->write_super()' method for writing out the FSINFO block. The final goal is to get rid of the 'sync_supers()' kernel thread. This kernel thread wakes up every 5 seconds (by default) and calls '->write_super()' for all mounted file-systems. And the bad thing is that this is done even if all the superblocks are clean. Moreover, some file-systems do not even need this end they do not register the '->write_super()' method at all (e.g., btrfs). So 'sync_supers()' most often just generates useless wake-ups and wastes power. I am trying to make all file-systems independent of '->write_super()' and plan to remove 'sync_supers()' and '->write_super' completely once there are no more users. The '->write_supers()' method is mostly used by baroque file-systems like hfs, udf, etc. Modern file-systems like btrfs and xfs do not use it. This justifies removing this stuff from VFS completely and make every FS self-manage own superblock. Tested with xfstests. This patch: Preparation for further changes. It introduces a special inode ('fsinfo_inode') in FAT file-system which we'll later use for managing the FSINFO block. Note, this there is already one special inode ('fat_inode') which is used for managing the FAT tables. Introduce new 'MSDOS_FSINFO_INO' constant for this special inode. It is safe to do because FAT file-system does not store inode numbers on the media but generates them run-time. I've also cleaned up the comment to existing 'MSDOS_ROOT_INO' constant, while on it. Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Denys Vlasenko 提交于
Previous code was using optimizations which were developed to work well even on narrow-word CPUs (by today's standards). But Linux runs only on 32-bit and wider CPUs. We can use that. First: using 32x32->64 multiply and trivial 32-bit shift, we can correctly divide by 10 much larger numbers, and thus we can print groups of 9 digits instead of groups of 5 digits. Next: there are two algorithms to print larger numbers. One is generic: divide by 1000000000 and repeatedly print groups of (up to) 9 digits. It's conceptually simple, but requires an (unsigned long long) / 1000000000 division. Second algorithm splits 64-bit unsigned long long into 16-bit chunks, manipulates them cleverly and generates groups of 4 decimal digits. It so happens that it does NOT require long long division. If long is > 32 bits, division of 64-bit values is relatively easy, and we will use the first algorithm. If long long is > 64 bits (strange architecture with VERY large long long), second algorithm can't be used, and we again use the first one. Else (if long is 32 bits and long long is 64 bits) we use second one. And third: there is a simple optimization which takes fast path not only for zero as was done before, but for all one-digit numbers. In all tested cases new code is faster than old one, in many cases by 30%, in few cases by more than 50% (for example, on x86-32, conversion of 12345678). Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case. This patch is based upon an original from Michal Nazarewicz. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: NMichal Nazarewicz <mina86@mina86.com> Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com> Cc: Douglas W Jones <jones@cs.uiowa.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xi Wang 提交于
ULONG_MAX is often used to check for integer overflow when calculating allocation size. While ULONG_MAX happens to work on most systems, there is no guarantee that `size_t' must be the same size as `long'. This patch introduces SIZE_MAX, the maximum value of `size_t', to improve portability and readability for allocation size validation. Signed-off-by: NXi Wang <xi.wang@gmail.com> Acked-by: NAlex Elder <elder@dreamhost.com> Cc: David Airlie <airlied@linux.ie> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 5月, 2012 33 次提交
-
-
由 Wolfram Sang 提交于
Some DS13XX devices have "trickle chargers". Its configuration register is at different locations, the setup is the same, though. Since the configuration is board specific, introduce a platform_data to this driver. Tested with a DS1339 on a custom board. Signed-off-by: NWolfram Sang <w.sang@pengutronix.de> Cc: Alessandro Zummo <alessandro.zummo@towertech.it> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexander Stein 提交于
Currently there is no generic way to get the RTC battery status within an application. So add an ioctl to read the status bit. The idea is that the bit is set once a low voltage is detected. It stays there until it is reset using the RTC_VL_CLR ioctl. Signed-off-by: NAlexander Stein <alexander.stein@systec-electronic.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stephen Boyd 提交于
Using %ps in a printk format will sometimes fail silently and print the empty string if the address passed in does not match a symbol that kallsyms knows about. But using %pS will fall back to printing the full address if kallsyms can't find the symbol. Make %ps act the same as %pS by falling back to printing the address. While we're here also make %ps print the module that a symbol comes from so that it matches what %pS already does. Take this simple function for example (in a module): static void test_printk(void) { int test; pr_info("with pS: %pS\n", &test); pr_info("with ps: %ps\n", &test); } Before this patch: with pS: 0xdff7df44 with ps: After this patch: with pS: 0xdff7df44 with ps: 0xdff7df44 Signed-off-by: NStephen Boyd <sboyd@codeaurora.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kim, Milo 提交于
max brightness is 127, so the range of brt_val should be from 0 to 127 Signed-off-by: NMilo(Woogyom) Kim <milo.kim@ti.com> Acked-by: NLinus Walleij <linus.walleij@linaro.org> Cc: Shreshtha Kumar SAHU <shreshthakumar.sahu@stericsson.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Bryan Wu <bryan.wu@canonical.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shuah Khan 提交于
Add a new field to led_classdev to save activattion state after activate routine is successful. This saved state is used in deactivate routine to do cleanup such as removing device files, and free memory allocated during activation. Currently trigger_data not being null is used for this purpose. Existing triggers will need changes to use this new field. Signed-off-by: NShuah Khan <shuahkhan@gmail.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Bryan Wu <bryan.wu@canonical.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 H Hartley Sweeten 提交于
Include the header to pickup the exported symbol prototype. Quiets the sparse warning: warning: symbol 'apple_bl_register' was not declared. Should it be static? warning: symbol 'apple_bl_unregister' was not declared. Should it be static? [akpm@linux-foundation.org: fix resulting build error] Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: NFlorian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Inki Dae 提交于
This patchset adds early fb blank feature that a callback of lcd panel driver is called prior to specific fb driver's one. In the case of MIPI-DSI based video mode LCD Panel, for lcd power off, the power off commands should be transferred to lcd panel with display and mipi-dsi controller enabled because the commands is set to lcd panel at vsync porch period. and in opposite case, the callback of fb driver should be called prior to lcd panel driver's one because of same issue. Also if fb_blank mode is changed to FB_BLANK_POWERDOWN then display controller would be off(clock disable) but lcd panel would be still on. at this time, you could see some issue like sparkling on lcd panel because video clock to be delivered to ldi module of lcd panel was disabled. this issue could occurs for all lcd panels. The callback order is as the following: at fb_blank function of fbmem.c -> fb_notifier_call_chain(FB_EARLY_EVENT_BLANK) -> lcd panel driver's early_set_power() -> info->fbops->fb_blank() -> spcefic fb driver's fb_blank() -> fb_notifier_call_chain(FB_EVENT_BLANK) -> lcd panel driver's set_power() -> fb_notifier_call_chain(FB_R_EARLY_EVENT_BLANK) if info->fops->fb_blank() was failed. fb_notifier_call_chain(FB_R_EARLY_EVENT_BLANK) would be called to revert the effects of previous FB_EARLY_EVENT_BLANK call. and note that if early_set_power() of lcd_ops is NULL then early fb blank callback would be ignored. This patch: Add early_set_power and r_early_set_power callbacks. early_set_power callback is called prior to fb_blank() of fbmem.c and r_early_set_power callback is called if fb_blank() was failed to revert the effects of the early_set_power call of lcd panel driver. Signed-off-by: NInki Dae <inki.dae@samsung.com> Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Inki Dae 提交于
Add FB_EARLY_EVENT_BLANK and FB_R_EARLY_EVENT_BLANK event mode supports. first, fb_notifier_call_chain() is called with FB_EARLY_EVENT_BLANK and fb_blank() of specific fb driver is called and then fb_notifier_call_chain() is called with FB_EVENT_BLANK again at fb_blank(). and if fb_blank() was failed then fb_nitifier_call_chain() would be called with FB_R_EARLY_EVENT_BLANK to revert the previous effects. Signed-off-by: NInki Dae <inki.dae@samsung.com> Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Acked-by: NFlorian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
We call the destroy function when a cgroup starts to be removed, such as by a rmdir event. However, because of our reference counters, some objects are still inflight. Right now, we are decrementing the static_keys at destroy() time, meaning that if we get rid of the last static_key reference, some objects will still have charges, but the code to properly uncharge them won't be run. This becomes a problem specially if it is ever enabled again, because now new charges will be added to the staled charges making keeping it pretty much impossible. We just need to be careful with the static branch activation: since there is no particular preferred order of their activation, we need to make sure that we only start using it after all call sites are active. This is achieved by having a per-memcg flag that is only updated after static_key_slow_inc() returns. At this time, we are sure all sites are active. This is made per-memcg, not global, for a reason: it also has the effect of making socket accounting more consistent. The first memcg to be limited will trigger static_key() activation, therefore, accounting. But all the others will then be accounted no matter what. After this patch, only limited memcgs will have its sockets accounted. [akpm@linux-foundation.org: move enum sock_flag_bits into sock.h, document enum sock_flag_bits, convert memcg_proto_active() and memcg_proto_activated() to test_bit(), redo tcp_update_limit() comment to 80 cols] Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Acked-by: NDavid Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Take lruvec further: pass it instead of zone to add_page_to_lru_list() and del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to its target functions. This cleanup eliminates a swathe of cruft in memcontrol.c, including mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and mem_cgroup_lru_move_lists() - which never actually touched the lists. In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously a side-effect of add, and mem_cgroup_update_lru_size() to maintain the lru_size stats. Whilst these are simplifications in their own right, the goal is to bring the evaluation of lruvec next to the spin_locking of the lrus, in preparation for a future patch. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Konstantin just introduced mem_cgroup_get_lruvec_size() and get_lruvec_size(), I'm about to add mem_cgroup_update_lru_size(): but we're dealing with the same thing, lru_size[lru]. We ought to agree on the naming, and I do think lru_size is the more correct: so rename his ones to get_lru_size(). Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Since we will succeed with the allocation no matter what, there isn't a need to use __must_check with it. It can very well be optional. Signed-off-by: NGlauber Costa <glommer@parallels.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ying Han <yinghan@google.com> Reviewed-by: NTejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Frederic Weisbecker 提交于
When killing a res_counter which is a child of other counter, we need to do res_counter_uncharge(child, xxx) res_counter_charge(parent, xxx) This is not atomic and wastes CPU. This patch adds res_counter_uncharge_until(). This function's uncharge propagates to ancestors until specified res_counter. res_counter_uncharge_until(child, parent, xxx) Now the operation is atomic and efficient. Signed-off-by: NFrederic Weisbecker <fweisbec@redhat.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Cc: Glauber Costa <glommer@parallels.com> Reviewed-by: NTejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Switch mem_cgroup_inactive_anon_is_low() to lruvec pointers, mem_cgroup_get_lruvec_size() is more effective than mem_cgroup_zone_nr_lru_pages() Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
If memory cgroup is enabled we always use lruvecs which are embedded into struct mem_cgroup_per_zone, so we can reach lru_size counters via container_of(). Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
This is the first stage of struct mem_cgroup_zone removal. Further patches replace struct mem_cgroup_zone with a pointer to struct lruvec. If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of(). Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
This patch kills mem_cgroup_lru_del(), we can use mem_cgroup_lru_del_list() instead. On 0-order isolation we already have right lru list id. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can completely remove anon/file and active/inactive lru type filters from __isolate_lru_page(), because isolation for 0-order reclaim always isolates pages from right lru list. And pages-isolation for lumpy shrink_inactive_list() or memory-compaction anyway allowed to isolate pages from all evictable lru lists. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
GCC sometimes ignores "inline" directives even for small and simple functions. This supposed to be fixed in gcc 4.7, but it was released only yesterday. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
With mem_cgroup_disabled() now explicit, it becomes clear that the zone_reclaim_stat structure actually belongs in lruvec, per-zone when memcg is disabled but per-memcg per-zone when it's enabled. We can delete mem_cgroup_get_reclaim_stat(), and change update_page_reclaim_stat() to update just the one set of stats, the one which get_scan_count() will actually use. Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
This patch changes memcg's behavior at task_move(). At task_move(), the kernel scans a task's page table and move the changes for mapped pages from source cgroup to target cgroup. There has been a bug at handling shared anonymous pages for a long time. Before patch: - The spec says 'shared anonymous pages are not moved.' - The implementation was 'shared anonymoys pages may be moved'. If page_mapcount <=2, shared anonymous pages's charge were moved. After patch: - The spec says 'all anonymous pages are moved'. - The implementation is 'all anonymous pages are moved'. Considering usage of memcg, this will not affect user's experience. 'shared anonymous' pages only exists between a tree of processes which don't do exec(). Moving one of process without exec() seems not sane. For example, libcgroup will not be affected by this change. (Anyway, no one noticed the implementation for a long time...) Below is a discussion log: - current spec/implementation are complex - Now, shared file caches are moved - It adds unclear check as page_mapcount(). To do correct check, we should check swap users, etc. - No one notice this implementation behavior. So, no one get benefit from the design. - In general, once task is moved to a cgroup for running, it will not be moved.... - Finally, we have control knob as memory.move_charge_at_immigrate. Here is a patch to allow moving shared pages, completely. This makes memcg simpler and fix current broken code. Suggested-by: NHugh Dickins <hughd@google.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pravin B Shelar 提交于
Transparent huge pages can change page->flags (PG_compound_lock) without taking Slab lock. Since THP can not break slab pages we can safely access compound page without taking compound lock. Specifically this patch fixes a race between compound_unlock() and slab functions which perform page-flags updates. This can occur when get_page()/put_page() is called on a page from slab. [akpm@linux-foundation.org: tweak comment text, fix comment layout, fix label indenting] Reported-by: NAmey Bhide <abhide@nicira.com> Signed-off-by: NPravin B Shelar <pshelar@nicira.com> Reviewed-by: NChristoph Lameter <cl@linux.com> Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
When holding the mmap_sem for reading, pmd_offset_map_lock should only run on a pmd_t that has been read atomically from the pmdp pointer, otherwise we may read only half of it leading to this crash. PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic" #0 [f06a9dd8] crash_kexec at c049b5ec #1 [f06a9e2c] oops_end at c083d1c2 #2 [f06a9e40] no_context at c0433ded #3 [f06a9e64] bad_area_nosemaphore at c043401a #4 [f06a9e6c] __do_page_fault at c0434493 #5 [f06a9eec] do_page_fault at c083eb45 #6 [f06a9f04] error_code (via page_fault) at c083c5d5 EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP: 00000000 DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0 CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246 #7 [f06a9f38] _spin_lock at c083bc14 #8 [f06a9f44] sys_mincore at c0507b7d #9 [f06a9fb0] system_call at c083becd start len EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00 SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033 CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286 This should be a longstanding bug affecting x86 32bit PAE without THP. Only archs with 64bit large pmd_t and 32bit unsigned long should be affected. With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when the pmd transition from none to stable, by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is enabled a new set of problem arises by the fact could then transition freely in any of the none, pmd_trans_huge or pmd_trans_stable states. So making the barrier in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea and it would be a flakey solution. This should be fully fixed by introducing a pmd_read_atomic that reads the pmd in order with THP disabled, or by reading the pmd atomically with cmpxchg8b with THP enabled. Luckily this new race condition only triggers in the places that must already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix is localized there but this bug is not related to THP. NOTE: this can trigger on x86 32bit systems with PAE enabled with more than 4G of ram, otherwise the high part of the pmd will never risk to be truncated because it would be zero at all times, in turn so hiding the SMP race. This bug was discovered and fully debugged by Ulrich, quote: ---- [..] pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and eax. 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) 497 { 498 /* depend on compiler for an atomic pmd read */ 499 pmd_t pmdval = *pmd; // edi = pmd pointer 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi ... // edx = PTE page table high address 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx ... // eax = PTE page table low address 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax [..] Please note that the PMD is not read atomically. These are two "mov" instructions where the high order bits of the PMD entry are fetched first. Hence, the above machine code is prone to the following race. - The PMD entry {high|low} is 0x0000000000000000. The "mov" at 0xc0507a84 loads 0x00000000 into edx. - A page fault (on another CPU) sneaks in between the two "mov" instructions and instantiates the PMD. - The PMD entry {high|low} is now 0x00000003fda38067. The "mov" at 0xc0507a8e loads 0xfda38067 into eax. ---- Reported-by: NUlrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the "best" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: NDave Jones <davej@redhat.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Remove vmtruncate_range(), and remove the truncate_range method from struct inode_operations: only tmpfs ever supported it, and tmpfs has now converted over to using the fallocate method of file_operations. Update Documentation accordingly, adding (setlease and) fallocate lines. And while we're in mm.h, remove duplicate declarations of shmem_lock() and shmem_file_setup(): everyone is now using the ones in shmem_fs.h. Based-on-patch-by: NCong Wang <amwang@redhat.com> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cong Wang <amwang@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the backing RAM has to be below 4GB. Not a problem while the boards supported only 4GB: but now Intel's D2700MUD boards support 8GB, and their GMA3600 is managed by the GMA500 driver. shmem/tmpfs has never pretended to support hardware restrictions on the backing memory, but it might have appeared to do so before v3.1, and even now it works fine until a page is swapped out then back in. When read_cache_page_gfp() supplied a freshly allocated page for copy, that compensated for whatever choice might have been made by earlier swapin readahead; but swapoff was likely to destroy the illusion. We'd like to continue to support GMA500, so now add a new shmem_should_replace_page() check on the zone when about to move a page from swapcache to filecache (in swapin and swapoff cases), with shmem_replace_page() to allocate and substitute a suitable page (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). This does involve a minor extension to mem_cgroup_replace_page_cache() (the page may or may not have already been charged); and I've removed a comment and call to mem_cgroup_uncharge_cache_page(), which in fact is always a no-op while PageSwapCache. Also removed optimization of an unlikely path in shmem_getpage_gfp(), now that we need to check PageSwapCache more carefully (a racing caller might already have made the copy). And at one point shmem_unuse_inode() needs to use the hitherto private page_swapcount(), to guard against racing with inode eviction. It would make sense to extend shmem_should_replace_page(), to cover cpuset and NUMA mempolicy restrictions too, but set that aside for now: needs a cleanup of shmem mempolicy handling, and more testing, and ought to handle swap faults in do_swap_page() as well as shmem. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Stephane Marchesin <marcheu@chromium.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Rob Clark <rob.clark@linaro.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an allocation takes ownership of the block may take too long. The type of the pageblock remains unchanged so the pageblock cannot be used as a migration target during compaction. Fix it by: * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and COMPACT_SYNC) and then converting sync field in struct compact_control to use it. * Adding nr_pageblocks_skipped field to struct compact_control and tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type. If COMPACT_ASYNC_MOVABLE mode compaction ran fully in try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a suitable page for allocation. In this case then check how if there were enough MIGRATE_UNMOVABLE pageblocks to try a second pass in COMPACT_ASYNC_UNMOVABLE mode. * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all pages within the MIGRATE_UNMOVABLE pageblock are in one of those three sets change the whole pageblock type to MIGRATE_MOVABLE. My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means 131072 standard 4KiB pages in 'Normal' zone) is to: - allocate 120000 pages for kernel's usage - free every second page (60000 pages) of memory just allocated - allocate and use 60000 pages from user space - free remaining 60000 pages of kernel memory (now we have fragmented memory occupied mostly by user space pages) - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage The results: - with compaction disabled I get 11 successful allocations - with compaction enabled - 14 successful allocations - with this patch I'm able to get all 100 successful allocations NOTE: If we can make kswapd aware of order-0 request during compaction, we can enhance kswapd with changing mode to COMPACT_ASYNC_FULL (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the following thread: http://marc.info/?l=linux-mm&m=133552069417068&w=2 [minchan@kernel.org: minor cleanups] Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. As __alloc_bootmem_node_nopanic() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. [akpm@linux-foundation.org: fix build] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NDavid S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alex Shi 提交于
When transparent_hugepage_enabled() is used outside mm/, such as in arch/x86/xx/tlb.c: + if (!cpu_has_invlpg || vma->vm_flags & VM_HUGETLB + || transparent_hugepage_enabled(vma)) { + flush_tlb_mm(vma->vm_mm); is_vma_temporary_stack() isn't referenced in huge_mm.h, so it has compile errors: arch/x86/mm/tlb.c: In function `flush_tlb_range': arch/x86/mm/tlb.c:324:4: error: implicit declaration of function `is_vma_temporary_stack' [-Werror=implicit-function-declaration] Since is_vma_temporay_stack() is just used in rmap.c and huge_memory.c, it is better to move it to huge_mm.h from rmap.h to avoid such errors. Signed-off-by: NAlex Shi <alex.shi@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Drepper 提交于
Programs using /proc/kpageflags need to know about the various flags. The <linux/kernel-page-flags.h> provides them and the comments in the file indicate that it is supposed to be used by user-level code. But the file is not installed. Install the headers and mark the unstable flags as out-of-bounds. The page-type tool is also adjusted to not duplicate the definitions Signed-off-by: NUlrich Drepper <drepper@gmail.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Even if CONFIG_DEBUG_VM=n gcc genereates code for some VM_BUG_ON() for example VM_BUG_ON(!PageCompound(page) || !PageHead(page)); in do_huge_pmd_wp_page() generates 114 bytes of code. But they mostly disappears when I split this VM_BUG_ON into two: -VM_BUG_ON(!PageCompound(page) || !PageHead(page)); +VM_BUG_ON(!PageCompound(page)); +VM_BUG_ON(!PageHead(page)); weird... but anyway after this patch code disappears completely. add/remove: 0/0 grow/shrink: 7/97 up/down: 135/-1784 (-1649) Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Sometimes we want to check some expressions correctness at compile time. "(void)(e);" or "if (e);" can be dangerous if the expression has side-effects, and gcc sometimes generates a lot of code, even if the expression has no effect. This patch introduces macro BUILD_BUG_ON_INVALID() for such checks, it forces a compilation error if expression is invalid without any extra code. [Cast to "long" required because sizeof does not work for bit-fields.] Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The rmap walker checking page table references has historically ignored references from VMAs that were not part of the memcg that was being reclaimed during memcg hard limit reclaim. When transitioning global reclaim to memcg hierarchy reclaim, I missed that bit and now references from outside a memcg are ignored even during global reclaim. Reverting back to traditional behaviour - count all references during global reclaim and only mind references of the memcg being reclaimed during limit reclaim would be one option. However, the more generic idea is to ignore references exactly then when they are outside the hierarchy that is currently under reclaim; because only then will their reclamation be of any use to help the pressure situation. It makes no sense to ignore references from a sibling memcg and then evict a page that will be immediately refaulted by that sibling which contributes to the same usage of the common ancestor under reclaim. The solution: make the rmap walker ignore references from VMAs that are not part of the hierarchy that is being reclaimed. Flat limit reclaim will stay the same, hierarchical limit reclaim will mind the references only to pages that the hierarchy owns. Global reclaim, since it reclaims from all memcgs, will be fixed to regard all references. [akpm@linux-foundation.org: name the args in the declaration] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Konstantin Khlebnikov<khlebnikov@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-