1. 28 1月, 2010 1 次提交
    • L
      mm: add new 'read_cache_page_gfp()' helper function · 0531b2aa
      Linus Torvalds 提交于
      It's a simplified 'read_cache_page()' which takes a page allocation
      flag, so that different paths can control how aggressive the memory
      allocations are that populate a address space.
      
      In particular, the intel GPU object mapping code wants to be able to do
      a certain amount of own internal memory management by automatically
      shrinking the address space when memory starts getting tight.  This
      allows it to dynamically use different memory allocation policies on a
      per-allocation basis, rather than depend on the (static) address space
      gfp policy.
      
      The actual new function is a one-liner, but re-organizing the helper
      functions to the point where you can do this with a single line of code
      is what most of the patch is all about.
      Tested-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0531b2aa
  2. 21 1月, 2010 1 次提交
    • Y
      vmalloc: remove BUG_ON due to racy counting of VM_LAZY_FREE · 88f50044
      Yongseok Koh 提交于
      In free_unmap_area_noflush(), va->flags is marked as VM_LAZY_FREE first, and
      then vmap_lazy_nr is increased atomically.
      
      But, in __purge_vmap_area_lazy(), while traversing of vmap_are_list, nr
      is counted by checking VM_LAZY_FREE is set to va->flags.  After counting
      the variable nr, kernel reads vmap_lazy_nr atomically and checks a
      BUG_ON condition whether nr is greater than vmap_lazy_nr to prevent
      vmap_lazy_nr from being negative.
      
      The problem is that, if interrupted right after marking VM_LAZY_FREE,
      increment of vmap_lazy_nr can be delayed.  Consequently, BUG_ON
      condition can be met because nr is counted more than vmap_lazy_nr.
      
      It is highly probable when vmalloc/vfree are called frequently.  This
      scenario have been verified by adding delay between marking VM_LAZY_FREE
      and increasing vmap_lazy_nr in free_unmap_area_noflush().
      
      Even the vmap_lazy_nr is for checking high watermark, it never be the
      strict watermark.  Although the BUG_ON condition is to prevent
      vmap_lazy_nr from being negative, vmap_lazy_nr is signed variable.  So,
      it could go down to negative value temporarily.
      
      Consequently, removing the BUG_ON condition is proper.
      
      A possible BUG_ON message is like the below.
      
         kernel BUG at mm/vmalloc.c:517!
         invalid opcode: 0000 [#1] SMP
         EIP: 0060:[<c04824a4>] EFLAGS: 00010297 CPU: 3
         EIP is at __purge_vmap_area_lazy+0x144/0x150
         EAX: ee8a8818 EBX: c08e77d4 ECX: e7c7ae40 EDX: c08e77ec
         ESI: 000081fe EDI: e7c7ae60 EBP: e7c7ae64 ESP: e7c7ae3c
         DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
         Call Trace:
         [<c0482ad9>] free_unmap_vmap_area_noflush+0x69/0x70
         [<c0482b02>] remove_vm_area+0x22/0x70
         [<c0482c15>] __vunmap+0x45/0xe0
         [<c04831ec>] vmalloc+0x2c/0x30
         Code: 8d 59 e0 eb 04 66 90 89 cb 89 d0 e8 87 fe ff ff 8b 43 20 89 da 8d 48 e0 8d 43 20 3b 04 24 75 e7 fe 05 a8 a5 a3 c0 e9 78 ff ff ff <0f> 0b eb fe 90 8d b4 26 00 00 00 00 56 89 c6 b8 ac a5 a3 c0 31
         EIP: [<c04824a4>] __purge_vmap_area_lazy+0x144/0x150 SS:ESP 0068:e7c7ae3c
      
      [ See also http://marc.info/?l=linux-kernel&m=126335856228090&w=2 ]
      Signed-off-by: NYongseok Koh <yongseok.koh@samsung.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f50044
  3. 17 1月, 2010 8 次提交
    • K
      page allocator: update NR_FREE_PAGES only when necessary · 6ccf80eb
      KOSAKI Motohiro 提交于
      commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
      made one minor regression.  if __rmqueue() was failed, NR_FREE_PAGES stat
      go wrong.  this patch fixes it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reported-by: NHuang Shijie <shijie8@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ccf80eb
    • D
      nommu: fix shared mmap after truncate shrinkage problems · 7e660872
      David Howells 提交于
      Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
      over the end of a truncation.  The problem is that
      ramfs_nommu_check_mappings() checks that the reduced file size against the
      VMA tree, but not the vm_region tree.
      
      The following sequence of events can cause the problem:
      
      	fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
      	ftruncate(fd, 32 * 1024);
      	a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      	b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      	munmap(a, 32 * 1024);
      	ftruncate(fd, 16 * 1024);
      	c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      
      Mapping 'a' creates a vm_region covering 32KB of the file.  Mapping 'b'
      sees that the vm_region from 'a' is covering the region it wants and so
      shares it, pinning it in memory.
      
      Mapping 'a' then goes away and the file is truncated to the end of VMA
      'b'.  However, the region allocated by 'a' is still in effect, and has
      _not_ been reduced.
      
      Mapping 'c' is then created, and because there's a vm_region covering the
      desired region, get_unmapped_area() is _not_ called to repeat the check,
      and the mapping is granted, even though the pages from the latter half of
      the mapping have been discarded.
      
      However:
      
      	d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      
      Mapping 'd' should work, and should end up sharing the region allocated by
      'a'.
      
      To deal with this, we shrink the vm_region struct during the truncation,
      lest do_mmap_pgoff() take it as licence to share the full region
      automatically without calling the get_unmapped_area() file op again.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e660872
    • D
      nommu: don't need get_unmapped_area() for NOMMU · efc1a3b1
      David Howells 提交于
      get_unmapped_area() is unnecessary for NOMMU as no-one calls it.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efc1a3b1
    • D
      nommu: remove a superfluous check of vm_region::vm_usage · 779c1023
      David Howells 提交于
      In split_vma(), there's no need to check if the VMA being split has a
      region that's in use by more than one VMA because:
      
       (1) The preceding test prohibits splitting of non-anonymous VMAs and regions
           (eg: file or chardev backed VMAs).
      
       (2) Anonymous regions can't be mapped multiple times because there's no handle
           by which to refer to the already existing region.
      
       (3) If a VMA has previously been split, then the region backing it has also
           been split into two regions, each of usage 1.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      779c1023
    • D
      nommu: struct vm_region's vm_usage count need not be atomic · 1e2ae599
      David Howells 提交于
      The vm_usage count field in struct vm_region does not need to be atomic as
      it's only even modified whilst nommu_region_sem is write locked.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e2ae599
    • D
      memcg: ensure list is empty at rmdir · fce66477
      Daisuke Nishimura 提交于
      Current mem_cgroup_force_empty() only ensures mem->res.usage == 0 on
      success.  But this doesn't guarantee memcg's LRU is really empty, because
      there are some cases in which !PageCgrupUsed pages exist on memcg's LRU.
      
      For example:
      - Pages can be uncharged by its owner process while they are on LRU.
      - race between mem_cgroup_add_lru_list() and __mem_cgroup_uncharge_common().
      
      So there can be a case in which the usage is zero but some of the LRUs are not empty.
      
      OTOH, mem_cgroup_del_lru_list(), which can be called asynchronously with
      rmdir, accesses the mem_cgroup, so this access can cause a problem if it
      races with rmdir because the mem_cgroup might have been freed by rmdir.
      
      Actually, I saw a bug which seems to be caused by this race.
      
      	[1530745.949906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230
      	[1530745.950651] IP: [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80
      	[1530745.950651] PGD 3863de067 PUD 3862c7067 PMD 0
      	[1530745.950651] Oops: 0002 [#1] SMP
      	[1530745.950651] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/shared_cpu_map
      	[1530745.950651] CPU 3
      	[1530745.950651] Modules linked in: configs ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp nfsd nfs_acl auth_rpcgss exportfs autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc battery ac lp kvm_intel kvm sg ide_cd_mod cdrom serio_raw tpm_tis tpm tpm_bios acpi_memhotplug button parport_pc parport rtc_cmos rtc_core rtc_lib e1000 i2c_i801 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
      	[1530745.950651] Pid: 19653, comm: shmem_test_02 Tainted: G   M       2.6.32-mm1-00701-g2b04386 #3 Express5800/140Rd-4 [N8100-1065]
      	[1530745.950651] RIP: 0010:[<ffffffff810fbc11>]  [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80
      	[1530745.950651] RSP: 0018:ffff8803863ddcb8  EFLAGS: 00010002
      	[1530745.950651] RAX: 00000000000001e0 RBX: ffff8803abc02238 RCX: 00000000000001e0
      	[1530745.950651] RDX: 0000000000000000 RSI: ffff88038611a000 RDI: ffff8803abc02238
      	[1530745.950651] RBP: ffff8803863ddcc8 R08: 0000000000000002 R09: ffff8803a04c8643
      	[1530745.950651] R10: 0000000000000000 R11: ffffffff810c7333 R12: 0000000000000000
      	[1530745.950651] R13: ffff880000017f00 R14: 0000000000000092 R15: ffff8800179d0310
      	[1530745.950651] FS:  0000000000000000(0000) GS:ffff880017800000(0000) knlGS:0000000000000000
      	[1530745.950651] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      	[1530745.950651] CR2: 0000000000000230 CR3: 0000000379d87000 CR4: 00000000000006e0
      	[1530745.950651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	[1530745.950651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      	[1530745.950651] Process shmem_test_02 (pid: 19653, threadinfo ffff8803863dc000, task ffff88038612a8a0)
      	[1530745.950651] Stack:
      	[1530745.950651]  ffffea00040c2fe8 0000000000000000 ffff8803863ddd98 ffffffff810c739a
      	[1530745.950651] <0> 00000000863ddd18 000000000000000c 0000000000000000 0000000000000000
      	[1530745.950651] <0> 0000000000000002 0000000000000000 ffff8803863ddd68 0000000000000046
      	[1530745.950651] Call Trace:
      	[1530745.950651]  [<ffffffff810c739a>] release_pages+0x142/0x1e7
      	[1530745.950651]  [<ffffffff810c778f>] ? pagevec_move_tail+0x6e/0x112
      	[1530745.950651]  [<ffffffff810c781e>] pagevec_move_tail+0xfd/0x112
      	[1530745.950651]  [<ffffffff810c78a9>] lru_add_drain+0x76/0x94
      	[1530745.950651]  [<ffffffff810dba0c>] exit_mmap+0x6e/0x145
      	[1530745.950651]  [<ffffffff8103f52d>] mmput+0x5e/0xcf
      	[1530745.950651]  [<ffffffff81043ea8>] exit_mm+0x11c/0x129
      	[1530745.950651]  [<ffffffff8108fb29>] ? audit_free+0x196/0x1c9
      	[1530745.950651]  [<ffffffff81045353>] do_exit+0x1f5/0x6b7
      	[1530745.950651]  [<ffffffff8106133f>] ? up_read+0x2b/0x2f
      	[1530745.950651]  [<ffffffff8137d187>] ? lockdep_sys_exit_thunk+0x35/0x67
      	[1530745.950651]  [<ffffffff81045898>] do_group_exit+0x83/0xb0
      	[1530745.950651]  [<ffffffff810458dc>] sys_exit_group+0x17/0x1b
      	[1530745.950651]  [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b
      	[1530745.950651] Code: 54 53 0f 1f 44 00 00 83 3d cc 29 7c 00 00 41 89 f4 75 63 eb 4e 48 83 7b 08 00 75 04 0f 0b eb fe 48 89 df e8 18 f3 ff ff 44 89 e2 <48> ff 4c d0 50 48 8b 05 2b 2d 7c 00 48 39 43 08 74 39 48 8b 4b
      	[1530745.950651] RIP  [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80
      	[1530745.950651]  RSP <ffff8803863ddcb8>
      	[1530745.950651] CR2: 0000000000000230
      	[1530745.950651] ---[ end trace c3419c1bb8acc34f ]---
      	[1530745.950651] Fixing recursive fault but reboot is needed!
      
      The problem here is pages on LRU may contain pointer to stale memcg.  To
      make res->usage to be 0, all pages on memcg must be uncharged or moved to
      another(parent) memcg.  Moved page_cgroup have already removed from
      original LRU, but uncharged page_cgroup contains pointer to memcg withou
      PCG_USED bit.  (This asynchronous LRU work is for improving performance.)
      If PCG_USED bit is not set, page_cgroup will never be added to memcg's
      LRU.  So, about pages not on LRU, they never access stale pointer.  Then,
      what we have to take care of is page_cgroup _on_ LRU list.  This patch
      fixes this problem by making mem_cgroup_force_empty() visit all LRUs
      before exiting its loop and guarantee there are no pages on its LRU.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fce66477
    • K
      vmscan: kswapd: don't retry balance_pgdat() if all zones are unreclaimable · de3fab39
      KOSAKI Motohiro 提交于
      Commit f50de2d3 (vmscan: have kswapd sleep for a short interval and double
      check it should be asleep) can cause kswapd to enter an infinite loop if
      running on a single-CPU system.  If all zones are unreclaimble,
      sleeping_prematurely return 1 and kswapd will call balance_pgdat() again.
      but it's totally meaningless, balance_pgdat() doesn't anything against
      unreclaimable zone!
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reported-by: NWill Newton <will.newton@gmail.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Tested-by: NWill Newton <will.newton@gmail.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de3fab39
    • K
      mm/page_alloc: fix the range check for backward merging · d2dbe08d
      Kazuhisa Ichikawa 提交于
      The current check for 'backward merging' within add_active_range() does
      not seem correct.  start_pfn must be compared against
      early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
      the new region is backward-mergeable with the existing range.
      Signed-off-by: NKazuhisa Ichikawa <ki@epsilou.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2dbe08d
  4. 14 1月, 2010 1 次提交
  5. 12 1月, 2010 2 次提交
  6. 08 1月, 2010 1 次提交
  7. 07 1月, 2010 2 次提交
    • J
      NOMMU: Use copy_*_user_page() in access_process_vm() · 7959722b
      Jie Zhang 提交于
      The MMU code uses the copy_*_user_page() variants in access_process_vm()
      rather than copy_*_user() as the former includes an icache flush.  This
      is important when doing things like setting software breakpoints with
      gdb.  So switch the NOMMU code over to do the same.
      
      This patch makes the reasonable assumption that copy_from_user_page()
      won't fail - which is probably fine, as we've checked the VMA from which
      we're copying is usable, and the copy is not allowed to cross VMAs.  The
      one case where it might go wrong is if the VMA is a device rather than
      RAM, and that device returns an error which - in which case rubbish will
      be returned rather than EIO.
      Signed-off-by: NJie Zhang <jie.zhang@analog.com>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NDavid McCullough <david_mccullough@mcafee.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7959722b
    • M
      NOMMU: Avoiding duplicate icache flushes of shared maps · cfe79c00
      Mike Frysinger 提交于
      When working with FDPIC, there are many shared mappings of read-only
      code regions between applications (the C library, applet packages like
      busybox, etc.), but the current do_mmap_pgoff() function will issue an
      icache flush whenever a VMA is added to an MM instead of only doing it
      when the map is initially created.
      
      The flush can instead be done when a region is first mmapped PROT_EXEC.
      Note that we may not rely on the first mapping of a region being
      executable - it's possible for it to be PROT_READ only, so we have to
      remember whether we've flushed the region or not, and then flush the
      entire region when a bit of it is made executable.
      
      However, this also affects the brk area.  That will no longer be
      executable.  We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
      for NOMMU mode kernels, when it increases the brk allocation, making
      sys_brk() flush the extra from the icache should suffice.  The brk area
      probably isn't used by NOMMU programs since the brk area can only use up
      the leavings from the stack allocation, where the stack allocation is
      larger than requested.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfe79c00
  8. 31 12月, 2009 1 次提交
  9. 29 12月, 2009 1 次提交
  10. 24 12月, 2009 1 次提交
  11. 22 12月, 2009 1 次提交
  12. 18 12月, 2009 2 次提交
    • R
      mm: Add notifier in pageblock isolation for balloon drivers · 925cc71e
      Robert Jennings 提交于
      Memory balloon drivers can allocate a large amount of memory which is not
      movable but could be freed to accomodate memory hotplug remove.
      
      Prior to calling the memory hotplug notifier chain the memory in the
      pageblock is isolated.  Currently, if the migrate type is not
      MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
      for that page range to fail.
      
      Rather than failing pageblock isolation if the migrateteype is not
      MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
      and not on the LRU, are owned by a registered balloon driver (or other
      entity) using a notifier chain.  If all of the non-movable pages are owned
      by a balloon, they can be freed later through the memory notifier chain
      and the range can still be isolated in set_migratetype_isolate().
      Signed-off-by: NRobert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Gerald Schaefer <geralds@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      925cc71e
    • H
      readahead: add blk_run_backing_dev · 65a80b4c
      Hisashi Hifumi 提交于
      I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O
      is unpluged to improve throughput on especially RAID environment.
      
      The normal case is, if page N become uptodate at time T(N), then T(N) <=
      T(N+1) holds.  With RAID (and NFS to some degree), there is no strict
      ordering, the data arrival time depends on runtime status of individual
      disks, which breaks that formula.  So in do_generic_file_read(), just
      after submitting the async readahead IO request, the current page may well
      be uptodate, so the page won't be locked, and the block device won't be
      implicitly unplugged:
      
                     if (PageReadahead(page))
                              page_cache_async_readahead()
                      if (!PageUptodate(page))
                                      goto page_not_up_to_date;
                      //...
      page_not_up_to_date:
                      lock_page_killable(page);
      
      Therefore explicit unplugging can help.
      
      Following is the test result with dd.
      
      #dd if=testdir/testfile of=/dev/null bs=16384
      
      -2.6.30-rc6
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 224.182 seconds, 76.6 MB/s
      
      -2.6.30-rc6-patched
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 206.465 seconds, 83.2 MB/s
      
      (7Disks RAID-0 Array)
      
      -2.6.30-rc6
      1054976+0 records in
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 212.233 seconds, 81.4 MB/s
      
      -2.6.30-rc6-patched
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 198.878 seconds, 86.9 MB/s
      
      (7Disks RAID-5 Array)
      
      The patch was found to improve performance with the SCST scsi target
      driver.  See
      http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel
      
      [akpm@linux-foundation.org: unbust comment layout]
      [akpm@linux-foundation.org: "fix" CONFIG_BLOCK=n]
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NRonald <intercommit@gmail.com>
      Cc: Bart Van Assche <bart.vanassche@gmail.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65a80b4c
  13. 17 12月, 2009 10 次提交
    • R
      cpumask: avoid deprecated function in mm/slab.c · 58463c1f
      Rusty Russell 提交于
      These days we use cpumask_empty() which takes a pointer.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      58463c1f
    • A
      Fix breakage in shmem.c · 718deb6b
      Al Viro 提交于
      Replacing
      	error = 0;
      	if (error)
      		op
      with nothing is not quite an equivalent transformation ;-)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      718deb6b
    • Y
      x86: Fix checking of SRAT when node 0 ram is not from 0 · 32996250
      Yinghai Lu 提交于
      Found one system that boot from socket1 instead of socket0, SRAT get rejected...
      
      [    0.000000] SRAT: Node 1 PXM 0 0-a0000
      [    0.000000] SRAT: Node 1 PXM 0 100000-80000000
      [    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
      [    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
      [    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
      [    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
      [    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
      [    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
      [    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
      [    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
      ...
      [    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
      [    0.000000] NUMA: Using 20 for the hash shift.
      [    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
      [    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
      [    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
      [    0.000000] SRAT: SRAT not used.
      
      the early_node_map is not sorted because node0 with non zero start come first.
      
      so try to sort it right away after all regions are registered.
      
      also fixs refression by 8716273c (x86: Export srat physical topology)
      
      -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
      -v3: update comments.
      Reported-and-tested-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B2579D2.3010201@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      32996250
    • D
      NOMMU: Optimise away the {dac_,}mmap_min_addr tests · 6e141546
      David Howells 提交于
      In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
      skipped by the compiler.  We do this as the minimum mmap address doesn't make
      any sense in NOMMU mode.
      
      mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      6e141546
    • C
      direct I/O fallback sync simplification · c05c4edd
      Christoph Hellwig 提交于
      In the case of direct I/O falling back to buffered I/O we sync data
      twice currently: once at the end of generic_file_buffered_write using
      filemap_write_and_wait_range and once a little later in
      __generic_file_aio_write using do_sync_mapping_range with all flags set.
      
      The wait before write of the do_sync_mapping_range call does not make
      any sense, so just keep the filemap_write_and_wait_range call and move
      it to the right spot.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c05c4edd
    • C
      make generic_acl slightly more generic · 1c7c474c
      Christoph Hellwig 提交于
      Now that we cache the ACL pointers in the generic inode all the generic_acl
      cruft can go away and generic_acl.c can directly implement xattr handlers
      dealing with the full Posix ACL semantics for in-memory filesystems.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1c7c474c
    • C
      sanitize xattr handler prototypes · 431547b3
      Christoph Hellwig 提交于
      Add a flags argument to struct xattr_handler and pass it to all xattr
      handler methods.  This allows using the same methods for multiple
      handlers, e.g. for the ACL methods which perform exactly the same action
      for the access and default ACLs, just using a different underlying
      attribute.  With a little more groundwork it'll also allow sharing the
      methods for the regular user/trusted/secure handlers in extN, ocfs2 and
      jffs2 like it's already done for xfs in this patch.
      
      Also change the inode argument to the handlers to a dentry to allow
      using the handlers mechnism for filesystems that require it later,
      e.g. cifs.
      
      [with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Acked-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      431547b3
    • A
      Untangling ima mess, part 1: alloc_file() · 0552f879
      Al Viro 提交于
      There are 2 groups of alloc_file() callers:
      	* ones that are followed by ima_counts_get
      	* ones giving non-regular files
      So let's pull that ima_counts_get() into alloc_file();
      it's a no-op in case of non-regular files.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0552f879
    • A
      switch alloc_file() to passing struct path · 2c48b9c4
      Al Viro 提交于
      ... and have the caller grab both mnt and dentry; kill
      leak in infiniband, while we are at it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c48b9c4
    • A
      switch shmem_file_setup() to alloc_file() · 4b42af81
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4b42af81
  14. 16 12月, 2009 8 次提交
    • B
      memcg: code clean, remove unused variable in mem_cgroup_resize_limit() · aa20d489
      Bob Liu 提交于
      Variable `progress' isn't used in mem_cgroup_resize_limit() any more.
      Remove it.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa20d489
    • D
      memcg: remove memcg_tasklist · 9ab322ca
      Daisuke Nishimura 提交于
      memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock
      caused by race between oom and cpuset_attach) instead of cgroup_mutex to
      fix a deadlock problem.  The cgroup_mutex, which was removed by the
      commit, in mem_cgroup_out_of_memory() was originally introduced at commit
      c7ba5c9e (Memory controller: OOM handling).
      
      IIUC, the intention of this cgroup_mutex was to prevent task move during
      select_bad_process() so that situations like below can be avoided.
      
        Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
        1. Process A, which has been in cgroup "baa" and uses large memory, is just
           moved to cgroup "foo". Process A can be the candidates for being killed.
        2. Process B, which has been in cgroup "foo" and uses large memory, is just
           moved from cgroup "foo". Process B can be excluded from the candidates for
           being killed.
      
      But these race window exists anyway even if we hold a lock, because
      __mem_cgroup_try_charge() decides wether it should trigger oom or not
      outside of the lock.  So the original cgroup_mutex in
      mem_cgroup_out_of_memory and thus current memcg_tasklist has no use.  And
      IMHO, those races are not so critical for users.
      
      This patch removes it and make codes simpler.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab322ca
    • D
      memcg: avoid oom-killing innocent task in case of use_hierarchy · d31f56db
      Daisuke Nishimura 提交于
      task_in_mem_cgroup(), which is called by select_bad_process() to check
      whether a task can be a candidate for being oom-killed from memcg's limit,
      checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs
      to).
      
      But this check return true(it's false positive) when:
      
      	<some path>/aa		use_hierarchy == 0	<- hitting limit
      	  <some path>/aa/00	use_hierarchy == 1	<- the task belongs to
      
      This leads to killing an innocent task in aa/00.  This patch is a fix for
      this bug.  And this patch also fixes the arg for
      mem_cgroup_print_oom_info().  We should print information of mem_cgroup
      which the task being killed, not current, belongs to.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d31f56db
    • D
      memcg: cleanup mem_cgroup_move_parent() · 57f9fd7d
      Daisuke Nishimura 提交于
      mem_cgroup_move_parent() calls try_charge first and cancel_charge on
      failure.  IMHO, charge/uncharge(especially charge) is high cost operation,
      so we should avoid it as far as possible.
      
      This patch tries to delay try_charge in mem_cgroup_move_parent() by
      re-ordering checks it does.
      
      And this patch renames mem_cgroup_move_account() to
      __mem_cgroup_move_account(), changes the return value of
      __mem_cgroup_move_account() from int to void, and adds a new
      wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid
      for moving account and calls __mem_cgroup_move_account().
      
      This patch removes the last caller of trylock_page_cgroup(), so removes
      its definition too.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f9fd7d
    • D
      memcg: add mem_cgroup_cancel_charge() · a3032a2c
      Daisuke Nishimura 提交于
      There are some places calling both res_counter_uncharge() and css_put() to
      cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().
      
      This patch introduces mem_cgroup_cancel_charge() and call it in those
      places.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3032a2c
    • K
      memcg: make memcg's file mapped consistent with global VM · d8046582
      KAMEZAWA Hiroyuki 提交于
      In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE.  This makes
      grep difficult.  Replace memcg's MAPPED_FILE with FILE_MAPPED
      
      And in global VM, mapped shared memory is accounted into FILE_MAPPED.
      But memcg doesn't. fix it.
      Note:
        page_is_file_cache() just checks SwapBacked or not.
        So, we need to check PageAnon.
      
      Cc: Balbir Singh <balbir@in.ibm.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8046582
    • K
      memcg: coalesce charging via percpu storage · cdec2e42
      KAMEZAWA Hiroyuki 提交于
      This is a patch for coalescing access to res_counter at charging by percpu
      caching.  At charge, memcg charges 64pages and remember it in percpu
      cache.  Because it's cache, drain/flush if necessary.
      
      This version uses public percpu area.
       2 benefits for using public percpu area.
       1. Sum of stocked charge in the system is limited to # of cpus
          not to the number of memcg. This shows better synchonization.
       2. drain code for flush/cpuhotplug is very easy (and quick)
      
      The most important point of this patch is that we never touch res_counter
      in fast path. The res_counter is system-wide shared counter which is modified
      very frequently. We shouldn't touch it as far as we can for avoiding
      false sharing.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 cache miss/faults
      
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 cache miss/faults
      
      [ + coalescing uncharge patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 cache miss/faults
      
      [ + coalescing uncharge patch + this patch ]
        34224709  page-faults              #      0.072 M/sec   ( +-   0.173% )
        34.69 cache miss/faults
      
      Changelog (since Oct/2):
        - updated comments
        - replaced get_cpu_var() with __get_cpu_var() if possible.
        - removed mutex for system-wide drain. adds a counter instead of it.
        - removed CONFIG_HOTPLUG_CPU
      
      Changelog (old):
        - rebased onto the latest mmotm
        - moved charge size check before __GFP_WAIT check for avoiding unnecesary
        - added asynchronous flush routine.
        - fixed bugs pointed out by Nishimura-san.
      
      [akpm@linux-foundation.org: tweak comments]
      [nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdec2e42
    • K
      memcg: coalesce uncharge during unmap/truncate · 569b846d
      KAMEZAWA Hiroyuki 提交于
      In massive parallel enviroment, res_counter can be a performance
      bottleneck.  One strong techinque to reduce lock contention is reducing
      calls by coalescing some amount of calls into one.
      
      Considering charge/uncharge chatacteristic,
      	- charge is done one by one via demand-paging.
      	- uncharge is done by
      		- in chunk at munmap, truncate, exit, execve...
      		- one by one via vmscan/paging.
      
      It seems we have a chance to coalesce uncharges for improving scalability
      at unmap/truncation.
      
      This patch is a for coalescing uncharge.  For avoiding scattering memcg's
      structure to functions under /mm, this patch adds memcg batch uncharge
      information to the task.  A reason for per-task batching is for making use
      of caller's context information.  We do batched uncharge (deleyed
      uncharge) when truncation/unmap occurs but do direct uncharge when
      uncharge is called by memory reclaim (vmscan.c).
      
      The degree of coalescing depends on callers
        - at invalidate/trucate... pagevec size
        - at unmap ....ZAP_BLOCK_SIZE
      (memory itself will be freed in this degree.)
      Then, we'll not coalescing too much.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 miss/faults
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 miss/faults
      [child with this patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 miss/faults
      
      We can see some amounts of improvement.
      (root cgroup doesn't affected by this patch)
      Another patch for "charge" will follow this and above will be improved more.
      
      Changelog(since 2009/10/02):
       - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
       - some clean up and commentary/description updates.
       - added initialize code to copy_process(). (possible bug fix)
      
      Changelog(old):
       - fixed !CONFIG_MEM_CGROUP case.
       - rebased onto the latest mmotm + softlimit fix patches.
       - unified patch for callers
       - added commetns.
       - make ->do_batch as bool.
       - removed css_get() at el. We don't need it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      569b846d