1. 25 7月, 2008 40 次提交
    • J
      powerpc: scan device tree for gigantic pages · 658013e9
      Jon Tollefson 提交于
      The 16G huge pages have to be reserved in the HMC prior to boot.  The
      location of the pages are placed in the device tree.  This patch adds code
      to scan the device tree during very early boot and save these page
      locations until hugetlbfs is ready for them.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NJon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      658013e9
    • J
      powerpc: function to allocate gigantic hugepages · ec4b2c0c
      Jon Tollefson 提交于
      The 16G page locations have been saved during early boot in an array.  The
      alloc_bootmem_huge_page() function adds a page from here to the
      huge_boot_pages list.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NJon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec4b2c0c
    • J
      hugetlb: allow arch overridden hugepage allocation · 53ba51d2
      Jon Tollefson 提交于
      Allow alloc_bootmem_huge_page() to be overridden by architectures that
      can't always use bootmem.  This requires huge_boot_pages to be available
      for use by this function.
      
      This is required for powerpc 16G pages, which have to be reserved prior to
      boot-time.  The location of these pages are indicated in the device tree.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NJon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53ba51d2
    • N
      hugetlb: override default huge page size · e11bfbfc
      Nick Piggin 提交于
      Allow configurations with the default huge page size which is different to
      the traditional HPAGE_SIZE size.  The default huge page size is the one
      represented in the legacy /proc ABIs, SHM, and which is defaulted to when
      mounting hugetlbfs filesystems.
      
      This is implemented with a new kernel option default_hugepagesz=, which
      defaults to HPAGE_SIZE if not specified.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e11bfbfc
    • A
      x86: add hugepagesz option on 64-bit · b4718e62
      Andi Kleen 提交于
      Add an hugepagesz=...  option similar to IA64, PPC etc.  to x86-64.
      
      This finally allows to select GB pages for hugetlbfs in x86 now that all
      the infrastructure is in place.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4718e62
    • A
      39c11e6c
    • A
      hugetlb: introduce pud_huge · ceb86879
      Andi Kleen 提交于
      Straight forward extensions for huge pages located in the PUD instead of
      PMDs.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ceb86879
    • A
      hugetlb: printk cleanup · 4abd32db
      Andi Kleen 提交于
      - Reword sentence to clarify meaning with multiple options
      - Add support for using GB prefixes for the page size
      - Add extra printk to delayed > MAX_ORDER allocation code
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4abd32db
    • A
      hugetlb: support boot allocate different sizes · 8faa8b07
      Andi Kleen 提交于
      Make some infrastructure changes to allow boot-time allocation of
      different hugepage page sizes.
      
      - move all basic hstate initialisation into hugetlb_add_hstate
      - create a new function hugetlb_hstate_alloc_pages() to do the
        actual initial page allocations. Call this function early in
        order to allocate giant pages from bootmem.
      - Check for multiple hugepages= parameters
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAndrew Hastings <abh@cray.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8faa8b07
    • A
      hugetlb: support larger than MAX_ORDER · aa888a74
      Andi Kleen 提交于
      This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
      not practical to enlarge MAX_ORDER to 1GB.
      
      Instead the 1GB pages are only allocated at boot using the bootmem
      allocator using the hugepages=...  option.
      
      These 1G bootmem pages are never freed.  In theory it would be possible to
      implement that with some complications, but since it would be a one-way
      street (>= MAX_ORDER pages cannot be allocated later) I decided not to
      currently.
      
      The >= MAX_ORDER code is not ifdef'ed per architecture.  It is not very
      big and the ifdef uglyness seemed not be worth it.
      
      Known problems: /proc/meminfo and "free" do not display the memory
      allocated for gb pages in "Total".  This is a little confusing for the
      user.
      Acked-by: NAndrew Hastings <abh@cray.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa888a74
    • A
      mm: export prep_compound_page to mm · 01ad1c08
      Andi Kleen 提交于
      hugetlb will need to get compound pages from bootmem to handle the case of
      them being greater than or equal to MAX_ORDER.  Export the constructor
      function needed for this.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01ad1c08
    • A
      mm: introduce non panic alloc_bootmem · b54bbf7b
      Andi Kleen 提交于
      Straight forward variant of the existing __alloc_bootmem_node, only
      subsequent patch when allocating giant hugepages at boot -- don't want to
      panic if we can't allocate as many as the user asked for.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b54bbf7b
    • A
      hugetlb: abstract numa round robin selection · 5ced66c9
      Andi Kleen 提交于
      Need this as a separate function for a future patch.
      
      No behaviour change.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ced66c9
    • N
      hugetlb: new sysfs interface · a3437870
      Nishanth Aravamudan 提交于
      Provide new hugepages user APIs that are more suited to multiple hstates
      in sysfs.  There is a new directory, /sys/kernel/hugepages.  Underneath
      that directory there will be a directory per-supported hugepage size,
      e.g.:
      
      /sys/kernel/hugepages/hugepages-64kB
      /sys/kernel/hugepages/hugepages-16384kB
      /sys/kernel/hugepages/hugepages-16777216kB
      
      corresponding to 64k, 16m and 16g respectively.  Within each
      hugepages-size directory there are a number of files, corresponding to the
      tracked counters in the hstate, e.g.:
      
      /sys/kernel/hugepages/hugepages-64/nr_hugepages
      /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
      /sys/kernel/hugepages/hugepages-64/free_hugepages
      /sys/kernel/hugepages/hugepages-64/resv_hugepages
      /sys/kernel/hugepages/hugepages-64/surplus_hugepages
      
      Of these files, the first two are read-write and the latter three are
      read-only.  The size of the hugepage being manipulated is trivially
      deducible from the enclosing directory and is always expressed in kB (to
      match meminfo).
      
      [dave@linux.vnet.ibm.com: fix build]
      [nacc@us.ibm.com: hugetlb: hang off of /sys/kernel/mm rather than /sys/kernel]
      [nacc@us.ibm.com: hugetlb: remove CONFIG_SYSFS dependency]
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3437870
    • A
      hugetlbfs: per mount huge page sizes · a137e1cc
      Andi Kleen 提交于
      Add the ability to configure the hugetlb hstate used on a per mount basis.
      
      - Add a new pagesize= option to the hugetlbfs mount that allows setting
        the page size
      - This option causes the mount code to find the hstate corresponding to the
        specified size, and sets up a pointer to the hstate in the mount's
        superblock.
      - Change the hstate accessors to use this information rather than the
        global_hstate they were using (requires a slight change in mm/memory.c
        so we don't NULL deref in the error-unmap path -- see comments).
      
      [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a137e1cc
    • A
      hugetlb: multiple hstates for multiple page sizes · e5ff2159
      Andi Kleen 提交于
      Add basic support for more than one hstate in hugetlbfs.  This is the key
      to supporting multiple hugetlbfs page sizes at once.
      
      - Rather than a single hstate, we now have an array, with an iterator
      - default_hstate continues to be the struct hstate which we use by default
      - Add functions for architectures to register new hstates
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5ff2159
    • A
      hugetlb: modular state for hugetlb page size · a5516438
      Andi Kleen 提交于
      The goal of this patchset is to support multiple hugetlb page sizes.  This
      is achieved by introducing a new struct hstate structure, which
      encapsulates the important hugetlb state and constants (eg.  huge page
      size, number of huge pages currently allocated, etc).
      
      The hstate structure is then passed around the code which requires these
      fields, they will do the right thing regardless of the exact hstate they
      are operating on.
      
      This patch adds the hstate structure, with a single global instance of it
      (default_hstate), and does the basic work of converting hugetlb to use the
      hstate.
      
      Future patches will add more hstate structures to allow for different
      hugetlbfs mounts to have different page sizes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5516438
    • A
      hugetlb: factor out prep_new_huge_page · b7ba30c6
      Andi Kleen 提交于
      Needed to avoid code duplication in follow up patches.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7ba30c6
    • N
      mm: create /sys/kernel/mm · ff7ea79c
      Nishanth Aravamudan 提交于
      Add a kobject to create /sys/kernel/mm when sysfs is mounted.  The kobject
      will exist regardless.  This will allow for the hugepage related sysfs
      directories to exist under the mm "subsystem" directory.  Add an ABI file
      appropriately.
      
      [kosaki.motohiro@jp.fujitsu.com: fix build]
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff7ea79c
    • N
      mm: remove mm_init compilation dependency on CONFIG_DEBUG_MEMORY_INIT · 5e9426ab
      Nishanth Aravamudan 提交于
      Towards the end of putting all core mm initialization in mm_init.c, I
      plan on putting the creation of a mm kobject in a function in that file.
      However, the file is currently only compiled if CONFIG_DEBUG_MEMORY_INIT
      is set. Remove this dependency, but put the code under an #ifdef on the
      same config option. This should result in no functional changes.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e9426ab
    • E
      vmallocinfo: add NUMA information · a47a126a
      Eric Dumazet 提交于
      Christoph recently added /proc/vmallocinfo file to get information about
      vmalloc allocations.
      
      This patch adds NUMA specific information, giving number of pages
      allocated on each memory node.
      
      This should help to check that vmalloc() is able to respect NUMA policies.
      
      Example of output on a four nodes machine (one cpu per node)
      
      1) network hash tables are evenly spreaded on four nodes (OK) (Same
         point for inodes and dentries hash tables)
      
      2) iptables tables (x_tables) are correctly allocated on each cpu node
         (OK).
      
      3) sys_swapon() allocates its memory from one node only.
      
      4) each loaded module is using memory on one node.
      
      Sysadmins could tune their setup to change points 3) and 4) if necessary.
      
      grep "pages="  /proc/vmallocinfo
      0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
      0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
      0xffffc2000031a000-0xffffc2000031d000   12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1
      0xffffc2000031f000-0xffffc2000032b000   49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
      0xffffc2000033e000-0xffffc20000341000   12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2
      0xffffc20000341000-0xffffc20000344000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2
      0xffffc20000344000-0xffffc20000347000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2
      0xffffc20000347000-0xffffc2000034a000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2
      0xffffc2000034a000-0xffffc2000034d000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2
      0xffffc20004381000-0xffffc20004402000  528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32
      0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256
      0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
      0xffffc20004904000-0xffffc20004bec000 3047424 sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743
      0xffffffffa0000000-0xffffffffa000f000   61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14
      0xffffffffa000f000-0xffffffffa0014000   20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4
      0xffffffffa0014000-0xffffffffa0017000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2
      0xffffffffa0017000-0xffffffffa0022000   45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10
      0xffffffffa0022000-0xffffffffa0028000   24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5
      0xffffffffa0028000-0xffffffffa0050000  163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39
      0xffffffffa0050000-0xffffffffa0052000    8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1
      0xffffffffa0052000-0xffffffffa0056000   16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3
      0xffffffffa0056000-0xffffffffa0081000  176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42
      0xffffffffa0081000-0xffffffffa00ae000  184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44
      0xffffffffa00ae000-0xffffffffa00b1000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
      0xffffffffa00b1000-0xffffffffa00b9000   32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7
      0xffffffffa00b9000-0xffffffffa00c4000   45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10
      0xffffffffa00c6000-0xffffffffa00e0000  106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25
      0xffffffffa00e0000-0xffffffffa00f1000   69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16
      0xffffffffa00f1000-0xffffffffa00f4000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
      0xffffffffa00f4000-0xffffffffa00f7000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
      
      [akpm@linux-foundation.org: fix comment]
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a47a126a
    • P
      SYNC_FILE_RANGE_WRITE may and will block. Document that. · cce77081
      Pavel Machek 提交于
      [akpm@linux-foundation.org: fix comment text]
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cce77081
    • H
      tmpfs: support aio · bcd78e49
      Hugh Dickins 提交于
      We have a request for tmpfs to support the AIO interface: easily done, no
      more than replacing the old shmem_file_read by shmem_file_aio_read,
      cribbed from generic_file_aio_read.  (In 2.6.25 its write side was already
      changed to use generic_file_aio_write.)
      
      Incorporate cleanups from Andrew Morton and Harvey Harrison.
      
      Tests out fine with LTP's ltp-aiodio.sh, given hacks (not included) to
      support O_DIRECT.  tmpfs cannot honestly support O_DIRECT: its
      cache-avoiding-IO nature is at odds with direct IO-avoiding-cache.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Tested-by: NLawrence Greenfield <leg@google.com>
      Cc: Christoph Rohland <hans-christoph.rohland@sap.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcd78e49
    • H
      generic_file_aio_read() cleanups · 11fa977e
      Hugh Dickins 提交于
      As akpm points out, there's really no need for generic_file_aio_read to
      make a special case of count 0: just loop through nr_segs doing nothing.
      And as Harvey Harrison points out, there's no need to reset retval to 0
      where it's already 0.
      
      Setting count (or ocount) to 0 before calling generic_segment_checks is
      unnecessary too; but reluctantly I'll leave that removal to someone with a
      wider range of gcc versions to hand - 4.1.2 and 4.2.1 don't warn about it,
      but perhaps others do - I forget which are the warniest versions.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Tested-by: NLawrence Greenfield <leg@google.com>
      Cc: Christoph Rohland <hans-christoph.rohland@sap.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11fa977e
    • J
      vma_page_offset() has no callees: drop it · a858f7b2
      Johannes Weiner 提交于
      Hugh adds: vma_pagecache_offset() has a dangerously misleading name, since
      it's using hugepage units: rename it to vma_hugecache_offset().
      
      [apw@shadowen.org: restack onto fixed MAP_PRIVATE reservations]
      [akpm@linux-foundation.org: vma_split conversion]
      Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a858f7b2
    • A
      hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits · 84afd99b
      Andy Whitcroft 提交于
      When a hugetlb mapping with a reservation is split, a new VMA is cloned
      from the original.  This new VMA is a direct copy of the original
      including the reservation count.  When this pair of VMAs are unmapped we
      will incorrect double account the unused reservation and the overall
      reservation count will be incorrect, in extreme cases it will wrap.
      
      The problem occurs when we split an existing VMA say to unmap a page in
      the middle.  split_vma() will create a new VMA copying all fields from the
      original.  As we are storing our reservation count in vm_private_data this
      is also copies, endowing the new VMA with a duplicate of the original
      VMA's reservation.  Neither of the new VMAs can exhaust these reservations
      as they are too small, but when we unmap and close these VMAs we will
      incorrect credit the remainder twice and resv_huge_pages will become out
      of sync.  This can lead to allocation failures on mappings with
      reservations and even to resv_huge_pages wrapping which prevents all
      subsequent hugepage allocations.
      
      The simple fix would be to correctly apportion the remaining reservation
      count when the split is made.  However the only hook we have vm_ops->open
      only has the new VMA we do not know the identity of the preceeding VMA.
      Also even if we did have that VMA to hand we do not know how much of the
      reservation was consumed each side of the split.
      
      This patch therefore takes a different tack.  We know that the whole of
      any private mapping (which has a reservation) has a reservation over its
      whole size.  Any present pages represent consumed reservation.  Therefore
      if we track the instantiated pages we can calculate the remaining
      reservation.
      
      This patch reuses the existing regions code to track the regions for which
      we have consumed reservation (ie.  the instantiated pages), as each page
      is faulted in we record the consumption of reservation for the new page.
      When we need to return unused reservations at unmap time we simply count
      the consumed reservation region subtracting that from the whole of the
      map.  During a VMA split the newly opened VMA will point to the same
      region map, as this map is offset oriented it remains valid for both of
      the split VMAs.  This map is referenced counted so that it is removed when
      all VMAs which are part of the mmap are gone.
      
      Thanks to Adam Litke and Mel Gorman for their review feedback.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84afd99b
    • A
      hugetlb: allow huge page mappings to be created without reservations · c37f9fb1
      Andy Whitcroft 提交于
      By default all shared mappings and most private mappings now have
      reservations associated with them.  This improves semantics by providing
      allocation guarentees to the mapper.  However a small number of
      applications may attempt to make very large sparse mappings, with these
      strict reservations the system will never be able to honour the mapping.
      
      This patch set brings MAP_NORESERVE support to hugetlb files.  This allows
      new mappings to be made to hugetlbfs files without an associated
      reservation, for both shared and private mappings.  This allows
      applications which want to create very sparse mappings to opt-out of the
      reservation system.  Obviously as there is no reservation they are liable
      to fault at runtime if the huge page pool becomes exhausted; buyer beware.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c37f9fb1
    • A
      hugetlb: move reservation region support earlier · 96822904
      Andy Whitcroft 提交于
      The following patch will require use of the reservation regions support.
      Move this earlier in the file.  No changes have been made to this code.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96822904
    • A
      mm: record MAP_NORESERVE status on vmas and fix small page mprotect reservations · cdfd4325
      Andy Whitcroft 提交于
      With Mel's hugetlb private reservation support patches applied, strict
      overcommit semantics are applied to both shared and private huge page
      mappings.  This can be a problem if an application relied on unlimited
      overcommit semantics for private mappings.  An example of this would be an
      application which maps a huge area with the intention of using it very
      sparsely.  These application would benefit from being able to opt-out of
      the strict overcommit.  It should be noted that prior to hugetlb
      supporting demand faulting all mappings were fully populated and so
      applications of this type should be rare.
      
      This patch stack implements the MAP_NORESERVE mmap() flag for huge page
      mappings.  This flag has the same meaning as for small page mappings,
      suppressing reservations for that mapping.
      
      Thanks to Mel Gorman for reviewing a number of early versions of these
      patches.
      
      This patch:
      
      When a small page mapping is created with mmap() reservations are created
      by default for any memory pages required.  When the region is read/write
      the reservation is increased for every page, no reservation is needed for
      read-only regions (as they implicitly share the zero page).  Reservations
      are tracked via the VM_ACCOUNT vma flag which is present when the region
      has reservation backing it.  When we convert a region from read-only to
      read-write new reservations are aquired and VM_ACCOUNT is set.  However,
      when a read-only map is created with MAP_NORESERVE it is indistinguishable
      from a normal mapping.  When we then convert that to read/write we are
      forced to incorrectly create reservations for it as we have no record of
      the original MAP_NORESERVE.
      
      This patch introduces a new vma flag VM_NORESERVE which records the
      presence of the original MAP_NORESERVE flag.  This allows us to
      distinguish these two circumstances and correctly account the reserve.
      
      As well as fixing this FIXME in the code, this makes it much easier to
      introduce MAP_NORESERVE support for huge pages as this flag is available
      consistantly for the life of the mapping.  VM_ACCOUNT on the other hand is
      heavily used at the generic level in association with small pages.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdfd4325
    • A
      huge page private reservation review cleanups · e7c4b0bf
      Andy Whitcroft 提交于
      Create some new accessors for vma private data to cut down on and contain
      the casts.  Encapsulates the huge and small page offset calculations.
      Also adds a couple of VM_BUG_ONs for consistency.
      
      [akpm@linux-foundation.org: Make things static]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7c4b0bf
    • M
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE)... · 04f2cbe3
      Mel Gorman 提交于
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
      
      After patch 2 in this series, a process that successfully calls mmap() for
      a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
      process calls fork().  At that point, the next write fault from the parent
      could fail due to COW if the child still has a reference.
      
      We only reserve pages for the parent but a copy must be made to avoid
      leaking data from the parent to the child after fork().  Reserves could be
      taken for both parent and child at fork time to guarantee faults but if
      the mapping is large it is highly likely we will not have sufficient pages
      for the reservation, and it is common to fork only to exec() immediatly
      after.  A failure here would be very undesirable.
      
      Note that the current behaviour of mainline with MAP_PRIVATE pages is
      pretty bad.  The following situation is allowed to occur today.
      
      1. Process calls mmap(MAP_PRIVATE)
      2. Process calls mlock() to fault all pages and makes sure it succeeds
      3. Process forks()
      4. Process writes to MAP_PRIVATE mapping while child still exists
      5. If the COW fails at this point, the process gets SIGKILLed even though it
         had taken care to ensure the pages existed
      
      This patch improves the situation by guaranteeing the reliability of the
      process that successfully calls mmap().  When the parent performs COW, it
      will try to satisfy the allocation without using reserves.  If that fails
      the parent will steal the page leaving any children without a page.
      Faults from the child after that point will result in failure.  If the
      child COW happens first, an attempt will be made to allocate the page
      without reserves and the child will get SIGKILLed on failure.
      
      To summarise the new behaviour:
      
      1. If the original mapper performs COW on a private mapping with multiple
         references, it will attempt to allocate a hugepage from the pool or
         the buddy allocator without using the existing reserves. On fail, VMAs
         mapping the same area are traversed and the page being COW'd is unmapped
         where found. It will then steal the original page as the last mapper in
         the normal way.
      
      2. The VMAs the pages were unmapped from are flagged to note that pages
         with data no longer exist. Future no-page faults on those VMAs will
         terminate the process as otherwise it would appear that data was corrupted.
         A warning is printed to the console that this situation occured.
      
      2. If the child performs COW first, it will attempt to satisfy the COW
         from the pool if there are enough pages or via the buddy allocator if
         overcommit is allowed and the buddy allocator can satisfy the request. If
         it fails, the child will be killed.
      
      If the pool is large enough, existing applications will not notice that
      the reserves were a factor.  Existing applications depending on the
      no-reserves been set are unlikely to exist as for much of the history of
      hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
      point or failing the mmap().
      
      [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f2cbe3
    • M
      hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() · a1e78772
      Mel Gorman 提交于
      This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
      a similar manner to the reservations taken for MAP_SHARED mappings.  The
      reserve count is accounted both globally and on a per-VMA basis for
      private mappings.  This guarantees that a process that successfully calls
      mmap() will successfully fault all pages in the future unless fork() is
      called.
      
      The characteristics of private mappings of hugetlbfs files behaviour after
      this patch are;
      
      1. The process calling mmap() is guaranteed to succeed all future faults until
         it forks().
      2. On fork(), the parent may die due to SIGKILL on writes to the private
         mapping if enough pages are not available for the COW. For reasonably
         reliable behaviour in the face of a small huge page pool, children of
         hugepage-aware processes should not reference the mappings; such as
         might occur when fork()ing to exec().
      3. On fork(), the child VMAs inherit no reserves. Reads on pages already
         faulted by the parent will succeed. Successful writes will depend on enough
         huge pages being free in the pool.
      4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
         and at fault time otherwise.
      
      Before this patch, all reads or writes in the child potentially needs page
      allocations that can later lead to the death of the parent.  This applies
      to reads and writes of uninstantiated pages as well as COW.  After the
      patch it is only a write to an instantiated page that causes problems.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1e78772
    • M
      hugetlb: move hugetlb_acct_memory() · fc1b8a73
      Mel Gorman 提交于
      This is a patchset to give reliable behaviour to a process that
      successfully calls mmap(MAP_PRIVATE) on a hugetlbfs file.  Currently, it
      is possible for the process to be killed due to a small hugepage pool size
      even if it calls mlock().
      
      MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time.  This
      guarantees all future faults against the mapping will succeed.  This
      allows local allocations at first use improving NUMA locality whilst
      retaining reliability.
      
      MAP_PRIVATE mappings do not reserve pages.  This can result in an
      application being SIGKILLed later if a huge page is not available at fault
      time.  This makes huge pages usage very ill-advised in some cases as the
      unexpected application failure cannot be detected and handled as it is
      immediately fatal.  Although an application may force instantiation of the
      pages using mlock(), this may lead to poor memory placement and the
      process may still be killed when performing COW.
      
      This patchset introduces a reliability guarantee for the process which
      creates a private mapping, i.e.  the process that calls mmap() on a
      hugetlbfs file successfully.  The first patch of the set is purely
      mechanical code move to make later diffs easier to read.  The second patch
      will guarantee faults up until the process calls fork().  After patch two,
      as long as the child keeps the mappings, the parent is no longer
      guaranteed to be reliable.  Patch 3 guarantees that the parent will always
      successfully COW by unmapping the pages from the child in the event there
      are insufficient pages in the hugepage pool in allocate a new page, be it
      via a static or dynamic pool.
      
      Existing hugepage-aware applications are unlikely to be affected by this
      change.  For much of hugetlbfs's history, pages were pre-faulted at mmap()
      time or mmap() failed which acts in a reserve-like manner.  If the pool is
      sized correctly already so that parent and child can fault reliably, the
      application will not even notice the reserves.  It's only when the pool is
      too small for the application to function perfectly reliably that the
      reserves come into play.
      
      Credit goes to Andy Whitcroft for cleaning up a number of mistakes during
      review before the patches were released.
      
      This patch:
      
      A later patch in this set needs to call hugetlb_acct_memory() before it is
      defined.  This patch moves the function without modification.  This makes
      later diffs easier to read.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc1b8a73
    • J
      mm: drop unneeded pgdat argument from free_area_init_node() · 9109fb7b
      Johannes Weiner 提交于
      free_area_init_node() gets passed in the node id as well as the node
      descriptor.  This is redundant as the function can trivially get the node
      descriptor itself by means of NODE_DATA() and the node's id.
      
      I checked all the users and NODE_DATA() seems to be usable everywhere
      from where this function is called.
      Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9109fb7b
    • A
      mapping_set_error: add unlikely() · 2185e69f
      Andrew Morton 提交于
      This is called on a per-page basis and in the vast majority of cases
      `error' is zero.
      
      Cc: Guillaume Chazarain <guichaz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2185e69f
    • A
      slob: record page flag overlays explicitly · 9023cb7e
      Andy Whitcroft 提交于
      SLOB reuses two page bits for internal purposes, it overlays PG_active and
      PG_private.  This is hidden away in slob.c.  Document these overlays
      explicitly in the main page-flags enum along with all the others.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9023cb7e
    • A
      slub: record page flag overlays explicitly · 8a38082d
      Andy Whitcroft 提交于
      SLUB reuses two page bits for internal purposes, it overlays PG_active and
      PG_error.  This is hidden away in slub.c.  Document these overlays
      explicitly in the main page-flags enum along with all the others.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a38082d
    • A
      page-flags: record page flag overlays explicitly · 0cad47cf
      Andy Whitcroft 提交于
      With the recent page flag reorganisation we have a single enum which
      defines the valid page flags and their values, nice and clear.  However
      there are a number of bits which are overloaded by different subsystems.
      Firstly there is PG_owner_priv_1 which is used by filesystems and by XEN.
      Secondly both SLOB and SLUB use a couple of extra page bits to manage
      internal state for pages they own; both overlay other bits.  All of these
      "aliases" are scattered about the source making it very hard for a reader
      to know if the bits are safe to rely on in all contexts; confusion here is
      bad.
      
      As we now have a single place where the bits are clearly assigned it makes
      sense to clarify the reuse of bits by making the aliases explicit and
      visible with the original bit assignments.  This patch creates explicit
      aliases within the enum itself for the overloaded bits, creates standard
      bit accessors PageFoo etc.  and uses those throughout.
      
      This version pulls the bit manipulation out to standard named page bit
      accessors as suggested by Christoph, it retains the explicit mapping to
      the overlayed bits.  A fusion of both ideas.  This has been SLUB and SLOB
      have been compile tested on x86_64 only, and SLUB boot tested.  If people
      feel this is worth doing then I can run a fuller set of testing.
      
      This patch:
      
      Some page flags are used for more than one purpose, for example
      PG_owner_priv_1.  Currently there are individual accessors for each user,
      each built using the common flag name far away from the bit definitions.
      This makes it hard to see all possible uses of these bits.
      
      Now that we have a single enum to generate the bit orders it makes sense
      to express overlays in the same place.  So create per use aliases for this
      bit in the main page-flags enum and use those in the accessors.
      
      [akpm@linux-foundation.org: fix xen]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cad47cf
    • K
      fix soft lock up at NFS mount via per-SB LRU-list of unused dentries · da3bbdd4
      Kentaro Makita 提交于
      [Summary]
      
       Split LRU-list of unused dentries to one per superblock to avoid soft
       lock up during NFS mounts and remounting of any filesystem.
      
       Previously I posted here:
       http://lkml.org/lkml/2008/3/5/590
      
      [Descriptions]
      
      - background
      
        dentry_unused is a list of dentries which are not referenced.
        dentry_unused grows up when references on directories or files are
        released.  This list can be very long if there is huge free memory.
      
      - the problem
      
        When shrink_dcache_sb() is called, it scans all dentry_unused linearly
        under spin_lock(), and if dentry->d_sb is differnt from given
        superblock, scan next dentry.  This scan costs very much if there are
        many entries, and very ineffective if there are many superblocks.
      
        IOW, When we need to shrink unused dentries on one dentry, but scans
        unused dentries on all superblocks in the system.  For example, we scan
        500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
        dentries on other superblocks.
      
        In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
        unused dentries on NFS, but scans 100,000,000 unused dentries on
        superblocks in the system such as local ext3 filesystems.  I hear NFS
        mounting took 1 min on some system in use.
      
      * : NFS uses virtual filesystem in rpc layer, so NFS is affected by
        this problem.
      
        100,000,000 is possible number on large systems.
      
        Per-superblock LRU of unused dentried can reduce the cost in
        reasonable manner.
      
      - How to fix
      
        I found this problem is solved by David Chinner's "Per-superblock
        unused dentry LRU lists V3"(1), so I rebase it and add some fix to
        reclaim with fairness, which is in Andrew Morton's comments(2).
      
        1) http://lkml.org/lkml/2006/5/25/318
        2) http://lkml.org/lkml/2006/5/25/320
      
        Split LRU-list of unused dentries to each superblocks.  Then, NFS
        mounting will check dentries under a superblock instead of all.  But
        this spliting will break LRU of dentry-unused.  So, I've attempted to
        make reclaim unused dentrins with fairness by calculate number of
        dentries to scan on this sb based on following way
      
        number of dentries to scan on this sb =
        count * (number of dentries on this sb / number of dentries in the machine)
      
      - ToDo
       - I have to measuring performance number and do stress tests.
      
       - When unmount occurs during prune_dcache(), scanning on same
        superblock, It is unable to reach next superblock because it is gone
        away.  We restart scannig superblock from first one, it causes
        unfairness of reclaim unused dentries on first superblock.  But I think
        this happens very rarely.
      
      - Test Results
      
        Result on 6GB boxes with excessive unused dentries.
      
      Without patch:
      
      $ cat /proc/sys/fs/dentry-state
      10181835        10180203        45      0       0       0
      # mount -t nfs 10.124.60.70:/work/kernel-src nfs
      real    0m1.830s
      user    0m0.001s
      sys     0m1.653s
      
       With this patch:
      $ cat /proc/sys/fs/dentry-state
      10236610        10234751        45      0       0       0
      # mount -t nfs 10.124.60.70:/work/kernel-src nfs
      real    0m0.106s
      user    0m0.002s
      sys     0m0.032s
      
      [akpm@linux-foundation.org: fix comments]
      Signed-off-by: NKentaro Makita <k-makita@np.css.fujitsu.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: David Chinner <dgc@sgi.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da3bbdd4
    • A
      buddy: clarify comments describing buddy merge · 3c82d0ce
      Andy Whitcroft 提交于
      In __free_one_page(), the comment "Move the buddy up one level" appears
      attached to the break and by implication when the break is taken we are
      moving it up one level:
      
      	if (!page_is_buddy(page, buddy, order))
      		break;          /* Move the buddy up one level. */
      
      In reality the inverse is true, we break out when we can no longer merge
      this page with its buddy.  Looking back into pre-history (into the full
      git history) it appears that these two lines accidentally got joined as
      part of another change.
      
      Move the comment down where it belongs below the if and clarify its
      language.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c82d0ce