1. 24 6月, 2009 1 次提交
    • T
      percpu: cleanup percpu array definitions · 204fba4a
      Tejun Heo 提交于
      Currently, the following three different ways to define percpu arrays
      are in use.
      
      1. DEFINE_PER_CPU(elem_type[array_len], array_name);
      2. DEFINE_PER_CPU(elem_type, array_name[array_len]);
      3. DEFINE_PER_CPU(elem_type, array_name)[array_len];
      
      Unify to #1 which correctly separates the roles of the two parameters
      and thus allows more flexibility in the way percpu variables are
      defined.
      
      [ Impact: cleanup ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: linux-mm@kvack.org
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David S. Miller <davem@davemloft.net>
      204fba4a
  2. 13 3月, 2009 1 次提交
  3. 03 9月, 2008 1 次提交
    • K
      mm: size of quicklists shouldn't be proportional to the number of CPUs · b9541852
      KOSAKI Motohiro 提交于
      Quicklists store pages for each CPU as caches.  (Each CPU can cache
      node_free_pages/16 pages)
      
      It is used for page table cache.  exit() will increase the cache size,
      while fork() consumes it.
      
      So for example if an apache-style application runs (one parent and many
      child model), one CPU process will fork() while another CPU will process
      the middleware work and exit().
      
      At that time, the CPU on which the parent runs doesn't have page table
      cache at all.  Others (on which children runs) have maximum caches.
      
      	QList_max = (#ofCPUs - 1) x Free / 16
      	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)
      
      So, How much quicklist memory is used in the maximum case?
      
      This is proposional to # of CPUs because the limit of per cpu quicklist
      cache doesn't see the number of cpus.
      
      Above calculation mean
      
      	 Number of CPUs per node            2    4    8   16
      	 ==============================  ====================
      	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%
      
      Wow! Quicklist can spend about 50% memory at worst case.
      
      My demonstration program is here
      --------------------------------------------------------------------------------
      #define _GNU_SOURCE
      
      #include <stdio.h>
      #include <errno.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sched.h>
      #include <unistd.h>
      #include <sys/mman.h>
      #include <sys/wait.h>
      
      #define BUFFSIZE 512
      
      int max_cpu(void)	/* get max number of logical cpus from /proc/cpuinfo */
      {
        FILE *fd;
        char *ret, buffer[BUFFSIZE];
        int cpu = 1;
      
        fd = fopen("/proc/cpuinfo", "r");
        if (fd == NULL) {
          perror("fopen(/proc/cpuinfo)");
          exit(EXIT_FAILURE);
        }
        while (1) {
          ret = fgets(buffer, BUFFSIZE, fd);
          if (ret == NULL)
            break;
          if (!strncmp(buffer, "processor", 9))
            cpu = atoi(strchr(buffer, ':') + 2);
        }
        fclose(fd);
        return cpu;
      }
      
      void cpu_bind(int cpu)	/* bind current process to one cpu */
      {
        cpu_set_t mask;
        int ret;
      
        CPU_ZERO(&mask);
        CPU_SET(cpu, &mask);
        ret = sched_setaffinity(0, sizeof(mask), &mask);
        if (ret == -1) {
          perror("sched_setaffinity()");
          exit(EXIT_FAILURE);
        }
        sched_yield();	/* not necessary */
      }
      
      #define MMAP_SIZE (10 * 1024 * 1024)	/* 10 MB */
      #define FORK_INTERVAL 1	/* 1 second */
      
      main(int argc, char *argv[])
      {
        int cpu_max, nextcpu;
        long pagesize;
        pid_t pid;
      
        /* set max number of logical cpu */
        if (argc > 1)
          cpu_max = atoi(argv[1]) - 1;
        else
          cpu_max = max_cpu();
      
        /* get the page size */
        pagesize = sysconf(_SC_PAGESIZE);
        if (pagesize == -1) {
          perror("sysconf(_SC_PAGESIZE)");
          exit(EXIT_FAILURE);
        }
      
        /* prepare parent process */
        cpu_bind(0);
        nextcpu = cpu_max;
      
      loop:
      
        /* select destination cpu for child process by round-robin rule */
        if (++nextcpu > cpu_max)
          nextcpu = 1;
      
        pid = fork();
      
        if (pid == 0) { /* child action */
      
          char *p;
          int i;
      
          /* consume page tables */
          p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
          i = MMAP_SIZE / pagesize;
          while (i-- > 0) {
            *p = 1;
            p += pagesize;
          }
      
          /* move to other cpu */
          cpu_bind(nextcpu);
      /*
          printf("a child moved to cpu%d after mmap().\n", nextcpu);
          fflush(stdout);
       */
      
          /* back page tables to pgtable_quicklist */
          exit(0);
      
        } else if (pid > 0) { /* parent action */
      
          sleep(FORK_INTERVAL);
          waitpid(pid, NULL, WNOHANG);
      
        }
      
        goto loop;
      }
      ----------------------------------------
      
      When above program which does task migration runs, my 8GB box spends
      800MB of memory for quicklist.  This is not memory leak but doesn't seem
      good.
      
      % cat /proc/meminfo
      
      MemTotal:        7701568 kB
      MemFree:         4724672 kB
      (snip)
      Quicklists:       844800 kB
      
      because
      
      - My machine spec is
      	number of numa node: 2
      	number of cpus:      8 (4CPU x2 node)
              total mem:           8GB (4GB x2 node)
              free mem:            about 5GB
      
      - Then, 4.7GB x 16% ~= 880MB.
        So, Quicklist can use 800MB.
      
      So, if following spec machine run that program
      
         CPUs: 64 (8cpu x 8node)
         Mem:  1TB (128GB x8node)
      
      Then, quicklist can waste 300GB (= 1TB x 30%).  It is too large.
      
      So, I don't like cache policies which is proportional to # of cpus.
      
      My patch changes the number of caches
      from:
         per-cpu-cache-amount = memory_on_node / 16
      to
         per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Tested-by: NDavid Miller <davem@davemloft.net>
      Acked-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9541852
  4. 15 1月, 2008 1 次提交
  5. 08 5月, 2007 1 次提交
    • C
      Quicklists for page table pages · 6225e937
      Christoph Lameter 提交于
      On x86_64 this cuts allocation overhead for page table pages down to a
      fraction (kernel compile / editing load.  TSC based measurement of times spend
      in each function):
      
      no quicklist
      
      pte_alloc               1569048 4.3s(401ns/2.7us/179.7us)
      pmd_alloc                780988 2.1s(337ns/2.7us/86.1us)
      pud_alloc                780072 2.2s(424ns/2.8us/300.6us)
      pgd_alloc                260022 1s(920ns/4us/263.1us)
      
      quicklist:
      
      pte_alloc                452436 573.4ms(8ns/1.3us/121.1us)
      pmd_alloc                196204 174.5ms(7ns/889ns/46.1us)
      pud_alloc                195688 172.4ms(7ns/881ns/151.3us)
      pgd_alloc                 65228 9.8ms(8ns/150ns/6.1us)
      
      pgd allocations are the most complex and there we see the most dramatic
      improvement (may be we can cut down the amount of pgds cached somewhat?).  But
      even the pte allocations still see a doubling of performance.
      
      1. Proven code from the IA64 arch.
      
      	The method used here has been fine tuned for years and
      	is NUMA aware. It is based on the knowledge that accesses
      	to page table pages are sparse in nature. Taking a page
      	off the freelists instead of allocating a zeroed pages
      	allows a reduction of number of cachelines touched
      	in addition to getting rid of the slab overhead. So
      	performance improves. This is particularly useful if pgds
      	contain standard mappings. We can save on the teardown
      	and setup of such a page if we have some on the quicklists.
      	This includes avoiding lists operations that are otherwise
      	necessary on alloc and free to track pgds.
      
      2. Light weight alternative to use slab to manage page size pages
      
      	Slab overhead is significant and even page allocator use
      	is pretty heavy weight. The use of a per cpu quicklist
      	means that we touch only two cachelines for an allocation.
      	There is no need to access the page_struct (unless arch code
      	needs to fiddle around with it). So the fast past just
      	means bringing in one cacheline at the beginning of the
      	page. That same cacheline may then be used to store the
      	page table entry. Or a second cacheline may be used
      	if the page table entry is not in the first cacheline of
      	the page. The current code will zero the page which means
      	touching 32 cachelines (assuming 128 byte). We get down
      	from 32 to 2 cachelines in the fast path.
      
      3. x86_64 gets lightweight page table page management.
      
      	This will allow x86_64 arch code to faster repopulate pgds
      	and other page table entries. The list operations for pgds
      	are reduced in the same way as for i386 to the point where
      	a pgd is allocated from the page allocator and when it is
      	freed back to the page allocator. A pgd can pass through
      	the quicklists without having to be reinitialized.
      
      64 Consolidation of code from multiple arches
      
      	So far arches have their own implementation of quicklist
      	management. This patch moves that feature into the core allowing
      	an easier maintenance and consistent management of quicklists.
      
      Page table pages have the characteristics that they are typically zero or in a
      known state when they are freed.  This is usually the exactly same state as
      needed after allocation.  So it makes sense to build a list of freed page
      table pages and then consume the pages already in use first.  Those pages have
      already been initialized correctly (thus no need to zero them) and are likely
      already cached in such a way that the MMU can use them most effectively.  Page
      table pages are used in a sparse way so zeroing them on allocation is not too
      useful.
      
      Such an implementation already exits for ia64.  Howver, that implementation
      did not support constructors and destructors as needed by i386 / x86_64.  It
      also only supported a single quicklist.  The implementation here has
      constructor and destructor support as well as the ability for an arch to
      specify how many quicklists are needed.
      
      Quicklists are defined by an arch defining CONFIG_QUICKLIST.  If more than one
      quicklist is necessary then we can define NR_QUICK for additional lists.  F.e.
       i386 needs two and thus has
      
      config NR_QUICK
      	int
      	default 2
      
      If an arch has requested quicklist support then pages can be allocated
      from the quicklist (or from the page allocator if the quicklist is
      empty) via:
      
      quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)
      
      Page table pages can be freed using:
      
      quicklist_free(<quicklist-nr>, <destructor>, <page>)
      
      Pages must have a definite state after allocation and before
      they are freed. If no constructor is specified then pages
      will be zeroed on allocation and must be zeroed before they are
      freed.
      
      If a constructor is used then the constructor will establish
      a definite page state. F.e. the i386 and x86_64 pgd constructors
      establish certain mappings.
      
      Constructors and destructors can also be used to track the pages.
      i386 and x86_64 use a list of pgds in order to be able to dynamically
      update standard mappings.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6225e937