• P
    [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
    Paul Jackson 提交于
    Optimize the critical zonelist scanning for free pages in the kernel memory
    allocator by caching the zones that were found to be full recently, and
    skipping them.
    
    Remembers the zones in a zonelist that were short of free memory in the
    last second.  And it stashes a zone-to-node table in the zonelist struct,
    to optimize that conversion (minimize its cache footprint.)
    
    Recent changes:
    
        This differs in a significant way from a similar patch that I
        posted a week ago.  Now, instead of having a nodemask_t of
        recently full nodes, I have a bitmask of recently full zones.
        This solves a problem that last weeks patch had, which on
        systems with multiple zones per node (such as DMA zone) would
        take seeing any of these zones full as meaning that all zones
        on that node were full.
    
        Also I changed names - from "zonelist faster" to "zonelist cache",
        as that seemed to better convey what we're doing here - caching
        some of the key zonelist state (for faster access.)
    
        See below for some performance benchmark results.  After all that
        discussion with David on why I didn't need them, I went and got
        some ;).  I wanted to verify that I had not hurt the normal case
        of memory allocation noticeably.  At least for my one little
        microbenchmark, I found (1) the normal case wasn't affected, and
        (2) workloads that forced scanning across multiple nodes for
        memory improved up to 10% fewer System CPU cycles and lower
        elapsed clock time ('sys' and 'real').  Good.  See details, below.
    
        I didn't have the logic in get_page_from_freelist() for various
        full nodes and zone reclaim failures correct.  That should be
        fixed up now - notice the new goto labels zonelist_scan,
        this_zone_full, and try_next_zone, in get_page_from_freelist().
    
    There are two reasons I persued this alternative, over some earlier
    proposals that would have focused on optimizing the fake numa
    emulation case by caching the last useful zone:
    
     1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
        have seen real customer loads where the cost to scan the zonelist
        was a problem, due to many nodes being full of memory before
        we got to a node we could use.  Or at least, I think we have.
        This was related to me by another engineer, based on experiences
        from some time past.  So this is not guaranteed.  Most likely, though.
    
        The following approach should help such real numa systems just as
        much as it helps fake numa systems, or any combination thereof.
    
     2) The effort to distinguish fake from real numa, using node_distance,
        so that we could cache a fake numa node and optimize choosing
        it over equivalent distance fake nodes, while continuing to
        properly scan all real nodes in distance order, was going to
        require a nasty blob of zonelist and node distance munging.
    
        The following approach has no new dependency on node distances or
        zone sorting.
    
    See comment in the patch below for a description of what it actually does.
    
    Technical details of note (or controversy):
    
     - See the use of "zlc_active" and "did_zlc_setup" below, to delay
       adding any work for this new mechanism until we've looked at the
       first zone in zonelist.  I figured the odds of the first zone
       having the memory we needed were high enough that we should just
       look there, first, then get fancy only if we need to keep looking.
    
     - Some odd hackery was needed to add items to struct zonelist, while
       not tripping up the custom zonelists built by the mm/mempolicy.c
       code for MPOL_BIND.  My usual wordy comments below explain this.
       Search for "MPOL_BIND".
    
     - Some per-node data in the struct zonelist is now modified frequently,
       with no locking.  Multiple CPU cores on a node could hit and mangle
       this data.  The theory is that this is just performance hint data,
       and the memory allocator will work just fine despite any such mangling.
       The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
       (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
       all be self correcting after at most a one second delay.
    
     - This still does a linear scan of the same lengths as before.  All
       I've optimized is making the scan faster, not algorithmically
       shorter.  It is now able to scan a compact array of 'unsigned
       short' in the case of many full nodes, so one cache line should
       cover quite a few nodes, rather than each node hitting another
       one or two new and distinct cache lines.
    
     - If both Andi and Nick don't find this too complicated, I will be
       (pleasantly) flabbergasted.
    
     - I removed the comment claiming we only use one cachline's worth of
       zonelist.  We seem, at least in the fake numa case, to have put the
       lie to that claim.
    
     - I pay no attention to the various watermarks and such in this performance
       hint.  A node could be marked full for one watermark, and then skipped
       over when searching for a page using a different watermark.  I think
       that's actually quite ok, as it will tend to slightly increase the
       spreading of memory over other nodes, away from a memory stressed node.
    
    ===============
    
    Performance - some benchmark results and analysis:
    
    This benchmark runs a memory hog program that uses multiple
    threads to touch alot of memory as quickly as it can.
    
    Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
    the total 96 GBytes on the system, and using 1, 19, 37, or 55
    threads (on a 56 CPU system.)  System, user and real (elapsed)
    timings were recorded for each run, shown in units of seconds,
    in the table below.
    
    Two kernels were tested - 2.6.18-mm3 and the same kernel with
    this zonelist caching patch added.  The table also shows the
    percentage improvement the zonelist caching sys time is over
    (lower than) the stock *-mm kernel.
    
          number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
     GBs    N  	------------	   --------------    ----------------	systime
     mem threads   sys user  real	  sys  user  real     sys  user  real	 better
      12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
      12	19	99   22     8	   99	 22	8	0     0     0	   0%
      12	37     111   25     6	  112	 25	6	1     0     0	  -0%
      12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
      38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
      38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
      38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
      38	55     501   77    23	  511	 80    24      10     3     1	  -1%
      64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
      64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
      64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
      64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
      90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
      90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
      90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
      90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
    
    Notes:
     1) This test ran a memory hog program that started a specified number N of
        threads, and had each thread allocate and touch 1/N'th of
        the total memory to be used in the test run in a single loop,
        writing a constant word to memory, one store every 4096 bytes.
        Watching this test during some earlier trial runs, I would see
        each of these threads sit down on one CPU and stay there, for
        the remainder of the pass, a different CPU for each thread.
    
     2) The 'real' column is not comparable to the 'sys' or 'user' columns.
        The 'real' column is seconds wall clock time elapsed, from beginning
        to end of that test pass.  The 'sys' and 'user' columns are total
        CPU seconds spent on that test pass.  For a 19 thread test run,
        for example, the sum of 'sys' and 'user' could be up to 19 times the
        number of 'real' elapsed wall clock seconds.
    
     3) Tests were run on a fresh, single-user boot, to minimize the amount
        of memory already in use at the start of the test, and to minimize
        the amount of background activity that might interfere.
    
     4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
    
     5) Notice that the 'real' time gets large for the single thread runs, even
        though the measured 'sys' and 'user' times are modest.  I'm not sure what
        that means - probably something to do with it being slow for one thread to
        be accessing memory along ways away.  Perhaps the fake numa system, running
        ostensibly the same workload, would not show this substantial degradation
        of 'real' time for one thread on many nodes -- lets hope not.
    
     6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
        ran quite efficiently, as one might expect.  Each pair of threads needed
        to allocate and touch the memory on the node the two threads shared, a
        pleasantly parallizable workload.
    
     7) The intermediate thread count passes, when asking for alot of memory forcing
        them to go to a few neighboring nodes, improved the most with this zonelist
        caching patch.
    
    Conclusions:
     * This zonelist cache patch probably makes little difference one way or the
       other for most workloads on real numa hardware, if those workloads avoid
       heavy off node allocations.
     * For memory intensive workloads requiring substantial off-node allocations
       on real numa hardware, this patch improves both kernel and elapsed timings
       up to ten per-cent.
     * For fake numa systems, I'm optimistic, but will have to leave that up to
       Rohit Seth to actually test (once I get him a 2.6.18 backport.)
    Signed-off-by: NPaul Jackson <pj@sgi.com>
    Cc: Rohit Seth <rohitseth@google.com>
    Cc: Christoph Lameter <clameter@engr.sgi.com>
    Cc: David Rientjes <rientjes@cs.washington.edu>
    Cc: Paul Menage <menage@google.com>
    Signed-off-by: NAndrew Morton <akpm@osdl.org>
    Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
    9276b1bc
cpuset.h 3.5 KB