1. 07 11月, 2015 2 次提交
    • M
      mm: page_alloc: hide some GFP internals and document the bits and flag combinations · dd56b046
      Mel Gorman 提交于
      Andrew stated the following
      
      	We have quite a history of remote parts of the kernel using
      	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
      	to think that this is because we didn't adequately explain the
      	interface.
      
      	And I don't think that gfp.h really improved much in this area as
      	a result of this patchset.  Could you go through it some time and
      	decide if we've adequately documented all this stuff?
      
      This patches first moves some GFP flag combinations that are part of the MM
      internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
      bits under various headings and then documents the flag combinations. It
      will not help callers that are brain damaged but the clarity might motivate
      some fixes and avoid future mistakes.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd56b046
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  2. 06 11月, 2015 1 次提交
  3. 02 11月, 2015 1 次提交
    • L
      mm: get rid of 'vmalloc_info' from /proc/meminfo · a5ad88ce
      Linus Torvalds 提交于
      It turns out that at least some versions of glibc end up reading
      /proc/meminfo at every single startup, because glibc wants to know the
      amount of memory the machine has.  And while that's arguably insane,
      it's just how things are.
      
      And it turns out that it's not all that expensive most of the time, but
      the vmalloc information statistics (amount of virtual memory used in the
      vmalloc space, and the biggest remaining chunk) can be rather expensive
      to compute.
      
      The 'get_vmalloc_info()' function actually showed up on my profiles as
      4% of the CPU usage of "make test" in the git source repository, because
      the git tests are lots of very short-lived shell-scripts etc.
      
      It turns out that apparently this same silly vmalloc info gathering
      shows up on the facebook servers too, according to Dave Jones.  So it's
      not just "make test" for git.
      
      We had two patches to just cache the information (one by me, one by
      Ingo) to mitigate this issue, but the whole vmalloc information of of
      rather dubious value to begin with, and people who *actually* want to
      know what the situation is wrt the vmalloc area should just look at the
      much more complete /proc/vmallocinfo instead.
      
      In fact, according to my testing - and perhaps more importantly,
      according to that big search engine in the sky: Google - there is
      nothing out there that actually cares about those two expensive fields:
      VmallocUsed and VmallocChunk.
      
      So let's try to just remove them entirely.  Actually, this just removes
      the computation and reports the numbers as zero for now, just to try to
      be minimally intrusive.
      
      If this breaks anything, we'll obviously have to re-introduce the code
      to compute this all and add the caching patches on top.  But if given
      the option, I'd really prefer to just remove this bad idea entirely
      rather than add even more code to work around our historical mistake
      that likely nobody really cares about.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5ad88ce
  4. 16 4月, 2015 3 次提交
    • R
      mm/vmalloc: get rid of dirty bitmap inside vmap_block structure · 7d61bfe8
      Roman Pen 提交于
      In original implementation of vm_map_ram made by Nick Piggin there were
      two bitmaps: alloc_map and dirty_map.  None of them were used as supposed
      to be: finding a suitable free hole for next allocation in block.
      vm_map_ram allocates space sequentially in block and on free call marks
      pages as dirty, so freed space can't be reused anymore.
      
      Actually it would be very interesting to know the real meaning of those
      bitmaps, maybe implementation was incomplete, etc.
      
      But long time ago Zhang Yanfei removed alloc_map by these two commits:
      
        mm/vmalloc.c: remove dead code in vb_alloc
           3fcd76e8
        mm/vmalloc.c: remove alloc_map from vmap_block
           b8e748b6
      
      In this patch I replaced dirty_map with two range variables: dirty min and
      max.  These variables store minimum and maximum position of dirty space in
      a block, since we need only to know the dirty range, not exact position of
      dirty pages.
      
      Why it was made?  Several reasons: at first glance it seems that
      vm_map_ram allocator concerns about fragmentation thus it uses bitmaps for
      finding free hole, but it is not true.  To avoid complexity seems it is
      better to use something simple, like min or max range values.  Secondly,
      code also becomes simpler, without iteration over bitmap, just comparing
      values in min and max macros.  Thirdly, bitmap occupies up to 1024 bits
      (4MB is a max size of a block).  Here I replaced the whole bitmap with two
      longs.
      
      Finally vm_unmap_aliases should be slightly faster and the whole
      vmap_block structure occupies less memory.
      Signed-off-by: NRoman Pen <r.peniaev@gmail.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: Rob Jones <rob.jones@codethink.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d61bfe8
    • R
      mm/vmalloc: occupy newly allocated vmap block just after allocation · cf725ce2
      Roman Pen 提交于
      Previous implementation allocates new vmap block and repeats search of a
      free block from the very beginning, iterating over the CPU free list.
      
      Why it can be better??
      
      1. Allocation can happen on one CPU, but search can be done on another CPU.
         In worst case we preallocate amount of vmap blocks which is equal to
         CPU number on the system.
      
      2. In previous patch I added newly allocated block to the tail of free list
         to avoid soon exhaustion of virtual space and give a chance to occupy
         blocks which were allocated long time ago.  Thus to find newly allocated
         block all the search sequence should be repeated, seems it is not efficient.
      
      In this patch newly allocated block is occupied right away, address of
      virtual space is returned to the caller, so there is no any need to repeat
      the search sequence, allocation job is done.
      Signed-off-by: NRoman Pen <r.peniaev@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: Rob Jones <rob.jones@codethink.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf725ce2
    • R
      mm/vmalloc: fix possible exhaustion of vmalloc space caused by vm_map_ram allocator · 68ac546f
      Roman Pen 提交于
      Recently I came across high fragmentation of vm_map_ram allocator:
      vmap_block has free space, but still new blocks continue to appear.
      Further investigation showed that certain mapping/unmapping sequences
      can exhaust vmalloc space.  On small 32bit systems that's not a big
      problem, cause purging will be called soon on a first allocation failure
      (alloc_vmap_area), but on 64bit machines, e.g.  x86_64 has 45 bits of
      vmalloc space, that can be a disaster.
      
      1) I came up with a simple allocation sequence, which exhausts virtual
         space very quickly:
      
        while (iters) {
      
                      /* Map/unmap big chunk */
                      vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
                      vm_unmap_ram(vaddr, 16);
      
                      /* Map/unmap small chunks.
                       *
                       * -1 for hole, which should be left at the end of each block
                       * to keep it partially used, with some free space available */
                      for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
                              vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
                              vm_unmap_ram(vaddr, 8);
                      }
        }
      
      The idea behind is simple:
      
       1. We have to map a big chunk, e.g. 16 pages.
      
       2. Then we have to occupy the remaining space with smaller chunks, i.e.
          8 pages. At the end small hole should remain to keep block in free list,
          but do not let big chunk to occupy remaining space.
      
       3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots
          are left free in the block in the #2 step), new block will be allocated,
          all further requests will lay into newly allocated block.
      
      To have some measurement numbers for all further tests I setup ftrace and
      enabled 4 basic calls in a function profile:
      
              echo vm_map_ram              > /sys/kernel/debug/tracing/set_ftrace_filter;
              echo alloc_vmap_area        >> /sys/kernel/debug/tracing/set_ftrace_filter;
              echo vm_unmap_ram           >> /sys/kernel/debug/tracing/set_ftrace_filter;
              echo free_vmap_block        >> /sys/kernel/debug/tracing/set_ftrace_filter;
      
      So for this scenario I got these results:
      
      BEFORE (all new blocks are put to the head of a free list)
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                          126000    30683.30 us     0.243 us        30819.36 us
        vm_unmap_ram                        126000    22003.24 us     0.174 us        340.886 us
        alloc_vmap_area                       1000    4132.065 us     4.132 us        0.903 us
      
      AFTER (all new blocks are put to the tail of a free list)
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                          126000    28713.13 us     0.227 us        24944.70 us
        vm_unmap_ram                        126000    20403.96 us     0.161 us        1429.872 us
        alloc_vmap_area                        993    3916.795 us     3.944 us        29.370 us
        free_vmap_block                        992    654.157 us      0.659 us        1.273 us
      
      SUMMARY:
      
      The most interesting numbers in those tables are numbers of block
      allocations and deallocations: alloc_vmap_area and free_vmap_block
      calls, which show that before the change blocks were not freed, and
      virtual space and physical memory (vmap_block structure allocations,
      etc) were consumed.
      
      Average time which were spent in vm_map_ram/vm_unmap_ram became slightly
      better.  That can be explained with a reasonable amount of blocks in a
      free list, which we need to iterate to find a suitable free block.
      
      2) Another scenario is a random allocation:
      
        while (iters) {
      
                      /* Randomly take number from a range [1..32/64] */
                      nr = rand(1, VMAP_MAX_ALLOC);
                      vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL);
                      vm_unmap_ram(vaddr, nr);
        }
      
      I chose mersenne twister PRNG to generate persistent random state to
      guarantee that both runs have the same random sequence.  For each
      vm_map_ram call random number from [1..32/64] was taken to represent
      amount of pages which I do map.
      
      I did 10'000 vm_map_ram calls and got these two tables:
      
      BEFORE (all new blocks are put to the head of a free list)
      
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                           10000    10170.01 us     1.017 us        993.609 us
        vm_unmap_ram                         10000    5321.823 us     0.532 us        59.789 us
        alloc_vmap_area                        420    2150.239 us     5.119 us        3.307 us
        free_vmap_block                         37    159.587 us      4.313 us        134.344 us
      
      AFTER (all new blocks are put to the tail of a free list)
      
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                           10000    7745.637 us     0.774 us        395.229 us
        vm_unmap_ram                         10000    5460.573 us     0.546 us        67.187 us
        alloc_vmap_area                        414    2201.650 us     5.317 us        5.591 us
        free_vmap_block                        412    574.421 us      1.394 us        15.138 us
      
      SUMMARY:
      
      'BEFORE' table shows, that 420 blocks were allocated and only 37 were
      freed.  Remained 383 blocks are still in a free list, consuming virtual
      space and physical memory.
      
      'AFTER' table shows, that 414 blocks were allocated and 412 were really
      freed.  2 blocks remained in a free list.
      
      So fragmentation was dramatically reduced.  Why? Because when we put
      newly allocated block to the head, all further requests will occupy new
      block, regardless remained space in other blocks.  In this scenario all
      requests come randomly.  Eventually remained free space will be less
      than requested size, free list will be iterated and it is possible that
      nothing will be found there - finally new block will be created.  So
      exhaustion in random scenario happens for the maximum possible
      allocation size: 32 pages for 32-bit system and 64 pages for 64-bit
      system.
      
      Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us.
      Again this can be explained by iteration through smaller list of free
      blocks.
      
      3) Next simple scenario is a sequential allocation, when the allocation
         order is increased for each block.  This scenario forces allocator to
         reach maximum amount of partially free blocks in a free list:
      
        while (iters) {
      
                      /* Populate free list with blocks with remaining space */
                      for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) {
                              nr = VMAP_BBMAP_BITS / (1 << order);
      
                              /* Leave a hole */
                              nr -= 1;
      
                              for (i = 0; i < nr; i++) {
                                      vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
                                      vm_unmap_ram(vaddr, (1 << order));
                      }
      
                      /* Completely occupy blocks from a free list */
                      for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) {
                              vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
                              vm_unmap_ram(vaddr, (1 << order));
                      }
        }
      
      Results which I got:
      
      BEFORE (all new blocks are put to the head of a free list)
      
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                         2032000    399545.2 us     0.196 us        467123.7 us
        vm_unmap_ram                       2032000    363225.7 us     0.178 us        111405.9 us
        alloc_vmap_area                       7001    30627.76 us     4.374 us        495.755 us
        free_vmap_block                       6993    7011.685 us     1.002 us        159.090 us
      
      AFTER (all new blocks are put to the tail of a free list)
      
      # cat /sys/kernel/debug/tracing/trace_stat/function0
        Function                               Hit    Time            Avg             s^2
        --------                               ---    ----            ---             ---
        vm_map_ram                         2032000    394259.7 us     0.194 us        589395.9 us
        vm_unmap_ram                       2032000    292500.7 us     0.143 us        94181.08 us
        alloc_vmap_area                       7000    31103.11 us     4.443 us        703.225 us
        free_vmap_block                       7000    6750.844 us     0.964 us        119.112 us
      
      SUMMARY:
      
      No surprises here, almost all numbers are the same.
      
      Fixing this fragmentation problem I also did some improvements in a
      allocation logic of a new vmap block: occupy block immediately and get
      rid of extra search in a free list.
      
      Also I replaced dirty bitmap with min/max dirty range values to make the
      logic simpler and slightly faster, since two longs comparison costs
      less, than loop thru bitmap.
      
      This patchset raises several questions:
      
       Q: Think the problem you comments is already known so that I wrote comments
          about it as "it could consume lots of address space through fragmentation".
          Could you tell me about your situation and reason why it should be avoided?
                                                                           Gioh Kim
      
       A: Indeed, there was a commit 36437638 which adds explicit comment about
          fragmentation.  But fragmentation which is described in this comment caused
          by mixing of long-lived and short-lived objects, when a whole block is pinned
          in memory because some page slots are still in use.  But here I am talking
          about blocks which are free, nobody uses them, and allocator keeps them alive
          forever, continuously allocating new blocks.
      
       Q: I think that if you put newly allocated block to the tail of a free
          list, below example would results in enormous performance degradation.
      
          new block: 1MB (256 pages)
      
          while (iters--) {
            vm_map_ram(3 or something else not dividable for 256) * 85
            vm_unmap_ram(3) * 85
          }
      
          On every iteration, it needs newly allocated block and it is put to the
          tail of a free list so finding it consumes large amount of time.
                                                                          Joonsoo Kim
      
       A: Second patch in current patchset gets rid of extra search in a free list,
          so new block will be immediately occupied..
      
          Also, the scenario above is impossible, cause vm_map_ram allocates virtual
          range in orders, i.e. 2^n.  I.e. passing 3 to vm_map_ram you will allocate
          4 slots in a block and 256 slots (capacity of a block) of course dividable
          on 4, so block will be completely occupied.
      
          But there is a worst case which we can achieve: each free block has a hole
          equal to order size.
      
          The maximum size of allocation is 64 pages for 64-bit system
          (if you try to map more, original alloc_vmap_area will be called).
      
          So the maximum order is 6.  That means that worst case, before allocator
          makes a decision to allocate a new block, is to iterate 7 blocks:
      
          HEAD
          1st block - has 1  page slot  free (order 0)
          2nd block - has 2  page slots free (order 1)
          3rd block - has 4  page slots free (order 2)
          4th block - has 8  page slots free (order 3)
          5th block - has 16 page slots free (order 4)
          6th block - has 32 page slots free (order 5)
          7th block - has 64 page slots free (order 6)
          TAIL
      
          So the worst scenario on 64-bit system is that each CPU queue can have 7
          blocks in a free list.
      
          This can happen only and only if you allocate blocks increasing the order.
          (as I did in the function written in the comment of the first patch)
          This is weird and rare case, but still it is possible.  Afterwards you will
          get 7 blocks in a list.
      
          All further requests should be placed in a newly allocated block or some
          free slots should be found in a free list.
          Seems it does not look dramatically awful.
      
      This patch (of 3):
      
      If suitable block can't be found, new block is allocated and put into a
      head of a free list, so on next iteration this new block will be found
      first.
      
      That's bad, because old blocks in a free list will not get a chance to be
      fully used, thus fragmentation will grow.
      
      Let's consider this simple example:
      
       #1 We have one block in a free list which is partially used, and where only
          one page is free:
      
          HEAD |xxxxxxxxx-| TAIL
                         ^
                         free space for 1 page, order 0
      
       #2 New allocation request of order 1 (2 pages) comes, new block is allocated
          since we do not have free space to complete this request. New block is put
          into a head of a free list:
      
          HEAD |----------|xxxxxxxxx-| TAIL
      
       #3 Two pages were occupied in a new found block:
      
          HEAD |xx--------|xxxxxxxxx-| TAIL
                ^
                two pages mapped here
      
       #4 New allocation request of order 0 (1 page) comes.  Block, which was created
          on #2 step, is located at the beginning of a free list, so it will be found
          first:
      
        HEAD |xxX-------|xxxxxxxxx-| TAIL
                ^                 ^
                page mapped here, but better to use this hole
      
      It is obvious, that it is better to complete request of #4 step using the
      old block, where free space is left, because in other case fragmentation
      will be highly increased.
      
      But fragmentation is not only the case.  The worst thing is that I can
      easily create scenario, when the whole vmalloc space is exhausted by
      blocks, which are not used, but already dirty and have several free pages.
      
      Let's consider this function which execution should be pinned to one CPU:
      
      static void exhaust_virtual_space(struct page *pages[16], int iters)
      {
              /* Firstly we have to map a big chunk, e.g. 16 pages.
               * Then we have to occupy the remaining space with smaller
               * chunks, i.e. 8 pages. At the end small hole should remain.
               * So at the end of our allocation sequence block looks like
               * this:
               *                XX  big chunk
               * |XXxxxxxxx-|    x  small chunk
               *                 -  hole, which is enough for a small chunk,
               *                    but is not enough for a big chunk
               */
              while (iters--) {
                      int i;
                      void *vaddr;
      
                      /* Map/unmap big chunk */
                      vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
                      vm_unmap_ram(vaddr, 16);
      
                      /* Map/unmap small chunks.
                       *
                       * -1 for hole, which should be left at the end of each block
                       * to keep it partially used, with some free space available */
                      for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
                              vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
                              vm_unmap_ram(vaddr, 8);
                      }
              }
      }
      
      On every iteration new block (1MB of vm area in my case) will be
      allocated and then will be occupied, without attempt to resolve small
      allocation request using previously allocated blocks in a free list.
      
      In case of random allocation (size should be randomly taken from the
      range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the
      same: new blocks continue to appear if maximum possible allocation size
      (32 or 64) passed to the allocator, because all remaining blocks in a
      free list do not have enough free space to complete this allocation
      request.
      
      In summary if new blocks are put into the head of a free list eventually
      virtual space will be exhausted.
      
      In current patch I simply put newly allocated block to the tail of a
      free list, thus reduce fragmentation, giving a chance to resolve
      allocation request using older blocks with possible holes left.
      Signed-off-by: NRoman Pen <r.peniaev@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: Rob Jones <rob.jones@codethink.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68ac546f
  5. 15 4月, 2015 2 次提交
    • T
      mm: change vunmap to tear down huge KVA mappings · b9820d8f
      Toshi Kani 提交于
      Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA
      mappings when they are set.  pud_clear_huge() and pmd_clear_huge() return
      zero when no-operation is performed, i.e.  huge page mapping was not used.
      
      These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined
      on the architecture.
      
      [akpm@linux-foundation.org: use consistent code layout]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9820d8f
    • T
      mm: change __get_vm_area_node() to use fls_long() · 0f616be1
      Toshi Kani 提交于
      ioremap() and its related interfaces are used to create I/O mappings to
      memory-mapped I/O devices.  The mapping sizes of the traditional I/O
      devices are relatively small.  Non-volatile memory (NVM), however, has
      many GB and is going to have TB soon.  It is not very efficient to create
      large I/O mappings with 4KB.
      
      This patchset extends the ioremap() interfaces to transparently create I/O
      mappings with huge pages whenever possible.  ioremap() continues to use
      4KB mappings when a huge page does not fit into a requested range.  There
      is no change necessary to the drivers using ioremap().  A requested
      physical address must be aligned by a huge page size (1GB or 2MB on x86)
      for using huge page mapping, though.  The kernel huge I/O mapping will
      improve performance of NVM and other devices with large memory, and reduce
      the time to create their mappings as well.
      
      On x86, MTRRs can override PAT memory types with a 4KB granularity.  When
      using a huge page, MTRRs can override the memory type of the huge page,
      which may lead a performance penalty.  The processor can also behave in an
      undefined manner if a huge page is mapped to a memory range that MTRRs
      have mapped with multiple different memory types.  Therefore, the mapping
      code falls back to use a smaller page size toward 4KB when a mapping range
      is covered by non-WB type of MTRRs.  The WB type of MTRRs has no affect on
      the PAT memory types.
      
      The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch
      supports huge KVA mappings for ioremap().  User may specify a new kernel
      option "nohugeiomap" to disable the huge I/O mapping capability of
      ioremap() when necessary.
      
      Patch 1-4 change common files to support huge I/O mappings.  There is no
      change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the
      architecture of the system.
      
      Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set
      HAVE_ARCH_HUGE_VMAP on x86.
      
      This patch (of 6):
      
      __get_vm_area_node() takes unsigned long size, which is a 64-bit value on
      a 64-bit kernel.  However, fls(size) simply ignores the upper 32-bit.
      Change to use fls_long() to handle the size properly.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f616be1
  6. 13 3月, 2015 1 次提交
    • A
      kasan, module, vmalloc: rework shadow allocation for modules · a5af5aa8
      Andrey Ryabinin 提交于
      Current approach in handling shadow memory for modules is broken.
      
      Shadow memory could be freed only after memory shadow corresponds it is no
      longer used.  vfree() called from interrupt context could use memory its
      freeing to store 'struct llist_node' in it:
      
          void vfree(const void *addr)
          {
          ...
              if (unlikely(in_interrupt())) {
                  struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
                  if (llist_add((struct llist_node *)addr, &p->list))
                          schedule_work(&p->wq);
      
      Later this list node used in free_work() which actually frees memory.
      Currently module_memfree() called in interrupt context will free shadow
      before freeing module's memory which could provoke kernel crash.
      
      So shadow memory should be freed after module's memory.  However, such
      deallocation order could race with kasan_module_alloc() in module_alloc().
      
      Free shadow right before releasing vm area.  At this point vfree()'d
      memory is not used anymore and yet not available for other allocations.
      New VM_KASAN flag used to indicate that vm area has dynamically allocated
      shadow memory so kasan frees shadow only if it was previously allocated.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5af5aa8
  7. 14 2月, 2015 2 次提交
    • A
      mm: vmalloc: pass additional vm_flags to __vmalloc_node_range() · cb9e3c29
      Andrey Ryabinin 提交于
      For instrumenting global variables KASan will shadow memory backing memory
      for modules.  So on module loading we will need to allocate memory for
      shadow and map it at address in shadow that corresponds to the address
      allocated in module_alloc().
      
      __vmalloc_node_range() could be used for this purpose, except it puts a
      guard hole after allocated area.  Guard hole in shadow memory should be a
      problem because at some future point we might need to have a shadow memory
      at address occupied by guard hole.  So we could fail to allocate shadow
      for module_alloc().
      
      Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
      __vmalloc_node_range().  Add new parameter 'vm_flags' to
      __vmalloc_node_range() function.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb9e3c29
    • A
      mm: vmalloc: add flag preventing guard hole allocation · 71394fe5
      Andrey Ryabinin 提交于
      For instrumenting global variables KASan will shadow memory backing memory
      for modules.  So on module loading we will need to allocate memory for
      shadow and map it at address in shadow that corresponds to the address
      allocated in module_alloc().
      
      __vmalloc_node_range() could be used for this purpose, except it puts a
      guard hole after allocated area.  Guard hole in shadow memory should be a
      problem because at some future point we might need to have a shadow memory
      at address occupied by guard hole.  So we could fail to allocate shadow
      for module_alloc().
      
      Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
      have a guard hole.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71394fe5
  8. 14 12月, 2014 1 次提交
  9. 11 12月, 2014 1 次提交
  10. 10 10月, 2014 1 次提交
  11. 07 8月, 2014 4 次提交
    • W
      mm/vmalloc.c: clean up map_vm_area third argument · f6f8ed47
      WANG Chao 提交于
      Currently map_vm_area() takes (struct page *** pages) as third argument,
      and after mapping, it moves (*pages) to point to (*pages +
      nr_mappped_pages).
      
      It looks like this kind of increment is useless to its caller these
      days.  The callers don't care about the increments and actually they're
      trying to avoid this by passing another copy to map_vm_area().
      
      The caller can always guarantee all the pages can be mapped into vm_area
      as specified in first argument and the caller only cares about whether
      map_vm_area() fails or not.
      
      This patch cleans up the pointer movement in map_vm_area() and updates
      its callers accordingly.
      Signed-off-by: NWANG Chao <chaowang@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6f8ed47
    • D
      mm, vmalloc: constify allocation mask · 930f036b
      David Rientjes 提交于
      tmp_mask in the __vmalloc_area_node() iteration never changes so it can
      be moved into function scope and marked with const.  This causes the
      movl and orl to only be done once per call rather than area->nr_pages
      times.
      
      nested_gfp can also be marked const.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      930f036b
    • E
      mm/vmalloc.c: add a schedule point to vmalloc() · 660654f9
      Eric Dumazet 提交于
      It is not uncommon on busy servers to get stuck hundred of ms in
      vmalloc() calls (like file descriptor expansions).
      
      Add a cond_resched() to __vmalloc_area_node() to be gentle to
      other tasks.
      
      [akpm@linux-foundation.org: only do it for __GFP_WAIT, per David]
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      660654f9
    • J
      vmalloc: use rcu list iterator to reduce vmap_area_lock contention · 474750ab
      Joonsoo Kim 提交于
      Richard Yao reported a month ago that his system have a trouble with
      vmap_area_lock contention during performance analysis by /proc/meminfo.
      Andrew asked why his analysis checks /proc/meminfo stressfully, but he
      didn't answer it.
      
        https://lkml.org/lkml/2014/4/10/416
      
      Although I'm not sure that this is right usage or not, there is a
      solution reducing vmap_area_lock contention with no side-effect.  That
      is just to use rcu list iterator in get_vmalloc_info().
      
      rcu can be used in this function because all RCU protocol is already
      respected by writers, since Nick Piggin commit db64fe02 ("mm:
      rewrite vmap layer") back in linux-2.6.28
      
      Specifically :
         insertions use list_add_rcu(),
         deletions use list_del_rcu() and kfree_rcu().
      
      Note the rb tree is not used from rcu reader (it would not be safe),
      only the vmap_area_list has full RCU protection.
      
      Note that __purge_vmap_area_lazy() already uses this rcu protection.
      
              rcu_read_lock();
              list_for_each_entry_rcu(va, &vmap_area_list, list) {
                      if (va->flags & VM_LAZY_FREE) {
                              if (va->va_start < *start)
                                      *start = va->va_start;
                              if (va->va_end > *end)
                                      *end = va->va_end;
                              nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
                              list_add_tail(&va->purge_list, &valist);
                              va->flags |= VM_LAZY_FREEING;
                              va->flags &= ~VM_LAZY_FREE;
                      }
              }
              rcu_read_unlock();
      
      Peter:
      
      : While rcu list traversal over the vmap_area_list is safe, this may
      : arrive at different results than the spinlocked version. The rcu list
      : traversal version will not be a 'snapshot' of a single, valid instant
      : of the entire vmap_area_list, but rather a potential amalgam of
      : different list states.
      
      Joonsoo:
      
      : Yes, you are right, but I don't think that we should be strict here.
      : Meminfo is already not a 'snapshot' at specific time.  While we try to get
      : certain stats, the other stats can change.  And, although we may arrive at
      : different results than the spinlocked version, the difference would not be
      : large and would not make serious side-effect.
      
      [edumazet@google.com: add more commit description]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NRichard Yao <ryao@gentoo.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      474750ab
  12. 05 6月, 2014 3 次提交
  13. 08 4月, 2014 2 次提交
    • G
      mm/vmalloc.c: enhance vm_map_ram() comment · 36437638
      Gioh Kim 提交于
      vm_map_ram() has a fragmentation problem when it cannot purge a
      chunk(ie, 4M address space) if there is a pinning object in that
      addresss space.  So it could consume all VMALLOC address space easily.
      
      We can fix the fragmentation problem by using vmap instead of
      vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
      Minchan said vm_map_ram is 5 times faster than vmap in his tests.  So I
      thought we should fix fragment problem of vm_map_ram because our
      proprietary GPU driver has used it heavily.
      
      On second thought, it's not an easy because we should reuse freed space
      for solving the problem and it could make more IPI and bitmap operation
      for searching hole.  It could mitigate API's goal which is very fast
      mapping.  And even fragmentation problem wouldn't show in 64 bit
      machine.
      
      Another option is that the user should separate long-life and short-life
      object and use vmap for long-life but vm_map_ram for short-life.  If we
      inform the user about the characteristic of vm_map_ram the user can
      choose one according to the page lifetime.
      
      Let's add some notice messages to user.
      
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NGioh Kim <gioh.kim@lge.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36437638
    • G
      mm: use macros from compiler.h instead of __attribute__((...)) · 3b32123d
      Gideon Israel Dsouza 提交于
      To increase compiler portability there is <linux/compiler.h> which
      provides convenience macros for various gcc constructs.  Eg: __weak for
      __attribute__((weak)).  I've replaced all instances of gcc attributes with
      the right macro in the memory management (/mm) subsystem.
      
      [akpm@linux-foundation.org: while-we're-there consistency tweaks]
      Signed-off-by: NGideon Israel Dsouza <gidisrael@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b32123d
  14. 28 1月, 2014 1 次提交
    • M
      Revert "mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}" · add688fb
      malc 提交于
      Revert commit ece86e22, which was intended as a small performance
      improvement.
      
      Despite the claim that the patch doesn't introduce any functional
      changes in fact it does.
      
      The "no page" path behaves different now.  Originally, vmalloc_to_page
      might return NULL under some conditions, with new implementation it
      returns pfn_to_page(0) which is not the same as NULL.
      
      Simple test shows the difference.
      
      test.c
      
      #include <linux/kernel.h>
      #include <linux/module.h>
      #include <linux/vmalloc.h>
      #include <linux/mm.h>
      
      int __init myi(void)
      {
      	struct page *p;
      	void *v;
      
      	v = vmalloc(PAGE_SIZE);
      	/* trigger the "no page" path in vmalloc_to_page*/
      	vfree(v);
      
      	p = vmalloc_to_page(v);
      
      	pr_err("expected val = NULL, returned val = %p", p);
      
      	return -EBUSY;
      }
      
      void __exit mye(void)
      {
      
      }
      module_init(myi)
      module_exit(mye)
      
      Before interchange:
      expected val = NULL, returned val =   (null)
      
      After interchange:
      expected val = NULL, returned val = c7ebe000
      Signed-off-by: NVladimir Murzin <murzin.v@gmail.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      add688fb
  15. 22 1月, 2014 1 次提交
  16. 13 11月, 2013 6 次提交
  17. 12 9月, 2013 3 次提交
  18. 10 7月, 2013 5 次提交