1. 14 11月, 2014 10 次提交
    • T
      mem-hotplug: reset node managed pages when hot-adding a new pgdat · f784a3f1
      Tang Chen 提交于
      In free_area_init_core(), zone->managed_pages is set to an approximate
      value for lowmem, and will be adjusted when the bootmem allocator frees
      pages into the buddy system.
      
      But free_area_init_core() is also called by hotadd_new_pgdat() when
      hot-adding memory.  As a result, zone->managed_pages of the newly added
      node's pgdat is set to an approximate value in the very beginning.
      
      Even if the memory on that node has node been onlined,
      /sys/device/system/node/nodeXXX/meminfo has wrong value:
      
        hot-add node2 (memory not onlined)
        cat /sys/device/system/node/node2/meminfo
        Node 2 MemTotal:       33554432 kB
        Node 2 MemFree:               0 kB
        Node 2 MemUsed:        33554432 kB
        Node 2 Active:                0 kB
      
      This patch fixes this problem by reset node managed pages to 0 after
      hot-adding a new node.
      
      1. Move reset_managed_pages_done from reset_node_managed_pages() to
         reset_all_zones_managed_pages()
      2. Make reset_node_managed_pages() non-static
      3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat
         is initialized
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f784a3f1
    • J
      mm/debug-pagealloc: correct freepage accounting and order resetting · 57cbc87e
      Joonsoo Kim 提交于
      One thing I did in this patch is fixing freepage accounting.  If we
      clear guard page and link it onto isolate buddy list, we should not
      increase freepage count.  This patch adds conditional branch to skip
      counting in this case.  Without this patch, this overcounting happens
      frequently if guard order is set and CMA is used.
      
      Another thing fixed in this patch is the target to reset order.  In
      __free_one_page(), we check the buddy page whether it is a guard page or
      not.  And, if so, we should clear guard attribute on the buddy page and
      reset order of it to 0.  But, current code resets original page's order
      rather than buddy one's.  Maybe, this doesn't have any problem, because
      whole merged page's order will be re-assigned soon.  But, it is better
      to correct code.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57cbc87e
    • V
      mm, compaction: prevent infinite loop in compact_zone · 1d5bfe1f
      Vlastimil Babka 提交于
      Several people have reported occasionally seeing processes stuck in
      compact_zone(), even triggering soft lockups, in 3.18-rc2+.
      
      Testing a revert of commit e14c720e ("mm, compaction: remember
      position within pageblock in free pages scanner") fixed the issue,
      although the stuck processes do not appear to involve the free scanner.
      
      Finally, by code inspection, the bug was found in isolate_migratepages()
      which uses a slightly different condition to detect if the migration and
      free scanners have met, than compact_finished().  That has not been a
      problem until commit e14c720e allowed the free scanner position
      between individual invocations to be in the middle of a pageblock.
      
      In a relatively rare case, the migration scanner position can end up at
      the beginning of a pageblock, with the free scanner position in the
      middle of the same pageblock.  If it's the migration scanner's turn,
      isolate_migratepages() exits immediately (without updating the
      position), while compact_finished() decides to continue compaction,
      resulting in a potentially infinite loop.  The system can recover only
      if another process creates enough high-order pages to make the watermark
      checks in compact_finished() pass.
      
      This patch fixes the immediate problem by bumping the migration
      scanner's position to meet the free scanner in isolate_migratepages(),
      when both are within the same pageblock.  This causes compact_finished()
      to terminate properly.  A more robust check in compact_finished() is
      planned as a cleanup for better future maintainability.
      
      Fixes: e14c720e ("mm, compaction: remember position within pageblock in free pages scanner)
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NP. Christeas <xrg@linux.gr>
      Tested-by: NP. Christeas <xrg@linux.gr>
      Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2Reported-by: NNorbert Preining <preining@logic.at>
      Tested-by: NNorbert Preining <preining@logic.at>
      Link: https://lkml.org/lkml/2014/11/4/904Reported-by: NPavel Machek <pavel@ucw.cz>
      Link: https://lkml.org/lkml/2014/11/7/164
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d5bfe1f
    • M
      mm: alloc_contig_range: demote pages busy message from warn to info · dae803e1
      Michal Nazarewicz 提交于
      Having test_pages_isolated failure message as a warning confuses users
      into thinking that it is more serious than it really is.  In reality, if
      called via CMA, allocation will be retried so a single
      test_pages_isolated failure does not prevent allocation from succeeding.
      
      Demote the warning message to an info message and reformat it such that
      the text "failed" does not appear and instead a less worrying "PFNS
      busy" is used.
      
      This message is trivially reproducible on a 10GB x86 machine on 3.16.y
      kernels configured with CONFIG_DMA_CMA.
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dae803e1
    • J
      mm/slab: fix unalignment problem on Malta with EVA due to slab merge · 95069ac8
      Joonsoo Kim 提交于
      Unlike SLUB, sometimes, object isn't started at the beginning of the
      slab in SLAB.  This causes the unalignment problem after slab merging is
      supported by commit 12220dea ("mm/slab: support slab merge").
      
      Following is the report from Markos that fail to boot on Malta with EVA.
      
          Calibrating delay loop... 19.86 BogoMIPS (lpj=99328)
          pid_max: default: 32768 minimum: 301
          Mount-cache hash table entries: 4096 (order: 0, 16384 bytes)
          Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes)
          Kernel bug detected[#1]:
          CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.0-05639-g12220dea #1631
          task: 1f04f5d8 ti: 1f050000 task.ti: 1f050000
          epc   : 80141190 alloc_unbound_pwq+0x234/0x304
              Not tainted
          ra    : 80141184 alloc_unbound_pwq+0x228/0x304
          Process swapper/0 (pid: 1, threadinfo=1f050000, task=1f04f5d8, tls=00000000)
          Call Trace:
            alloc_unbound_pwq+0x234/0x304
            apply_workqueue_attrs+0x11c/0x294
            __alloc_workqueue_key+0x23c/0x470
            init_workqueues+0x320/0x400
            do_one_initcall+0xe8/0x23c
            kernel_init_freeable+0x9c/0x224
            kernel_init+0x10/0x100
            ret_from_kernel_thread+0x14/0x1c
          [ end trace cb88537fdc8fa200 ]
          Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      
      alloc_unbound_pwq() allocates slab object from pool_workqueue.  This
      kmem_cache requires 256 bytes alignment, but, current merging code
      doesn't honor that, and merge it with kmalloc-256.  kmalloc-256 requires
      only cacheline size alignment so that above failure occurs.  However, in
      x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't
      happen on it.
      
      To fix this problem, this patch introduces alignment mismatch check in
      find_mergeable().  This will fix the problem.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NMarkos Chandras <Markos.Chandras@imgtec.com>
      Tested-by: NMarkos Chandras <Markos.Chandras@imgtec.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95069ac8
    • J
      mm/page_alloc: restrict max order of merging on isolated pageblock · 3c605096
      Joonsoo Kim 提交于
      Current pageblock isolation logic could isolate each pageblock
      individually.  This causes freepage accounting problem if freepage with
      pageblock order on isolate pageblock is merged with other freepage on
      normal pageblock.  We can prevent merging by restricting max order of
      merging to pageblock order if freepage is on isolate pageblock.
      
      A side-effect of this change is that there could be non-merged buddy
      freepage even if finishing pageblock isolation, because undoing
      pageblock isolation is just to move freepage from isolate buddy list to
      normal buddy list rather than to consider merging.  So, the patch also
      makes undoing pageblock isolation consider freepage merge.  When
      un-isolation, freepage with more than pageblock order and it's buddy are
      checked.  If they are on normal pageblock, instead of just moving, we
      isolate the freepage and free it in order to get merged.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c605096
    • J
      mm/page_alloc: move freepage counting logic to __free_one_page() · 8f82b55d
      Joonsoo Kim 提交于
      All the caller of __free_one_page() has similar freepage counting logic,
      so we can move it to __free_one_page().  This reduce line of code and
      help future maintenance.
      
      This is also preparation step for "mm/page_alloc: restrict max order of
      merging on isolated pageblock" which fix the freepage counting problem
      on freepage with more than pageblock order.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f82b55d
    • J
      mm/page_alloc: add freepage on isolate pageblock to correct buddy list · 51bb1a40
      Joonsoo Kim 提交于
      In free_pcppages_bulk(), we use cached migratetype of freepage to
      determine type of buddy list where freepage will be added.  This
      information is stored when freepage is added to pcp list, so if
      isolation of pageblock of this freepage begins after storing, this
      cached information could be stale.  In other words, it has original
      migratetype rather than MIGRATE_ISOLATE.
      
      There are two problems caused by this stale information.
      
      One is that we can't keep these freepages from being allocated.
      Although this pageblock is isolated, freepage will be added to normal
      buddy list so that it could be allocated without any restriction.  And
      the other problem is incorrect freepage accounting.  Freepages on
      isolate pageblock should not be counted for number of freepage.
      
      Following is the code snippet in free_pcppages_bulk().
      
          /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
          __free_one_page(page, page_to_pfn(page), zone, 0, mt);
          trace_mm_page_pcpu_drain(page, 0, mt);
          if (likely(!is_migrate_isolate_page(page))) {
              __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
              if (is_migrate_cma(mt))
                  __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
          }
      
      As you can see above snippet, current code already handle second
      problem, incorrect freepage accounting, by re-fetching pageblock
      migratetype through is_migrate_isolate_page(page).
      
      But, because this re-fetched information isn't used for
      __free_one_page(), first problem would not be solved.  This patch try to
      solve this situation to re-fetch pageblock migratetype before
      __free_one_page() and to use it for __free_one_page().
      
      In addition to move up position of this re-fetch, this patch use
      optimization technique, re-fetching migratetype only if there is isolate
      pageblock.  Pageblock isolation is rare event, so we can avoid
      re-fetching in common case with this optimization.
      
      This patch also correct migratetype of the tracepoint output.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51bb1a40
    • J
      mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype · ad53f92e
      Joonsoo Kim 提交于
      Before describing bugs itself, I first explain definition of freepage.
      
       1. pages on buddy list are counted as freepage.
       2. pages on isolate migratetype buddy list are *not* counted as freepage.
       3. pages on cma buddy list are counted as CMA freepage, too.
      
      Now, I describe problems and related patch.
      
      Patch 1: There is race conditions on getting pageblock migratetype that
      it results in misplacement of freepages on buddy list, incorrect
      freepage count and un-availability of freepage.
      
      Patch 2: Freepages on pcp list could have stale cached information to
      determine migratetype of buddy list to go.  This causes misplacement of
      freepages on buddy list and incorrect freepage count.
      
      Patch 4: Merging between freepages on different migratetype of
      pageblocks will cause freepages accouting problem.  This patch fixes it.
      
      Without patchset [3], above problem doesn't happens on my CMA allocation
      test, because CMA reserved pages aren't used at all.  So there is no
      chance for above race.
      
      With patchset [3], I did simple CMA allocation test and get below
      result:
      
       - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
       - run kernel build (make -j16) on background
       - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
       - Result: more than 5000 freepage count are missed
      
      With patchset [3] and this patchset, I found that no freepage count are
      missed so that I conclude that problems are solved.
      
      On my simple memory offlining test, these problems also occur on that
      environment, too.
      
      This patch (of 4):
      
      There are two paths to reach core free function of buddy allocator,
      __free_one_page(), one is free_one_page()->__free_one_page() and the
      other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
      Each paths has race condition causing serious problems.  At first, this
      patch is focused on first type of freepath.  And then, following patch
      will solve the problem in second type of freepath.
      
      In the first type of freepath, we got migratetype of freeing page
      without holding the zone lock, so it could be racy.  There are two cases
      of this race.
      
       1. pages are added to isolate buddy list after restoring orignal
          migratetype
      
          CPU1                                   CPU2
      
          get migratetype => return MIGRATE_ISOLATE
          call free_one_page() with MIGRATE_ISOLATE
      
                                      grab the zone lock
                                      unisolate pageblock
                                      release the zone lock
      
          grab the zone lock
          call __free_one_page() with MIGRATE_ISOLATE
          freepage go into isolate buddy list,
          although pageblock is already unisolated
      
      This may cause two problems.  One is that we can't use this page anymore
      until next isolation attempt of this pageblock, because freepage is on
      isolate buddy list.  The other is that freepage accouting could be wrong
      due to merging between different buddy list.  Freepages on isolate buddy
      list aren't counted as freepage, but ones on normal buddy list are
      counted as freepage.  If merge happens, buddy freepage on normal buddy
      list is inevitably moved to isolate buddy list without any consideration
      of freepage accouting so it could be incorrect.
      
       2. pages are added to normal buddy list while pageblock is isolated.
          It is similar with above case.
      
      This also may cause two problems.  One is that we can't keep these
      freepages from being allocated.  Although this pageblock is isolated,
      freepage would be added to normal buddy list so that it could be
      allocated without any restriction.  And the other problem is same as
      case 1, that it, incorrect freepage accouting.
      
      This race condition would be prevented by checking migratetype again
      with holding the zone lock.  Because it is somewhat heavy operation and
      it isn't needed in common case, we want to avoid rechecking as much as
      possible.  So this patch introduce new variable, nr_isolate_pageblock in
      struct zone to check if there is isolated pageblock.  With this, we can
      avoid to re-check migratetype in common case and do it only if there is
      isolated pageblock or migratetype is MIGRATE_ISOLATE.  This solve above
      mentioned problems.
      
      Changes from v3:
      Add one more check in free_one_page() that checks whether migratetype is
      MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad53f92e
    • J
      mm/compaction: skip the range until proper target pageblock is met · 58420016
      Joonsoo Kim 提交于
      Commit 7d49d886 ("mm, compaction: reduce zone checking frequency in
      the migration scanner") has a side-effect that changes the iteration
      range calculation.  Before the change, block_end_pfn is calculated using
      start_pfn, but now it blindly adds pageblock_nr_pages to the previous
      value.
      
      This causes the problem that isolation_start_pfn is larger than
      block_end_pfn when we isolate the page with more than pageblock order.
      In this case, isolation would fail due to an invalid range parameter.
      
      To prevent this, this patch implements skipping the range until a proper
      target pageblock is met.  Without this patch, CMA with more than
      pageblock order always fails but with this patch it will succeed.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58420016
  2. 07 11月, 2014 1 次提交
  3. 30 10月, 2014 11 次提交
    • J
      mm: Remove false WARN_ON from pagecache_isize_extended() · f55fefd1
      Jan Kara 提交于
      The WARN_ON checking whether i_mutex is held in
      pagecache_isize_extended() was wrong because some filesystems (e.g.
      XFS) use different locks for serialization of truncates / writes. So
      just remove the check.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f55fefd1
    • K
      mm/balloon_compaction: fix deflation when compaction is disabled · 4d88e6f7
      Konstantin Khlebnikov 提交于
      If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link pages
      with balloon and doesn't set PagePrivate flag, as a result
      balloon_page_dequeue() cannot get any pages because it thinks that all
      of them are isolated.  Without balloon compaction nobody can isolate
      ballooned pages.  It's safe to remove this check.
      
      Fixes: d6d86c0a ("mm/balloon_compaction: redesign ballooned pages management").
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Reported-by: NMatt Mullins <mmullins@mmlx.us>
      Cc: <stable@vger.kernel.org>	[3.17]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d88e6f7
    • M
      mm/slab_common: don't check for duplicate cache names · 8aba7e0a
      Mikulas Patocka 提交于
      The SLUB cache merges caches with the same size and alignment and there
      was long standing bug with this behavior:
      
       - create the cache named "foo"
       - create the cache named "bar" (which is merged with "foo")
       - delete the cache named "foo" (but it stays allocated because "bar"
         uses it)
       - create the cache named "foo" again - it fails because the name "foo"
         is already used
      
      That bug was fixed in commit 69461747 ("slab_common: fix the check
      for duplicate slab names") by not warning on duplicate cache names when
      the SLUB subsystem is used.
      
      Recently, cache merging was implemented the with SLAB subsystem too, in
      12220dea ("mm/slab: support slab merge")).  Therefore we need stop
      checking for duplicate names even for the SLAB subsystem.
      
      This patch fixes the bug by removing the check.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aba7e0a
    • J
      mm: rmap: split out page_remove_file_rmap() · 8186eb6a
      Johannes Weiner 提交于
      page_remove_rmap() has too many branches on PageAnon() and is hard to
      follow.  Move the file part into a separate function.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8186eb6a
    • J
      mm: memcontrol: fix missed end-writeback page accounting · d7365e78
      Johannes Weiner 提交于
      Commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API") changed
      page migration to uncharge the old page right away.  The page is locked,
      unmapped, truncated, and off the LRU, but it could race with writeback
      ending, which then doesn't unaccount the page properly:
      
      test_clear_page_writeback()              migration
                                                 wait_on_page_writeback()
        TestClearPageWriteback()
                                                 mem_cgroup_migrate()
                                                   clear PCG_USED
        mem_cgroup_update_page_stat()
          if (PageCgroupUsed(pc))
            decrease memcg pages under writeback
      
        release pc->mem_cgroup->move_lock
      
      The per-page statistics interface is heavily optimized to avoid a
      function call and a lookup_page_cgroup() in the file unmap fast path,
      which means it doesn't verify whether a page is still charged before
      clearing PageWriteback() and it has to do it in the stat update later.
      
      Rework it so that it looks up the page's memcg once at the beginning of
      the transaction and then uses it throughout.  The charge will be
      verified before clearing PageWriteback() and migration can't uncharge
      the page as long as that is still set.  The RCU lock will protect the
      memcg past uncharge.
      
      As far as losing the optimization goes, the following test results are
      from a microbenchmark that maps, faults, and unmaps a 4GB sparse file
      three times in a nested fashion, so that there are two negative passes
      that don't account but still go through the new transaction overhead.
      There is no actual difference:
      
       old:     33.195102545 seconds time elapsed       ( +-  0.01% )
       new:     33.199231369 seconds time elapsed       ( +-  0.03% )
      
      The time spent in page_remove_rmap()'s callees still adds up to the
      same, but the time spent in the function itself seems reduced:
      
           # Children      Self  Command        Shared Object       Symbol
       old:     0.12%     0.11%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
       new:     0.12%     0.08%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7365e78
    • J
      mm: page-writeback: inline account_page_dirtied() into single caller · 3a3c02ec
      Johannes Weiner 提交于
      A follow-up patch would have changed the call signature.  To save the
      trouble, just fold it instead.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a3c02ec
    • Y
      memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node() · 35dca71c
      Yasuaki Ishimatsu 提交于
      When hot adding the same memory after hot removal, the following
      messages are shown:
      
        WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426()
        ...
        Call Trace:
          dump_stack+0x46/0x58
          warn_slowpath_common+0x81/0xa0
          warn_slowpath_null+0x1a/0x20
          free_area_init_node+0x3fe/0x426
          hotadd_new_pgdat+0x90/0x110
          add_memory+0xd4/0x200
          acpi_memory_device_add+0x1aa/0x289
          acpi_bus_attach+0xfd/0x204
          acpi_bus_attach+0x178/0x204
          acpi_bus_scan+0x6a/0x90
          acpi_device_hotplug+0xe8/0x418
          acpi_hotplug_work_fn+0x1f/0x2b
          process_one_work+0x14e/0x3f0
          worker_thread+0x11b/0x510
          kthread+0xe1/0x100
          ret_from_fork+0x7c/0xb0
      
      The detaled explanation is as follows:
      
      When hot removing memory, pgdat is set to 0 in try_offline_node().  But
      if the pgdat is allocated by bootmem allocator, the clearing step is
      skipped.
      
      And when hot adding the same memory, the uninitialized pgdat is reused.
      But free_area_init_node() checks wether pgdat is set to zero.  As a
      result, free_area_init_node() hits WARN_ON().
      
      This patch clears pgdat which is allocated by bootmem allocator in
      try_offline_node().
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35dca71c
    • D
      mm, thp: fix collapsing of hugepages on madvise · 6d50e60c
      David Rientjes 提交于
      If an anonymous mapping is not allowed to fault thp memory and then
      madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
      collapse this memory into thp memory.
      
      This occurs because the madvise(2) handler for thp, hugepage_madvise(),
      clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
      until the final action of madvise_behavior().  This causes the
      khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
      the vma had previously had VM_NOHUGEPAGE set.
      
      Fix this by passing the correct vma flags to the khugepaged mm slot
      handler.  There's no chance khugepaged can run on this vma until after
      madvise_behavior() returns since we hold mm->mmap_sem.
      
      It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
      in hugepage_advise(), but I didn't want to introduce special case
      behavior into madvise_behavior().  I think it's best to just let it
      always set vma->vm_flags itself.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d50e60c
    • Y
      mm: free compound page with correct order · 5ddacbe9
      Yu Zhao 提交于
      Compound page should be freed by put_page() or free_pages() with correct
      order.  Not doing so will cause tail pages leaked.
      
      The compound order can be obtained by compound_order() or use
      HPAGE_PMD_ORDER in our case.  Some people would argue the latter is
      faster but I prefer the former which is more general.
      
      This bug was observed not just on our servers (the worst case we saw is
      11G leaked on a 48G machine) but also on our workstations running Ubuntu
      based distro.
      
        $ cat /proc/vmstat  | grep thp_zero_page_alloc
        thp_zero_page_alloc 55
        thp_zero_page_alloc_failed 0
      
      This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked.
      
      Fixes: 97ae1749 ("thp: implement refcounting for huge zero page")
      Signed-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ddacbe9
    • J
      mm/compaction.c: avoid premature range skip in isolate_migratepages_range · 6ea41c0c
      Joonsoo Kim 提交于
      Commit edc2ca61 ("mm, compaction: move pageblock checks up from
      isolate_migratepages_range()") commonizes isolate_migratepages variants
      and make them use isolate_migratepages_block().
      
      isolate_migratepages_block() could stop the execution when enough pages
      are isolated, but, there is no code in isolate_migratepages_range() to
      handle this case.  In the result, even if isolate_migratepages_block()
      returns prematurely without checking all pages in the range,
      
      isolate_migratepages_block() is called repeately on the following
      pageblock and some pages in the previous range are skipped to check.
      Then, CMA is failed frequently due to this fact.
      
      To fix this problem, this patch let isolate_migratepages_range() know
      the situation that enough pages are isolated and stop the isolation in
      that case.
      
      Note that isolate_migratepages() has no such problem, because, it always
      stops the isolation after just one call of isolate_migratepages_block().
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ea41c0c
    • W
      cgroup/kmemleak: add kmemleak_free() for cgroup deallocations. · 401507d6
      Wang Nan 提交于
      Commit ff7ee93f ("cgroup/kmemleak: Annotate alloc_page() for cgroup
      allocations") introduces kmemleak_alloc() for alloc_page_cgroup(), but
      corresponding kmemleak_free() is missing, which makes kmemleak be
      wrongly disabled after memory offlining.  Log is pasted at the end of
      this commit message.
      
      This patch add kmemleak_free() into free_page_cgroup().  During page
      offlining, this patch removes corresponding entries in kmemleak rbtree.
      After that, the freed memory can be allocated again by other subsystems
      without killing kmemleak.
      
        bash # for x in 1 2 3 4; do echo offline > /sys/devices/system/memory/memory$x/state ; sleep 1; done ; dmesg | grep leak
      
        Offlined Pages 32768
        kmemleak: Cannot insert 0xffff880016969000 into the object search tree (overlaps existing)
        CPU: 0 PID: 412 Comm: sleep Not tainted 3.17.0-rc5+ #86
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x46/0x58
          create_object+0x266/0x2c0
          kmemleak_alloc+0x26/0x50
          kmem_cache_alloc+0xd3/0x160
          __sigqueue_alloc+0x49/0xd0
          __send_signal+0xcb/0x410
          send_signal+0x45/0x90
          __group_send_sig_info+0x13/0x20
          do_notify_parent+0x1bb/0x260
          do_exit+0x767/0xa40
          do_group_exit+0x44/0xa0
          SyS_exit_group+0x17/0x20
          system_call_fastpath+0x16/0x1b
      
        kmemleak: Kernel memory leak detector disabled
        kmemleak: Object 0xffff880016900000 (size 524288):
        kmemleak:   comm "swapper/0", pid 0, jiffies 4294667296
        kmemleak:   min_count = 0
        kmemleak:   count = 0
        kmemleak:   flags = 0x1
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
              log_early+0x63/0x77
              kmemleak_alloc+0x4b/0x50
              init_section_page_cgroup+0x7f/0xf5
              page_cgroup_init+0xc5/0xd0
              start_kernel+0x333/0x408
              x86_64_start_reservations+0x2a/0x2c
              x86_64_start_kernel+0xf5/0xfc
      
      Fixes: ff7ee93f (cgroup/kmemleak: Annotate alloc_page() for cgroup allocations)
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      401507d6
  4. 29 10月, 2014 1 次提交
    • W
      zap_pte_range: update addr when forcing flush after TLB batching faiure · ce9ec37b
      Will Deacon 提交于
      When unmapping a range of pages in zap_pte_range, the page being
      unmapped is added to an mmu_gather_batch structure for asynchronous
      freeing. If we run out of space in the batch structure before the range
      has been completely unmapped, then we break out of the loop, force a
      TLB flush and free the pages that we have batched so far. If there are
      further pages to unmap, then we resume the loop where we left off.
      
      Unfortunately, we forget to update addr when we break out of the loop,
      which causes us to truncate the range being invalidated as the end
      address is exclusive. When we re-enter the loop at the same address, the
      page has already been freed and the pte_present test will fail, meaning
      that we do not reconsider the address for invalidation.
      
      This patch fixes the problem by incrementing addr by the PAGE_SIZE
      before breaking out of the loop on batch failure.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce9ec37b
  5. 27 10月, 2014 4 次提交
  6. 24 10月, 2014 1 次提交
    • M
      shmem: support RENAME_WHITEOUT · 46fdb794
      Miklos Szeredi 提交于
      Allocate a dentry, initialize it with a whiteout and hash it in the place
      of the old dentry.  Later the old dentry will be moved away and the
      whiteout will remain.
      
      i_mutex protects agains concurrent readdir.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      46fdb794
  7. 22 10月, 2014 1 次提交
    • M
      OOM, PM: OOM killed task shouldn't escape PM suspend · 5695be14
      Michal Hocko 提交于
      PM freezer relies on having all tasks frozen by the time devices are
      getting frozen so that no task will touch them while they are getting
      frozen. But OOM killer is allowed to kill an already frozen task in
      order to handle OOM situtation. In order to protect from late wake ups
      OOM killer is disabled after all tasks are frozen. This, however, still
      keeps a window open when a killed task didn't manage to die by the time
      freeze_processes finishes.
      
      Reduce the race window by checking all tasks after OOM killer has been
      disabled. This is still not race free completely unfortunately because
      oom_killer_disable cannot stop an already ongoing OOM killer so a task
      might still wake up from the fridge and get killed without
      freeze_processes noticing. Full synchronization of OOM and freezer is,
      however, too heavy weight for this highly unlikely case.
      
      Introduce and check oom_kills counter which gets incremented early when
      the allocator enters __alloc_pages_may_oom path and only check all the
      tasks if the counter changes during the freezing attempt. The counter
      is updated so early to reduce the race window since allocator checked
      oom_killer_disabled which is set by PM-freezing code. A false positive
      will push the PM-freezer into a slow path but that is not a big deal.
      
      Changes since v1
      - push the re-check loop out of freeze_processes into
        check_frozen_processes and invert the condition to make the code more
        readable as per Rafael
      
      Fixes: f660daac (oom: thaw threads if oom killed thread is frozen before deferring)
      Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5695be14
  8. 14 10月, 2014 4 次提交
    • P
      mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared · 64e45507
      Peter Feiner 提交于
      For VMAs that don't want write notifications, PTEs created for read faults
      have their write bit set.  If the read fault happens after VM_SOFTDIRTY is
      cleared, then the PTE's softdirty bit will remain clear after subsequent
      writes.
      
      Here's a simple code snippet to demonstrate the bug:
      
        char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                       MAP_ANONYMOUS | MAP_SHARED, -1, 0);
        system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
        assert(*m == '\0');     /* new PTE allows write access */
        assert(!soft_dirty(x));
        *m = 'x';               /* should dirty the page */
        assert(soft_dirty(x));  /* fails */
      
      With this patch, write notifications are enabled when VM_SOFTDIRTY is
      cleared.  Furthermore, to avoid unnecessary faults, write notifications
      are disabled when VM_SOFTDIRTY is set.
      
      As a side effect of enabling and disabling write notifications with
      care, this patch fixes a bug in mprotect where vm_page_prot bits set by
      drivers were zapped on mprotect.  An analogous bug was fixed in mmap by
      commit c9d0bf24 ("mm: uncached vma support with writenotify").
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Reported-by: NPeter Feiner <pfeiner@google.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64e45507
    • M
      drivers: dma-contiguous: add initialization from device tree · de9e14ee
      Marek Szyprowski 提交于
      Add a function to create CMA region from previously reserved memory and
      add support for handling 'shared-dma-pool' reserved-memory device tree
      nodes.
      
      Based on previous code provided by Josh Cartwright <joshc@codeaurora.org>
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Josh Cartwright <joshc@codeaurora.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de9e14ee
    • W
      mm/cma: fix cma bitmap aligned mask computing · 68faed63
      Weijie Yang 提交于
      The current cma bitmap aligned mask computation is incorrect.  It could
      cause an unexpected alignment when using cma_alloc() if the wanted align
      order is larger than cma->order_per_bit.
      
      Take kvm for example (PAGE_SHIFT = 12), kvm_cma->order_per_bit is set to
      6.  When kvm_alloc_rma() tries to alloc kvm_rma_pages, it will use 15 as
      the expected align value.  After using the current implementation however,
      we get 0 as cma bitmap aligned mask other than 511.
      
      This patch fixes the cma bitmap aligned mask calculation.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[3.17]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68faed63
    • J
      mm/slab: fix unaligned access on sparc64 · 85c9f4b0
      Joonsoo Kim 提交于
      Commit bf0dea23 ("mm/slab: use percpu allocator for cpu cache")
      changed the allocation method for cpu cache array from slab allocator to
      percpu allocator.  Alignment should be provided for aligned memory in
      percpu allocator case, but, that commit mistakenly set this alignment to
      0.  So, percpu allocator returns unaligned memory address.  It doesn't
      cause any problem on x86 which permits unaligned access, but, it causes
      the problem on sparc64 which needs strong guarantee of alignment.
      
      Following bug report is reported from David Miller.
      
        I'm getting tons of the following on sparc64:
      
        [603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
        [603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
        ...
        [603970.554394] log_unaligned: 333 callbacks suppressed
        ...
      
      This patch provides a proper alignment parameter when allocating cpu
      cache to fix this unaligned memory access problem on sparc64.
      Reported-by: NDavid Miller <davem@davemloft.net>
      Tested-by: NDavid Miller <davem@davemloft.net>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85c9f4b0
  9. 11 10月, 2014 1 次提交
  10. 10 10月, 2014 6 次提交
    • C
      zbud: avoid accessing last unused freelist · f203c3b3
      Chao Yu 提交于
      For now, there are NCHUNKS of 64 freelists in zbud_pool, the last
      unbuddied[63] freelist linked with all zbud pages which have free chunks
      of 63.  Calculating according to context of num_free_chunks(), our max
      chunk number of unbuddied zbud page is 62, so none of zbud pages will be
      added/removed in last freelist, but still we will try to find an unbuddied
      zbud page in the last unused freelist, it is unneeded.
      
      This patch redefines NCHUNKS to 63 as free chunk number in one zbud page,
      hence we can decrease size of zpool and avoid accessing the last unused
      freelist whenever failing to allocate zbud from freelist in zbud_alloc.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f203c3b3
    • D
      zsmalloc: simplify init_zspage free obj linking · 5538c562
      Dan Streetman 提交于
      Change zsmalloc init_zspage() logic to iterate through each object on each
      of its pages, checking the offset to verify the object is on the current
      page before linking it into the zspage.
      
      The current zsmalloc init_zspage free object linking code has logic that
      relies on there only being one page per zspage when PAGE_SIZE is a
      multiple of class->size.  It calculates the number of objects for the
      current page, and iterates through all of them plus one, to account for
      the assumed partial object at the end of the page.  While this currently
      works, the logic can be simplified to just link the object at each
      successive offset until the offset is larger than PAGE_SIZE, which does
      not rely on PAGE_SIZE being a multiple of class->size.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5538c562
    • W
      mm/zsmalloc.c: correct comment for fullness group computation · 6dd9737e
      Wang Sheng-Hui 提交于
      The letter 'f' in "n <= N/f" stands for fullness_threshold_frac, not
      1/fullness_threshold_frac.
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dd9737e
    • M
      zsmalloc: change return value unit of zs_get_total_size_bytes · 722cdc17
      Minchan Kim 提交于
      zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with
      *byte unit* but zsmalloc operates *page unit* rather than byte unit so
      let's change the API so benefit we could get is that reduce unnecessary
      overhead (ie, change page unit with byte unit) in zsmalloc.
      
      Since return type is pages, "zs_get_total_pages" is better than
      "zs_get_total_size_bytes".
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: David Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      722cdc17
    • M
      zsmalloc: move pages_allocated to zs_pool · 13de8933
      Minchan Kim 提交于
      Currently, zram has no feature to limit memory so theoretically zram can
      deplete system memory.  Users have asked for a limit several times as even
      without exhaustion zram makes it hard to control memory usage of the
      platform.  This patchset adds the feature.
      
      Patch 1 makes zs_get_total_size_bytes faster because it would be used
      frequently in later patches for the new feature.
      
      Patch 2 changes zs_get_total_size_bytes's return unit from bytes to page
      so that zsmalloc doesn't need unnecessary operation(ie, << PAGE_SHIFT).
      
      Patch 3 adds new feature.  I added the feature into zram layer, not
      zsmalloc because limiation is zram's requirement, not zsmalloc so any
      other user using zsmalloc(ie, zpool) shouldn't affected by unnecessary
      branch of zsmalloc.  In future, if every users of zsmalloc want the
      feature, then, we could move the feature from client side to zsmalloc
      easily but vice versa would be painful.
      
      Patch 4 adds news facility to report maximum memory usage of zram so that
      this avoids user polling frequently via /sys/block/zram0/ mem_used_total
      and ensures transient max are not missed.
      
      This patch (of 4):
      
      pages_allocated has counted in size_class structure and when user of
      zsmalloc want to see total_size_bytes, it should gather all of count from
      each size_class to report the sum.
      
      It's not bad if user don't see the value often but if user start to see
      the value frequently, it would be not a good deal for performance pov.
      
      This patch moves the count from size_class to zs_pool so it could reduce
      memory footprint (from [255 * 8byte] to [sizeof(atomic_long_t)]).
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: NDavid Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13de8933
    • C
      vmstat: on-demand vmstat workers V8 · 7cc36bbd
      Christoph Lameter 提交于
      vmstat workers are used for folding counter differentials into the zone,
      per node and global counters at certain time intervals.  They currently
      run at defined intervals on all processors which will cause some holdoff
      for processors that need minimal intrusion by the OS.
      
      The current vmstat_update mechanism depends on a deferrable timer firing
      every other second by default which registers a work queue item that runs
      on the local CPU, with the result that we have 1 interrupt and one
      additional schedulable task on each CPU every 2 seconds If a workload
      indeed causes VM activity or multiple tasks are running on a CPU, then
      there are probably bigger issues to deal with.
      
      However, some workloads dedicate a CPU for a single CPU bound task.  This
      is done in high performance computing, in high frequency financial
      applications, in networking (Intel DPDK, EZchip NPS) and with the advent
      of systems with more and more CPUs over time, this may become more and
      more common to do since when one has enough CPUs one cares less about
      efficiently sharing a CPU with other tasks and more about efficiently
      monopolizing a CPU per task.
      
      The difference of having this timer firing and workqueue kernel thread
      scheduled per second can be enormous.  An artificial test measuring the
      worst case time to do a simple "i++" in an endless loop on a bare metal
      system and under Linux on an isolated CPU with dynticks and with and
      without this patch, have Linux match the bare metal performance (~700
      cycles) with this patch and loose by couple of orders of magnitude (~200k
      cycles) without it[*].  The loss occurs for something that just calculates
      statistics.  For networking applications, for example, this could be the
      difference between dropping packets or sustaining line rate.
      
      Statistics are important and useful, but it would be great if there would
      be a way to not cause statistics gathering produce a huge performance
      difference.  This patche does just that.
      
      This patch creates a vmstat shepherd worker that monitors the per cpu
      differentials on all processors.  If there are differentials on a
      processor then a vmstat worker local to the processors with the
      differentials is created.  That worker will then start folding the diffs
      in regular intervals.  Should the worker find that there is no work to be
      done then it will make the shepherd worker monitor the differentials
      again.
      
      With this patch it is possible then to have periods longer than
      2 seconds without any OS event on a "cpu" (hardware thread).
      
      The patch shows a very minor increased in system performance.
      
      hackbench -s 512 -l 2000 -g 15 -f 25 -P
      
      Results before the patch:
      
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.992
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.971
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 5.063
      
      Hackbench after the patch:
      
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.973
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.990
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.993
      
      [fengguang.wu@intel.com: cpu_stat_off can be static]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Max Krasnyansky <maxk@qti.qualcomm.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cc36bbd