1. 12 2月, 2015 15 次提交
    • K
      mm, asm-generic: define PUD_SHIFT in <asm-generic/4level-fixup.h> · 4155b8e0
      Kirill A. Shutemov 提交于
      If an architecure uses <asm-generic/4level-fixup.h>, build fails if we
      try to use PUD_SHIFT in generic code:
      
         In file included from arch/microblaze/include/asm/bug.h:1:0,
                          from include/linux/bug.h:4,
                          from include/linux/thread_info.h:11,
                          from include/asm-generic/preempt.h:4,
                          from arch/microblaze/include/generated/asm/preempt.h:1,
                          from include/linux/preempt.h:18,
                          from include/linux/spinlock.h:50,
                          from include/linux/mmzone.h:7,
                          from include/linux/gfp.h:5,
                          from include/linux/slab.h:14,
                          from mm/mmap.c:12:
         mm/mmap.c: In function 'exit_mmap':
      >> mm/mmap.c:2858:46: error: 'PUD_SHIFT' undeclared (first use in this function)
             round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
                                                       ^
         include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
           int __ret_warn_on = !!(condition);    \
                                  ^
         mm/mmap.c:2858:46: note: each undeclared identifier is reported only once for each function it appears in
             round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
                                                       ^
         include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
           int __ret_warn_on = !!(condition);    \
                                  ^
      As with <asm-generic/pgtable-nopud.h>, let's define PUD_SHIFT to
      PGDIR_SHIFT.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4155b8e0
    • M
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko 提交于
      Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • M
      oom: add helpers for setting and clearing TIF_MEMDIE · 49550b60
      Michal Hocko 提交于
      This patchset addresses a race which was described in the changelog for
      5695be14 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
      
      : PM freezer relies on having all tasks frozen by the time devices are
      : getting frozen so that no task will touch them while they are getting
      : frozen.  But OOM killer is allowed to kill an already frozen task in order
      : to handle OOM situtation.  In order to protect from late wake ups OOM
      : killer is disabled after all tasks are frozen.  This, however, still keeps
      : a window open when a killed task didn't manage to die by the time
      : freeze_processes finishes.
      
      The original patch hasn't closed the race window completely because that
      would require a more complex solution as it can be seen by this patchset.
      
      The primary motivation was to close the race condition between OOM killer
      and PM freezer _completely_.  As Tejun pointed out, even though the race
      condition is unlikely the harder it would be to debug weird bugs deep in
      the PM freezer when the debugging options are reduced considerably.  I can
      only speculate what might happen when a task is still runnable
      unexpectedly.
      
      On a plus side and as a side effect the oom enable/disable has a better
      (full barrier) semantic without polluting hot paths.
      
      I have tested the series in KVM with 100M RAM:
      - many small tasks (20M anon mmap) which are triggering OOM continually
      - s2ram which resumes automatically is triggered in a loop
      	echo processors > /sys/power/pm_test
      	while true
      	do
      		echo mem > /sys/power/state
      		sleep 1s
      	done
      - simple module which allocates and frees 20M in 8K chunks. If it sees
        freezing(current) then it tries another round of allocation before calling
        try_to_freeze
      - debugging messages of PM stages and OOM killer enable/disable/fail added
        and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
        it wakes up waiters.
      - rebased on top of the current mmotm which means some necessary updates
        in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
        I think this should be OK because __thaw_task shouldn't interfere with any
        locking down wake_up_process. Oleg?
      
      As expected there are no OOM killed tasks after oom is disabled and
      allocations requested by the kernel thread are failing after all the tasks
      are frozen and OOM disabled.  I wasn't able to catch a race where
      oom_killer_disable would really have to wait but I kinda expected the race
      is really unlikely.
      
      [  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
      [  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
      [  243.636072] (elapsed 2.837 seconds) done.
      [  243.641985] Trying to disable OOM killer
      [  243.643032] Waiting for concurent OOM victims
      [  243.644342] OOM killer disabled
      [  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
      [  243.652983] Suspending console(s) (use no_console_suspend to debug)
      [  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
      [...]
      [  243.992600] PM: suspend of devices complete after 336.667 msecs
      [  243.993264] PM: late suspend of devices complete after 0.660 msecs
      [  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
      [  243.994717] ACPI: Preparing to enter system sleep state S3
      [  243.994795] PM: Saving platform NVS memory
      [  243.994796] Disabling non-boot CPUs ...
      
      The first 2 patches are simple cleanups for OOM.  They should go in
      regardless the rest IMO.
      
      Patches 3 and 4 are trivial printk -> pr_info conversion and they should
      go in ditto.
      
      The main patch is the last one and I would appreciate acks from Tejun and
      Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
      task_lock where a look from Oleg would appreciated) but I am not so sure I
      haven't screwed anything in the freezer code.  I have found several
      surprises there.
      
      This patch (of 5):
      
      This patch is just a preparatory and it doesn't introduce any functional
      change.
      
      Note:
      I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
      wait for the oom victim and to prevent from new killing. This is
      just a side effect of the flag. The primary meaning is to give the oom
      victim access to the memory reserves and that shouldn't be necessary
      here.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49550b60
    • J
      mm: memcontrol: default hierarchy interface for memory · 241994ed
      Johannes Weiner 提交于
      Introduce the basic control files to account, partition, and limit
      memory using cgroups in default hierarchy mode.
      
      This interface versioning allows us to address fundamental design
      issues in the existing memory cgroup interface, further explained
      below.  The old interface will be maintained indefinitely, but a
      clearer model and improved workload performance should encourage
      existing users to switch over to the new one eventually.
      
      The control files are thus:
      
        - memory.current shows the current consumption of the cgroup and its
          descendants, in bytes.
      
        - memory.low configures the lower end of the cgroup's expected
          memory consumption range.  The kernel considers memory below that
          boundary to be a reserve - the minimum that the workload needs in
          order to make forward progress - and generally avoids reclaiming
          it, unless there is an imminent risk of entering an OOM situation.
      
        - memory.high configures the upper end of the cgroup's expected
          memory consumption range.  A cgroup whose consumption grows beyond
          this threshold is forced into direct reclaim, to work off the
          excess and to throttle new allocations heavily, but is generally
          allowed to continue and the OOM killer is not invoked.
      
        - memory.max configures the hard maximum amount of memory that the
          cgroup is allowed to consume before the OOM killer is invoked.
      
        - memory.events shows event counters that indicate how often the
          cgroup was reclaimed while below memory.low, how often it was
          forced to reclaim excess beyond memory.high, how often it hit
          memory.max, and how often it entered OOM due to memory.max.  This
          allows users to identify configuration problems when observing a
          degradation in workload performance.  An overcommitted system will
          have an increased rate of low boundary breaches, whereas increased
          rates of high limit breaches, maximum hits, or even OOM situations
          will indicate internally overcommitted cgroups.
      
      For existing users of memory cgroups, the following deviations from
      the current interface are worth pointing out and explaining:
      
        - The original lower boundary, the soft limit, is defined as a limit
          that is per default unset.  As a result, the set of cgroups that
          global reclaim prefers is opt-in, rather than opt-out.  The costs
          for optimizing these mostly negative lookups are so high that the
          implementation, despite its enormous size, does not even provide
          the basic desirable behavior.  First off, the soft limit has no
          hierarchical meaning.  All configured groups are organized in a
          global rbtree and treated like equal peers, regardless where they
          are located in the hierarchy.  This makes subtree delegation
          impossible.  Second, the soft limit reclaim pass is so aggressive
          that it not just introduces high allocation latencies into the
          system, but also impacts system performance due to overreclaim, to
          the point where the feature becomes self-defeating.
      
          The memory.low boundary on the other hand is a top-down allocated
          reserve.  A cgroup enjoys reclaim protection when it and all its
          ancestors are below their low boundaries, which makes delegation
          of subtrees possible.  Secondly, new cgroups have no reserve per
          default and in the common case most cgroups are eligible for the
          preferred reclaim pass.  This allows the new low boundary to be
          efficiently implemented with just a minor addition to the generic
          reclaim code, without the need for out-of-band data structures and
          reclaim passes.  Because the generic reclaim code considers all
          cgroups except for the ones running low in the preferred first
          reclaim pass, overreclaim of individual groups is eliminated as
          well, resulting in much better overall workload performance.
      
        - The original high boundary, the hard limit, is defined as a strict
          limit that can not budge, even if the OOM killer has to be called.
          But this generally goes against the goal of making the most out of
          the available memory.  The memory consumption of workloads varies
          during runtime, and that requires users to overcommit.  But doing
          that with a strict upper limit requires either a fairly accurate
          prediction of the working set size or adding slack to the limit.
          Since working set size estimation is hard and error prone, and
          getting it wrong results in OOM kills, most users tend to err on
          the side of a looser limit and end up wasting precious resources.
      
          The memory.high boundary on the other hand can be set much more
          conservatively.  When hit, it throttles allocations by forcing
          them into direct reclaim to work off the excess, but it never
          invokes the OOM killer.  As a result, a high boundary that is
          chosen too aggressively will not terminate the processes, but
          instead it will lead to gradual performance degradation.  The user
          can monitor this and make corrections until the minimal memory
          footprint that still gives acceptable performance is found.
      
          In extreme cases, with many concurrent allocations and a complete
          breakdown of reclaim progress within the group, the high boundary
          can be exceeded.  But even then it's mostly better to satisfy the
          allocation from the slack available in other groups or the rest of
          the system than killing the group.  Otherwise, memory.max is there
          to limit this type of spillover and ultimately contain buggy or
          even malicious applications.
      
        - The original control file names are unwieldy and inconsistent in
          many different ways.  For example, the upper boundary hit count is
          exported in the memory.failcnt file, but an OOM event count has to
          be manually counted by listening to memory.oom_control events, and
          lower boundary / soft limit events have to be counted by first
          setting a threshold for that value and then counting those events.
          Also, usage and limit files encode their units in the filename.
          That makes the filenames very long, even though this is not
          information that a user needs to be reminded of every time they
          type out those names.
      
          To address these naming issues, as well as to signal clearly that
          the new interface carries a new configuration model, the naming
          conventions in it necessarily differ from the old interface.
      
        - The original limit files indicate the state of an unset limit with
          a very high number, and a configured limit can be unset by echoing
          -1 into those files.  But that very high number is implementation
          and architecture dependent and not very descriptive.  And while -1
          can be understood as an underflow into the highest possible value,
          -2 or -10M etc. do not work, so it's not inconsistent.
      
          memory.low, memory.high, and memory.max will use the string
          "infinity" to indicate and set the highest possible value.
      
      [akpm@linux-foundation.org: use seq_puts() for basic strings]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      241994ed
    • J
      mm: page_counter: pull "-1" handling out of page_counter_memparse() · 650c5e56
      Johannes Weiner 提交于
      The unified hierarchy interface for memory cgroups will no longer use "-1"
      to mean maximum possible resource value.  In preparation for this, make
      the string an argument and let the caller supply it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      650c5e56
    • V
      vmscan: force scan offline memory cgroups · 90cbc250
      Vladimir Davydov 提交于
      Since commit b2052564 ("mm: memcontrol: continue cache reclaim from
      offlined groups") pages charged to a memory cgroup are not reparented when
      the cgroup is removed.  Instead, they are supposed to be reclaimed in a
      regular way, along with pages accounted to online memory cgroups.
      
      However, an lruvec of an offline memory cgroup will sooner or later get so
      small that it will be scanned only at low scan priorities (see
      get_scan_count()).  Therefore, if there are enough reclaimable pages in
      big lruvecs, pages accounted to offline memory cgroups will never be
      scanned at all, wasting memory.
      
      Fix this by unconditionally forcing scanning dead lruvecs from kswapd.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90cbc250
    • V
      mm: microoptimize zonelist operations · 05891fb0
      Vlastimil Babka 提交于
      next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
      via extra parameter.  Since the latter can be trivially obtained by
      dereferencing the former, the overhead of the extra parameter is
      unjustified.
      
      This patch thus removes the zone parameter from next_zones_zonelist().
      Both callers happen to be in the same header file, so it's simple to add
      the zoneref dereference inline.  We save some bytes of code size.
      
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
      function                                     old     new   delta
      nr_free_zone_pages                           129     115     -14
      __alloc_pages_nodemask                      2300    2285     -15
      get_page_from_freelist                      2652    2576     -76
      
      add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
      function                                     old     new   delta
      try_to_compact_pages                         569     579     +10
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05891fb0
    • V
      mm: reduce try_to_compact_pages parameters · 1a6d53a1
      Vlastimil Babka 提交于
      Expand the usage of the struct alloc_context introduced in the previous
      patch also for calling try_to_compact_pages(), to reduce the number of its
      parameters.  Since the function is in different compilation unit, we need
      to move alloc_context definition in the shared mm/internal.h header.
      
      With this change we get simpler code and small savings of code size and stack
      usage:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
      function                                     old     new   delta
      __alloc_pages_direct_compact                 283     256     -27
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
      function                                     old     new   delta
      try_to_compact_pages                         582     569     -13
      
      Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
      scripts/checkstack.pl).
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a6d53a1
    • N
      mm/hugetlb: take page table lock in follow_huge_pmd() · e66f17ff
      Naoya Horiguchi 提交于
      We have a race condition between move_pages() and freeing hugepages, where
      move_pages() calls follow_page(FOLL_GET) for hugepages internally and
      tries to get its refcount without preventing concurrent freeing.  This
      race crashes the kernel, so this patch fixes it by moving FOLL_GET code
      for hugepages into follow_huge_pmd() with taking the page table lock.
      
      This patch intentionally removes page==NULL check after pte_page.
      This is justified because pte_page() never returns NULL for any
      architectures or configurations.
      
      This patch changes the behavior of follow_huge_pmd() for tail pages and
      then tail pages can be pinned/returned.  So the caller must be changed to
      properly handle the returned tail pages.
      
      We could have a choice to add the similar locking to
      follow_huge_(addr|pud) for consistency, but it's not necessary because
      currently these functions don't support FOLL_GET flag, so let's leave it
      for future development.
      
      Here is the reproducer:
      
        $ cat movepages.c
        #include <stdio.h>
        #include <stdlib.h>
        #include <numaif.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
        #define PS              0x1000
      
        int main(int argc, char *argv[]) {
                int i;
                int nr_hp = strtol(argv[1], NULL, 0);
                int nr_p  = nr_hp * HPS / PS;
                int ret;
                void **addrs;
                int *status;
                int *nodes;
                pid_t pid;
      
                pid = strtol(argv[2], NULL, 0);
                addrs  = malloc(sizeof(char *) * nr_p + 1);
                status = malloc(sizeof(char *) * nr_p + 1);
                nodes  = malloc(sizeof(char *) * nr_p + 1);
      
                while (1) {
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 1;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
      
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 0;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
                }
                return 0;
        }
      
        $ cat hugepage.c
        #include <stdio.h>
        #include <sys/mman.h>
        #include <string.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
      
        int main(int argc, char *argv[]) {
                int nr_hp = strtol(argv[1], NULL, 0);
                char *p;
      
                while (1) {
                        p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                        if (p != (void *)ADDR_INPUT) {
                                perror("mmap");
                                break;
                        }
                        memset(p, 0, nr_hp * HPS);
                        munmap(p, nr_hp * HPS);
                }
        }
      
        $ sysctl vm.nr_hugepages=40
        $ ./hugepage 10 &
        $ ./movepages 10 $(pgrep -f hugepage)
      
      Fixes: e632a938 ("mm: migrate: add hugepage migration code to move_pages()")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66f17ff
    • B
      mm: fix typo of MIGRATE_RESERVE in comment · 44628d97
      Baoquan He 提交于
      Found it when I want to jump to the definition of MIGRATE_RESERVE ctags.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44628d97
    • J
      mm: memcontrol: track move_lock state internally · 6de22619
      Johannes Weiner 提交于
      The complexity of memcg page stat synchronization is currently leaking
      into the callsites, forcing them to keep track of the move_lock state and
      the IRQ flags.  Simplify the API by tracking it in the memcg.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de22619
    • V
      swap: remove unused mem_cgroup_uncharge_swapcache declaration · 93aa7d95
      Vladimir Davydov 提交于
      The body of this function was removed by commit 0a31bc97 ("mm:
      memcontrol: rewrite uncharge API").
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93aa7d95
    • W
      mm:add KPF_ZERO_PAGE flag for /proc/kpageflags · 56873f43
      Wang, Yalin 提交于
      Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
      detect zero_page in /proc/kpageflags, and then do memory analysis more
      accurately.
      Signed-off-by: NYalin Wang <yalin.wang@sonymobile.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56873f43
    • W
      mm: add VM_BUG_ON_PAGE() to page_mapcount() · 1d148e21
      Wang, Yalin 提交于
      Add VM_BUG_ON_PAGE() for slab pages.  _mapcount is an union with slab
      struct in struct page, so we must avoid accessing _mapcount if this page
      is a slab page.  Also remove the unneeded bracket.
      Signed-off-by: NYalin Wang <yalin.wang@sonymobile.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d148e21
    • K
      mm: add fields for compound destructor and order into struct page · e4b294c2
      Kirill A. Shutemov 提交于
      Currently, we use lru.next/lru.prev plus cast to access or set
      destructor and order of compound page.
      
      Let's replace it with explicit fields in struct page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4b294c2
  2. 11 2月, 2015 14 次提交
  3. 10 2月, 2015 8 次提交
  4. 09 2月, 2015 3 次提交
    • P
      ALSA: pcm: allow for trigger_tstamp snapshot in .trigger · 2b79d7a6
      Pierre-Louis Bossart 提交于
      Don't use generic snapshot of trigger_tstamp if low-level driver or
      hardware can get a more precise value for better audio/system time
      synchronization.
      Signed-off-by: NPierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      2b79d7a6
    • E
      net:rfs: adjust table size checking · 93c1af6c
      Eric Dumazet 提交于
      Make sure root user does not try something stupid.
      
      Also make sure mask field in struct rps_sock_flow_table
      does not share a cache line with the potentially often dirtied
      flow table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c1af6c
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79