1. 12 2月, 2014 2 次提交
    • T
      cgroup: introduce cgroup_ino() · b1664924
      Tejun Heo 提交于
      mm/memory-failure.c::hwpoison_filter_task() has been reaching into
      cgroup to extract the associated ino to be used as a filtering
      criterion.  This is an implementation detail which shouldn't be
      depended upon from outside cgroup proper and is about to change with
      the scheduled kernfs conversion.
      
      This patch introduces a proper interface to determine the associated
      ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
      instead of reaching directly into cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      b1664924
    • T
      cgroup: improve css_from_dir() into css_tryget_from_dir() · 5a17f543
      Tejun Heo 提交于
      css_from_dir() returns the matching css (cgroup_subsys_state) given a
      dentry and subsystem.  The function doesn't pin the css before
      returning and requires the caller to be holding RCU read lock or
      cgroup_mutex and handling pinning on the caller side.
      
      Given that users of the function are likely to want to pin the
      returned css (both existing users do) and that getting and putting
      css's are very cheap, there's no reason for the interface to be tricky
      like this.
      
      Rename css_from_dir() to css_tryget_from_dir() and make it try to pin
      the found css and return it only if pinning succeeded.  The callers
      are updated so that they no longer do RCU locking and pinning around
      the function and just use the returned css.
      
      This will also ease converting cgroup to kernfs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      5a17f543
  2. 08 2月, 2014 1 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
  3. 04 2月, 2014 1 次提交
    • T
      arm, pm, vmpressure: add missing slab.h includes · 1ff6bbfd
      Tejun Heo 提交于
      arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c
      were somehow getting slab.h indirectly through cgroup.h which in turn
      was getting it indirectly through xattr.h.  A scheduled cgroup change
      drops xattr.h inclusion from cgroup.h and breaks compilation of these
      three files.  Add explicit slab.h includes to the three files.
      
      A pending cgroup patch depends on this change and it'd be great if
      this can be routed through cgroup/for-3.14-fixes branch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NStephen Warren <swarren@wwwdotorg.org>
      Cc: Thierry Reding <thierry.reding@gmail.com>
      Cc: linux-tegra@vger.kernel.org
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: cgroups@vger.kernel.org
      1ff6bbfd
  4. 31 1月, 2014 8 次提交
    • M
      mm: Fix warning on make htmldocs caused by slab.c · cb8ee1a3
      Masanari Iida 提交于
      This patch fixed following errors while make htmldocs
      Warning(/mm/slab.c:1956): No description found for parameter 'page'
      Warning(/mm/slab.c:1956): Excess function parameter 'slabp' description in 'slab_destroy'
      
      Incorrect function parameter "slabp" was set instead of "page"
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      cb8ee1a3
    • D
      mm: slub: work around unneeded lockdep warning · 67b6c900
      Dave Hansen 提交于
      The slub code does some setup during early boot in
      early_kmem_cache_node_alloc() with some local data.  There is no
      possible way that another CPU can see this data, so the slub code
      doesn't unnecessarily lock it.  However, some new lockdep asserts
      check to make sure that add_partial() _always_ has the list_lock
      held.
      
      Just add the locking, even though it is technically unnecessary.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      67b6c900
    • V
      memcg: fix mutex not unlocked on memcg_create_kmem_cache fail path · 7c094fd6
      Vladimir Davydov 提交于
      Commit 842e2873 ("memcg: get rid of kmem_cache_dup()") introduced a
      mutex for memcg_create_kmem_cache() to protect the tmp_name buffer that
      holds the memcg name.  It failed to unlock the mutex if this buffer
      could not be allocated.
      
      This patch fixes the issue by appropriately unlocking the mutex if the
      allocation fails.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c094fd6
    • D
      mm, oom: base root bonus on current usage · 778c14af
      David Rientjes 提交于
      A 3% of system memory bonus is sometimes too excessive in comparison to
      other processes.
      
      With commit a63d83f4 ("oom: badness heuristic rewrite"), the OOM
      killer tries to avoid killing privileged tasks by subtracting 3% of
      overall memory (system or cgroup) from their per-task consumption.  But
      as a result, all root tasks that consume less than 3% of overall memory
      are considered equal, and so it only takes 33+ privileged tasks pushing
      the system out of memory for the OOM killer to do something stupid and
      kill dhclient or other root-owned processes.  For example, on a 32G
      machine it can't tell the difference between the 1M agetty and the 10G
      fork bomb member.
      
      The changelog describes this 3% boost as the equivalent to the global
      overcommit limit being 3% higher for privileged tasks, but this is not
      the same as discounting 3% of overall memory from _every privileged task
      individually_ during OOM selection.
      
      Replace the 3% of system memory bonus with a 3% of current memory usage
      bonus.
      
      By giving root tasks a bonus that is proportional to their actual size,
      they remain comparable even when relatively small.  In the example
      above, the OOM killer will discount the 1M agetty's 256 badness points
      down to 179, and the 10G fork bomb's 262144 points down to 183500 points
      and make the right choice, instead of discounting both to 0 and killing
      agetty because it's first in the task list.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778c14af
    • D
      mm/slub.c: fix page->_count corruption (again) · a0320865
      Dave Hansen 提交于
      Commit abca7c49 ("mm: fix slab->page _count corruption when using
      slub") notes that we can not _set_ a page->counters directly, except
      when using a real double-cmpxchg.  Doing so can lose updates to
      ->_count.
      
      That is an absolute rule:
      
              You may not *set* page->counters except via a cmpxchg.
      
      Commit abca7c49 fixed this for the folks who have the slub
      cmpxchg_double code turned off at compile time, but it left the bad case
      alone.  It can still be reached, and the same bug triggered in two
      cases:
      
      1. Turning on slub debugging at runtime, which is available on
         the distro kernels that I looked at.
      2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
         cpus, evidently)
      
      There are at least 3 ways we could fix this:
      
      1. Take all of the exising calls to cmpxchg_double_slab() and
         __cmpxchg_double_slab() and convert them to take an old, new
         and target 'struct page'.
      2. Do (1), but with the newly-introduced 'slub_data'.
      3. Do some magic inside the two cmpxchg...slab() functions to
         pull the counters out of new_counters and only set those
         fields in page->{inuse,frozen,objects}.
      
      I've done (2) as well, but it's a bunch more code.  This patch is an
      attempt at (3).  This was the most straightforward and foolproof way
      that I could think to do this.
      
      This would also technically allow us to get rid of the ugly
      
      #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
             defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
      
      in 'struct page', but leaving it alone has the added benefit that
      'counters' stays 'unsigned' instead of 'unsigned long', so all the
      copies that the slub code does stay a bit smaller.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0320865
    • D
      mm/mempolicy.c: fix mempolicy printing in numa_maps · 8790c71a
      David Rientjes 提交于
      As a result of commit 5606e387 ("mm: numa: Migrate on reference
      policy"), /proc/<pid>/numa_maps prints the mempolicy for any <pid> as
      "prefer:N" for the local node, N, of the process reading the file.
      
      This should only be printed when the mempolicy of <pid> is
      MPOL_PREFERRED for node N.
      
      If the process is actually only using the default mempolicy for local
      node allocation, make sure "default" is printed as expected.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NRobert Lippert <rlippert@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8790c71a
    • M
      zsmalloc: add copyright · 31fc00bb
      Minchan Kim 提交于
      Add my copyright to the zsmalloc source code which I maintain.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31fc00bb
    • M
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim 提交于
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d
  5. 30 1月, 2014 8 次提交
  6. 28 1月, 2014 4 次提交
  7. 26 1月, 2014 2 次提交
    • S
      Fix race when checking i_size on direct i/o read · 9fe55eea
      Steven Whitehouse 提交于
      So far I've had one ACK for this, and no other comments. So I think it
      is probably time to send this via some suitable tree. I'm guessing that
      the vfs tree would be the most appropriate route, but not sure that
      there is one at the moment (don't see anything recent at kernel.org)
      so in that case I think -mm is the "back up plan". Al, please let me
      know if you will take this?
      
      Steve.
      
      ---------------------
      
      Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
      reads with append dio writes" thread on linux-fsdevel, this patch is my
      current version of the fix proposed as option (b) in that thread.
      
      Removing the i_size test from the direct i/o read path at vfs level
      means that filesystems now have to deal with requests which are beyond
      i_size themselves. These I've divided into three sets:
      
       a) Those with "no op" ->direct_IO (9p, cifs, ceph)
      These are obviously not going to be an issue
      
       b) Those with "home brew" ->direct_IO (nfs, fuse)
      I've been told that NFS should not have any problem with the larger
      i_size, however I've added an extra test to FUSE to duplicate the
      original behaviour just to be on the safe side.
      
       c) Those using __blockdev_direct_IO()
      These call through to ->get_block() which should deal with the EOF
      condition correctly. I've verified that with GFS2 and I believe that
      Zheng has verified it for ext4. I've also run the test on XFS and it
      passes both before and after this change.
      
      The part of the patch in filemap.c looks a lot larger than it really is
      - there are only two lines of real change. The rest is just indentation
      of the contained code.
      
      There remains a test of i_size though, which was added for btrfs. It
      doesn't cause the other filesystems a problem as the test is performed
      after ->direct_IO has been called. It is possible that there is a race
      that does matter to btrfs, however this patch doesn't change that, so
      its still an overall improvement.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reported-by: NZheng Liu <gnehzuil.liu@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Acked-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9fe55eea
    • C
      fs: remove generic_acl · feda821e
      Christoph Hellwig 提交于
      And instead convert tmpfs to use the new generic ACL code, with two stub
      methods provided for in-memory filesystems.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      feda821e
  8. 24 1月, 2014 14 次提交
    • C
      mm: ignore VM_SOFTDIRTY on VMA merging · 34228d47
      Cyrill Gorcunov 提交于
      The VM_SOFTDIRTY bit affects vma merge routine: if two VMAs has all bits
      in vm_flags matched except dirty bit the kernel can't longer merge them
      and this forces the kernel to generate new VMAs instead.
      
      It finally may lead to the situation when userspace application reaches
      vm.max_map_count limit and get crashed in worse case
      
       | (gimp:11768): GLib-ERROR **: gmem.c:110: failed to allocate 4096 bytes
       |
       | (file-tiff-load:12038): LibGimpBase-WARNING **: file-tiff-load: gimp_wire_read(): error
       | xinit: connection to X server lost
       |
       | waiting for X server to shut down
       | /usr/lib64/gimp/2.0/plug-ins/file-tiff-load terminated: Hangup
       | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
       | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
      
        https://bugzilla.kernel.org/show_bug.cgi?id=67651
        https://bugzilla.gnome.org/show_bug.cgi?id=719619#c0
      
      Initial problem came from missed VM_SOFTDIRTY in do_brk() routine but
      even if we would set up VM_SOFTDIRTY here, there is still a way to
      prevent VMAs from merging: one can call
      
       | echo 4 > /proc/$PID/clear_refs
      
      and clear all VM_SOFTDIRTY over all VMAs presented in memory map, then
      new do_brk() will try to extend old VMA and finds that dirty bit doesn't
      match thus new VMA will be generated.
      
      As discussed with Pavel, the right approach should be to ignore
      VM_SOFTDIRTY bit when we're trying to merge VMAs and if merge successed
      we mark extended VMA with dirty bit where needed.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reported-by: NBastian Hougaard <gnome@rvzt.net>
      Reported-by: NMel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34228d47
    • F
      mm/rmap: fix coccinelle warnings · 871beb8c
      Fengguang Wu 提交于
      mm/rmap.c:851:9-10: WARNING: return of 0/1 in function 'invalid_mkclean_vma' with return type bool
      
       Return statements in functions returning bool should use
       true/false instead of 1/0.
      
      Generated by: coccinelle/misc/boolreturn.cocci
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      871beb8c
    • J
      mm/swapfile.c: do not skip lowest_bit in scan_swap_map() scan loop · a5998061
      Jamie Liu 提交于
      In the second half of scan_swap_map()'s scan loop, offset is set to
      si->lowest_bit and then incremented before entering the loop for the
      first time, causing si->swap_map[si->lowest_bit] to be skipped.
      Signed-off-by: NJamie Liu <jamieliu@google.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5998061
    • V
      memcg: remove unused code from kmem_cache_destroy_work_func · 0d8a4a37
      Vladimir Davydov 提交于
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d8a4a37
    • M
      mm: improve documentation of page_order · 6c14466c
      Mel Gorman 提交于
      Developers occasionally try and optimise PFN scanners by using
      page_order but miss that in general it requires zone->lock.  This has
      happened twice for compaction.c and rejected both times.  This patch
      clarifies the documentation of page_order and adds a note to
      compaction.c why page_order is not used.
      
      [akpm@linux-foundation.org: tweaks]
      [lauraa@codeaurora.org: Corrected a page_zone(page)->lock reference]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c14466c
    • M
      memcg: fix css reference leak and endless loop in mem_cgroup_iter · 0eef6156
      Michal Hocko 提交于
      Commit 19f39402 ("memcg: simplify mem_cgroup_iter") has reorganized
      mem_cgroup_iter code in order to simplify it.  A part of that change was
      dropping an optimization which didn't call css_tryget on the root of the
      walked tree.  The patch however didn't change the css_put part in
      mem_cgroup_iter which excludes root.
      
      This wasn't an issue at the time because __mem_cgroup_iter_next bailed
      out for root early without taking a reference as cgroup iterators
      (css_next_descendant_pre) didn't visit root themselves.
      
      Nevertheless cgroup iterators have been reworked to visit root by commit
      bd8815a6 ("cgroup: make css_for_each_descendant() and friends
      include the origin css in the iteration") when the root bypass have been
      dropped in __mem_cgroup_iter_next.  This means that css_put is not
      called for root and so css along with mem_cgroup and other cgroup
      internal object tied by css lifetime are never freed.
      
      Fix the issue by reintroducing root check in __mem_cgroup_iter_next and
      do not take css reference for it.
      
      This reference counting magic protects us also from another issue, an
      endless loop reported by Hugh Dickins when reclaim races with root
      removal and css_tryget called by iterator internally would fail.  There
      would be no other nodes to visit so __mem_cgroup_iter_next would return
      NULL and mem_cgroup_iter would interpret it as "start looping from root
      again" and so mem_cgroup_iter would loop forever internally.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eef6156
    • M
      memcg: fix endless loop caused by mem_cgroup_iter · ecc736fc
      Michal Hocko 提交于
      Hugh has reported an endless loop when the hardlimit reclaim sees the
      same group all the time.  This might happen when the reclaim races with
      the memcg removal.
      
      shrink_zone
                                                      [rmdir root]
        mem_cgroup_iter(root, NULL, reclaim)
          // prev = NULL
          rcu_read_lock()
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root || NULL
            css_tryget(last_visited)            // failed
            last_visited = NULL                 [1]
          memcg = root = __mem_cgroup_iter_next(root, NULL)
          mem_cgroup_iter_update
            iter->last_visited = root;
          reclaim->generation = iter->generation
      
       mem_cgroup_iter(root, root, reclaim)
         // prev = root
         rcu_read_lock
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root
            css_tryget(last_visited)            // failed
          [1]
      
      The issue seemed to be introduced by commit 5f578161 ("memcg: relax
      memcg iter caching") which has replaced unconditional css_get/css_put by
      css_tryget/css_put for the cached iterator.
      
      This patch fixes the issue by skipping css_tryget on the root of the
      tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
      in mem_cgroup_iter_update.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecc736fc
    • D
      mm, oom: prefer thread group leaders for display purposes · d49ad935
      David Rientjes 提交于
      When two threads have the same badness score, it's preferable to kill
      the thread group leader so that the actual process name is printed to
      the kernel log rather than the thread group name which may be shared
      amongst several processes.
      
      This was the behavior when select_bad_process() used to do
      for_each_process(), but it now iterates threads instead and leads to
      ambiguity.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d49ad935
    • H
      mm/memcg: iteration skip memcgs not yet fully initialized · d8ad3055
      Hugh Dickins 提交于
      It is surprising that the mem_cgroup iterator can return memcgs which
      have not yet been fully initialized.  By accident (or trial and error?)
      this appears not to present an actual problem; but it may be better to
      prevent such surprises, by skipping memcgs not yet online.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8ad3055
    • H
      mm/memcg: fix last_dead_count memory wastage · d2ab70aa
      Hugh Dickins 提交于
      Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to
      int: it's assigned from an int and compared with an int, and adjacent to
      an unsigned int: so there's no point to it being unsigned long, which
      wasted 104 bytes in every mem_cgroup_per_zone.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ab70aa
    • P
      mm: audit/fix non-modular users of module_init in core code · a64fb3cd
      Paul Gortmaker 提交于
      Code that is obj-y (always built-in) or dependent on a bool Kconfig
      (built-in or absent) can never be modular.  So using module_init as an
      alias for __initcall can be somewhat misleading.
      
      Fix these up now, so that we can relocate module_init from init.h into
      module.h in the future.  If we don't do this, we'd have to add module.h
      to obviously non-modular code, and that would be a worse thing.
      
      The audit targets the following module_init users for change:
       mm/ksm.c                       bool KSM
       mm/mmap.c                      bool MMU
       mm/huge_memory.c               bool TRANSPARENT_HUGEPAGE
       mm/mmu_notifier.c              bool MMU_NOTIFIER
      
      Note that direct use of __initcall is discouraged, vs.  one of the
      priority categorized subgroups.  As __initcall gets mapped onto
      device_initcall, our use of subsys_initcall (which makes sense for these
      files) will thus change this registration from level 6-device to level
      4-subsys (i.e.  slightly earlier).
      
      However no observable impact of that difference has been observed during
      testing.
      
      One might think that core_initcall (l2) or postcore_initcall (l3) would
      be more appropriate for anything in mm/ but if we look at some actual
      init functions themselves, we see things like:
      
      mm/huge_memory.c --> hugepage_init     --> hugepage_init_sysfs
      mm/mmap.c        --> init_user_reserve --> sysctl_user_reserve_kbytes
      mm/ksm.c         --> ksm_init          --> sysfs_create_group
      
      and hence the choice of subsys_initcall (l4) seems reasonable, and at
      the same time minimizes the risk of changing the priority too
      drastically all at once.  We can adjust further in the future.
      
      Also, several instances of missing ";" at EOL are fixed.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a64fb3cd
    • P
      mm/mm_init.c: make creation of the mm_kobj happen earlier than device_initcall · da29bd36
      Paul Gortmaker 提交于
      The use of __initcall is to be eventually replaced by choosing one from
      the prioritized groupings laid out in init.h header:
      
      	pure_initcall               0
      	core_initcall               1
      	postcore_initcall           2
      	arch_initcall               3
      	subsys_initcall             4
      	fs_initcall                 5
      	device_initcall             6
      	late_initcall               7
      
      In the interim, all __initcall are mapped onto device_initcall, which as
      can be seen above, comes quite late in the ordering.
      
      Currently the mm_kobj is created with __initcall in mm_sysfs_init().
      This means that any other initcalls that want to reference the mm_kobj
      have to be device_initcall (or later), otherwise we will for example,
      trip the BUG_ON(!kobj) in sysfs's internal_create_group().  This
      unfairly restricts those users; for example something that clearly makes
      sense to be an arch_initcall will not be able to choose that.
      
      However, upon examination, it is only this way for historical reasons
      (i.e.  simply not reprioritized yet).  We see that sysfs is ready quite
      earlier in init/main.c via:
      
       vfs_caches_init
       |_ mnt_init
          |_ sysfs_init
      
      well ahead of the processing of the prioritized calls listed above.
      
      So we can recategorize mm_sysfs_init to be a pure_initcall, which in
      turn allows any mm_kobj initcall users a wider range (1 --> 7) of
      initcall priorities to choose from.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da29bd36
    • H
      mm: show message when updating min_free_kbytes in thp · 42aa83cb
      Han Pingtian 提交于
      min_free_kbytes may be raised during THP's initialization.  Sometimes,
      this will change the value which was set by the user.  Showing this
      message will clarify this confusion.
      
      Only show this message when changing a value which was set by the user
      according to Michal Hocko's suggestion.
      
      Show the old value of min_free_kbytes according to Dave Hansen's
      suggestion.  This will give user the chance to restore old value of
      min_free_kbytes.
      Signed-off-by: NHan Pingtian <hanpt@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42aa83cb
    • N
      mm/memory_hotplug.c: move register_memory_resource out of the lock_memory_hotplug · ac13c462
      Nathan Zimmer 提交于
      We don't need to do register_memory_resource() under
      lock_memory_hotplug() since it has its own lock and doesn't make any
      callbacks.
      
      Also register_memory_resource return NULL on failure so we don't have
      anything to cleanup at this point.
      
      The reason for this rfc is I was doing some experiments with hotplugging
      of memory on some of our larger systems.  While it seems to work, it can
      be quite slow.  With some preliminary digging I found that
      lock_memory_hotplug is clearly ripe for breakup.
      
      It could be broken up per nid or something but it also covers the
      online_page_callback.  The online_page_callback shouldn't be very hard
      to break out.
      
      Also there is the issue of various structures(wmarks come to mind) that
      are only updated under the lock_memory_hotplug that would need to be
      dealt with.
      
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Hedi <hedi@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac13c462