1. 30 4月, 2013 18 次提交
    • J
      mm, vmalloc: move get_vmalloc_info() to vmalloc.c · db3808c1
      Joonsoo Kim 提交于
      Now get_vmalloc_info() is in fs/proc/mmu.c.  There is no reason that this
      code must be here and it's implementation needs vmlist_lock and it iterate
      a vmlist which may be internal data structure for vmalloc.
      
      It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
      for maintainability. So move the code to vmalloc.c
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Dave Anderson <anderson@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db3808c1
    • D
      mm: make snapshotting pages for stable writes a per-bio operation · 71368511
      Darrick J. Wong 提交于
      Walking a bio's page mappings has proved problematic, so create a new
      bio flag to indicate that a bio's data needs to be snapshotted in order
      to guarantee stable pages during writeback.  Next, for the one user
      (ext3/jbd) of snapshotting, hook all the places where writes can be
      initiated without PG_writeback set, and set BIO_SNAP_STABLE there.
      
      We must also flag journal "metadata" bios for stable writeout, since
      file data can be written through the journal.  Finally, the
      MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
      rid of it.
      
      [akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
      [akpm@linux-foundation.org: teeny cleanup]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71368511
    • G
      mm/hugetlb: add more arch-defined huge_pte functions · 106c992a
      Gerald Schaefer 提交于
      Commit abf09bed ("s390/mm: implement software dirty bits")
      introduced another difference in the pte layout vs.  the pmd layout on
      s390, thoroughly breaking the s390 support for hugetlbfs.  This requires
      replacing some more pte_xxx functions in mm/hugetlbfs.c with a
      huge_pte_xxx version.
      
      This patch introduces those huge_pte_xxx functions and their generic
      implementation in asm-generic/hugetlb.h, which will now be included on
      all architectures supporting hugetlbfs apart from s390.  This change
      will be a no-op for those architectures.
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Acked-by: Michal Hocko <mhocko@suse.cz>	[for !s390 parts]
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      106c992a
    • M
      memcg: further simplify mem_cgroup_iter · 16248d8f
      Michal Hocko 提交于
      mem_cgroup_iter basically does two things currently.  It takes care of
      the house keeping (reference counting, raclaim cookie) and it iterates
      through a hierarchy tree (by using cgroup generic tree walk).  The code
      would be much more easier to follow if we move the iteration outside of
      the function (to __mem_cgrou_iter_next) so the distinction is more
      clear.  This patch doesn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16248d8f
    • M
      memcg: simplify mem_cgroup_iter · 19f39402
      Michal Hocko 提交于
      The current implementation of mem_cgroup_iter has to consider both css
      and memcg to find out whether no group has been found (css==NULL - aka
      the loop is completed) and that no memcg is associated with the found
      node (!memcg - aka css_tryget failed because the group is no longer
      alive).  This leads to awkward tweaks like tests for css && !memcg to
      skip the current node.
      
      It will be much easier if we got rid off css variable altogether and
      only rely on memcg.  In order to do that the iteration part has to skip
      dead nodes.  This sounds natural to me and as a nice side effect we will
      get a simple invariant that memcg is always alive when non-NULL and all
      nodes have been visited otherwise.
      
      We could get rid of the surrounding while loop but keep it in for now to
      make review easier.  It will go away in the following patch.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19f39402
    • M
      memcg: relax memcg iter caching · 5f578161
      Michal Hocko 提交于
      Now that the per-node-zone-priority iterator caches memory cgroups
      rather than their css ids we have to be careful and remove them from the
      iterator when they are on the way out otherwise they might live for
      unbounded amount of time even though their group is already gone (until
      the global/targeted reclaim triggers the zone under priority to find out
      the group is dead and let it to find the final rest).
      
      We can fix this issue by relaxing rules for the last_visited memcg.
      Instead of taking a reference to the css before it is stored into
      iter->last_visited we can just store its pointer and track the number of
      removed groups from each memcg's subhierarchy.
      
      This number would be stored into iterator everytime when a memcg is
      cached.  If the iter count doesn't match the curent walker root's one we
      will start from the root again.  The group counter is incremented
      upwards the hierarchy every time a group is removed.
      
      The iter_lock can be dropped because racing iterators cannot leak the
      reference anymore as the reference count is not elevated for
      last_visited when it is cached.
      
      Locking rules got a bit complicated by this change though.  The iterator
      primarily relies on rcu read lock which makes sure that once we see a
      valid last_visited pointer then it will be valid for the whole RCU walk.
      smp_rmb makes sure that dead_count is read before last_visited and
      last_dead_count while smp_wmb makes sure that last_visited is updated
      before last_dead_count so the up-to-date last_dead_count cannot point to
      an outdated last_visited.  css_tryget then makes sure that the
      last_visited is still alive in case the iteration races with the cached
      group removal (css is invalidated before mem_cgroup_css_offline
      increments dead_count).
      
      In short:
      mem_cgroup_iter
       rcu_read_lock()
       dead_count = atomic_read(parent->dead_count)
       smp_rmb()
       if (dead_count != iter->last_dead_count)
       	last_visited POSSIBLY INVALID -> last_visited = NULL
       if (!css_tryget(iter->last_visited))
       	last_visited DEAD -> last_visited = NULL
       next = find_next(last_visited)
       css_tryget(next)
       css_put(last_visited) 	// css would be invalidated and parent->dead_count
       			// incremented if this was the last reference
       iter->last_visited = next
       smp_wmb()
       iter->last_dead_count = dead_count
       rcu_read_unlock()
      
      cgroup_rmdir
       cgroup_destroy_locked
        atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
         mem_cgroup_css_offline
          mem_cgroup_invalidate_reclaim_iterators
           while(parent = parent_mem_cgroup)
           	atomic_inc(parent->dead_count)
        css_put(css) // last reference held by cgroup core
      
      Spotted by Ying Han.
      
      Original idea from Johannes Weiner.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f578161
    • M
      memcg: rework mem_cgroup_iter to use cgroup iterators · 542f85f9
      Michal Hocko 提交于
      mem_cgroup_iter curently relies on css->id when walking down a group
      hierarchy tree.  This is really awkward because the tree walk depends on
      the groups creation ordering.  The only guarantee is that a parent node is
      visited before its children.
      
      Example:
      
       1) mkdir -p a a/d a/b/c
       2) mkdir -a a/b/c a/d
      
      Will create the same trees but the tree walks will be different:
      
       1) a, d, b, c
       2) a, b, c, d
      
      Commit 574bd9f7 ("cgroup: implement generic child / descendant walk
      macros") has introduced generic cgroup tree walkers which provide either
      pre-order or post-order tree walk.  This patch converts css->id based
      iteration to pre-order tree walk to keep the semantic with the original
      iterator where parent is always visited before its subtree.
      
      cgroup_for_each_descendant_pre suggests using post_create and
      pre_destroy for proper synchronization with groups addidition resp.
      removal.  This implementation doesn't use those because a new memory
      cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
      already and css reference counting enforces that the group is alive for
      both the last seen cgroup and the found one resp.  it signals that the
      group is dead and it should be skipped.
      
      If the reclaim cookie is used we need to store the last visited group
      into the iterator so we have to be careful that it doesn't disappear in
      the mean time.  Elevated reference count on the css keeps it alive even
      though the group have been removed (parked waiting for the last dput so
      that it can be freed).
      
      Per node-zone-prio iter_lock has been introduced to ensure that
      css_tryget and iter->last_visited is set atomically.  Otherwise two
      racing walkers could both take a references and only one release it
      leading to a css leak (which pins cgroup dentry).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      542f85f9
    • M
      memcg: keep prev's css alive for the whole mem_cgroup_iter · c40046f3
      Michal Hocko 提交于
      The patchset tries to make mem_cgroup_iter saner in the way how it walks
      hierarchies.  css->id based traversal is far from being ideal as it is not
      deterministic because it depends on the creation ordering.  Additional to
      that css_id is considered a burden for cgroup maintainers because it is
      quite some code and memcg is the last user of it.  After this series only
      the swap accounting uses css_id but that one will follow up later.
      
      Diffstat (if we exclude removed/added comments) looks quite
      promising. We got rid of some code:
      
        $ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
         b/include/linux/cgroup.h |    3 ---
         kernel/cgroup.c          |   33 ---------------------------------
         mm/memcontrol.c          |    4 +++-
         3 files changed, 3 insertions(+), 37 deletions(-)
      
      The first patch is just preparatory and it changes when we release css of
      the previously returned memcg.  Nothing controlversial.
      
      The second patch is the core of the patchset and it replaces css_get_next
      based on css_id by the generic cgroup pre-order.  This brings some
      chalanges for the last visited group caching during the reclaim
      (mem_cgroup_per_zone::reclaim_iter).  We have to use memcg pointers
      directly now which means that we have to keep a reference to those groups'
      css to keep them alive.
      
      I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
      in the previous version into this patch.  Johannes felt the race I was
      describing should be mostly harmless and I haven't been able to trigger it
      so the lock doesn't deserve its own patch.  It is still needed
      temporarily, though, because the reference counting on iter->last_visited
      depends on it.  It will go away with the next patch.
      
      The next patch fixups an unbounded cgroup removal holdoff caused by the
      elevated css refcount.  The issue has been observed by Ying Han.  Johannes
      wasn't impressed by the previous version of the fix
      (https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
      during mem_cgroup_css_offline when a group is removed.  He has suggested a
      different way when the iterator checks whether a cached memcg is still
      valid or no.  More on that in the patch but the basic idea is that every
      memcg tracks the number removed subgroups and iterator records this number
      when a group is cached.  These numbers are checked before
      iter->last_visited is about to be used and the iteration is restarted if
      it is invalid.
      
      The fourth and fifth patches are an attempt for simplification of the
      mem_cgroup_iter.  css juggling is removed and the iteration logic is moved
      to a helper so that the reference counting and iteration are separated.
      
      The last patch just removes css_get_next as there is no user for it any
      longer.
      
      My testing looked as follows:
              A (use_hierarchy=1, limit_in_bytes=150M)
             /|\
            1 2 3
      
      Children groups were created so that the number is never higher than 3 and
      their limits were random between 50-100M.  Each group hosts a kernel build
      (starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
      and terminated after random time - up to 5 minutes) and then it is
      removed.
      
      This should exercise both leaf and hierarchical reclaim as well as races
      with cgroup removals and debugging messages I added on top proved that.
      100 groups were created during the test.
      
      This patch:
      
      css reference counting keeps the cgroup alive even though it has been
      already removed.  mem_cgroup_iter relies on this fact and takes a
      reference to the returned group.  The reference is then released on the
      next iteration or mem_cgroup_iter_break.  mem_cgroup_iter currently
      releases the reference right after it gets the last css_id.
      
      This is correct because neither prev's memcg nor cgroup are accessed after
      then.  This will change in the next patch so we need to hold the group
      alive a bit longer so let's move the css_put at the end of the function.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c40046f3
    • J
      mm: introduce free_highmem_page() helper to free highmem pages into buddy system · cfa11e08
      Jiang Liu 提交于
      The original goal of this patchset is to fix the bug reported by
      
        https://bugzilla.kernel.org/show_bug.cgi?id=53501
      
      Now it has also been expanded to reduce common code used by memory
      initializion.
      
      This is the second part, which applies to the previous part at:
        http://marc.info/?l=linux-mm&m=136289696323825&w=2
      
      It introduces a helper function free_highmem_page() to free highmem
      pages into the buddy system when initializing mm subsystem.
      Introduction of free_highmem_page() is one step forward to clean up
      accesses and modificaitons of totalhigh_pages, totalram_pages and
      zone->managed_pages etc. I hope we could remove all references to
      totalhigh_pages from the arch/ subdirectory.
      
      We have only tested these patchset on x86 platforms, and have done basic
      compliation tests using cross-compilers from ftp.kernel.org. That means
      some code may not pass compilation on some architectures. So any help
      to test this patchset are welcomed!
      
      There are several other parts still under development:
      Part3: refine code to manage totalram_pages, totalhigh_pages and
      	zone->managed_pages
      Part4: introduce helper functions to simplify mem_init() and remove the
      	global variable num_physpages.
      
      This patch:
      
      Introduce helper function free_highmem_page(), which will be used by
      architectures with HIGHMEM enabled to free highmem pages into the buddy
      system.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Attilio Rao <attilio.rao@citrix.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfa11e08
    • J
      mm: introduce common help functions to deal with reserved/managed pages · 69afade7
      Jiang Liu 提交于
      The original goal of this patchset is to fix the bug reported by
      
        https://bugzilla.kernel.org/show_bug.cgi?id=53501
      
      Now it has also been expanded to reduce common code used by memory
      initializion.
      
      This is the first part, which applies to v3.9-rc1.
      
      It introduces following common helper functions to simplify
      free_initmem() and free_initrd_mem() on different architectures:
      
      adjust_managed_page_count():
      	will be used to adjust totalram_pages, totalhigh_pages,
      	zone->managed_pages when reserving/unresering a page.
      
      __free_reserved_page():
      	free a reserved page into the buddy system without adjusting
      	page statistics info
      
      free_reserved_page():
      	free a reserved page into the buddy system and adjust page
      	statistics info
      
      mark_page_reserved():
      	mark a page as reserved and adjust page statistics info
      
      free_reserved_area():
      	free a continous ranges of pages by calling free_reserved_page()
      
      free_initmem_default():
      	default method to free __init pages.
      
      We have only tested these patchset on x86 platforms, and have done basic
      compliation tests using cross-compilers from ftp.kernel.org.  That means
      some code may not pass compilation on some architectures.  So any help to
      test this patchset are welcomed!
      
      There are several other parts still under development:
      Part2: introduce free_highmem_page() to simplify freeing highmem pages
      Part3: refine code to manage totalram_pages, totalhigh_pages and
      	zone->managed_pages
      Part4: introduce helper functions to simplify mem_init() and remove the
      	global variable num_physpages.
      
      This patch:
      
      Code to deal with reserved/managed pages are duplicated by many
      architectures, so introduce common help functions to reduce duplicated
      code.  These common help functions will also be used to concentrate code
      to modify totalram_pages and zone->managed_pages, which makes the code
      much more clear.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Anatolij Gustschin <agust@denx.de>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69afade7
    • H
      mm/vmscan.c: minor cleanup for kswapd · 2d42a40d
      Hillf Danton 提交于
      Local variable total_scanned is no longer used.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d42a40d
    • T
      mm: walk_memory_range(): fix typo in comment · e05c4bbf
      Toshi Kani 提交于
      Fix a typo "end_pft" in the comment of walk_memory_range().
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e05c4bbf
    • V
      memblock: add assertion for zero allocation alignment · 94f3d3af
      Vineet Gupta 提交于
      This came to light when calling memblock allocator from arc port (for
      copying flattended DT).  If a "0" alignment is passed, the allocator
      round_up() call incorrectly rounds up the size to 0.
      
      round_up(num, alignto) => ((num - 1) | (alignto -1)) + 1
      
      While the obvious allocation failure causes kernel to panic, it is better
      to warn the caller to fix the code.
      
      Tejun suggested that instead of BUG_ON(!align) - which might be
      ineffective due to pending console init and such, it is better to WARN_ON,
      and continue the boot with a reasonable default align.
      
      Caller passing @size need not be handled similarly as the subsequent
      panic will indicate that anyhow.
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94f3d3af
    • H
      rmap: recompute pgoff for unmapping huge page · 369a713e
      Hillf Danton 提交于
      We have to recompute pgoff if the given page is huge, since result based
      on HPAGE_SIZE is not approapriate for scanning the vma interval tree, as
      shown by commit 36e4f20a ("hugetlb: do not use vma_hugecache_offset()
      for vma_prio_tree_foreach").
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      369a713e
    • A
      mm/shmem.c: remove an ifdef · 250297ed
      Andrew Morton 提交于
      Create a CONFIG_MMU=y stub for ramfs_nommu_expand_for_mapping() in the
      usual fashion.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Wolfram Sang <wsa@the-dreams.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      250297ed
    • D
      mm, show_mem: suppress page counts in non-blockable contexts · 4b59e6c4
      David Rientjes 提交于
      On large systems with a lot of memory, walking all RAM to determine page
      types may take a half second or even more.
      
      In non-blockable contexts, the page allocator will emit a page allocation
      failure warning unless __GFP_NOWARN is specified.  In such contexts, irqs
      are typically disabled and such a lengthy delay may even result in NMI
      watchdog timeouts.
      
      To fix this, suppress the page walk in such contexts when printing the
      page allocation failure warning.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b59e6c4
    • R
      mm: trace filemap add and del · fe0bfaaf
      Robert Jarzmik 提交于
      Use the events API to trace filemap loading and unloading of file pieces
      into the page cache.
      
      This patch aims at tracing the eviction reload cycle of executable and
      shared libraries pages in a memory constrained environment.
      
      The typical usage is to spot a specific device and inode (for example
      /lib/libc.so) to see the eviction cycles, and find out if frequently
      used code is rather spread across many pages (bad) or coallesced (good).
      Signed-off-by: NRobert Jarzmik <robert.jarzmik@free.fr>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe0bfaaf
    • N
      HWPOISON: check dirty flag to match against clean page · e3986295
      Naoya Horiguchi 提交于
      Currently page_action() does not check dirty flag to determine whether
      the error page is "clean mlocked/unevictable LRU" page.  This doesn't
      cause any misjudgement because we do matching against "dirty
      mlocked/unevictable LRU" just before the check.  But in order to make
      code consistent and/or to avoid potential regression, we had better
      check dirty flag explicitly.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: NChen Gong <gong.chen@linux.intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3986295
  2. 28 4月, 2013 1 次提交
    • L
      vm: add no-mmu vm_iomap_memory() stub · 3c0b9de6
      Linus Torvalds 提交于
      I think we could just move the full vm_iomap_memory() function into
      util.h or similar, but I didn't get any reply from anybody actually
      using nommu even to this trivial patch, so I'm not going to touch it any
      more than required.
      
      Here's the fairly minimal stub to make the nommu case at least
      potentially work.  It doesn't seem like anybody cares, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c0b9de6
  3. 18 4月, 2013 2 次提交
  4. 17 4月, 2013 1 次提交
    • L
      vm: add vm_iomap_memory() helper function · b4cbb197
      Linus Torvalds 提交于
      Various drivers end up replicating the code to mmap() their memory
      buffers into user space, and our core memory remapping function may be
      very flexible but it is unnecessarily complicated for the common cases
      to use.
      
      Our internal VM uses pfn's ("page frame numbers") which simplifies
      things for the VM, and allows us to pass physical addresses around in a
      denser and more efficient format than passing a "phys_addr_t" around,
      and having to shift it up and down by the page size.  But it just means
      that drivers end up doing that shifting instead at the interface level.
      
      It also means that drivers end up mucking around with internal VM things
      like the vma details (vm_pgoff, vm_start/end) way more than they really
      need to.
      
      So this just exports a function to map a certain physical memory range
      into user space (using a phys_addr_t based interface that is much more
      natural for a driver) and hides all the complexity from the driver.
      Some drivers will still end up tweaking the vm_page_prot details for
      things like prefetching or cacheability etc, but that's actually
      relevant to the driver, rather than caring about what the page offset of
      the mapping is into the particular IO memory region.
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4cbb197
  5. 13 4月, 2013 1 次提交
    • D
      x86-32: Fix possible incomplete TLB invalidate with PAE pagetables · 1de14c3c
      Dave Hansen 提交于
      This patch attempts to fix:
      
      	https://bugzilla.kernel.org/show_bug.cgi?id=56461
      
      The symptom is a crash and messages like this:
      
      	chrome: Corrupted page table at address 34a03000
      	*pdpt = 0000000000000000 *pde = 0000000000000000
      	Bad pagetable: 000f [#1] PREEMPT SMP
      
      Ingo guesses this got introduced by commit 611ae8e3 ("x86/tlb:
      enable tlb flush range support for x86") since that code started to free
      unused pagetables.
      
      On x86-32 PAE kernels, that new code has the potential to free an entire
      PMD page and will clear one of the four page-directory-pointer-table
      (aka pgd_t entries).
      
      The hardware aggressively "caches" these top-level entries and invlpg
      does not actually affect the CPU's copy.  If we clear one we *HAVE* to
      do a full TLB flush, otherwise we might continue using a freed pmd page.
      (note, we do this properly on the population side in pud_populate()).
      
      This patch tracks whenever we clear one of these entries in the 'struct
      mmu_gather', and ensures that we follow up with a full tlb flush.
      
      BTW, I disassembled and checked that:
      
      	if (tlb->fullmm == 0)
      and
      	if (!tlb->fullmm && !tlb->need_flush_all)
      
      generate essentially the same code, so there should be zero impact there
      to the !PAE case.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Artem S Tashkinov <t.artem@mailcity.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1de14c3c
  6. 05 4月, 2013 1 次提交
    • J
      mm: prevent mmap_cache race in find_vma() · b6a9b7f6
      Jan Stancek 提交于
      find_vma() can be called by multiple threads with read lock
      held on mm->mmap_sem and any of them can update mm->mmap_cache.
      Prevent compiler from re-fetching mm->mmap_cache, because other
      readers could update it in the meantime:
      
                     thread 1                             thread 2
                                              |
        find_vma()                            |  find_vma()
          struct vm_area_struct *vma = NULL;  |
          vma = mm->mmap_cache;               |
          if (!(vma && vma->vm_end > addr     |
              && vma->vm_start <= addr)) {    |
                                              |    mm->mmap_cache = vma;
          return vma;                         |
           ^^ compiler may optimize this      |
              local variable out and re-read  |
              mm->mmap_cache                  |
      
      This issue can be reproduced with gcc-4.8.0-1 on s390x by running
      mallocstress testcase from LTP, which triggers:
      
        kernel BUG at mm/rmap.c:1088!
          Call Trace:
           ([<000003d100c57000>] 0x3d100c57000)
            [<000000000023a1c0>] do_wp_page+0x2fc/0xa88
            [<000000000023baae>] handle_pte_fault+0x41a/0xac8
            [<000000000023d832>] handle_mm_fault+0x17a/0x268
            [<000000000060507a>] do_protection_exception+0x1e2/0x394
            [<0000000000603a04>] pgm_check_handler+0x138/0x13c
            [<000003fffcf1f07a>] 0x3fffcf1f07a
          Last Breaking-Event-Address:
            [<000000000024755e>] page_add_new_anon_rmap+0xc2/0x168
      
      Thanks to Jakub Jelinek for his insight on gcc and helping to
      track this down.
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a9b7f6
  7. 29 3月, 2013 2 次提交
  8. 23 3月, 2013 2 次提交
  9. 15 3月, 2013 1 次提交
    • M
      mm/fremap.c: fix possible oops on error path · a2362d24
      Michel Lespinasse 提交于
      The vm_flags introduced in 6d7825b1 ("mm/fremap.c: fix oops on error
      path") is supposed to avoid a compiler warning about unitialized
      vm_flags without changing the generated code.
      
      However I am concerned that this is going to be very brittle, and fail
      with some compiler versions. The failure could be either of:
      
      - compiler could actually load vma->vm_flags before checking for the
        !vma condition, thus reintroducing the oops
      
      - compiler could optimize out the !vma check, since the pointer just got
        dereferenced shortly before (so the compiler knows it can't be NULL!)
      
      I propose reversing this part of the change and initializing vm_flags to 0
      just to avoid the bogus uninitialized use warning.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Tommi Rantala <tt.rantala@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2362d24
  10. 14 3月, 2013 2 次提交
  11. 13 3月, 2013 2 次提交
    • S
      Select VIRT_TO_BUS directly where needed · 4febd95a
      Stephen Rothwell 提交于
      In commit 887cbce0 ("arch Kconfig: centralise ARCH_NO_VIRT_TO_BUS")
      I introduced the config sybmol HAVE_VIRT_TO_BUS and selected that where
      needed.  I am not sure what I was thinking.  Instead, just directly
      select VIRT_TO_BUS where it is needed.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4febd95a
    • M
      Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys · 8aec0f5d
      Mathieu Desnoyers 提交于
      Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
      compat_process_vm_rw() shows that the compatibility code requires an
      explicit "access_ok()" check before calling
      compat_rw_copy_check_uvector(). The same difference seems to appear when
      we compare fs/read_write.c:do_readv_writev() to
      fs/compat.c:compat_do_readv_writev().
      
      This subtle difference between the compat and non-compat requirements
      should probably be debated, as it seems to be error-prone. In fact,
      there are two others sites that use this function in the Linux kernel,
      and they both seem to get it wrong:
      
      Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
      also ends up calling compat_rw_copy_check_uvector() through
      aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
      be missing. Same situation for
      security/keys/compat.c:compat_keyctl_instantiate_key_iov().
      
      I propose that we add the access_ok() check directly into
      compat_rw_copy_check_uvector(), so callers don't have to worry about it,
      and it therefore makes the compat call code similar to its non-compat
      counterpart. Place the access_ok() check in the same location where
      copy_from_user() can trigger a -EFAULT error in the non-compat code, so
      the ABI behaviors are alike on both compat and non-compat.
      
      While we are here, fix compat_do_readv_writev() so it checks for
      compat_rw_copy_check_uvector() negative return values.
      
      And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
      handling.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aec0f5d
  12. 09 3月, 2013 4 次提交
  13. 03 3月, 2013 1 次提交
    • Y
      x86, ACPI, mm: Revert movablemem_map support · 20e6926d
      Yinghai Lu 提交于
      Tim found:
      
        WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
        Hardware name: S2600CP
        sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
        smpboot: Booting Node   1, Processors  #1
        Modules linked in:
        Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
        Call Trace:
          set_cpu_sibling_map+0x279/0x449
          start_secondary+0x11d/0x1e5
      
      Don Morris reproduced on a HP z620 workstation, and bisected it to
      commit e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock
      is ready")
      
      It turns out movable_map has some problems, and it breaks several things
      
      1. numa_init is called several times, NOT just for srat. so those
      	nodes_clear(numa_nodes_parsed)
      	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
         can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
         and make fall back path working.
      
      2. simply split acpi_numa_init to early_parse_srat.
         a. that early_parse_srat is NOT called for ia64, so you break ia64.
         b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
      	     set_apicid_to_node(i, NUMA_NO_NODE)
           still left in numa_init. So it will just clear result from early_parse_srat.
           it should be moved before that....
         c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
             early before override from INITRD is settled.
      
      3. that patch TITLE is total misleading, there is NO x86 in the title,
         but it changes critical x86 code. It caused x86 guys did not
         pay attention to find the problem early. Those patches really should
         be routed via tip/x86/mm.
      
      4. after that commit, following range can not use movable ram:
        a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
        b. initrd... it will be freed after booting, so it could be on movable...
        c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
      	anymore.
        d. init_mem_mapping: can not put page table high anymore.
        e. initmem_init: vmemmap can not be high local node anymore. That is
           not good.
      
      If node is hotplugable, the mem related range like page table and
      vmemmap could be on the that node without problem and should be on that
      node.
      
      We have workaround patch that could fix some problems, but some can not
      be fixed.
      
      So just remove that offending commit and related ones including:
      
       f7210e6c ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
          protect movablecore_map in memblock_overlaps_region().")
      
       01a178a9 ("acpi, memory-hotplug: support getting hotplug info from
          SRAT")
      
       27168d38 ("acpi, memory-hotplug: extend movablemem_map ranges to
          the end of node")
      
       e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock is
          ready")
      
       fb06bc8e ("page_alloc: bootmem limit with movablecore_map")
      
       42f47e27 ("page_alloc: make movablemem_map have higher priority")
      
       6981ec31 ("page_alloc: introduce zone_movable_limit[] to keep
          movable limit for nodes")
      
       34b71f1e ("page_alloc: add movable_memmap kernel parameter")
      
       4d59a751 ("x86: get pg_data_t's memory from other node")
      
      Later we should have patches that will make sure kernel put page table
      and vmemmap on local node ram instead of push them down to node0.  Also
      need to find way to put other kernel used ram to local node ram.
      Reported-by: NTim Gardner <tim.gardner@canonical.com>
      Reported-by: NDon Morris <don.morris@hp.com>
      Bisected-by: NDon Morris <don.morris@hp.com>
      Tested-by: NDon Morris <don.morris@hp.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20e6926d
  14. 02 3月, 2013 1 次提交
  15. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d