1. 13 1月, 2012 13 次提交
    • K
      memcg: make mem_cgroup_split_huge_fixup() more efficient · e94c8a9c
      KAMEZAWA Hiroyuki 提交于
      In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
      page_cgroup modifcations.  It takes move_lock_page_cgroup() and modifies
      page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.
      
      But thinking again,
        - compound_lock() is held at move_accout...then, it's not necessary
          to take move_lock_page_cgroup().
        - LRU is locked and all tail pages will go into the same LRU as
          head is now on.
        - page_cgroup is contiguous in huge page range.
      
      This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
      hugepage and reduce costs for spliting.
      
      [akpm@linux-foundation.org: fix typo, per Michal]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e94c8a9c
    • J
      mm: memcg: remove unused node/section info from pc->flags · 6b208e3f
      Johannes Weiner 提交于
      To find the page corresponding to a certain page_cgroup, the pc->flags
      encoded the node or section ID with the base array to compare the pc
      pointer to.
      
      Now that the per-memory cgroup LRU lists link page descriptors directly,
      there is no longer any code that knows the struct page_cgroup of a PFN
      but not the struct page.
      
      [hughd@google.com: remove unused node/section info from pc->flags fix]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b208e3f
    • J
      mm: make per-memcg LRU lists exclusive · 925b7673
      Johannes Weiner 提交于
      Now that all code that operated on global per-zone LRU lists is
      converted to operate on per-memory cgroup LRU lists instead, there is no
      reason to keep the double-LRU scheme around any longer.
      
      The pc->lru member is removed and page->lru is linked directly to the
      per-memory cgroup LRU lists, which removes two pointers from a
      descriptor that exists for every page frame in the system.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NYing Han <yinghan@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      925b7673
    • J
      mm: collect LRU list heads into struct lruvec · 6290df54
      Johannes Weiner 提交于
      Having a unified structure with a LRU list set for both global zones and
      per-memcg zones allows to keep that code simple which deals with LRU
      lists and does not care about the container itself.
      
      Once the per-memcg LRU lists directly link struct pages, the isolation
      function and all other list manipulations are shared between the memcg
      case and the global LRU case.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6290df54
    • J
      mm: move memcg hierarchy reclaim to generic reclaim code · 5660048c
      Johannes Weiner 提交于
      Memory cgroup limit reclaim and traditional global pressure reclaim will
      soon share the same code to reclaim from a hierarchical tree of memory
      cgroups.
      
      In preparation of this, move the two right next to each other in
      shrink_zone().
      
      The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
      limit reclaim function, which still does hierarchy walking on its own,
      and a limit (shrinking) reclaim function, which relies on generic
      reclaim code to walk the hierarchy.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5660048c
    • K
      memcg: add mem_cgroup_replace_page_cache() to fix LRU issue · ab936cbc
      KAMEZAWA Hiroyuki 提交于
      Commit ef6a3c63 ("mm: add replace_page_cache_page() function") added a
      function replace_page_cache_page().  This function replaces a page in the
      radix-tree with a new page.  WHen doing this, memory cgroup needs to fix
      up the accounting information.  memcg need to check PCG_USED bit etc.
      
      In some(many?) cases, 'newpage' is on LRU before calling
      replace_page_cache().  So, memcg's LRU accounting information should be
      fixed, too.
      
      This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
       In that function, old pages will be unaccounted without touching
      res_counter and new page will be accounted to the memcg (of old page).
      WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
      races with LRU handling.
      
      Background:
        replace_page_cache_page() is called by FUSE code in its splice() handling.
        Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
        page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
        because rmdir() checks the whole LRU is empty and there is no account leak.
        If a page is on the other LRU than it should be, rmdir() will fail.
      
      This bug was added in March 2011, but no bug report yet.  I guess there
      are not many people who use memcg and FUSE at the same time with upstream
      kernels.
      
      The result of this bug is that admin cannot destroy a memcg because of
      account leak.  So, no panic, no deadlock.  And, even if an active cgroup
      exist, umount can succseed.  So no problem at shutdown.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab936cbc
    • J
      epoll: limit paths · 28d82dc1
      Jason Baron 提交于
      The current epoll code can be tickled to run basically indefinitely in
      both loop detection path check (on ep_insert()), and in the wakeup paths.
      The programs that tickle this behavior set up deeply linked networks of
      epoll file descriptors that cause the epoll algorithms to traverse them
      indefinitely.  A couple of these sample programs have been previously
      posted in this thread: https://lkml.org/lkml/2011/2/25/297.
      
      To fix the loop detection path check algorithms, I simply keep track of
      the epoll nodes that have been already visited.  Thus, the loop detection
      becomes proportional to the number of epoll file descriptor and links.
      This dramatically decreases the run-time of the loop check algorithm.  In
      one diabolical case I tried it reduced the run-time from 15 mintues (all
      in kernel time) to .3 seconds.
      
      Fixing the wakeup paths could be done at wakeup time in a similar manner
      by keeping track of nodes that have already been visited, but the
      complexity is harder, since there can be multiple wakeups on different
      cpus...Thus, I've opted to limit the number of possible wakeup paths when
      the paths are created.
      
      This is accomplished, by noting that the end file descriptor points that
      are found during the loop detection pass (from the newly added link), are
      actually the sources for wakeup events.  I keep a list of these file
      descriptors and limit the number and length of these paths that emanate
      from these 'source file descriptors'.  In the current implemetation I
      allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
      length 4 and 10 of length 5.  Note that it is sufficient to check the
      'source file descriptors' reachable from the newly added link, since no
      other 'source file descriptors' will have newly added links.  This allows
      us to check only the wakeup paths that may have gotten too long, and not
      re-check all possible wakeup paths on the system.
      
      In terms of the path limit selection, I think its first worth noting that
      the most common case for epoll, is probably the model where you have 1
      epoll file descriptor that is monitoring n number of 'source file
      descriptors'.  In this case, each 'source file descriptor' has a 1 path of
      length 1.  Thus, I believe that the limits I'm proposing are quite
      reasonable and in fact may be too generous.  Thus, I'm hoping that the
      proposed limits will not prevent any workloads that currently work to
      fail.
      
      In terms of locking, I have extended the use of the 'epmutex' to all
      epoll_ctl add and remove operations.  Currently its only used in a subset
      of the add paths.  I need to hold the epmutex, so that we can correctly
      traverse a coherent graph, to check the number of paths.  I believe that
      this additional locking is probably ok, since its in the setup/teardown
      paths, and doesn't affect the running paths, but it certainly is going to
      add some extra overhead.  Also, worth noting is that the epmuex was
      recently added to the ep_ctl add operations in the initial path loop
      detection code using the argument that it was not on a critical path.
      
      Another thing to note here, is the length of epoll chains that is allowed.
      Currently, eventpoll.c defines:
      
      /* Maximum number of nesting allowed inside epoll sets */
      #define EP_MAX_NESTS 4
      
      This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
      + 1).  However, this limit is currently only enforced during the loop
      check detection code, and only when the epoll file descriptors are added
      in a certain order.  Thus, this limit is currently easily bypassed.  The
      newly added check for wakeup paths, stricly limits the wakeup paths to a
      length of 5, regardless of the order in which ep's are linked together.
      Thus, a side-effect of the new code is a more consistent enforcement of
      the graph depth.
      
      Thus far, I've tested this, using the sample programs previously
      mentioned, which now either return quickly or return -EINVAL.  I've also
      testing using the piptest.c epoll tester, which showed no difference in
      performance.  I've also created a number of different epoll networks and
      tested that they behave as expectded.
      
      I believe this solves the original diabolical test cases, while still
      preserving the sane epoll nesting.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Cc: Nelson Elhage <nelhage@ksplice.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28d82dc1
    • H
      mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL · 43570fd2
      Heiko Carstens 提交于
      While implementing cmpxchg_double() on s390 I realized that we don't set
      CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.
      
      However setting that option will increase the size of struct page by
      eight bytes on 64 bit, which we certainly do not want.  Also, it doesn't
      make sense that a present cpu feature should increase the size of struct
      page.
      
      Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
      that it should depend on CMPXCHG_DOUBLE instead.
      
      This patch:
      
      If an architecture supports CMPXCHG_LOCAL this shouldn't result
      automatically in larger struct pages if the SLUB allocator is used.
      Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
      can be selected if a double word aligned struct page is required.  Also
      update x86 Kconfig so that it should work as before.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43570fd2
    • J
      include/linux/linkage.h: remove unused ATTRIB_NORET macro · 0d259cf8
      Joe Perches 提交于
      The uses have been renamed so delete the unused macro.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d259cf8
    • J
      treewide: convert uses of ATTRIB_NORETURN to __noreturn · ff2d8b19
      Joe Perches 提交于
      Use the more commonly used __noreturn instead of ATTRIB_NORETURN.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff2d8b19
    • J
      treewide: remove useless NORET_TYPE macro and uses · 9402c95f
      Joe Perches 提交于
      It's a very old and now unused prototype marking so just delete it.
      
      Neaten panic pointer argument style to keep checkpatch quiet.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9402c95f
    • J
      include/linux/linkage.h: remove unused NORET_AND macro · 80bf007f
      Joe Perches 提交于
      The only use in kernel.h is gone so remove the macro.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80bf007f
    • J
      kernel.h: neaten panic prototype · 4da47859
      Joe Perches 提交于
      Use __printf macro.
      Convert NORET_AND to ATTRIB_NORET.
      Use the normal kernel style for pointer arguments.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4da47859
  2. 11 1月, 2012 18 次提交
    • T
      workqueue: make alloc_workqueue() take printf fmt and args for name · b196be89
      Tejun Heo 提交于
      alloc_workqueue() currently expects the passed in @name pointer to remain
      accessible.  This is inconvenient and a bit silly given that the whole wq
      is being dynamically allocated.  This patch updates alloc_workqueue() and
      friends to take printf format string instead of opaque string and matching
      varargs at the end.  The name is allocated together with the wq and
      formatted.
      
      alloc_ordered_workqueue() is converted to a macro to unify varargs
      handling with alloc_workqueue(), and, while at it, add comment to
      alloc_workqueue().
      
      None of the current in-kernel users pass in string with '%' as constant
      name and this change shouldn't cause any problem.
      
      [akpm@linux-foundation.org: use __printf]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b196be89
    • V
      procfs: add hidepid= and gid= mount options · 0499680a
      Vasiliy Kulikov 提交于
      Add support for mount options to restrict access to /proc/PID/
      directories.  The default backward-compatible "relaxed" behaviour is left
      untouched.
      
      The first mount option is called "hidepid" and its value defines how much
      info about processes we want to be available for non-owners:
      
      hidepid=0 (default) means the old behavior - anybody may read all
      world-readable /proc/PID/* files.
      
      hidepid=1 means users may not access any /proc/<pid>/ directories, but
      their own.  Sensitive files like cmdline, sched*, status are now protected
      against other users.  As permission checking done in proc_pid_permission()
      and files' permissions are left untouched, programs expecting specific
      files' modes are not confused.
      
      hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
      users.  It doesn't mean that it hides whether a process exists (it can be
      learned by other means, e.g.  by kill -0 $PID), but it hides process' euid
      and egid.  It compicates intruder's task of gathering info about running
      processes, whether some daemon runs with elevated privileges, whether
      another user runs some sensitive program, whether other users run any
      program at all, etc.
      
      gid=XXX defines a group that will be able to gather all processes' info
      (as in hidepid=0 mode).  This group should be used instead of putting
      nonroot user in sudoers file or something.  However, untrusted users (like
      daemons, etc.) which are not supposed to monitor the tasks in the whole
      system should not be added to the group.
      
      hidepid=1 or higher is designed to restrict access to procfs files, which
      might reveal some sensitive private information like precise keystrokes
      timings:
      
      http://www.openwall.com/lists/oss-security/2011/11/05/3
      
      hidepid=1/2 doesn't break monitoring userspace tools.  ps, top, pgrep, and
      conky gracefully handle EPERM/ENOENT and behave as if the current user is
      the only user running processes.  pstree shows the process subtree which
      contains "pstree" process.
      
      Note: the patch doesn't deal with setuid/setgid issues of keeping
      preopened descriptors of procfs files (like
      https://lkml.org/lkml/2011/2/7/368).  We rely on that the leaked
      information like the scheduling counters of setuid apps doesn't threaten
      anybody's privacy - only the user started the setuid program may read the
      counters.
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Theodore Tso <tytso@MIT.EDU>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: James Morris <jmorris@namei.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0499680a
    • P
      procfs: introduce the /proc/<pid>/map_files/ directory · 640708a2
      Pavel Emelyanov 提交于
      This one behaves similarly to the /proc/<pid>/fd/ one - it contains
      symlinks one for each mapping with file, the name of a symlink is
      "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
      results in a file that point exactly to the same inode as them vma's one.
      
      For example the ls -l of some arbitrary /proc/<pid>/map_files/
      
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
      
      This *helps* checkpointing process in three ways:
      
      1. When dumping a task mappings we do know exact file that is mapped
         by particular region.  We do this by opening
         /proc/$pid/map_files/$address symlink the way we do with file
         descriptors.
      
      2. This also helps in determining which anonymous shared mappings are
         shared with each other by comparing the inodes of them.
      
      3. When restoring a set of processes in case two of them has a mapping
         shared, we map the memory by the 1st one and then open its
         /proc/$pid/map_files/$address file and map it by the 2nd task.
      
      Using /proc/$pid/maps for this is quite inconvenient since it brings
      repeatable re-reading and reparsing for this text file which slows down
      restore procedure significantly.  Also as being pointed in (3) it is a way
      easier to use top level shared mapping in children as
      /proc/$pid/map_files/$address when needed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      640708a2
    • C
      procfs: make proc_get_link to use dentry instead of inode · 7773fbc5
      Cyrill Gorcunov 提交于
      Prepare the ground for the next "map_files" patch which needs a name of a
      link file to analyse.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7773fbc5
    • M
      signal: add block_sigmask() for adding sigmask to current->blocked · 5e6292c0
      Matt Fleming 提交于
      Abstract the code sequence for adding a signal handler's sa_mask to
      current->blocked because the sequence is identical for all architectures.
      Furthermore, in the past some architectures actually got this code wrong,
      so introduce a wrapper that all architectures can use.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e6292c0
    • N
      leds: add driver for TCA6507 LED controller · a6d511e5
      NeilBrown 提交于
      TI's TCA6507 is the LED driver in the GTA04 Openmoko motherboard.  The
      driver provides full support for brightness levels and hardware blinking.
      
      This driver can drive each of 7 outputs as an LED or a GPIO output,
      and provides hardware-assist blinking.
      
      [akpm@linux-foundation.org: fix __mod_i2c_device_table alias]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6d511e5
    • K
      mm/mempolicy.c: mpol_equal(): use bool · fcfb4dcc
      KOSAKI Motohiro 提交于
      mpol_equal() logically returns a boolean.  Use a bool type to slightly
      improve readability.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Stephen Wilson <wilsons@start.ca>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcfb4dcc
    • A
      mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() · 948f017b
      Andrea Arcangeli 提交于
      migrate was doing an rmap_walk with speculative lock-less access on
      pagetables.  That could lead it to not serializing properly against mremap
      PT locks.  But a second problem remains in the order of vmas in the
      same_anon_vma list used by the rmap_walk.
      
      If vma_merge succeeds in copy_vma, the src vma could be placed after the
      dst vma in the same_anon_vma list.  That could still lead to migrate
      missing some pte.
      
      This patch adds an anon_vma_moveto_tail() function to force the dst vma at
      the end of the list before mremap starts to solve the problem.
      
      If the mremap is very large and there are a lots of parents or childs
      sharing the anon_vma root lock, this should still scale better than taking
      the anon_vma root lock around every pte copy practically for the whole
      duration of mremap.
      
      Update: Hugh noticed special care is needed in the error path where
      move_page_tables goes in the reverse direction, a second
      anon_vma_moveto_tail() call is needed in the error path.
      
      This program exercises the anon_vma_moveto_tail:
      
      ===
      
      int main()
      {
      	static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	printf("%p\n", p);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	if (p4 != p3)
      		perror("mremap"), exit(1);
      	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
      	if (p4 != p+SIZE/2)
      		perror("mremap"), exit(1);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      $ perf probe -a anon_vma_moveto_tail
      Add new event:
        probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
      
      You can now use it on all perf tools, such as:
      
              perf record -e probe:anon_vma_moveto_tail -aR sleep 1
      
      $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
      0x7f2ca2800000
      ok
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
      $ perf report --stdio
         100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NNai Xia <nai.xia@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pawel Sikora <pluto@agmk.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948f017b
    • J
      mm: try to distribute dirty pages fairly across zones · a756cf59
      Johannes Weiner 提交于
      The maximum number of dirty pages that exist in the system at any time is
      determined by a number of pages considered dirtyable and a user-configured
      percentage of those, or an absolute number in bytes.
      
      This number of dirtyable pages is the sum of memory provided by all the
      zones in the system minus their lowmem reserves and high watermarks, so
      that the system can retain a healthy number of free pages without having
      to reclaim dirty pages.
      
      But there is a flaw in that we have a zoned page allocator which does not
      care about the global state but rather the state of individual memory
      zones.  And right now there is nothing that prevents one zone from filling
      up with dirty pages while other zones are spared, which frequently leads
      to situations where kswapd, in order to restore the watermark of free
      pages, does indeed have to write pages from that zone's LRU list.  This
      can interfere so badly with IO from the flusher threads that major
      filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
      already, taking away the VM's only possibility to keep such a zone
      balanced, aside from hoping the flushers will soon clean pages from that
      zone.
      
      Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
      the global limit is to the global amount of dirtyable memory, and try to
      make sure that no single zone receives more than its fair share of the
      globally allowed dirty pages in the first place.  As the number of pages
      considered dirtyable excludes the zones' lowmem reserves and high
      watermarks, the maximum number of dirty pages in a zone is such that the
      zone can always be balanced without requiring page cleaning.
      
      As this is a placement decision in the page allocator and pages are
      dirtied only after the allocation, this patch allows allocators to pass
      __GFP_WRITE when they know in advance that the page will be written to and
      become dirty soon.  The page allocator will then attempt to allocate from
      the first zone of the zonelist - which on NUMA is determined by the task's
      NUMA memory policy - that has not exceeded its dirty limit.
      
      At first glance, it would appear that the diversion to lower zones can
      increase pressure on them, but this is not the case.  With a full high
      zone, allocations will be diverted to lower zones eventually, so it is
      more of a shift in timing of the lower zone allocations.  Workloads that
      previously could fit their dirty pages completely in the higher zone may
      be forced to allocate from lower zones, but the amount of pages that
      "spill over" are limited themselves by the lower zones' dirty constraints,
      and thus unlikely to become a problem.
      
      For now, the problem of unfair dirty page distribution remains for NUMA
      configurations where the zones allowed for allocation are in sum not big
      enough to trigger the global dirty limits, wake up the flusher threads and
      remedy the situation.  Because of this, an allocation that could not
      succeed on any of the considered zones is allowed to ignore the dirty
      limits before going into direct reclaim or even failing the allocation,
      until a future patch changes the global dirty throttling and flusher
      thread activation so that they take individual zone states into account.
      
      			Test results
      
      15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
      40% dirty ratio
      16G USB thumb drive
      10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
      
      		seconds			nr_vmscan_write
      		        (stddev)	       min|     median|        max
      xfs
      vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
      patched:	 550.996( 3.802)	     0.000|      0.000|      0.000
      
      fuse-ntfs
      vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
      patched:	 558.049(17.914)	     0.000|      0.000|     43.000
      
      btrfs
      vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
      patched:	 563.365(11.368)	     0.000|      0.000|   1362.000
      
      ext4
      vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
      patched:	 568.806(17.496)	     0.000|      0.000|      0.000
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a756cf59
    • J
      mm: exclude reserved pages from dirtyable memory · ab8fabd4
      Johannes Weiner 提交于
      Per-zone dirty limits try to distribute page cache pages allocated for
      writing across zones in proportion to the individual zone sizes, to reduce
      the likelihood of reclaim having to write back individual pages from the
      LRU lists in order to make progress.
      
      This patch:
      
      The amount of dirtyable pages should not include the full number of free
      pages: there is a number of reserved pages that the page allocator and
      kswapd always try to keep free.
      
      The closer (reclaimable pages - dirty pages) is to the number of reserved
      pages, the more likely it becomes for reclaim to run into dirty pages:
      
             +----------+ ---
             |   anon   |  |
             +----------+  |
             |          |  |
             |          |  -- dirty limit new    -- flusher new
             |   file   |  |                     |
             |          |  |                     |
             |          |  -- dirty limit old    -- flusher old
             |          |                        |
             +----------+                       --- reclaim
             | reserved |
             +----------+
             |  kernel  |
             +----------+
      
      This patch introduces a per-zone dirty reserve that takes both the lowmem
      reserve as well as the high watermark of the zone into account, and a
      global sum of those per-zone values that is subtracted from the global
      amount of dirtyable pages.  The lowmem reserve is unavailable to page
      cache allocations and kswapd tries to keep the high watermark free.  We
      don't want to end up in a situation where reclaim has to clean pages in
      order to balance zones.
      
      Not treating reserved pages as dirtyable on a global level is only a
      conceptual fix.  In reality, dirty pages are not distributed equally
      across zones and reclaim runs into dirty pages on a regular basis.
      
      But it is important to get this right before tackling the problem on a
      per-zone level, where the distance between reclaim and the dirty pages is
      mostly much smaller in absolute numbers.
      
      [akpm@linux-foundation.org: fix highmem build]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab8fabd4
    • D
      mm, debug: test for online nid when allocating on single node · f6d7e0cb
      David Rientjes 提交于
      Calling alloc_pages_exact_node() means the allocation only passes the
      zonelist of a single node into the page allocator.  If that node isn't
      online, it's zonelist may never have been initialized causing a strange
      oops that may not immediately be clear.
      
      I recently debugged an issue where node 0 wasn't online and an allocator
      was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
      pointer on zonelist->_zoneref.  If CONFIG_DEBUG_VM is enabled, though, it
      would be nice to catch this a bit earlier.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6d7e0cb
    • S
      mm: more intensive memory corruption debugging · c0a32fc5
      Stanislaw Gruszka 提交于
      With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
      on access (read,write) to an unallocated page, which permits us to catch
      code which corrupts memory.  However the kernel is trying to maximise
      memory usage, hence there are usually few free pages in the system and
      buggy code usually corrupts some crucial data.
      
      This patch changes the buddy allocator to keep more free/protected pages
      and to interlace free/protected and allocated pages to increase the
      probability of catching corruption.
      
      When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
      debug_guardpage_minorder defines the minimum order used by the page
      allocator to grant a request.  The requested size will be returned with
      the remaining pages used as guard pages.
      
      The default value of debug_guardpage_minorder is zero: no change from
      current behaviour.
      
      [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a32fc5
    • D
      kernel.h: add BUILD_BUG() macro · 1399ff86
      David Daney 提交于
      We can place this in definitions that we expect the compiler to remove by
      dead code elimination.  If this assertion fails, we get a nice error
      message at build time.
      
      The GCC function attribute error("message") was added in version 4.3, so
      we define a new macro __linktime_error(message) to expand to this for
      GCC-4.3 and later.  This will give us an error diagnostic from the
      compiler on the line that fails.  For other compilers
      __linktime_error(message) expands to nothing, and we have to be content
      with a link time error, but at least we will still get a build error.
      
      BUILD_BUG() expands to the undefined function __build_bug_failed() and
      will fail at link time if the compiler ever emits code for it.  On GCC-4.3
      and later, attribute((error())) is used so that the failure will be noted
      at compile time instead.
      Signed-off-by: NDavid Daney <david.daney@cavium.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: DM <dm.n9107@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1399ff86
    • M
      mm: avoid livelock on !__GFP_FS allocations · f90ac398
      Mel Gorman 提交于
      Colin Cross reported;
      
        Under the following conditions, __alloc_pages_slowpath can loop forever:
        gfp_mask & __GFP_WAIT is true
        gfp_mask & __GFP_FS is false
        reclaim and compaction make no progress
        order <= PAGE_ALLOC_COSTLY_ORDER
      
        These conditions happen very often during suspend and resume,
        when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
        allocations into __GFP_WAIT.
      
        The oom killer is not run because gfp_mask & __GFP_FS is false,
        but should_alloc_retry will always return true when order is less
        than PAGE_ALLOC_COSTLY_ORDER.
      
      In his fix, he avoided retrying the allocation if reclaim made no progress
      and __GFP_FS was not set.  The problem is that this would result in
      GFP_NOIO allocations failing that previously succeeded which would be very
      unfortunate.
      
      The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
      behave like GFP_NOIO is that normally flushers will be cleaning pages and
      kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
      The same does not necessarily apply during suspend as the storage device
      may be suspended.
      
      This patch special cases the suspend case to fail the page allocation if
      reclaim cannot make progress and adds some documentation on how
      gfp_allowed_mask is currently used.  Failing allocations like this may
      cause suspend to abort but that is better than a livelock.
      
      [mgorman@suse.de: Rework fix to be suspend specific]
      [rientjes@google.com: Move suspended device check to should_alloc_retry]
      Reported-by: NColin Cross <ccross@android.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f90ac398
    • K
      mm: remove unused pagevec_free · da066ad3
      Konstantin Khlebnikov 提交于
      It not exported and now nobody uses it.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da066ad3
    • K
      mm: add free_hot_cold_page_list() helper · cc59850e
      Konstantin Khlebnikov 提交于
      This patch adds helper free_hot_cold_page_list() to free list of 0-order
      pages.  It frees pages directly from list without temporary page-vector.
      It also calls trace_mm_pagevec_free() to simulate pagevec_free()
      behaviour.
      
      bloat-o-meter:
      
      add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
      function                                     old     new   delta
      free_hot_cold_page_list                        -     264    +264
      get_page_from_freelist                      2129    2132      +3
      __pagevec_free                               243     239      -4
      split_free_page                              380     373      -7
      release_pages                                606     510     -96
      free_page_list                               188       -    -188
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc59850e
    • J
      mm/page-writeback.c: make determine_dirtyable_memory static again · 1edf2234
      Johannes Weiner 提交于
      The tracing ring-buffer used this function briefly, but not anymore.
      Make it local to the writeback code again.
      
      Also, move the function so that no forward declaration needs to be
      reintroduced.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1edf2234
    • M
      fix shrink_dcache_parent() livelock · eaf5f907
      Miklos Szeredi 提交于
      Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
      cause shrink_dcache_parent() to loop forever.
      
      Here's what appears to happen:
      
      1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1
      
      2 - CPU1: select_parent(P) locks P->d_lock
      
      3 - CPU0: shrink_dentry_list() locks C->d_lock
         dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock
      
      4 - CPU1: select_parent(P) locks C->d_lock,
               moves C from dispose list being processed on CPU0 to the new
      dispose list, returns 1
      
      5 - CPU0: shrink_dentry_list() finds dispose list empty, returns
      
      6 - Goto 2 with CPU0 and CPU1 switched
      
      Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
      it found a new one, causing shrink_dentry_list() to think it's making progress
      and loop over and over.
      
      One way to trigger this is to make udev calls stat() on the sysfs file while it
      is going away.
      
      Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:
      
      ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"
      
      Then execute the following loop:
      
      while true; do
              echo -bond0 > /sys/class/net/bonding_masters
              echo +bond0 > /sys/class/net/bonding_masters
              echo -bond1 > /sys/class/net/bonding_masters
              echo +bond1 > /sys/class/net/bonding_masters
      done
      
      One fix would be to check all callers and prevent concurrent calls to
      shrink_dcache_parent().  But I think a better solution is to stop the
      stealing behavior.
      
      This patch adds a new dentry flag that is set when the dentry is added to the
      dispose list.  The flag is cleared in dentry_lru_del() in case the dentry gets a
      new reference just before being pruned.
      
      If the dentry has this flag, select_parent() will skip it and let
      shrink_dentry_list() retry pruning it.  With select_parent() skipping those
      dentries there will not be the appearance of progress (new dentries found) when
      there is none, hence shrink_dcache_parent() will not loop forever.
      
      Set the flag is also set in prune_dcache_sb() for consistency as suggested by
      Linus.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eaf5f907
  3. 10 1月, 2012 9 次提交