1. 13 1月, 2012 30 次提交
    • K
      memcg: make mem_cgroup_split_huge_fixup() more efficient · e94c8a9c
      KAMEZAWA Hiroyuki 提交于
      In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
      page_cgroup modifcations.  It takes move_lock_page_cgroup() and modifies
      page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.
      
      But thinking again,
        - compound_lock() is held at move_accout...then, it's not necessary
          to take move_lock_page_cgroup().
        - LRU is locked and all tail pages will go into the same LRU as
          head is now on.
        - page_cgroup is contiguous in huge page range.
      
      This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
      hugepage and reduce costs for spliting.
      
      [akpm@linux-foundation.org: fix typo, per Michal]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e94c8a9c
    • J
      mm: memcg: remove unused node/section info from pc->flags · 6b208e3f
      Johannes Weiner 提交于
      To find the page corresponding to a certain page_cgroup, the pc->flags
      encoded the node or section ID with the base array to compare the pc
      pointer to.
      
      Now that the per-memory cgroup LRU lists link page descriptors directly,
      there is no longer any code that knows the struct page_cgroup of a PFN
      but not the struct page.
      
      [hughd@google.com: remove unused node/section info from pc->flags fix]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b208e3f
    • J
      mm: make per-memcg LRU lists exclusive · 925b7673
      Johannes Weiner 提交于
      Now that all code that operated on global per-zone LRU lists is
      converted to operate on per-memory cgroup LRU lists instead, there is no
      reason to keep the double-LRU scheme around any longer.
      
      The pc->lru member is removed and page->lru is linked directly to the
      per-memory cgroup LRU lists, which removes two pointers from a
      descriptor that exists for every page frame in the system.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NYing Han <yinghan@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      925b7673
    • J
      mm: collect LRU list heads into struct lruvec · 6290df54
      Johannes Weiner 提交于
      Having a unified structure with a LRU list set for both global zones and
      per-memcg zones allows to keep that code simple which deals with LRU
      lists and does not care about the container itself.
      
      Once the per-memcg LRU lists directly link struct pages, the isolation
      function and all other list manipulations are shared between the memcg
      case and the global LRU case.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6290df54
    • J
      mm: vmscan: convert global reclaim to per-memcg LRU lists · b95a2f2d
      Johannes Weiner 提交于
      The global per-zone LRU lists are about to go away on memcg-enabled
      kernels, global reclaim must be able to find its pages on the per-memcg
      LRU lists.
      
      Since the LRU pages of a zone are distributed over all existing memory
      cgroups, a scan target for a zone is complete when all memory cgroups
      are scanned for their proportional share of a zone's memory.
      
      The forced scanning of small scan targets from kswapd is limited to
      zones marked unreclaimable, otherwise kswapd can quickly overreclaim by
      force-scanning the LRU lists of multiple memory cgroups.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b95a2f2d
    • J
      mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty · ad2b8e60
      Johannes Weiner 提交于
      root_mem_cgroup, lacking a configurable limit, was never subject to
      limit reclaim, so the pages charged to it could be kept off its LRU
      lists.  They would be found on the global per-zone LRU lists upon
      physical memory pressure and it made sense to avoid uselessly linking
      them to both lists.
      
      The global per-zone LRU lists are about to go away on memcg-enabled
      kernels, with all pages being exclusively linked to their respective
      per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
      also be linked to its LRU lists again.  This is purely about the LRU
      list, root_mem_cgroup is still not charged.
      
      The overhead is temporary until the double-LRU scheme is going away
      completely.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2b8e60
    • J
      mm: move memcg hierarchy reclaim to generic reclaim code · 5660048c
      Johannes Weiner 提交于
      Memory cgroup limit reclaim and traditional global pressure reclaim will
      soon share the same code to reclaim from a hierarchical tree of memory
      cgroups.
      
      In preparation of this, move the two right next to each other in
      shrink_zone().
      
      The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
      limit reclaim function, which still does hierarchy walking on its own,
      and a limit (shrinking) reclaim function, which relies on generic
      reclaim code to walk the hierarchy.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5660048c
    • J
      mm: memcg: per-priority per-zone hierarchy scan generations · 527a5ec9
      Johannes Weiner 提交于
      Memory cgroup limit reclaim currently picks one memory cgroup out of the
      target hierarchy, remembers it as the last scanned child, and reclaims
      all zones in it with decreasing priority levels.
      
      The new hierarchy reclaim code will pick memory cgroups from the same
      hierarchy concurrently from different zones and priority levels, it
      becomes necessary that hierarchy roots not only remember the last
      scanned child, but do so for each zone and priority level.
      
      Until now, we reclaimed memcgs like this:
      
          mem = mem_cgroup_iter(root)
          for each priority level:
            for each zone in zonelist:
              reclaim(mem, zone)
      
      But subsequent patches will move the memcg iteration inside the loop
      over the zones:
      
          for each priority level:
            for each zone in zonelist:
              mem = mem_cgroup_iter(root)
              reclaim(mem, zone)
      
      And to keep with the original scan order - memcg -> priority -> zone -
      the last scanned memcg has to be remembered per zone and per priority
      level.
      
      Furthermore, global reclaim will be switched to the hierarchy walk as
      well.  Different from limit reclaim, which can just recheck the limit
      after some reclaim progress, its target is to scan all memcgs for the
      desired zone pages, proportional to the memcg size, and so reliably
      detecting a full hierarchy round-trip will become crucial.
      
      Currently, the code relies on one reclaimer encountering the same memcg
      twice, but that is error-prone with concurrent reclaimers.  Instead, use
      a generation counter that is increased every time the child with the
      highest ID has been visited, so that reclaimers can stop when the
      generation changes.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527a5ec9
    • J
      mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned · f16015fb
      Johannes Weiner 提交于
      Memory cgroup hierarchies are currently handled completely outside of
      the traditional reclaim code, which is invoked with a single memory
      cgroup as an argument for the whole call stack.
      
      Subsequent patches will switch this code to do hierarchical reclaim, so
      there needs to be a distinction between a) the memory cgroup that is
      triggering reclaim due to hitting its limit and b) the memory cgroup
      that is being scanned as a child of a).
      
      This patch introduces a struct mem_cgroup_zone that contains the
      combination of the memory cgroup and the zone being scanned, which is
      then passed down the stack instead of the zone argument.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f16015fb
    • J
      mm: vmscan: distinguish global reclaim from global LRU scanning · 89b5fae5
      Johannes Weiner 提交于
      The traditional zone reclaim code is scanning the per-zone LRU lists
      during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
      lists when reclaiming on behalf of a memory cgroup limit.
      
      Subsequent patches will convert the traditional reclaim code to reclaim
      exclusively from the per-memory cgroup LRU lists.  As a result, using
      the predicate for which LRU list is scanned will no longer be
      appropriate to tell global reclaim from limit reclaim.
      
      This patch adds a global_reclaim() predicate to tell direct/kswapd
      reclaim from memory cgroup limit reclaim and substitutes it in all
      places where currently scanning_global_lru() is used for that.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89b5fae5
    • J
      mm: memcg: consolidate hierarchy iteration primitives · 9f3a0d09
      Johannes Weiner 提交于
      The memcg naturalization series:
      
      Memory control groups are currently bolted onto the side of
      traditional memory management in places where better integration would
      be preferrable.  To reclaim memory, for example, memory control groups
      maintain their own LRU list and reclaim strategy aside from the global
      per-zone LRU list reclaim.  But an extra list head for each existing
      page frame is expensive and maintaining it requires additional code.
      
      This patchset disables the global per-zone LRU lists on memory cgroup
      configurations and converts all its users to operate on the per-memory
      cgroup lists instead.  As LRU pages are then exclusively on one list,
      this saves two list pointers for each page frame in the system:
      
      page_cgroup array size with 4G physical memory
      
        vanilla: allocated 31457280 bytes of page_cgroup
        patched: allocated 15728640 bytes of page_cgroup
      
      At the same time, system performance for various workloads is
      unaffected:
      
      100G sparse file cat, 4G physical memory, 10 runs, to test for code
      bloat in the traditional LRU handling and kswapd & direct reclaim
      paths, without/with the memory controller configured in
      
        vanilla: 71.603(0.207) seconds
        patched: 71.640(0.156) seconds
      
        vanilla: 79.558(0.288) seconds
        patched: 77.233(0.147) seconds
      
      100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
      bloat in the traditional memory cgroup LRU handling and reclaim path
      
        vanilla: 96.844(0.281) seconds
        patched: 94.454(0.311) seconds
      
      4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
      swap on SSD, 10 runs, to test for regressions in kswapd & direct
      reclaim using per-memcg LRU lists with multiple memcgs and multiple
      allocators within each memcg
      
        vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
        patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
      
      16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
      swap on SSD, 10 runs, to test for regressions in hierarchical memcg
      setups
      
        vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
        patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
      
      This patch:
      
      There are currently two different implementations of iterating over a
      memory cgroup hierarchy tree.
      
      Consolidate them into one worker function and base the convenience
      looping-macros on top of it.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f3a0d09
    • K
      memcg: add mem_cgroup_replace_page_cache() to fix LRU issue · ab936cbc
      KAMEZAWA Hiroyuki 提交于
      Commit ef6a3c63 ("mm: add replace_page_cache_page() function") added a
      function replace_page_cache_page().  This function replaces a page in the
      radix-tree with a new page.  WHen doing this, memory cgroup needs to fix
      up the accounting information.  memcg need to check PCG_USED bit etc.
      
      In some(many?) cases, 'newpage' is on LRU before calling
      replace_page_cache().  So, memcg's LRU accounting information should be
      fixed, too.
      
      This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
       In that function, old pages will be unaccounted without touching
      res_counter and new page will be accounted to the memcg (of old page).
      WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
      races with LRU handling.
      
      Background:
        replace_page_cache_page() is called by FUSE code in its splice() handling.
        Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
        page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
        because rmdir() checks the whole LRU is empty and there is no account leak.
        If a page is on the other LRU than it should be, rmdir() will fail.
      
      This bug was added in March 2011, but no bug report yet.  I guess there
      are not many people who use memcg and FUSE at the same time with upstream
      kernels.
      
      The result of this bug is that admin cannot destroy a memcg because of
      account leak.  So, no panic, no deadlock.  And, even if an active cgroup
      exist, umount can succseed.  So no problem at shutdown.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab936cbc
    • J
      epoll: limit paths · 28d82dc1
      Jason Baron 提交于
      The current epoll code can be tickled to run basically indefinitely in
      both loop detection path check (on ep_insert()), and in the wakeup paths.
      The programs that tickle this behavior set up deeply linked networks of
      epoll file descriptors that cause the epoll algorithms to traverse them
      indefinitely.  A couple of these sample programs have been previously
      posted in this thread: https://lkml.org/lkml/2011/2/25/297.
      
      To fix the loop detection path check algorithms, I simply keep track of
      the epoll nodes that have been already visited.  Thus, the loop detection
      becomes proportional to the number of epoll file descriptor and links.
      This dramatically decreases the run-time of the loop check algorithm.  In
      one diabolical case I tried it reduced the run-time from 15 mintues (all
      in kernel time) to .3 seconds.
      
      Fixing the wakeup paths could be done at wakeup time in a similar manner
      by keeping track of nodes that have already been visited, but the
      complexity is harder, since there can be multiple wakeups on different
      cpus...Thus, I've opted to limit the number of possible wakeup paths when
      the paths are created.
      
      This is accomplished, by noting that the end file descriptor points that
      are found during the loop detection pass (from the newly added link), are
      actually the sources for wakeup events.  I keep a list of these file
      descriptors and limit the number and length of these paths that emanate
      from these 'source file descriptors'.  In the current implemetation I
      allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
      length 4 and 10 of length 5.  Note that it is sufficient to check the
      'source file descriptors' reachable from the newly added link, since no
      other 'source file descriptors' will have newly added links.  This allows
      us to check only the wakeup paths that may have gotten too long, and not
      re-check all possible wakeup paths on the system.
      
      In terms of the path limit selection, I think its first worth noting that
      the most common case for epoll, is probably the model where you have 1
      epoll file descriptor that is monitoring n number of 'source file
      descriptors'.  In this case, each 'source file descriptor' has a 1 path of
      length 1.  Thus, I believe that the limits I'm proposing are quite
      reasonable and in fact may be too generous.  Thus, I'm hoping that the
      proposed limits will not prevent any workloads that currently work to
      fail.
      
      In terms of locking, I have extended the use of the 'epmutex' to all
      epoll_ctl add and remove operations.  Currently its only used in a subset
      of the add paths.  I need to hold the epmutex, so that we can correctly
      traverse a coherent graph, to check the number of paths.  I believe that
      this additional locking is probably ok, since its in the setup/teardown
      paths, and doesn't affect the running paths, but it certainly is going to
      add some extra overhead.  Also, worth noting is that the epmuex was
      recently added to the ep_ctl add operations in the initial path loop
      detection code using the argument that it was not on a critical path.
      
      Another thing to note here, is the length of epoll chains that is allowed.
      Currently, eventpoll.c defines:
      
      /* Maximum number of nesting allowed inside epoll sets */
      #define EP_MAX_NESTS 4
      
      This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
      + 1).  However, this limit is currently only enforced during the loop
      check detection code, and only when the epoll file descriptors are added
      in a certain order.  Thus, this limit is currently easily bypassed.  The
      newly added check for wakeup paths, stricly limits the wakeup paths to a
      length of 5, regardless of the order in which ep's are linked together.
      Thus, a side-effect of the new code is a more consistent enforcement of
      the graph depth.
      
      Thus far, I've tested this, using the sample programs previously
      mentioned, which now either return quickly or return -EINVAL.  I've also
      testing using the piptest.c epoll tester, which showed no difference in
      performance.  I've also created a number of different epoll networks and
      tested that they behave as expectded.
      
      I believe this solves the original diabolical test cases, while still
      preserving the sane epoll nesting.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Cc: Nelson Elhage <nelhage@ksplice.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28d82dc1
    • S
      pipe: fail cleanly when root tries F_SETPIPE_SZ with big size · 2ccd4f4d
      Sasha Levin 提交于
      When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
      with size bigger than kmalloc() can alloc it spits out an ugly warning:
      
        ------------[ cut here ]------------
        WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
        Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
        Call Trace:
           warn_slowpath_common+0x75/0xb0
           warn_slowpath_null+0x15/0x20
           __alloc_pages_nodemask+0x5d3/0x7a0
           __get_free_pages+0x12/0x50
           __kmalloc+0x12b/0x150
           pipe_set_size+0x75/0x120
           pipe_fcntl+0xf8/0x140
           do_fcntl+0x2d4/0x410
           sys_fcntl+0x66/0xa0
           system_call_fastpath+0x16/0x1b
        ---[ end trace 432f702e6db7b5ee ]---
      
      Instead, make kcalloc() handle the overflow case and fail quietly.
      
      [akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ccd4f4d
    • S
      slub: document setting min order with debug_guardpage_minorder > 0 · 888a214d
      Stanislaw Gruszka 提交于
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      888a214d
    • M
      parisc, exec: remove redundant set_fs(USER_DS) · 15ee2d00
      Mathias Krause 提交于
      The address limit is already set in flush_old_exec() so those calls to
      set_fs(USER_DS) are redundant.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15ee2d00
    • M
      ia64, exec: remove redundant set_fs(USER_DS) · 01fa310c
      Mathias Krause 提交于
      The address limit is already set in flush_old_exec() so this
      set_fs(USER_DS) is redundant.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01fa310c
    • A
      drivers/video/nvidia/nvidia.c: fix warning · 08346bf8
      Andrew Morton 提交于
      Fix the int/bool confusion in there.
      
        drivers/video/nvidia/nvidia.c:1602: warning: return from incompatible pointer type
      
      Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08346bf8
    • H
      mm,x86,um: move CMPXCHG_DOUBLE config option · 2565409f
      Heiko Carstens 提交于
      Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
      can simply select the option if it is supported.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2565409f
    • H
      mm,x86,um: move CMPXCHG_LOCAL config option · 4156153c
      Heiko Carstens 提交于
      Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
      can simply select the option if it is supported.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4156153c
    • H
      mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL · 43570fd2
      Heiko Carstens 提交于
      While implementing cmpxchg_double() on s390 I realized that we don't set
      CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.
      
      However setting that option will increase the size of struct page by
      eight bytes on 64 bit, which we certainly do not want.  Also, it doesn't
      make sense that a present cpu feature should increase the size of struct
      page.
      
      Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
      that it should depend on CMPXCHG_DOUBLE instead.
      
      This patch:
      
      If an architecture supports CMPXCHG_LOCAL this shouldn't result
      automatically in larger struct pages if the SLUB allocator is used.
      Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
      can be selected if a double word aligned struct page is required.  Also
      update x86 Kconfig so that it should work as before.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43570fd2
    • J
      include/linux/linkage.h: remove unused ATTRIB_NORET macro · 0d259cf8
      Joe Perches 提交于
      The uses have been renamed so delete the unused macro.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d259cf8
    • J
      treewide: convert uses of ATTRIB_NORETURN to __noreturn · ff2d8b19
      Joe Perches 提交于
      Use the more commonly used __noreturn instead of ATTRIB_NORETURN.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff2d8b19
    • J
      treewide: remove useless NORET_TYPE macro and uses · 9402c95f
      Joe Perches 提交于
      It's a very old and now unused prototype marking so just delete it.
      
      Neaten panic pointer argument style to keep checkpatch quiet.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9402c95f
    • J
      include/linux/linkage.h: remove unused NORET_AND macro · 80bf007f
      Joe Perches 提交于
      The only use in kernel.h is gone so remove the macro.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80bf007f
    • J
      kernel.h: neaten panic prototype · 4da47859
      Joe Perches 提交于
      Use __printf macro.
      Convert NORET_AND to ATTRIB_NORET.
      Use the normal kernel style for pointer arguments.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4da47859
    • S
      kprobes: silence DEBUG_STRICT_USER_COPY_CHECKS=y warning · efeb156e
      Stephen Boyd 提交于
      Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:
      
        In file included from arch/x86/include/asm/uaccess.h:573,
                         from kernel/kprobes.c:55:
        In function 'copy_from_user',
            inlined from 'write_enabled_file_bool' at
            kernel/kprobes.c:2191:
        arch/x86/include/asm/uaccess_64.h:65:
        warning: call to 'copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
      
      presumably due to buf_size being signed causing GCC to fail to see that
      buf_size can't become negative.
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efeb156e
    • X
      proc: fix null pointer deref in proc_pid_permission() · a2ef990a
      Xiaotian Feng 提交于
      get_proc_task() can fail to search the task and return NULL,
      put_task_struct() will then bomb the kernel with following oops:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
        IP: [<ffffffff81217d34>] proc_pid_permission+0x64/0xe0
        PGD 112075067 PUD 112814067 PMD 0
        Oops: 0002 [#1] PREEMPT SMP
      
      This is a regression introduced by commit 0499680a ("procfs: add hidepid=
      and gid= mount options").  The kernel should return -ESRCH if
      get_proc_task() failed.
      Signed-off-by: NXiaotian Feng <dannyfeng@tencent.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Stephen Wilson <wilsons@start.ca>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2ef990a
    • A
      x86: Get rid of 'dubious one-bit signed bitfield' sprase warning · bccd1729
      Anton Vorontsov 提交于
      This very noisy sparse warning appears on almost every file in the
      kernel:
      
        CHECK   init/main.c
        arch/x86/include/asm/thread_info.h:43:55: error: dubious one-bit signed bitfield
        arch/x86/include/asm/thread_info.h:44:46: error: dubious one-bit signed bitfield
      
      This patch changes sig_on_uaccess_error and uaccess_err flags to unsigned
      type and thus fixes the warning.
      Signed-off-by: NAnton Vorontsov <cbouatmailru@gmail.com>
      Acked-by: NAndy Lutomirski <luto@mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bccd1729
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · a429638c
      Linus Torvalds 提交于
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (526 commits)
        ASoC: twl6040 - Add method to query optimum PDM_DL1 gain
        ALSA: hda - Fix the lost power-setup of seconary pins after PM resume
        ALSA: usb-audio: add Yamaha MOX6/MOX8 support
        ALSA: virtuoso: add S/PDIF input support for all Xonars
        ALSA: ice1724 - Support for ooAoo SQ210a
        ALSA: ice1724 - Allow card info based on model only
        ALSA: ice1724 - Create capture pcm only for ADC-enabled configurations
        ALSA: hdspm - Provide unique driver id based on card serial
        ASoC: Dynamically allocate the rtd device for a non-empty release()
        ASoC: Fix recursive dependency due to select ATMEL_SSC in SND_ATMEL_SOC_SSC
        ALSA: hda - Fix the detection of "Loopback Mixing" control for VIA codecs
        ALSA: hda - Return the error from get_wcaps_type() for invalid NIDs
        ALSA: hda - Use auto-parser for HP laptops with cx20459 codec
        ALSA: asihpi - Fix potential Oops in snd_asihpi_cmode_info()
        ALSA: hdsp - Fix potential Oops in snd_hdsp_info_pref_sync_ref()
        ALSA: hda/cirrus - support for iMac12,2 model
        ASoC: cx20442: add bias control over a platform provided regulator
        ALSA: usb-audio - Avoid flood of frame-active debug messages
        ALSA: snd-usb-us122l: Delete calls to preempt_disable
        mfd: Put WM8994 into cache only mode when suspending
        ...
      
      Fix up trivial conflicts in:
       - arch/arm/mach-s3c64xx/mach-crag6410.c:
      	renamed speyside_wm8962 to tobermory, added littlemill right
      	next to it
       - drivers/base/regmap/{regcache.c,regmap.c}:
      	duplicate diff that had already come in with other changes in
      	the regmap tree
      a429638c
  2. 12 1月, 2012 10 次提交