1. 21 5月, 2016 24 次提交
  2. 18 3月, 2016 5 次提交
    • M
      radix_tree: add radix_tree_dump · 7cf19af4
      Matthew Wilcox 提交于
      This is debug code which is #if 0 out.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf19af4
    • M
      radix_tree: add support for multi-order entries · e6145236
      Matthew Wilcox 提交于
      With huge pages, it is convenient to have the radix tree be able to
      return an entry that covers multiple indices.  Previous attempts to deal
      with the problem have involved inserting N duplicate entries, which is a
      waste of memory and leads to problems trying to handle aliased tags, or
      probing the tree multiple times to find alternative entries which might
      cover the requested index.
      
      This approach inserts one canonical entry into the tree for a given
      range of indices, and may also insert other entries in order to ensure
      that lookups find the canonical entry.
      
      This solution only tolerates inserting powers of two that are greater
      than the fanout of the tree.  If we wish to expand the radix tree's
      abilities to support large-ish pages that is less than the fanout at the
      penultimate level of the tree, then we would need to add one more step
      in lookup to ensure that any sibling nodes in the final level of the
      tree are dereferenced and we return the canonical entry that they
      reference.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6145236
    • M
      radix_tree: loop based on shift count, not height · 0070e28d
      Matthew Wilcox 提交于
      When we introduce entries that can cover multiple indices, we will need
      to stop in __radix_tree_create based on the shift, not the height.
      Split out for ease of bisect.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0070e28d
    • M
      radix_tree: tag all internal tree nodes as indirect pointers · 339e6353
      Matthew Wilcox 提交于
      Set the 'indirect_ptr' bit on all the pointers to internal nodes, not
      just on the root node.  This enables the following patches to support
      multi-order entries in the radix tree.  This patch is split out for ease
      of bisection.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      339e6353
    • V
      radix-tree: account radix_tree_node to memory cgroup · 58e698af
      Vladimir Davydov 提交于
      Allocation of radix_tree_node objects can be easily triggered from
      userspace, so we should account them to memory cgroup.  Besides, we need
      them accounted for making shadow node shrinker per memcg (see
      mm/workingset.c).
      
      A tricky thing about accounting radix_tree_node objects is that they are
      mostly allocated through radix_tree_preload(), so we can't just set
      SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a
      lot of unrelated cgroups using objects from each other's caches.
      
      One way to overcome this would be making radix tree preloads per memcg,
      but that would probably look cumbersome and overcomplicated.
      
      Instead, we make radix_tree_node_alloc() first try to allocate from the
      cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not,
      and only if it fails fall back on using per cpu preloads.  This should
      make most allocations accounted.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58e698af
  3. 04 2月, 2016 1 次提交
    • M
      radix-tree: fix race in gang lookup · 46437f9a
      Matthew Wilcox 提交于
      If the indirect_ptr bit is set on a slot, that indicates we need to redo
      the lookup.  Introduce a new function radix_tree_iter_retry() which
      forces the loop to retry the lookup by setting 'slot' to NULL and
      turning the iterator back to point at the problematic entry.
      
      This is a pretty rare problem to hit at the moment; the lookup has to
      race with a grow of the radix tree from a height of 0.  The consequences
      of hitting this race are that gang lookup could return a pointer to a
      radix_tree_node instead of a pointer to whatever the user had inserted
      in the tree.
      
      Fixes: cebbd29e ("radix-tree: rewrite gang lookup using iterator")
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ohad Ben-Cohen <ohad@wizery.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46437f9a
  4. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  5. 26 6月, 2015 1 次提交
  6. 19 5月, 2015 1 次提交
    • F
      sched/preempt: Merge preempt_mask.h into preempt.h · 92cf2118
      Frederic Weisbecker 提交于
      preempt_mask.h defines all the preempt_count semantics and related
      symbols: preempt, softirq, hardirq, nmi, preempt active, need resched,
      etc...
      
      preempt.h defines the accessors and mutators of preempt_count.
      
      But there is a messy dependency game around those two header files:
      
      	* preempt_mask.h includes preempt.h in order to access preempt_count()
      
      	* preempt_mask.h defines all preempt_count semantic and symbols
      	  except PREEMPT_NEED_RESCHED that is needed by asm/preempt.h
      	  Thus we need to define it from preempt.h, right before including
      	  asm/preempt.h, instead of defining it to preempt_mask.h with the
      	  other preempt_count symbols. Therefore the preempt_count semantics
      	  happen to be spread out.
      
      	* We plan to introduce preempt_active_[enter,exit]() to consolidate
      	  preempt_schedule*() code. But we'll need to access both preempt_count
      	  mutators (preempt_count_add()) and preempt_count symbols
      	  (PREEMPT_ACTIVE, PREEMPT_OFFSET). The usual place to define preempt
      	  operations is in preempt.h but then we'll need symbols in
      	  preempt_mask.h which already includes preempt.h. So we end up with
      	  a ressource circle dependency.
      
      Lets merge preempt_mask.h into preempt.h to solve these dependency issues.
      This way we gather semantic symbols and operation definition of
      preempt_count in a single file.
      
      This is a dumb copy-paste merge. Further merge re-arrangments are
      performed in a subsequent patch to ease review.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431441711-29753-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      92cf2118
  7. 13 2月, 2015 1 次提交
  8. 07 6月, 2014 1 次提交
  9. 05 6月, 2014 2 次提交
  10. 04 4月, 2014 3 次提交
    • J
      mm: keep page cache radix tree nodes in check · 449dd698
      Johannes Weiner 提交于
      Previously, page cache radix tree nodes were freed after reclaim emptied
      out their page pointers.  But now reclaim stores shadow entries in their
      place, which are only reclaimed when the inodes themselves are
      reclaimed.  This is problematic for bigger files that are still in use
      after they have a significant amount of their cache reclaimed, without
      any of those pages actually refaulting.  The shadow entries will just
      sit there and waste memory.  In the worst case, the shadow entries will
      accumulate until the machine runs out of memory.
      
      To get this under control, the VM will track radix tree nodes
      exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
      rather than global because we expect the radix tree nodes themselves to
      be allocated node-locally and we want to reduce cross-node references of
      otherwise independent cache workloads.  A simple shrinker will then
      reclaim these nodes on memory pressure.
      
      A few things need to be stored in the radix tree node to implement the
      shadow node LRU and allow tree deletions coming from the list:
      
      1. There is no index available that would describe the reverse path
         from the node up to the tree root, which is needed to perform a
         deletion.  To solve this, encode in each node its offset inside the
         parent.  This can be stored in the unused upper bits of the same
         member that stores the node's height at no extra space cost.
      
      2. The number of shadow entries needs to be counted in addition to the
         regular entries, to quickly detect when the node is ready to go to
         the shadow node LRU list.  The current entry count is an unsigned
         int but the maximum number of entries is 64, so a shadow counter
         can easily be stored in the unused upper bits.
      
      3. Tree modification needs tree lock and tree root, which are located
         in the address space, so store an address_space backpointer in the
         node.  The parent pointer of the node is in a union with the 2-word
         rcu_head, so the backpointer comes at no extra cost as well.
      
      4. The node needs to be linked to an LRU list, which requires a list
         head inside the node.  This does increase the size of the node, but
         it does not change the number of objects that fit into a slab page.
      
      [akpm@linux-foundation.org: export the right function]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449dd698
    • J
      lib: radix_tree: tree node interface · 139e5616
      Johannes Weiner 提交于
      Make struct radix_tree_node part of the public interface and provide API
      functions to create, look up, and delete whole nodes.  Refactor the
      existing insert, look up, delete functions on top of these new node
      primitives.
      
      This will allow the VM to track and garbage collect page cache radix
      tree nodes.
      
      [sasha.levin@oracle.com: return correct error code on insertion failure]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      139e5616
    • J
      mm: filemap: move radix tree hole searching here · e7b563bb
      Johannes Weiner 提交于
      The radix tree hole searching code is only used for page cache, for
      example the readahead code trying to get a a picture of the area
      surrounding a fault.
      
      It sufficed to rely on the radix tree definition of holes, which is
      "empty tree slot".  But this is about to change, though, as shadow page
      descriptors will be stored in the page cache after the actual pages get
      evicted from memory.
      
      Move the functions over to mm/filemap.c and make them native page cache
      operations, where they can later be adapted to handle the new definition
      of "page cache hole".
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7b563bb