1. 05 6月, 2014 20 次提交
    • V
      memcg, slab: simplify synchronization scheme · bd673145
      Vladimir Davydov 提交于
      At present, we have the following mutexes protecting data related to per
      memcg kmem caches:
      
       - slab_mutex.  This one is held during the whole kmem cache creation
         and destruction paths.  We also take it when updating per root cache
         memcg_caches arrays (see memcg_update_all_caches).  As a result, taking
         it guarantees there will be no changes to any kmem cache (including per
         memcg).  Why do we need something else then?  The point is it is
         private to slab implementation and has some internal dependencies with
         other mutexes (get_online_cpus).  So we just don't want to rely upon it
         and prefer to introduce additional mutexes instead.
      
       - activate_kmem_mutex.  Initially it was added to synchronize
         initializing kmem limit (memcg_activate_kmem).  However, since we can
         grow per root cache memcg_caches arrays only on kmem limit
         initialization (see memcg_update_all_caches), we also employ it to
         protect against memcg_caches arrays relocation (e.g.  see
         __kmem_cache_destroy_memcg_children).
      
       - We have a convention not to take slab_mutex in memcontrol.c, but we
         want to walk over per memcg memcg_slab_caches lists there (e.g.  for
         destroying all memcg caches on offline).  So we have per memcg
         slab_caches_mutex's protecting those lists.
      
      The mutexes are taken in the following order:
      
         activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex
      
      Such a syncrhonization scheme has a number of flaws, for instance:
      
       - We can't call kmem_cache_{destroy,shrink} while walking over a
         memcg::memcg_slab_caches list due to locking order.  As a result, in
         mem_cgroup_destroy_all_caches we schedule the
         memcg_cache_params::destroy work shrinking and destroying the cache.
      
       - We don't have a mutex to synchronize per memcg caches destruction
         between memcg offline (mem_cgroup_destroy_all_caches) and root cache
         destruction (__kmem_cache_destroy_memcg_children).  Currently we just
         don't bother about it.
      
      This patch simplifies it by substituting per memcg slab_caches_mutex's
      with the global memcg_slab_mutex.  It will be held whenever a new per
      memcg cache is created or destroyed, so it protects per root cache
      memcg_caches arrays and per memcg memcg_slab_caches lists.  The locking
      order is following:
      
         activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex
      
      This allows us to call kmem_cache_{create,shrink,destroy} under the
      memcg_slab_mutex.  As a result, we don't need memcg_cache_params::destroy
      work any more - we can simply destroy caches while iterating over a per
      memcg slab caches list.
      
      Also using the global mutex simplifies synchronization between concurrent
      per memcg caches creation/destruction, e.g.  mem_cgroup_destroy_all_caches
      vs __kmem_cache_destroy_memcg_children.
      
      The downside of this is that we substitute per-memcg slab_caches_mutex's
      with a hummer-like global mutex, but since we already take either the
      slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
      shouldn't hurt concurrency a lot.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd673145
    • V
      memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab · c67a8a68
      Vladimir Davydov 提交于
      Currently we have two pairs of kmemcg-related functions that are called on
      slab alloc/free.  The first is memcg_{bind,release}_pages that count the
      total number of pages allocated on a kmem cache.  The second is
      memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
      counter.  Let's just merge them to keep the code clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67a8a68
    • V
      memcg, slab: do not schedule cache destruction when last page goes away · 1e32e77f
      Vladimir Davydov 提交于
      This patchset is a part of preparations for kmemcg re-parenting.  It
      targets at simplifying kmemcg work-flows and synchronization.
      
      First, it removes async per memcg cache destruction (see patches 1, 2).
      Now caches are only destroyed on memcg offline.  That means the caches
      that are not empty on memcg offline will be leaked.  However, they are
      already leaked, because memcg_cache_params::nr_pages normally never drops
      to 0 so the destruction work is never scheduled except kmem_cache_shrink
      is called explicitly.  In the future I'm planning reaping such dead caches
      on vmpressure or periodically.
      
      Second, it substitutes per memcg slab_caches_mutex's with the global
      memcg_slab_mutex, which should be taken during the whole per memcg cache
      creation/destruction path before the slab_mutex (see patch 3).  This
      greatly simplifies synchronization among various per memcg cache
      creation/destruction paths.
      
      I'm still not quite sure about the end picture, in particular I don't know
      whether we should reap dead memcgs' kmem caches periodically or try to
      merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
      more details), but whichever way we choose, this set looks like a
      reasonable change to me, because it greatly simplifies kmemcg work-flows
      and eases further development.
      
      This patch (of 3):
      
      After a memcg is offlined, we mark its kmem caches that cannot be deleted
      right now due to pending objects as dead by setting the
      memcg_cache_params::dead flag, so that memcg_release_pages will schedule
      cache destruction (memcg_cache_params::destroy) as soon as the last slab
      of the cache is freed (memcg_cache_params::nr_pages drops to zero).
      
      I guess the idea was to destroy the caches as soon as possible, i.e.
      immediately after freeing the last object.  However, it just doesn't work
      that way, because kmem caches always preserve some pages for the sake of
      performance, so that nr_pages never gets to zero unless the cache is
      shrunk explicitly using kmem_cache_shrink.  Of course, we could account
      the total number of objects on the cache or check if all the slabs
      allocated for the cache are empty on kmem_cache_free and schedule
      destruction if so, but that would be too costly.
      
      Thus we have a piece of code that works only when we explicitly call
      kmem_cache_shrink, but complicates the whole picture a lot.  Moreover,
      it's racy in fact.  For instance, kmem_cache_shrink may free the last slab
      and thus schedule cache destruction before it finishes checking that the
      cache is empty, which can lead to use-after-free.
      
      So I propose to remove this async cache destruction from
      memcg_release_pages, and check if the cache is empty explicitly after
      calling kmem_cache_shrink instead.  This will simplify things a lot w/o
      introducing any functional changes.
      
      And regarding dead memcg caches (i.e.  those that are left hanging around
      after memcg offline for they have objects), I suppose we should reap them
      either periodically or on vmpressure as Glauber suggested initially.  I'm
      going to implement this later.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e32e77f
    • O
      memcg: kill CONFIG_MM_OWNER · f98bafa0
      Oleg Nesterov 提交于
      CONFIG_MM_OWNER makes no sense.  It is not user-selectable, it is only
      selected by CONFIG_MEMCG automatically.  So we can kill this option in
      init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f98bafa0
    • J
      mm/swap.c: clean up *lru_cache_add* functions · 2329d375
      Jianyu Zhan 提交于
      In mm/swap.c, __lru_cache_add() is exported, but actually there are no
      users outside this file.
      
      This patch unexports __lru_cache_add(), and makes it static.  It also
      exports lru_cache_add_file(), as it is use by cifs and fuse, which can
      loaded as modules.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2329d375
    • V
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov 提交于
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
    • M
      mm: page_alloc: do not cache reclaim distances · 5f7a75ac
      Mel Gorman 提交于
      pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
      by zone_reclaim due to its distance.  As it is expected that
      zone_reclaim_mode will be rarely enabled it is unreasonable for all
      machines to take a penalty.  Fortunately, the zone_reclaim_mode() path
      is already slow and it is the path that takes the hit.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f7a75ac
    • M
      mm: disable zone_reclaim_mode by default · 4f9b16a6
      Mel Gorman 提交于
      When it was introduced, zone_reclaim_mode made sense as NUMA distances
      punished and workloads were generally partitioned to fit into a NUMA
      node.  NUMA machines are now common but few of the workloads are
      NUMA-aware and it's routine to see major performance degradation due to
      zone_reclaim_mode being enabled but relatively few can identify the
      problem.
      
      Those that require zone_reclaim_mode are likely to be able to detect
      when it needs to be enabled and tune appropriately so lets have a
      sensible default for the bulk of users.
      
      This patch (of 2):
      
      zone_reclaim_mode causes processes to prefer reclaiming memory from
      local node instead of spilling over to other nodes.  This made sense
      initially when NUMA machines were almost exclusively HPC and the
      workload was partitioned into nodes.  The NUMA penalties were
      sufficiently high to justify reclaiming the memory.  On current machines
      and workloads it is often the case that zone_reclaim_mode destroys
      performance but not all users know how to detect this.  Favour the
      common case and disable it by default.  Users that are sophisticated
      enough to know they need zone_reclaim_mode will detect it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f9b16a6
    • L
      hugetlb: add hstate_is_gigantic() · bae7f4ae
      Luiz Capitulino 提交于
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bae7f4ae
    • D
      mm: pass VM_BUG_ON() reason to dump_page() · e4f67422
      Dave Hansen 提交于
      I recently added a patch to let folks pass a "reason" string dump_page()
      which gets dumped out along with the page's data.  This essentially
      saves the bug-reader a trip in to the source to figure out why we
      BUG_ON()'d.
      
      The new VM_BUG_ON_PAGE() passes in NULL for "reason".  It seems like we
      might as well pass the BUG_ON() condition if we have it.  This will
      bloat kernels a bit with ~160 new strings, but this is all under a
      debugging option anyway.
      
      	page:ffffea0008560280 count:1 mapcount:0 mapping:(null) index:0x0
      	page flags: 0xbfffc0000000001(locked)
      	page dumped because: VM_BUG_ON_PAGE(PageLocked(page))
      	------------[ cut here ]------------
      	kernel BUG at /home/davehans/linux.git/mm/filemap.c:464!
      	invalid opcode: 0000 [#1] SMP
      	CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.14.0+ #251
      	Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      	...
      
      [akpm@linux-foundation.org: include stringify.h]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4f67422
    • A
      include/linux/mmdebug.h: add VM_WARN_ON() and VM_WARN_ON_ONCE() · 02a8efed
      Andrew Morton 提交于
      WARN_ON() and WARN_ON_ONCE(), dependent on CONFIG_DEBUG_VM
      
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02a8efed
    • A
      cma: add placement specifier for "cma=" kernel parameter · 5ea3b1b2
      Akinobu Mita 提交于
      Currently, "cma=" kernel parameter is used to specify the size of CMA,
      but we can't specify where it is located.  We want to locate CMA below
      4GB for devices only supporting 32-bit addressing on 64-bit systems
      without iommu.
      
      This enables to specify the placement of CMA by extending "cma=" kernel
      parameter.
      
      Examples:
       1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
       2. locate 64MB CMA exact at 512MB by "cma=64M@512M"
      
      Note that the DMA contiguous memory allocator on x86 assumes that
      page_address() works for the pages to allocate.  So this change requires
      to limit end address of contiguous memory area upto max_pfn_mapped to
      prevent from locating it on highmem area by the argument of
      dma_contiguous_reserve().
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ea3b1b2
    • A
      memblock: introduce memblock_alloc_range() · 2bfc2862
      Akinobu Mita 提交于
      This introduces memblock_alloc_range() which allocates memblock from the
      specified range of physical address.  I would like to use this function
      to specify the location of CMA.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bfc2862
    • A
      x86: enable DMA CMA with swiotlb · 9c5a3621
      Akinobu Mita 提交于
      The DMA Contiguous Memory Allocator support on x86 is disabled when
      swiotlb config option is enabled.  So DMA CMA is always disabled on
      x86_64 because swiotlb is always enabled.  This attempts to support for
      DMA CMA with enabling swiotlb config option.
      
      The contiguous memory allocator on x86 is integrated in the function
      dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
      for dma_alloc_coherent().
      
      x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
      tries to allocate with dma_generic_alloc_coherent() firstly and then
      swiotlb_alloc_coherent() is called as a fallback.
      
      The main part of supporting DMA CMA with swiotlb is that changing
      x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
      for dma_free_coherent() so that it can distinguish memory allocated by
      dma_generic_alloc_coherent() from one allocated by
      swiotlb_alloc_coherent() and release it with dma_generic_free_coherent()
      which can handle contiguous memory.  This change requires making
      is_swiotlb_buffer() global function.
      
      This also needs to change .free callback in the dma_map_ops for amd_gart
      and sta2x11, because these dma_ops are also using
      dma_generic_alloc_coherent().
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c5a3621
    • D
      mm,vmacache: add debug data · 4f115147
      Davidlohr Bueso 提交于
      Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
      hit rate -- exported in /proc/vmstat.
      
      Any updates to the caching scheme needs this kind of data, thus it can
      save some work re-implementing the counting all the time.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f115147
    • V
      mm: get rid of __GFP_KMEMCG · 52383431
      Vladimir Davydov 提交于
      Currently to allocate a page that should be charged to kmemcg (e.g.
      threadinfo), we pass __GFP_KMEMCG flag to the page allocator.  The page
      allocated is then to be freed by free_memcg_kmem_pages.  Apart from
      looking asymmetrical, this also requires intrusion to the general
      allocation path.  So let's introduce separate functions that will
      alloc/free pages charged to kmemcg.
      
      The new functions are called alloc_kmem_pages and free_kmem_pages.  They
      should be used when the caller actually would like to use kmalloc, but
      has to fall back to the page allocator for the allocation is large.
      They only differ from alloc_pages and free_pages in that besides
      allocating or freeing pages they also charge them to the kmem resource
      counter of the current memory cgroup.
      
      [sfr@canb.auug.org.au: export kmalloc_order() to modules]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52383431
    • V
      sl[au]b: charge slabs to kmemcg explicitly · 5dfb4175
      Vladimir Davydov 提交于
      We have only a few places where we actually want to charge kmem so
      instead of intruding into the general page allocation path with
      __GFP_KMEMCG it's better to explictly charge kmem there.  All kmem
      charges will be easier to follow that way.
      
      This is a step towards removing __GFP_KMEMCG.  It removes __GFP_KMEMCG
      from memcg caches' allocflags.  Instead it makes slab allocation path
      call memcg_charge_kmem directly getting memcg to charge from the cache's
      memcg params.
      
      This also eliminates any possibility of misaccounting an allocation
      going from one memcg's cache to another memcg, because now we always
      charge slabs against the memcg the cache belongs to.  That's why this
      patch removes the big comment to memcg_kmem_get_cache.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dfb4175
    • M
      x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81
      Mel Gorman 提交于
      _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
      faults on x86.  Care is taken such that _PAGE_NUMA is used only in
      situations where the VMA flags distinguish between NUMA hinting faults
      and prot_none faults.  This decision was x86-specific and conceptually
      it is difficult requiring special casing to distinguish between PROTNONE
      and NUMA ptes based on context.
      
      Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
      between an entry that is really unmapped and a page that is protected
      for NUMA hinting faults as if the PTE is not present then a fault will
      be trapped.
      
      Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
      This patch shrinks the maximum possible swap size and uses the bit to
      uniquely distinguish between NUMA hinting ptes and swap ptes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c46a7c81
    • F
      fs/libfs.c: add generic data flush to fsync · ac13a829
      Fabian Frederick 提交于
      Description by Jan Kara:
       "A lot of older filesystems don't properly flush volatile disk caches
        on fsync(2) which can lead to loss of fsynced data after power failure.
      
      This patch makes generic_file_fsync() issue proper cache flush to fix the
      problem.  Sysadmin can use /sys/devices/.../cache_type to tell the system
      it should not send the cache flush."
      
      [akpm@linux-foundation.org: nuke ifdef]
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Suggested-by: NJan Kara <jack@suse.cz>
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac13a829
    • N
      hugetlb: restrict hugepage_migration_support() to x86_64 · c177c81e
      Naoya Horiguchi 提交于
      Currently hugepage migration is available for all archs which support
      pmd-level hugepage, but testing is done only for x86_64 and there're
      bugs for other archs.  So to avoid breaking such archs, this patch
      limits the availability strictly to x86_64 until developers of other
      archs get interested in enabling this feature.
      
      Simply disabling hugepage migration on non-x86_64 archs is not enough to
      fix the reported problem where sys_move_pages() hits the BUG_ON() in
      follow_page(FOLL_GET), so let's fix this by checking if hugepage
      migration is supported in vma_migratable().
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c177c81e
  2. 04 6月, 2014 1 次提交
    • A
      of/irq: provide more wrappers for !CONFIG_OF · 64c5c759
      Arnd Bergmann 提交于
      The pci-rcar driver is enabled for compile tests, and this has
      now shown that the driver cannot build without CONFIG_OF,
      following the inclusion of f8f2fe73 "PCI: rcar: Use new OF
      interrupt mapping when possible":
      
      drivers/built-in.o: In function `rcar_pci_map_irq':
      :(.text+0x1cc7c): undefined reference to `of_irq_parse_and_map_pci'
      pci/host/pcie-rcar.c: In function 'pci_dma_range_parser_init':
      pci/host/pcie-rcar.c:875:2: error: implicit declaration of function 'of_n_addr_cells' [-Werror=implicit-function-declaration]
      
      As pointed out by Ben Dooks and Geert Uytterhoeven, this is actually
      supposed to build fine, which we can achieve if we make the
      declaration of of_irq_parse_and_map_pci conditional on CONFIG_OF
      and provide an empty inline function otherwise, as we do for
      a lot of other of interfaces.
      
      This lets us build the rcar_pci driver again without CONFIG_OF
      for build testing. All platforms using this driver select OF,
      so this doesn't change anything for the users.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: devicetree@vger.kernel.org
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Magnus Damm <damm@opensource.se>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ben Dooks <ben.dooks@codethink.co.uk>
      Cc: linux-pci@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      [robh: drop wrappers for of_n_addr_cells and of_n_size_cells which are
      low-level functions that should not be used for !OF]
      Signed-off-by: NRob Herring <robh@kernel.org>
      64c5c759
  3. 03 6月, 2014 1 次提交
  4. 02 6月, 2014 2 次提交
  5. 30 5月, 2014 6 次提交
    • Y
      PCI: Make pci_bus_add_device() void · c893d133
      Yijing Wang 提交于
      pci_bus_add_device() always returns 0, so there's no point in returning
      anything at all.  Make it a void function and remove the tests of the
      return value from the callers.
      
      [bhelgaas: changelog, remove unused "err" from i82875p_setup_overfl_dev()]
      Signed-off-by: NYijing Wang <wangyijing@huawei.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      c893d133
    • J
      blk-mq: make the sysfs mq/ layout reflect current mappings · 67aec14c
      Jens Axboe 提交于
      Currently blk-mq registers all the hardware queues in sysfs,
      regardless of whether it uses them (e.g. they have CPU mappings)
      or not. The unused hardware queues lack the cpux/ directories,
      and the other sysfs entries (like active, pending, etc) are all
      zeroes.
      
      Change this so that sysfs correctly reflects the current mappings
      of the hardware queues.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      67aec14c
    • S
      blk-mq: blk_mq_tag_to_rq should handle flush request · 22302375
      Shaohua Li 提交于
      flush request is special, which borrows the tag from the parent
      request. Hence blk_mq_tag_to_rq needs special handling to return
      the flush request from the tag.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      22302375
    • Z
      ACPI / PNP: use device ID list for PNPACPI device enumeration · eec15edb
      Zhang Rui 提交于
      ACPI can be used to enumerate PNP devices, but the code does not
      handle this in the right way currently.  Namely, if an ACPI device
      object
       1. Has a _CRS method,
       2. Has an identification of
          "three capital characters followed by four hex digits",
       3. Is not in the excluded IDs list,
      it will be enumerated to PNP bus (that is, a PNP device object will
      be create for it).  This means that, actually, the PNP bus type is
      used as the default bus type for enumerating _HID devices in ACPI.
      
      However, more and more _HID devices need to be enumerated to the
      platform bus instead (that is, platform device objects need to be
      created for them).  As a result, the device ID list in acpi_platform.c
      is used to enforce creating platform device objects rather than PNP
      device objects for matching devices.  That list has been continuously
      growing recently, unfortunately, and it is pretty much guaranteed to
      grow even more in the future.
      
      To address that problem it is better to enumerate _HID devices
      as platform devices by default.  To this end, change the way of
      enumerating PNP devices by adding a PNP ACPI scan handler that
      will use a device ID list to create PNP devices for the ACPI
      device objects whose device IDs are present in that list.
      
      The initial device ID list in the PNP ACPI scan handler contains
      all of the pnp_device_id strings from all the existing PNP drivers,
      so this change should be transparent to the PNP core and all of the
      PNP drivers.  Still, in the future it should be possible to reduce
      its size by converting PNP drivers that need not be PNP for any
      technical reasons into platform drivers.
      Signed-off-by: NZhang Rui <rui.zhang@intel.com>
      [rjw: Rewrote the changelog, modified the PNP ACPI scan handler code]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      eec15edb
    • Z
      power_supply: allow power supply devices registered w/o wakeup source · 9113e260
      Zhang Rui 提交于
      Currently, all the power supply devices are registered with wakeup source,
      this results in that every power_supply_changed() invocation brings
      the system out of suspend-to-freeze state.
      
      This is overkill as some device drivers, e.g. ACPI battery driver,
      have the ability to check the device status and wake up the system
      from sleeping only when necessary.
      
      Thus introduce a new API which allows device to be registered
      w/o wakeup source.
      Signed-off-by: NZhang Rui <rui.zhang@intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      9113e260
    • D
  6. 29 5月, 2014 8 次提交
    • J
      block: add queue flag for disabling SG merging · 05f1dd53
      Jens Axboe 提交于
      If devices are not SG starved, we waste a lot of time potentially
      collapsing SG segments. Enough that 1.5% of the CPU time goes
      to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
      which just returns the number of vectors in a bio instead of looping
      over all segments and checking for collapsible ones.
      
      Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
      merging, if they so desire.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      05f1dd53
    • J
      block: remove 'magic' from struct blk_plug · 4d92a9be
      Jens Axboe 提交于
      I don't think we've ever caught any bugs with this, and there's the
      list poisoning for the plug lists to catch uninitialized cases.
      So remove the magic member and save 8 bytes in the struct.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4d92a9be
    • A
      PCI: Introduce new device binding path using pci_dev.driver_override · 782a985d
      Alex Williamson 提交于
      The driver_override field allows us to specify the driver for a device
      rather than relying on the driver to provide a positive match of the
      device.  This shortcuts the existing process of looking up the vendor and
      device ID, adding them to the driver new_id, binding the device, then
      removing the ID, but it also provides a couple advantages.
      
      First, the above existing process allows the driver to bind to any device
      matching the new_id for the window where it's enabled.  This is often not
      desired, such as the case of trying to bind a single device to a meta
      driver like pci-stub or vfio-pci.  Using driver_override we can do this
      deterministically using:
      
        echo pci-stub > /sys/bus/pci/devices/0000:03:00.0/driver_override
        echo 0000:03:00.0 > /sys/bus/pci/devices/0000:03:00.0/driver/unbind
        echo 0000:03:00.0 > /sys/bus/pci/drivers_probe
      
      Previously we could not invoke drivers_probe after adding a device to
      new_id for a driver as we get non-deterministic behavior whether the driver
      we intend or the standard driver will claim the device.  Now it becomes a
      deterministic process, only the driver matching driver_override will probe
      the device.
      
      To return the device to the standard driver, we simply clear the
      driver_override and reprobe the device:
      
        echo > /sys/bus/pci/devices/0000:03:00.0/driver_override
        echo 0000:03:00.0 > /sys/bus/pci/devices/0000:03:00.0/driver/unbind
        echo 0000:03:00.0 > /sys/bus/pci/drivers_probe
      
      Another advantage to this approach is that we can specify a driver override
      to force a specific binding or prevent any binding.  For instance when an
      IOMMU group is exposed to userspace through VFIO we require that all
      devices within that group are owned by VFIO.  However, devices can be
      hot-added into an IOMMU group, in which case we want to prevent the device
      from binding to any driver (override driver = "none") or perhaps have it
      automatically bind to vfio-pci.  With driver_override it's a simple matter
      for this field to be set internally when the device is first discovered to
      prevent driver matches.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NAlexander Graf <agraf@suse.de>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      782a985d
    • T
      console: Use explicit pointer type for vc_uni_pagedir* fields · e4bdab70
      Takashi Iwai 提交于
      The vc_data.vc_uni_pagedir filed is currently long int, supposedly to
      be served generically.  This, however, leads to lots of cast to
      pointer, and rather it worsens the readability significantly.
      
      Actually, we have now only a single uni_pagedir map implementation,
      and this won't change likely.  So, it'd be much more simple and
      error-prone to just use the exact pointer for struct uni_pagedir
      instead of long.
      
      Ditto for vc_uni_pagedir_loc.  It's a pointer to the uni_pagedir, thus
      it can be changed similarly to the exact type.
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e4bdab70
    • R
      tty/serial: at91: use mctrl_gpio helpers · e0b0baad
      Richard Genoud 提交于
      On sam9x5, dedicated CTS (and RTS) pins are unusable together with the
      LCDC, the EMAC, or the MMC because they share the same line.
      
      Moreover, the USART controller doesn't handle DTR/DSR/DCD/RI signals,
      so we have to control them via GPIO.
      
      This patch permits to use GPIOs to control the CTS/RTS/DTR/DSR/DCD/RI
      signals.
      Signed-off-by: NRichard Genoud <richard.genoud@gmail.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: NNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0b0baad
    • G
      Revert "usb: gadget: net2280: Add support for PLX USB338X" · 357d596e
      Greg Kroah-Hartman 提交于
      This reverts commit c4128cac.
      
      This should come through Felipe's tree first, and there was a bunch of
      other patches that are needed after this one as well that I didn't have.
      
      Cc: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      357d596e
    • V
      regulators: Add definition of regulator_set_voltage_time() for !CONFIG_REGULATOR · d660e92a
      Viresh Kumar 提交于
      We already have dummy implementation for most of the regulators APIs for
      !CONFIG_REGULATOR case and were missing it for regulator_set_voltage_time().
      
      Found this issue while compiling cpufreq-cpu0 driver without regulators support
      in kernel.
      
      drivers/cpufreq/cpufreq-cpu0.c: In function ‘cpu0_cpufreq_probe’:
      drivers/cpufreq/cpufreq-cpu0.c:186:3: error: implicit declaration of function ‘regulator_set_voltage_time’ [-Werror=implicit-function-declaration]
      
      Fix this by adding dummy definition for regulator_set_voltage_time().
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NMark Brown <broonie@linaro.org>
      d660e92a
    • C
      blk-mq: remove alloc_hctx and free_hctx methods · cdef54dd
      Christoph Hellwig 提交于
      There is no need for drivers to control hardware context allocation
      now that we do the context to node mapping in common code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cdef54dd
  7. 28 5月, 2014 2 次提交