1. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  2. 03 9月, 2014 13 次提交
    • T
      percpu: implement asynchronous chunk population · 1a4d7607
      Tejun Heo 提交于
      The percpu allocator now supports atomic allocations by only
      allocating from already populated areas but the mechanism to ensure
      that there's adequate amount of populated areas was missing.
      
      This patch expands pcpu_balance_work so that in addition to freeing
      excess free chunks it also populates chunks to maintain an adequate
      level of populated areas.  pcpu_alloc() schedules pcpu_balance_work if
      the amount of free populated areas is too low or after an atomic
      allocation failure.
      
      * PERPCU_DYNAMIC_RESERVE is increased by two pages to account for
        PCPU_EMPTY_POP_PAGES_LOW.
      
      * pcpu_async_enabled is added to gate both async jobs -
        chunk->map_extend_work and pcpu_balance_work - so that we don't end
        up scheduling them while the needed subsystems aren't up yet.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1a4d7607
    • T
      percpu: rename pcpu_reclaim_work to pcpu_balance_work · fe6bd8c3
      Tejun Heo 提交于
      pcpu_reclaim_work will also be used to populate chunks asynchronously.
      Rename it to pcpu_balance_work in preparation.  pcpu_reclaim() is
      renamed to pcpu_balance_workfn() and some of its local variables are
      renamed too.
      
      This is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      fe6bd8c3
    • T
      percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated · b539b87f
      Tejun Heo 提交于
      pcpu_nr_empty_pop_pages counts the number of empty populated pages
      across all chunks and chunk->nr_populated counts the number of
      populated pages in a chunk.  Both will be used to implement pre/async
      population for atomic allocations.
      
      pcpu_chunk_[de]populated() are added to update chunk->populated,
      chunk->nr_populated and pcpu_nr_empty_pop_pages together.  All
      successful chunk [de]populations should be followed by the
      corresponding pcpu_chunk_[de]populated() calls.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b539b87f
    • T
      percpu: make sure chunk->map array has available space · 9c824b6a
      Tejun Heo 提交于
      An allocation attempt may require extending chunk->map array which
      requires GFP_KERNEL context which isn't available for atomic
      allocations.  This patch ensures that chunk->map array usually keeps
      some amount of available space by directly allocating buffer space
      during GFP_KERNEL allocations and scheduling async extension during
      atomic ones.  This should make atomic allocation failures from map
      space exhaustion rare.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9c824b6a
    • T
      percpu: implement [__]alloc_percpu_gfp() · 5835d96e
      Tejun Heo 提交于
      Now that pcpu_alloc_area() can allocate only from populated areas,
      it's easy to add atomic allocation support to [__]alloc_percpu().
      Update pcpu_alloc() so that it accepts @gfp and skips all the blocking
      operations and allocates only from the populated areas if @gfp doesn't
      contain GFP_KERNEL.  New interface functions [__]alloc_percpu_gfp()
      are added.
      
      While this means that atomic allocations are possible, this isn't
      complete yet as there's no mechanism to ensure that certain amount of
      populated areas is kept available and atomic allocations may keep
      failing under certain conditions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5835d96e
    • T
      percpu: indent the population block in pcpu_alloc() · e04d3208
      Tejun Heo 提交于
      The next patch will conditionalize the population block in
      pcpu_alloc() which will end up making a rather large indentation
      change obfuscating the actual logic change.  This patch puts the block
      under "if (true)" so that the next patch can avoid indentation
      changes.  The defintions of the local variables which are used only in
      the block are moved into the block.
      
      This patch is purely cosmetic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e04d3208
    • T
      percpu: make pcpu_alloc_area() capable of allocating only from populated areas · a16037c8
      Tejun Heo 提交于
      Update pcpu_alloc_area() so that it can skip unpopulated areas if the
      new parameter @pop_only is true.  This is implemented by a new
      function, pcpu_fit_in_area(), which determines the amount of head
      padding considering the alignment and populated state.
      
      @pop_only is currently always false but this will be used to implement
      atomic allocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a16037c8
    • T
      percpu: restructure locking · b38d08f3
      Tejun Heo 提交于
      At first, the percpu allocator required a sleepable context for both
      alloc and free paths and used pcpu_alloc_mutex to protect everything.
      Later, pcpu_lock was introduced to protect the index data structure so
      that the free path can be invoked from atomic contexts.  The
      conversion only updated what's necessary and left most of the
      allocation path under pcpu_alloc_mutex.
      
      The percpu allocator is planned to add support for atomic allocation
      and this patch restructures locking so that the coverage of
      pcpu_alloc_mutex is further reduced.
      
      * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new
        chunk and populating the allocated area.  Everything else is now
        protected soley by pcpu_lock.
      
        After this change, multiple instances of pcpu_extend_area_map() may
        race but the function already implements sufficient synchronization
        using pcpu_lock.
      
        This also allows multiple allocators to arrive at new chunk
        creation.  To avoid creating multiple empty chunks back-to-back, a
        new chunk is created iff there is no other empty chunk after
        grabbing pcpu_alloc_mutex.
      
      * pcpu_lock is now held while modifying chunk->populated bitmap.
        After this, all data structures are protected by pcpu_lock.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b38d08f3
    • T
      percpu: make percpu-km set chunk->populated bitmap properly · a63d4ac4
      Tejun Heo 提交于
      percpu-km instantiates the whole chunk on creation and doesn't make
      use of chunk->populated bitmap and leaves it as zero.  While this
      currently doesn't cause any problem, the inconsistency makes it
      difficult to build further logic on top of chunk->populated.  This
      patch makes percpu-km fill chunk->populated on creation so that the
      bitmap is always consistent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      a63d4ac4
    • T
      percpu: move region iterations out of pcpu_[de]populate_chunk() · a93ace48
      Tejun Heo 提交于
      Previously, pcpu_[de]populate_chunk() were called with the range which
      may contain multiple target regions in it and
      pcpu_[de]populate_chunk() iterated over the regions.  This has the
      benefit of batching up cache flushes for all the regions; however,
      we're planning to add more bookkeeping logic around [de]population to
      support atomic allocations and this delegation of iterations gets in
      the way.
      
      This patch moves the region iterations out of
      pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and
      pcpu_reclaim() - so that we can later add logic to track more states
      around them.  This change may make cache and tlb flushes more frequent
      but multi-region [de]populations are rare anyway and if this actually
      becomes a problem, it's not difficult to factor out cache flushes as
      separate callbacks which are directly invoked from percpu.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a93ace48
    • T
      percpu: move common parts out of pcpu_[de]populate_chunk() · dca49645
      Tejun Heo 提交于
      percpu-vm and percpu-km implement separate versions of
      pcpu_[de]populate_chunk() and some part which is or should be common
      are currently in the specific implementations.  Make the following
      changes.
      
      * Allocate area clearing is moved from the pcpu_populate_chunk()
        implementations to pcpu_alloc().  This makes percpu-km's version
        noop.
      
      * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved
        to their respective callers so that they are applied to percpu-km
        too.  This doesn't make any meaningful difference as both functions
        are noop for percpu-km; however, this is more consistent and will
        help implementing atomic allocation support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      dca49645
    • T
      percpu: remove @may_alloc from pcpu_get_pages() · cdb4cba5
      Tejun Heo 提交于
      pcpu_get_pages() creates the temp pages array if not already allocated
      and returns the pointer to it.  As the function is called from both
      [de]population paths and depopulation can only happen after at least
      one successful population, the param doesn't make any difference - the
      allocation will always happen on the population path anyway.
      
      Remove @may_alloc from pcpu_get_pages().  Also, add an lockdep
      assertion pcpu_alloc_mutex instead of vaguely stating that the
      exclusion is the caller's responsibility.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      cdb4cba5
    • T
      percpu: remove the usage of separate populated bitmap in percpu-vm · fbbb7f4e
      Tejun Heo 提交于
      percpu-vm uses pcpu_get_pages_and_bitmap() to acquire temp pages array
      and populated bitmap and uses the two during [de]population.  The temp
      bitmap is used only to build the new bitmap that is copied to
      chunk->populated after the operation succeeds; however, the new bitmap
      can be trivially set after success without using the temp bitmap.
      
      This patch removes the temp populated bitmap usage from percpu-vm.c.
      
      * pcpu_get_pages_and_bitmap() is renamed to pcpu_get_pages() and no
        longer hands out the temp bitmap.
      
      * @populated arugment is dropped from all the related functions.
        @populated updates in pcpu_[un]map_pages() are dropped.
      
      * Two loops in pcpu_map_pages() are merged.
      
      * pcpu_[de]populated_chunk() modify chunk->populated bitmap directly
        from @page_start and @page_end after success.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      fbbb7f4e
  3. 16 8月, 2014 3 次提交
    • H
      percpu: free percpu allocation info for uniprocessor system · 3189eddb
      Honggang Li 提交于
      Currently, only SMP system free the percpu allocation info.
      Uniprocessor system should free it too. For example, one x86 UML
      virtual machine with 256MB memory, UML kernel wastes one page memory.
      Signed-off-by: NHonggang Li <enjoymindful@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      3189eddb
    • T
      percpu: perform tlb flush after pcpu_map_pages() failure · 849f5169
      Tejun Heo 提交于
      If pcpu_map_pages() fails midway, it unmaps the already mapped pages.
      Currently, it doesn't flush tlb after the partial unmapping.  This may
      be okay in most cases as the established mapping hasn't been used at
      that point but it can go wrong and when it goes wrong it'd be
      extremely difficult to track down.
      
      Flush tlb after the partial unmapping.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      849f5169
    • T
      percpu: fix pcpu_alloc_pages() failure path · f0d27965
      Tejun Heo 提交于
      When pcpu_alloc_pages() fails midway, pcpu_free_pages() is invoked to
      free what has already been allocated.  The invocation is across the
      whole requested range and pcpu_free_pages() will try to free all
      non-NULL pages; unfortunately, this is incorrect as
      pcpu_get_pages_and_bitmap(), unlike what its comment suggests, doesn't
      clear the pages array and thus the array may have entries from the
      previous invocations making the partial failure path free incorrect
      pages.
      
      Fix it by open-coding the partial freeing of the already allocated
      pages.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      f0d27965
  4. 15 8月, 2014 1 次提交
  5. 12 8月, 2014 1 次提交
  6. 09 8月, 2014 12 次提交
    • D
      shm: wait for pins to be released when sealing · 05f65b5c
      David Herrmann 提交于
      If we set SEAL_WRITE on a file, we must make sure there cannot be any
      ongoing write-operations on the file.  For write() calls, we simply lock
      the inode mutex, for mmap() we simply verify there're no writable
      mappings.  However, there might be pages pinned by AIO, Direct-IO and
      similar operations via GUP.  We must make sure those do not write to the
      memfd file after we set SEAL_WRITE.
      
      As there is no way to notify GUP users to drop pages or to wait for them
      to be done, we implement the wait ourself: When setting SEAL_WRITE, we
      check all pages for their ref-count.  If it's bigger than 1, we know
      there's some user of the page.  We then mark the page and wait for up to
      150ms for those ref-counts to be dropped.  If the ref-counts are not
      dropped in time, we refuse the seal operation.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05f65b5c
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
    • D
      shm: add sealing API · 40e041a2
      David Herrmann 提交于
      If two processes share a common memory region, they usually want some
      guarantees to allow safe access. This often includes:
        - one side cannot overwrite data while the other reads it
        - one side cannot shrink the buffer while the other accesses it
        - one side cannot grow the buffer beyond previously set boundaries
      
      If there is a trust-relationship between both parties, there is no need
      for policy enforcement.  However, if there's no trust relationship (eg.,
      for general-purpose IPC) sharing memory-regions is highly fragile and
      often not possible without local copies.  Look at the following two
      use-cases:
      
        1) A graphics client wants to share its rendering-buffer with a
           graphics-server. The memory-region is allocated by the client for
           read/write access and a second FD is passed to the server. While
           scanning out from the memory region, the server has no guarantee that
           the client doesn't shrink the buffer at any time, requiring rather
           cumbersome SIGBUS handling.
        2) A process wants to perform an RPC on another process. To avoid huge
           bandwidth consumption, zero-copy is preferred. After a message is
           assembled in-memory and a FD is passed to the remote side, both sides
           want to be sure that neither modifies this shared copy, anymore. The
           source may have put sensible data into the message without a separate
           copy and the target may want to parse the message inline, to avoid a
           local copy.
      
      While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
      ways to achieve most of this, the first one is unproportionally ugly to
      use in libraries and the latter two are broken/racy or even disabled due
      to denial of service attacks.
      
      This patch introduces the concept of SEALING.  If you seal a file, a
      specific set of operations is blocked on that file forever.  Unlike locks,
      seals can only be set, never removed.  Hence, once you verified a specific
      set of seals is set, you're guaranteed that no-one can perform the blocked
      operations on this file, anymore.
      
      An initial set of SEALS is introduced by this patch:
        - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
                  in size. This affects ftruncate() and open(O_TRUNC).
        - GROW: If SEAL_GROW is set, the file in question cannot be increased
                in size. This affects ftruncate(), fallocate() and write().
        - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
                 are possible. This affects fallocate(PUNCH_HOLE), mmap() and
                 write().
        - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
                This basically prevents the F_ADD_SEAL operation on a file and
                can be set to prevent others from adding further seals that you
                don't want.
      
      The described use-cases can easily use these seals to provide safe use
      without any trust-relationship:
      
        1) The graphics server can verify that a passed file-descriptor has
           SEAL_SHRINK set. This allows safe scanout, while the client is
           allowed to increase buffer size for window-resizing on-the-fly.
           Concurrent writes are explicitly allowed.
        2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
           SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
           process can modify the data while the other side parses it.
           Furthermore, it guarantees that even with writable FDs passed to the
           peer, it cannot increase the size to hit memory-limits of the source
           process (in case the file-storage is accounted to the source).
      
      The new API is an extension to fcntl(), adding two new commands:
        F_GET_SEALS: Return a bitset describing the seals on the file. This
                     can be called on any FD if the underlying file supports
                     sealing.
        F_ADD_SEALS: Change the seals of a given file. This requires WRITE
                     access to the file and F_SEAL_SEAL may not already be set.
                     Furthermore, the underlying file must support sealing and
                     there may not be any existing shared mapping of that file.
                     Otherwise, EBADF/EPERM is returned.
                     The given seals are _added_ to the existing set of seals
                     on the file. You cannot remove seals again.
      
      The fcntl() handler is currently specific to shmem and disabled on all
      files. A file needs to explicitly support sealing for this interface to
      work. A separate syscall is added in a follow-up, which creates files that
      support sealing. There is no intention to support this on other
      file-systems. Semantics are unclear for non-volatile files and we lack any
      use-case right now. Therefore, the implementation is specific to shmem.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40e041a2
    • D
      mm: allow drivers to prevent new writable mappings · 4bb5f5d9
      David Herrmann 提交于
      This patch (of 6):
      
      The i_mmap_writable field counts existing writable mappings of an
      address_space.  To allow drivers to prevent new writable mappings, make
      this counter signed and prevent new writable mappings if it is negative.
      This is modelled after i_writecount and DENYWRITE.
      
      This will be required by the shmem-sealing infrastructure to prevent any
      new writable mappings after the WRITE seal has been set.  In case there
      exists a writable mapping, this operation will fail with EBUSY.
      
      Note that we rely on the fact that iff you already own a writable mapping,
      you can increase the counter without using the helpers.  This is the same
      that we do for i_writecount.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bb5f5d9
    • A
      arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area · a6c19dfe
      Andy Lutomirski 提交于
      The core mm code will provide a default gate area based on
      FIXADDR_USER_START and FIXADDR_USER_END if
      !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
      
      This default is only useful for ia64.  arm64, ppc, s390, sh, tile, 64-bit
      UML, and x86_32 have their own code just to disable it.  arm, 32-bit UML,
      and x86_64 have gate areas, but they have their own implementations.
      
      This gets rid of the default and moves the code into ia64.
      
      This should save some code on architectures without a gate area: it's now
      possible to inline the gate_area functions in the default case.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NNathan Lynch <nathan_lynch@mentor.com>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
      Acked-by: Richard Weinberger <richard@nod.at> [for um]
      Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6c19dfe
    • F
      mm/zswap.c: add __init to zswap_entry_cache_destroy() · c119239b
      Fabian Frederick 提交于
      zswap_entry_cache_destroy() is only called by __init init_zswap().
      
      This patch also fixes function name zswap_entry_cache_ s/destory/destroy
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c119239b
    • J
      mm: memcontrol: avoid charge statistics churn during page migration · 6abb5a86
      Johannes Weiner 提交于
      Charge migration currently disables IRQs twice to update the charge
      statistics for the old page and then again for the new page.
      
      But migration is a seamless transition of a charge from one physical
      page to another one of the same size, so this should be a non-event from
      an accounting point of view.  Leave the statistics alone.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6abb5a86
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
    • O
      vm_is_stack: use for_each_thread() rather then buggy while_each_thread() · 4449a51a
      Oleg Nesterov 提交于
      Aleksei hit the soft lockup during reading /proc/PID/smaps.  David
      investigated the problem and suggested the right fix.
      
      while_each_thread() is racy and should die, this patch updates
      vm_is_stack().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NAleksei Besogonov <alex.besogonov@gmail.com>
      Tested-by: NAleksei Besogonov <alex.besogonov@gmail.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4449a51a
    • J
      Revert "slab: remove BAD_ALIEN_MAGIC" · edcad250
      Joonsoo Kim 提交于
      This reverts commit a6406168 ("slab: remove BAD_ALIEN_MAGIC").
      
      commit a6406168 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
      system with !CONFIG_NUMA has only one memory node.  But, it turns out to
      be false by the report from Geert.  His system, m68k, has many memory
      nodes and is configured in !CONFIG_NUMA.  So it couldn't boot with above
      change.
      
      Here goes his failure report.
      
        With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:
      
        enable_cpucache failed for radix_tree_node, error 12.
        kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
        *** TRAP #7 ***   FORMAT=0
        Current process id is 0
        BAD KERNEL TRAP: 00000000
        Modules linked in:
        PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c
        SR: 2200  SP: 00345f90  a2: 0034c2e8
        d0: 0000003d    d1: 00000000    d2: 00000000    d3: 003ac942
        d4: 00000000    d5: 00000000    a0: 0034f686    a1: 0034f682
        Process swapper (pid: 0, task=0034c2e8)
        Frame format=0
        Stack from 00345fc4:
                002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
                00000000 00000000 00000000 00000000 00000000 003ac942 00000000
                003912d6
        Call Trace: [<000360fa>] parse_args+0x0/0x2ca
         [<0017d806>] strlen+0x0/0x1a
         [<003921d4>] start_kernel+0x23c/0x428
         [<003912d6>] _sinittext+0x2d6/0x95e
      
        Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff
        fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
        Disabling lock debugging due to kernel taint
        Kernel panic - not syncing: Attempted to kill the idle task!
      
      Although there is a alternative way to fix this issue such as disabling
      use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
      to me in this time.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edcad250
  7. 08 8月, 2014 3 次提交
  8. 07 8月, 2014 6 次提交
    • D
      mm/zpool: update zswap to use zpool · 12d79d64
      Dan Streetman 提交于
      Change zswap to use the zpool api instead of directly using zbud.  Add a
      boot-time param to allow selecting which zpool implementation to use,
      with zbud as the default.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12d79d64
    • D
      mm/zpool: zbud/zsmalloc implement zpool · c795779d
      Dan Streetman 提交于
      Update zbud and zsmalloc to implement the zpool api.
      
      [fengguang.wu@intel.com: make functions static]
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c795779d
    • D
      mm/zpool: implement common zpool api to zbud/zsmalloc · af8d417a
      Dan Streetman 提交于
      Add zpool api.
      
      zpool provides an interface for memory storage, typically of compressed
      memory.  Users can select what backend to use; currently the only
      implementations are zbud, a low density implementation with up to two
      compressed pages per storage page, and zsmalloc, a higher density
      implementation with multiple compressed pages per storage page.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af8d417a
    • D
      mm/zbud: change zbud_alloc size type to size_t · 99eef8e9
      Dan Streetman 提交于
      Change the type of the zbud_alloc() size param from unsigned int to
      size_t.
      
      Technically, this should not make any difference, as the zbud
      implementation already restricts the size to well within either type's
      limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use
      size_t, this brings the size parameter type in line with zsmalloc/zpool.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99eef8e9
    • M
      mm/highmem: make kmap cache coloring aware · 15de36a4
      Max Filippov 提交于
      User-visible effect:
       Architectures that choose this method of maintaining cache coherency
      (MIPS and xtensa currently) are able to use high memory on cores with
      aliasing data cache.  Without this fix such architectures can not use
      high memory (in case of xtensa it means that at most 128 MBytes of
      physical memory is available).
      
      The problem:
       VIPT cache with way size larger than MMU page size may suffer from
      aliasing problem: a single physical address accessed via different
      virtual addresses may end up in multiple locations in the cache.
      Virtual mappings of a physical address that always get cached in
      different cache locations are said to have different colors.  L1 caching
      hardware usually doesn't handle this situation leaving it up to
      software.  Software must avoid this situation as it leads to data
      corruption.
      
      What can be done:
       One way to handle this is to flush and invalidate data cache every time
      page mapping changes color.  The other way is to always map physical
      page at a virtual address with the same color.  Low memory pages already
      have this property.  Giving architecture a way to control color of high
      memory page mapping allows reusing of existing low memory cache alias
      handling code.
      
      How this is done with this patch:
       Provide hooks that allow architectures with aliasing cache to align
      mapping address of high pages according to their color.  Such
      architectures may enforce similar coloring of low- and high-memory page
      mappings and reuse existing cache management functions to support
      highmem.
      
      This code is based on the implementation of similar feature for MIPS by
      Leonid Yegoshin.
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Marc Gauthier <marc@cadence.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Steven Hill <Steven.Hill@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15de36a4
    • P
      mmu_notifier: add call_srcu and sync function for listener to delay call and sync · b972216e
      Peter Zijlstra 提交于
      When kernel device drivers or subsystems want to bind their lifespan to
      t= he lifespan of the mm_struct, they usually use one of the following
      methods:
      
      1. Manually calling a function in the interested kernel module.  The
         funct= ion call needs to be placed in mmput.  This method was rejected
         by several ker= nel maintainers.
      
      2. Registering to the mmu notifier release mechanism.
      
      The problem with the latter approach is that the mmu_notifier_release
      cal= lback is called from__mmu_notifier_release (called from exit_mmap).
      That functi= on iterates over the list of mmu notifiers and don't expect
      the release call= back function to remove itself from the list.
      Therefore, the callback function= in the kernel module can't release the
      mmu_notifier_object, which is actuall= y the kernel module's object
      itself.  As a result, the destruction of the kernel module's object must
      to be done in a delayed fashion.
      
      This patch adds support for this delayed callback, by adding a new
      mmu_notifier_call_srcu function that receives a function ptr and calls
      th= at function with call_srcu.  In that function, the kernel module
      releases its object.  To use mmu_notifier_call_srcu, the calling module
      needs to call b= efore that a new function called
      mmu_notifier_unregister_no_release that as its= name implies,
      unregisters a notifier without calling its notifier release call= back.
      
      This patch also adds a function that will call barrier_srcu so those
      kern= el modules can sync with mmu_notifier.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NOded Gabbay <oded.gabbay@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b972216e