1. 03 10月, 2014 1 次提交
    • J
      mm: memcontrol: do not iterate uninitialized memcgs · 2f7dd7a4
      Johannes Weiner 提交于
      The cgroup iterators yield css objects that have not yet gone through
      css_online(), but they are not complete memcgs at this point and so the
      memcg iterators should not return them.  Commit d8ad3055 ("mm/memcg:
      iteration skip memcgs not yet fully initialized") set out to implement
      exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does
      not meet the ordering requirements for memcg, and so the iterator may
      skip over initialized groups, or return partially initialized memcgs.
      
      The cgroup core can not reasonably provide a clear answer on whether the
      object around the css has been fully initialized, as that depends on
      controller-specific locking and lifetime rules.  Thus, introduce a
      memcg-specific flag that is set after the memcg has been initialized in
      css_online(), and read before mem_cgroup_iter() callers access the memcg
      members.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f7dd7a4
  2. 27 9月, 2014 2 次提交
    • M
      fuse: honour max_read and max_write in direct_io mode · 2c80929c
      Miklos Szeredi 提交于
      The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
      bytes a caller asked to pack into fuse request. This value may be lesser
      than capacity of fuse request or iov_iter.  So fuse_get_user_pages() must
      ensure that *nbytesp won't grow.
      
      Now, when helper iov_iter_get_pages() performs all hard work of extracting
      pages from iov_iter, it can be done by passing properly calculated
      "maxsize" to the helper.
      
      The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
      this capability, so pass LONG_MAX as the maxsize argument here.
      
      Fixes: c9c37e2e ("fuse: switch to iov_iter_get_pages()")
      Reported-by: NWerner Baumann <werner.baumann@onlinehome.de>
      Tested-by: NMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c80929c
    • M
      shmem: fix nlink for rename overwrite directory · b928095b
      Miklos Szeredi 提交于
      If overwriting an empty directory with rename, then need to drop the extra
      nlink.
      
      Test prog:
      
      #include <stdio.h>
      #include <fcntl.h>
      #include <err.h>
      #include <sys/stat.h>
      
      int main(void)
      {
      	const char *test_dir1 = "test-dir1";
      	const char *test_dir2 = "test-dir2";
      	int res;
      	int fd;
      	struct stat statbuf;
      
      	res = mkdir(test_dir1, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir1);
      
      	res = mkdir(test_dir2, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir2);
      
      	fd = open(test_dir2, O_RDONLY);
      	if (fd == -1)
      		err(1, "open(\"%s\")", test_dir2);
      
      	res = rename(test_dir1, test_dir2);
      	if (res == -1)
      		err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
      
      	res = fstat(fd, &statbuf);
      	if (res == -1)
      		err(1, "fstat(%i)", fd);
      
      	if (statbuf.st_nlink != 0) {
      		fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
      		return 1;
      	}
      
      	return 0;
      }
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b928095b
  3. 26 9月, 2014 2 次提交
  4. 25 9月, 2014 1 次提交
    • Z
      cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags · 2ad654bc
      Zefan Li 提交于
      When we change cpuset.memory_spread_{page,slab}, cpuset will flip
      PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
      This should be done using atomic bitops, but currently we don't,
      which is broken.
      
      Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
      when one thread tried to clear PF_USED_MATH while at the same time another
      thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
      the same task.
      
      Here's the full report:
      https://lkml.org/lkml/2014/9/19/230
      
      To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
      
      v4:
      - updated mm/slab.c. (Fengguang Wu)
      - updated Documentation.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Kees Cook <keescook@chromium.org>
      Fixes: 950592f7 ("cpusets: update tasks' page/slab spread flags in time")
      Cc: <stable@vger.kernel.org> # 2.6.31+
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2ad654bc
  5. 19 9月, 2014 1 次提交
  6. 14 9月, 2014 1 次提交
  7. 11 9月, 2014 2 次提交
  8. 05 9月, 2014 1 次提交
  9. 30 8月, 2014 5 次提交
  10. 16 8月, 2014 3 次提交
    • H
      percpu: free percpu allocation info for uniprocessor system · 3189eddb
      Honggang Li 提交于
      Currently, only SMP system free the percpu allocation info.
      Uniprocessor system should free it too. For example, one x86 UML
      virtual machine with 256MB memory, UML kernel wastes one page memory.
      Signed-off-by: NHonggang Li <enjoymindful@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      3189eddb
    • T
      percpu: perform tlb flush after pcpu_map_pages() failure · 849f5169
      Tejun Heo 提交于
      If pcpu_map_pages() fails midway, it unmaps the already mapped pages.
      Currently, it doesn't flush tlb after the partial unmapping.  This may
      be okay in most cases as the established mapping hasn't been used at
      that point but it can go wrong and when it goes wrong it'd be
      extremely difficult to track down.
      
      Flush tlb after the partial unmapping.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      849f5169
    • T
      percpu: fix pcpu_alloc_pages() failure path · f0d27965
      Tejun Heo 提交于
      When pcpu_alloc_pages() fails midway, pcpu_free_pages() is invoked to
      free what has already been allocated.  The invocation is across the
      whole requested range and pcpu_free_pages() will try to free all
      non-NULL pages; unfortunately, this is incorrect as
      pcpu_get_pages_and_bitmap(), unlike what its comment suggests, doesn't
      clear the pages array and thus the array may have entries from the
      previous invocations making the partial failure path free incorrect
      pages.
      
      Fix it by open-coding the partial freeing of the already allocated
      pages.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      f0d27965
  11. 15 8月, 2014 1 次提交
  12. 12 8月, 2014 1 次提交
  13. 09 8月, 2014 12 次提交
    • D
      shm: wait for pins to be released when sealing · 05f65b5c
      David Herrmann 提交于
      If we set SEAL_WRITE on a file, we must make sure there cannot be any
      ongoing write-operations on the file.  For write() calls, we simply lock
      the inode mutex, for mmap() we simply verify there're no writable
      mappings.  However, there might be pages pinned by AIO, Direct-IO and
      similar operations via GUP.  We must make sure those do not write to the
      memfd file after we set SEAL_WRITE.
      
      As there is no way to notify GUP users to drop pages or to wait for them
      to be done, we implement the wait ourself: When setting SEAL_WRITE, we
      check all pages for their ref-count.  If it's bigger than 1, we know
      there's some user of the page.  We then mark the page and wait for up to
      150ms for those ref-counts to be dropped.  If the ref-counts are not
      dropped in time, we refuse the seal operation.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05f65b5c
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
    • D
      shm: add sealing API · 40e041a2
      David Herrmann 提交于
      If two processes share a common memory region, they usually want some
      guarantees to allow safe access. This often includes:
        - one side cannot overwrite data while the other reads it
        - one side cannot shrink the buffer while the other accesses it
        - one side cannot grow the buffer beyond previously set boundaries
      
      If there is a trust-relationship between both parties, there is no need
      for policy enforcement.  However, if there's no trust relationship (eg.,
      for general-purpose IPC) sharing memory-regions is highly fragile and
      often not possible without local copies.  Look at the following two
      use-cases:
      
        1) A graphics client wants to share its rendering-buffer with a
           graphics-server. The memory-region is allocated by the client for
           read/write access and a second FD is passed to the server. While
           scanning out from the memory region, the server has no guarantee that
           the client doesn't shrink the buffer at any time, requiring rather
           cumbersome SIGBUS handling.
        2) A process wants to perform an RPC on another process. To avoid huge
           bandwidth consumption, zero-copy is preferred. After a message is
           assembled in-memory and a FD is passed to the remote side, both sides
           want to be sure that neither modifies this shared copy, anymore. The
           source may have put sensible data into the message without a separate
           copy and the target may want to parse the message inline, to avoid a
           local copy.
      
      While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
      ways to achieve most of this, the first one is unproportionally ugly to
      use in libraries and the latter two are broken/racy or even disabled due
      to denial of service attacks.
      
      This patch introduces the concept of SEALING.  If you seal a file, a
      specific set of operations is blocked on that file forever.  Unlike locks,
      seals can only be set, never removed.  Hence, once you verified a specific
      set of seals is set, you're guaranteed that no-one can perform the blocked
      operations on this file, anymore.
      
      An initial set of SEALS is introduced by this patch:
        - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
                  in size. This affects ftruncate() and open(O_TRUNC).
        - GROW: If SEAL_GROW is set, the file in question cannot be increased
                in size. This affects ftruncate(), fallocate() and write().
        - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
                 are possible. This affects fallocate(PUNCH_HOLE), mmap() and
                 write().
        - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
                This basically prevents the F_ADD_SEAL operation on a file and
                can be set to prevent others from adding further seals that you
                don't want.
      
      The described use-cases can easily use these seals to provide safe use
      without any trust-relationship:
      
        1) The graphics server can verify that a passed file-descriptor has
           SEAL_SHRINK set. This allows safe scanout, while the client is
           allowed to increase buffer size for window-resizing on-the-fly.
           Concurrent writes are explicitly allowed.
        2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
           SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
           process can modify the data while the other side parses it.
           Furthermore, it guarantees that even with writable FDs passed to the
           peer, it cannot increase the size to hit memory-limits of the source
           process (in case the file-storage is accounted to the source).
      
      The new API is an extension to fcntl(), adding two new commands:
        F_GET_SEALS: Return a bitset describing the seals on the file. This
                     can be called on any FD if the underlying file supports
                     sealing.
        F_ADD_SEALS: Change the seals of a given file. This requires WRITE
                     access to the file and F_SEAL_SEAL may not already be set.
                     Furthermore, the underlying file must support sealing and
                     there may not be any existing shared mapping of that file.
                     Otherwise, EBADF/EPERM is returned.
                     The given seals are _added_ to the existing set of seals
                     on the file. You cannot remove seals again.
      
      The fcntl() handler is currently specific to shmem and disabled on all
      files. A file needs to explicitly support sealing for this interface to
      work. A separate syscall is added in a follow-up, which creates files that
      support sealing. There is no intention to support this on other
      file-systems. Semantics are unclear for non-volatile files and we lack any
      use-case right now. Therefore, the implementation is specific to shmem.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40e041a2
    • D
      mm: allow drivers to prevent new writable mappings · 4bb5f5d9
      David Herrmann 提交于
      This patch (of 6):
      
      The i_mmap_writable field counts existing writable mappings of an
      address_space.  To allow drivers to prevent new writable mappings, make
      this counter signed and prevent new writable mappings if it is negative.
      This is modelled after i_writecount and DENYWRITE.
      
      This will be required by the shmem-sealing infrastructure to prevent any
      new writable mappings after the WRITE seal has been set.  In case there
      exists a writable mapping, this operation will fail with EBUSY.
      
      Note that we rely on the fact that iff you already own a writable mapping,
      you can increase the counter without using the helpers.  This is the same
      that we do for i_writecount.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bb5f5d9
    • A
      arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area · a6c19dfe
      Andy Lutomirski 提交于
      The core mm code will provide a default gate area based on
      FIXADDR_USER_START and FIXADDR_USER_END if
      !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
      
      This default is only useful for ia64.  arm64, ppc, s390, sh, tile, 64-bit
      UML, and x86_32 have their own code just to disable it.  arm, 32-bit UML,
      and x86_64 have gate areas, but they have their own implementations.
      
      This gets rid of the default and moves the code into ia64.
      
      This should save some code on architectures without a gate area: it's now
      possible to inline the gate_area functions in the default case.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NNathan Lynch <nathan_lynch@mentor.com>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
      Acked-by: Richard Weinberger <richard@nod.at> [for um]
      Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6c19dfe
    • F
      mm/zswap.c: add __init to zswap_entry_cache_destroy() · c119239b
      Fabian Frederick 提交于
      zswap_entry_cache_destroy() is only called by __init init_zswap().
      
      This patch also fixes function name zswap_entry_cache_ s/destory/destroy
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c119239b
    • J
      mm: memcontrol: avoid charge statistics churn during page migration · 6abb5a86
      Johannes Weiner 提交于
      Charge migration currently disables IRQs twice to update the charge
      statistics for the old page and then again for the new page.
      
      But migration is a seamless transition of a charge from one physical
      page to another one of the same size, so this should be a non-event from
      an accounting point of view.  Leave the statistics alone.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6abb5a86
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
    • O
      vm_is_stack: use for_each_thread() rather then buggy while_each_thread() · 4449a51a
      Oleg Nesterov 提交于
      Aleksei hit the soft lockup during reading /proc/PID/smaps.  David
      investigated the problem and suggested the right fix.
      
      while_each_thread() is racy and should die, this patch updates
      vm_is_stack().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NAleksei Besogonov <alex.besogonov@gmail.com>
      Tested-by: NAleksei Besogonov <alex.besogonov@gmail.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4449a51a
    • J
      Revert "slab: remove BAD_ALIEN_MAGIC" · edcad250
      Joonsoo Kim 提交于
      This reverts commit a6406168 ("slab: remove BAD_ALIEN_MAGIC").
      
      commit a6406168 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
      system with !CONFIG_NUMA has only one memory node.  But, it turns out to
      be false by the report from Geert.  His system, m68k, has many memory
      nodes and is configured in !CONFIG_NUMA.  So it couldn't boot with above
      change.
      
      Here goes his failure report.
      
        With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:
      
        enable_cpucache failed for radix_tree_node, error 12.
        kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
        *** TRAP #7 ***   FORMAT=0
        Current process id is 0
        BAD KERNEL TRAP: 00000000
        Modules linked in:
        PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c
        SR: 2200  SP: 00345f90  a2: 0034c2e8
        d0: 0000003d    d1: 00000000    d2: 00000000    d3: 003ac942
        d4: 00000000    d5: 00000000    a0: 0034f686    a1: 0034f682
        Process swapper (pid: 0, task=0034c2e8)
        Frame format=0
        Stack from 00345fc4:
                002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
                00000000 00000000 00000000 00000000 00000000 003ac942 00000000
                003912d6
        Call Trace: [<000360fa>] parse_args+0x0/0x2ca
         [<0017d806>] strlen+0x0/0x1a
         [<003921d4>] start_kernel+0x23c/0x428
         [<003912d6>] _sinittext+0x2d6/0x95e
      
        Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff
        fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
        Disabling lock debugging due to kernel taint
        Kernel panic - not syncing: Attempted to kill the idle task!
      
      Although there is a alternative way to fix this issue such as disabling
      use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
      to me in this time.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edcad250
  14. 08 8月, 2014 3 次提交
  15. 07 8月, 2014 4 次提交