1. 30 8月, 2014 5 次提交
  2. 15 8月, 2014 1 次提交
  3. 12 8月, 2014 1 次提交
  4. 09 8月, 2014 12 次提交
    • D
      shm: wait for pins to be released when sealing · 05f65b5c
      David Herrmann 提交于
      If we set SEAL_WRITE on a file, we must make sure there cannot be any
      ongoing write-operations on the file.  For write() calls, we simply lock
      the inode mutex, for mmap() we simply verify there're no writable
      mappings.  However, there might be pages pinned by AIO, Direct-IO and
      similar operations via GUP.  We must make sure those do not write to the
      memfd file after we set SEAL_WRITE.
      
      As there is no way to notify GUP users to drop pages or to wait for them
      to be done, we implement the wait ourself: When setting SEAL_WRITE, we
      check all pages for their ref-count.  If it's bigger than 1, we know
      there's some user of the page.  We then mark the page and wait for up to
      150ms for those ref-counts to be dropped.  If the ref-counts are not
      dropped in time, we refuse the seal operation.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05f65b5c
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
    • D
      shm: add sealing API · 40e041a2
      David Herrmann 提交于
      If two processes share a common memory region, they usually want some
      guarantees to allow safe access. This often includes:
        - one side cannot overwrite data while the other reads it
        - one side cannot shrink the buffer while the other accesses it
        - one side cannot grow the buffer beyond previously set boundaries
      
      If there is a trust-relationship between both parties, there is no need
      for policy enforcement.  However, if there's no trust relationship (eg.,
      for general-purpose IPC) sharing memory-regions is highly fragile and
      often not possible without local copies.  Look at the following two
      use-cases:
      
        1) A graphics client wants to share its rendering-buffer with a
           graphics-server. The memory-region is allocated by the client for
           read/write access and a second FD is passed to the server. While
           scanning out from the memory region, the server has no guarantee that
           the client doesn't shrink the buffer at any time, requiring rather
           cumbersome SIGBUS handling.
        2) A process wants to perform an RPC on another process. To avoid huge
           bandwidth consumption, zero-copy is preferred. After a message is
           assembled in-memory and a FD is passed to the remote side, both sides
           want to be sure that neither modifies this shared copy, anymore. The
           source may have put sensible data into the message without a separate
           copy and the target may want to parse the message inline, to avoid a
           local copy.
      
      While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
      ways to achieve most of this, the first one is unproportionally ugly to
      use in libraries and the latter two are broken/racy or even disabled due
      to denial of service attacks.
      
      This patch introduces the concept of SEALING.  If you seal a file, a
      specific set of operations is blocked on that file forever.  Unlike locks,
      seals can only be set, never removed.  Hence, once you verified a specific
      set of seals is set, you're guaranteed that no-one can perform the blocked
      operations on this file, anymore.
      
      An initial set of SEALS is introduced by this patch:
        - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
                  in size. This affects ftruncate() and open(O_TRUNC).
        - GROW: If SEAL_GROW is set, the file in question cannot be increased
                in size. This affects ftruncate(), fallocate() and write().
        - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
                 are possible. This affects fallocate(PUNCH_HOLE), mmap() and
                 write().
        - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
                This basically prevents the F_ADD_SEAL operation on a file and
                can be set to prevent others from adding further seals that you
                don't want.
      
      The described use-cases can easily use these seals to provide safe use
      without any trust-relationship:
      
        1) The graphics server can verify that a passed file-descriptor has
           SEAL_SHRINK set. This allows safe scanout, while the client is
           allowed to increase buffer size for window-resizing on-the-fly.
           Concurrent writes are explicitly allowed.
        2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
           SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
           process can modify the data while the other side parses it.
           Furthermore, it guarantees that even with writable FDs passed to the
           peer, it cannot increase the size to hit memory-limits of the source
           process (in case the file-storage is accounted to the source).
      
      The new API is an extension to fcntl(), adding two new commands:
        F_GET_SEALS: Return a bitset describing the seals on the file. This
                     can be called on any FD if the underlying file supports
                     sealing.
        F_ADD_SEALS: Change the seals of a given file. This requires WRITE
                     access to the file and F_SEAL_SEAL may not already be set.
                     Furthermore, the underlying file must support sealing and
                     there may not be any existing shared mapping of that file.
                     Otherwise, EBADF/EPERM is returned.
                     The given seals are _added_ to the existing set of seals
                     on the file. You cannot remove seals again.
      
      The fcntl() handler is currently specific to shmem and disabled on all
      files. A file needs to explicitly support sealing for this interface to
      work. A separate syscall is added in a follow-up, which creates files that
      support sealing. There is no intention to support this on other
      file-systems. Semantics are unclear for non-volatile files and we lack any
      use-case right now. Therefore, the implementation is specific to shmem.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40e041a2
    • D
      mm: allow drivers to prevent new writable mappings · 4bb5f5d9
      David Herrmann 提交于
      This patch (of 6):
      
      The i_mmap_writable field counts existing writable mappings of an
      address_space.  To allow drivers to prevent new writable mappings, make
      this counter signed and prevent new writable mappings if it is negative.
      This is modelled after i_writecount and DENYWRITE.
      
      This will be required by the shmem-sealing infrastructure to prevent any
      new writable mappings after the WRITE seal has been set.  In case there
      exists a writable mapping, this operation will fail with EBUSY.
      
      Note that we rely on the fact that iff you already own a writable mapping,
      you can increase the counter without using the helpers.  This is the same
      that we do for i_writecount.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bb5f5d9
    • A
      arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area · a6c19dfe
      Andy Lutomirski 提交于
      The core mm code will provide a default gate area based on
      FIXADDR_USER_START and FIXADDR_USER_END if
      !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
      
      This default is only useful for ia64.  arm64, ppc, s390, sh, tile, 64-bit
      UML, and x86_32 have their own code just to disable it.  arm, 32-bit UML,
      and x86_64 have gate areas, but they have their own implementations.
      
      This gets rid of the default and moves the code into ia64.
      
      This should save some code on architectures without a gate area: it's now
      possible to inline the gate_area functions in the default case.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NNathan Lynch <nathan_lynch@mentor.com>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
      Acked-by: Richard Weinberger <richard@nod.at> [for um]
      Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6c19dfe
    • F
      mm/zswap.c: add __init to zswap_entry_cache_destroy() · c119239b
      Fabian Frederick 提交于
      zswap_entry_cache_destroy() is only called by __init init_zswap().
      
      This patch also fixes function name zswap_entry_cache_ s/destory/destroy
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c119239b
    • J
      mm: memcontrol: avoid charge statistics churn during page migration · 6abb5a86
      Johannes Weiner 提交于
      Charge migration currently disables IRQs twice to update the charge
      statistics for the old page and then again for the new page.
      
      But migration is a seamless transition of a charge from one physical
      page to another one of the same size, so this should be a non-event from
      an accounting point of view.  Leave the statistics alone.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6abb5a86
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
    • O
      vm_is_stack: use for_each_thread() rather then buggy while_each_thread() · 4449a51a
      Oleg Nesterov 提交于
      Aleksei hit the soft lockup during reading /proc/PID/smaps.  David
      investigated the problem and suggested the right fix.
      
      while_each_thread() is racy and should die, this patch updates
      vm_is_stack().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NAleksei Besogonov <alex.besogonov@gmail.com>
      Tested-by: NAleksei Besogonov <alex.besogonov@gmail.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4449a51a
    • J
      Revert "slab: remove BAD_ALIEN_MAGIC" · edcad250
      Joonsoo Kim 提交于
      This reverts commit a6406168 ("slab: remove BAD_ALIEN_MAGIC").
      
      commit a6406168 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
      system with !CONFIG_NUMA has only one memory node.  But, it turns out to
      be false by the report from Geert.  His system, m68k, has many memory
      nodes and is configured in !CONFIG_NUMA.  So it couldn't boot with above
      change.
      
      Here goes his failure report.
      
        With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:
      
        enable_cpucache failed for radix_tree_node, error 12.
        kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
        *** TRAP #7 ***   FORMAT=0
        Current process id is 0
        BAD KERNEL TRAP: 00000000
        Modules linked in:
        PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c
        SR: 2200  SP: 00345f90  a2: 0034c2e8
        d0: 0000003d    d1: 00000000    d2: 00000000    d3: 003ac942
        d4: 00000000    d5: 00000000    a0: 0034f686    a1: 0034f682
        Process swapper (pid: 0, task=0034c2e8)
        Frame format=0
        Stack from 00345fc4:
                002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
                00000000 00000000 00000000 00000000 00000000 003ac942 00000000
                003912d6
        Call Trace: [<000360fa>] parse_args+0x0/0x2ca
         [<0017d806>] strlen+0x0/0x1a
         [<003921d4>] start_kernel+0x23c/0x428
         [<003912d6>] _sinittext+0x2d6/0x95e
      
        Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff
        fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
        Disabling lock debugging due to kernel taint
        Kernel panic - not syncing: Attempted to kill the idle task!
      
      Although there is a alternative way to fix this issue such as disabling
      use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
      to me in this time.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edcad250
  5. 08 8月, 2014 3 次提交
  6. 07 8月, 2014 18 次提交
    • D
      mm/zpool: update zswap to use zpool · 12d79d64
      Dan Streetman 提交于
      Change zswap to use the zpool api instead of directly using zbud.  Add a
      boot-time param to allow selecting which zpool implementation to use,
      with zbud as the default.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12d79d64
    • D
      mm/zpool: zbud/zsmalloc implement zpool · c795779d
      Dan Streetman 提交于
      Update zbud and zsmalloc to implement the zpool api.
      
      [fengguang.wu@intel.com: make functions static]
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c795779d
    • D
      mm/zpool: implement common zpool api to zbud/zsmalloc · af8d417a
      Dan Streetman 提交于
      Add zpool api.
      
      zpool provides an interface for memory storage, typically of compressed
      memory.  Users can select what backend to use; currently the only
      implementations are zbud, a low density implementation with up to two
      compressed pages per storage page, and zsmalloc, a higher density
      implementation with multiple compressed pages per storage page.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af8d417a
    • D
      mm/zbud: change zbud_alloc size type to size_t · 99eef8e9
      Dan Streetman 提交于
      Change the type of the zbud_alloc() size param from unsigned int to
      size_t.
      
      Technically, this should not make any difference, as the zbud
      implementation already restricts the size to well within either type's
      limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use
      size_t, this brings the size parameter type in line with zsmalloc/zpool.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99eef8e9
    • M
      mm/highmem: make kmap cache coloring aware · 15de36a4
      Max Filippov 提交于
      User-visible effect:
       Architectures that choose this method of maintaining cache coherency
      (MIPS and xtensa currently) are able to use high memory on cores with
      aliasing data cache.  Without this fix such architectures can not use
      high memory (in case of xtensa it means that at most 128 MBytes of
      physical memory is available).
      
      The problem:
       VIPT cache with way size larger than MMU page size may suffer from
      aliasing problem: a single physical address accessed via different
      virtual addresses may end up in multiple locations in the cache.
      Virtual mappings of a physical address that always get cached in
      different cache locations are said to have different colors.  L1 caching
      hardware usually doesn't handle this situation leaving it up to
      software.  Software must avoid this situation as it leads to data
      corruption.
      
      What can be done:
       One way to handle this is to flush and invalidate data cache every time
      page mapping changes color.  The other way is to always map physical
      page at a virtual address with the same color.  Low memory pages already
      have this property.  Giving architecture a way to control color of high
      memory page mapping allows reusing of existing low memory cache alias
      handling code.
      
      How this is done with this patch:
       Provide hooks that allow architectures with aliasing cache to align
      mapping address of high pages according to their color.  Such
      architectures may enforce similar coloring of low- and high-memory page
      mappings and reuse existing cache management functions to support
      highmem.
      
      This code is based on the implementation of similar feature for MIPS by
      Leonid Yegoshin.
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Marc Gauthier <marc@cadence.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Steven Hill <Steven.Hill@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15de36a4
    • P
      mmu_notifier: add call_srcu and sync function for listener to delay call and sync · b972216e
      Peter Zijlstra 提交于
      When kernel device drivers or subsystems want to bind their lifespan to
      t= he lifespan of the mm_struct, they usually use one of the following
      methods:
      
      1. Manually calling a function in the interested kernel module.  The
         funct= ion call needs to be placed in mmput.  This method was rejected
         by several ker= nel maintainers.
      
      2. Registering to the mmu notifier release mechanism.
      
      The problem with the latter approach is that the mmu_notifier_release
      cal= lback is called from__mmu_notifier_release (called from exit_mmap).
      That functi= on iterates over the list of mmu notifiers and don't expect
      the release call= back function to remove itself from the list.
      Therefore, the callback function= in the kernel module can't release the
      mmu_notifier_object, which is actuall= y the kernel module's object
      itself.  As a result, the destruction of the kernel module's object must
      to be done in a delayed fashion.
      
      This patch adds support for this delayed callback, by adding a new
      mmu_notifier_call_srcu function that receives a function ptr and calls
      th= at function with call_srcu.  In that function, the kernel module
      releases its object.  To use mmu_notifier_call_srcu, the calling module
      needs to call b= efore that a new function called
      mmu_notifier_unregister_no_release that as its= name implies,
      unregisters a notifier without calling its notifier release call= back.
      
      This patch also adds a function that will call barrier_srcu so those
      kern= el modules can sync with mmu_notifier.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NOded Gabbay <oded.gabbay@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b972216e
    • J
      mm: memcontrol: clean up reclaim size variable use in try_charge() · 61e02c74
      Johannes Weiner 提交于
      Charge reclaim and OOM currently use the charge batch variable, but
      batching is already disabled at that point.  To simplify the charge
      logic, the batch variable is reset to the original request size when
      reclaim is entered, so it's functionally equal, but it's misleading.
      
      Switch reclaim/OOM to nr_pages, which is the original request size.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61e02c74
    • R
      mm: change confusing #ifdef use in __access_remote_vm · dbffcd03
      Rik van Riel 提交于
      This patch changes confusing #ifdef use in __access_remote_vm into
      merely ugly #ifdef use.
      
      Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbffcd03
    • K
      mm: mark fault_around_bytes __read_mostly · 3a91053a
      Kirill A. Shutemov 提交于
      fault_around_bytes can only be changed via debugfs.  Let's mark it
      read-mostly.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a91053a
    • K
      mm: close race between do_fault_around() and fault_around_bytes_set() · aecd6f44
      Kirill A. Shutemov 提交于
      Things can go wrong if fault_around_bytes will be changed under
      do_fault_around(): between fault_around_mask() and fault_around_pages().
      
      Let's read fault_around_bytes only once during do_fault_around() and
      calculate mask based on the reading.
      
      Note: fault_around_bytes can only be updated via debug interface.  Also
      I've tried but was not able to trigger a bad behaviour without the
      patch.  So I would not consider this patch as urgent.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aecd6f44
    • J
      memcg, vmscan: Fix forced scan of anonymous pages · 2ab051e1
      Jerome Marchand 提交于
      When memory cgoups are enabled, the code that decides to force to scan
      anonymous pages in get_scan_count() compares global values (free,
      high_watermark) to a value that is restricted to a memory cgroup (file).
      It make the code over-eager to force anon scan.
      
      For instance, it will force anon scan when scanning a memcg that is
      mainly populated by anonymous page, even when there is plenty of file
      pages to get rid of in others memcgs, even when swappiness == 0.  It
      breaks user's expectation about swappiness and hurts performance.
      
      This patch makes sure that forced anon scan only happens when there not
      enough file pages for the all zone, not just in one random memcg.
      
      [hannes@cmpxchg.org: cleanups]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ab051e1
    • J
      mm, vmscan: fix an outdated comment still mentioning get_scan_ratio · 7c0db9e9
      Jerome Marchand 提交于
      Quite a while ago, get_scan_ratio() has been renamed get_scan_count(),
      however a comment in shrink_active_list() still mention it.  This patch
      fixes the outdated comment.
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c0db9e9
    • D
      mm, oom: remove unnecessary exit_state check · fb794bcb
      David Rientjes 提交于
      The oom killer scans each process and determines whether it is eligible
      for oom kill or whether the oom killer should abort because of
      concurrent memory freeing.  It will abort when an eligible process is
      found to have TIF_MEMDIE set, meaning it has already been oom killed and
      we're waiting for it to exit.
      
      Processes with task->mm == NULL should not be considered because they
      are either kthreads or have already detached their memory and killing
      them would not lead to memory freeing.  That memory is only freed after
      exit_mm() has returned, however, and not when task->mm is first set to
      NULL.
      
      Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
      is no longer considered for oom kill, but only until exit_mm() has
      returned.  This was fragile in the past because it relied on
      exit_notify() to be reached before no longer considering TIF_MEMDIE
      processes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb794bcb
    • L
      mm: fix potential infinite loop in dissolve_free_huge_pages() · d0177639
      Li Zhong 提交于
      It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to
      0 to indicate huge pages not supported.
      
      When this is the case, hugetlbfs could be disabled during boot time:
      hugetlbfs: disabling because there are no supported hugepage sizes
      
      Then in dissolve_free_huge_pages(), order is kept maximum (64 for
      64bits), and the for loop below won't end: for (pfn = start_pfn; pfn <
      end_pfn; pfn += 1 << order)
      
      As suggested by Naoya, below fix checks hugepages_supported() before
      calling dissolve_free_huge_pages().
      
      [rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()]
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0177639
    • D
      mm, thp: restructure thp avoidance of light synchronous migration · 8fe78048
      David Rientjes 提交于
      __GFP_NO_KSWAPD, once the way to determine if an allocation was for thp
      or not, has gained more users.  Their use is not necessarily wrong, they
      are trying to do a memory allocation that can easily fail without
      disturbing kswapd, so the bit has gained additional usecases.
      
      This restructures the check to determine whether MIGRATE_SYNC_LIGHT
      should be used for memory compaction in the page allocator.  Rather than
      testing solely for __GFP_NO_KSWAPD, test for all bits that must be set
      for thp allocations.
      
      This also moves the check to be done only after the page allocator is
      aborted for deferred or contended memory compaction since setting
      migration_mode for this case is pointless.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fe78048
    • D
      mm, oom: rename zonelist locking functions · e972a070
      David Rientjes 提交于
      try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
      to imply that they require locking semantics to avoid out_of_memory()
      being reordered.
      
      zone_scan_lock is required for both functions to ensure that there is
      proper locking synchronization.
      
      Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
      clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
      locking semantics.
      
      At the same time, convert oom_zonelist_trylock() to return bool instead
      of int since only success and failure are tested.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e972a070
    • D
      mm, oom: ensure memoryless node zonelist always includes zones · 8d060bf4
      David Rientjes 提交于
      With memoryless node support being worked on, it's possible that for
      optimizations that a node may not have a non-NULL zonelist.  When
      CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist
      for first_online_node may become NULL.
      
      The oom killer requires a zonelist that includes all memory zones for
      the sysrq trigger and pagefault out of memory handler.
      
      Ensure that a non-NULL zonelist is always passed to the oom killer.
      
      [akpm@linux-foundation.org: fix non-numa build]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d060bf4
    • W
      memory-hotplug: add zone_for_memory() for selecting zone for new memory · 63264400
      Wang Nan 提交于
      This series of patches fixes a problem when adding memory in bad manner.
      For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
      memory installed, following commands cause problem:
      
        # echo 0x40000000 > /sys/devices/system/memory/probe
       [   28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
        # echo 0x48000000 > /sys/devices/system/memory/probe
       [   28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
        # echo online_movable > /sys/devices/system/memory/memory9/state
        # echo 0x50000000 > /sys/devices/system/memory/probe
       [   29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
        # echo 0x58000000 > /sys/devices/system/memory/probe
       [   29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
        # echo online_movable > /sys/devices/system/memory/memory11/state
        # echo online> /sys/devices/system/memory/memory8/state
        # echo online> /sys/devices/system/memory/memory10/state
        # echo offline> /sys/devices/system/memory/memory9/state
       [   30.558819] Offlined Pages 32768
        # free
                    total       used       free     shared    buffers     cached
       Mem:        780588 18014398509432020     830552          0          0      51180
       -/+ buffers/cache: 18014398509380840     881732
       Swap:            0          0          0
      
      This is because the above commands probe higher memory after online a
      section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
      for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.
      
      After the second online_movable, the problem can be observed from
      zoneinfo:
      
        # cat /proc/zoneinfo
        ...
        Node 0, zone  Movable
          pages free     65491
                min      250
                low      312
                high     375
                scanned  0
                spanned  18446744073709518848
                present  65536
                managed  65536
        ...
      
      This series of patches solve the problem by checking ZONE_MOVABLE when
      choosing zone for new memory.  If new memory is inside or higher than
      ZONE_MOVABLE, makes it go there instead.
      
      After applying this series of patches, following are free and zoneinfo
      result (after offlining memory9):
      
        bash-4.2# free
                      total       used       free     shared    buffers     cached
         Mem:        780956      80112     700844          0          0      51180
         -/+ buffers/cache:      28932     752024
         Swap:            0          0          0
      
        bash-4.2# cat /proc/zoneinfo
      
        Node 0, zone      DMA
          pages free     3389
                min      14
                low      17
                high     21
                scanned  0
                spanned  4095
                present  3998
                managed  3977
            nr_free_pages 3389
        ...
          start_pfn:         1
          inactive_ratio:    1
        Node 0, zone    DMA32
          pages free     73724
                min      341
                low      426
                high     511
                scanned  0
                spanned  98304
                present  98304
                managed  92958
            nr_free_pages 73724
          ...
          start_pfn:         4096
          inactive_ratio:    1
        Node 0, zone   Normal
          pages free     32630
                min      120
                low      150
                high     180
                scanned  0
                spanned  32768
                present  32768
                managed  32768
            nr_free_pages 32630
        ...
          start_pfn:         262144
          inactive_ratio:    1
        Node 0, zone  Movable
          pages free     65476
                min      241
                low      301
                high     361
                scanned  0
                spanned  98304
                present  65536
                managed  65536
            nr_free_pages 65476
        ...
          start_pfn:         294912
          inactive_ratio:    1
      
      This patch (of 7):
      
      Introduce zone_for_memory() in arch independent code for
      arch_add_memory() use.
      
      Many arch_add_memory() function simply selects ZONE_HIGHMEM or
      ZONE_NORMAL and add new memory into it.  However, with the existance of
      ZONE_MOVABLE, the selection method should be carefully considered: if
      new, higher memory is added after ZONE_MOVABLE is setup, the default
      zone and ZONE_MOVABLE may overlap each other.
      
      should_add_memory_movable() checks the status of ZONE_MOVABLE.  If it
      has already contain memory, compare the address of new memory and
      movable memory.  If new memory is higher than movable, it should be
      added into ZONE_MOVABLE instead of default zone.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63264400