1. 23 2月, 2016 7 次提交
  2. 13 2月, 2016 1 次提交
    • J
      cgroup: provide cgroup_nov1= to disable controllers in v1 mounts · 223ffb29
      Johannes Weiner 提交于
      Testing cgroup2 can be painful with system software automatically
      mounting and populating all cgroup controllers in v1 mode. Sometimes
      they can be unmounted from rc.local, sometimes even that is too late.
      
      Provide a commandline option to disable certain controllers in v1
      mounts, so that they remain available for cgroup2 mounts.
      
      Example use:
      
      cgroup_no_v1=memory,cpu
      cgroup_no_v1=all
      
      Disabling will be confirmed at boot-time as such:
      
      [    0.013770] Disabling cpu control group subsystem in v1 mounts
      [    0.016004] Disabling memory control group subsystem in v1 mounts
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      223ffb29
  3. 31 1月, 2016 1 次提交
  4. 22 1月, 2016 3 次提交
    • T
      cgroup: make sure a parent css isn't freed before its children · 8bb5ef79
      Tejun Heo 提交于
      There are three subsystem callbacks in css shutdown path -
      css_offline(), css_released() and css_free().  Except for
      css_released(), cgroup core didn't guarantee the order of invocation.
      css_offline() or css_free() could be called on a parent css before its
      children.  This behavior is unexpected and led to bugs in cpu and
      memory controller.
      
      The previous patch updated ordering for css_offline() which fixes the
      cpu controller issue.  While there currently isn't a known bug caused
      by misordering of css_free() invocations, let's fix it too for
      consistency.
      
      css_free() ordering can be trivially fixed by moving putting of the
      parent css below css_free() invocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8bb5ef79
    • T
      cgroup: make sure a parent css isn't offlined before its children · aa226ff4
      Tejun Heo 提交于
      There are three subsystem callbacks in css shutdown path -
      css_offline(), css_released() and css_free().  Except for
      css_released(), cgroup core didn't guarantee the order of invocation.
      css_offline() or css_free() could be called on a parent css before its
      children.  This behavior is unexpected and led to bugs in cpu and
      memory controller.
      
      This patch updates offline path so that a parent css is never offlined
      before its children.  Each css keeps online_cnt which reaches zero iff
      itself and all its children are offline and offline_css() is invoked
      only after online_cnt reaches zero.
      
      This fixes the memory controller bug and allows the fix for cpu
      controller.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reported-by: NBrian Christiansen <brian.o.christiansen@gmail.com>
      Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
      Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      aa226ff4
    • T
      cpuset: make mm migration asynchronous · e93ad19d
      Tejun Heo 提交于
      If "cpuset.memory_migrate" is set, when a process is moved from one
      cpuset to another with a different memory node mask, pages in used by
      the process are migrated to the new set of nodes.  This was performed
      synchronously in the ->attach() callback, which is synchronized
      against process management.  Recently, the synchronization was changed
      from per-process rwsem to global percpu rwsem for simplicity and
      optimization.
      
      Combined with the synchronous mm migration, this led to deadlocks
      because mm migration could schedule a work item which may in turn try
      to create a new worker blocking on the process management lock held
      from cgroup process migration path.
      
      This heavy an operation shouldn't be performed synchronously from that
      deep inside cgroup migration in the first place.  This patch punts the
      actual migration to an ordered workqueue and updates cgroup process
      migration and cpuset config update paths to flush the workqueue after
      all locks are released.  This way, the operations still seem
      synchronous to userland without entangling mm migration with process
      management synchronization.  CPU hotplug can also invoke mm migration
      but there's no reason for it to wait for mm migrations and thus
      doesn't synchronize against their completions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      e93ad19d
  5. 21 1月, 2016 14 次提交
  6. 19 1月, 2016 1 次提交
  7. 17 1月, 2016 4 次提交
  8. 16 1月, 2016 9 次提交
    • D
      mm: bring in additional flag for fixup_user_fault to signal unlock · 4a9e1cda
      Dominik Dingel 提交于
      During Jason's work with postcopy migration support for s390 a problem
      regarding gmap faults was discovered.
      
      The gmap code will call fixup_user_fault which will end up always in
      handle_mm_fault.  Till now we never cared about retries, but as the
      userfaultfd code kind of relies on it.  this needs some fix.
      
      This patchset does not take care of the futex code.  I will now look
      closer at this.
      
      This patch (of 2):
      
      With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
      to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
      faulting we ever unlocked mmap_sem.
      
      This patch brings in the logic to handle retries as well as it cleans up
      the current documentation.  fixup_user_fault was not having the same
      semantics as filemap_fault.  It never indicated if a retry happened and
      so a caller wasn't able to handle that case.  So we now changed the
      behaviour to always retry a locked mmap_sem.
      Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a9e1cda
    • D
      mm, x86: get_user_pages() for dax mappings · 3565fce3
      Dan Williams 提交于
      A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
      has established a devm_memremap_pages() mapping, i.e.  when the pfn_t
      return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
      encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
      struct dev_pagemap instance to keep the result of pfn_to_page() valid
      until put_page().
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3565fce3
    • D
      mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup · 5c2c2587
      Dan Williams 提交于
      get_dev_page() enables paths like get_user_pages() to pin a dynamically
      mapped pfn-range (devm_memremap_pages()) while the resulting struct page
      objects are in use.  Unlike get_page() it may fail if the device is, or
      is in the process of being, disabled.  While the initial lookup of the
      range may be an expensive list walk, the result is cached to speed up
      subsequent lookups which are likely to be in the same mapped range.
      
      devm_memremap_pages() now requires a reference counter to be specified
      at init time.  For pmem this means moving request_queue allocation into
      pmem_alloc() so the existing queue usage counter can track "device
      pages".
      
      ZONE_DEVICE pages always have an elevated count and will never be on an
      lru reclaim list.  That space in 'struct page' can be redirected for
      other uses, but for safety introduce a poison value that will always
      trip __list_add() to assert.  This allows half of the struct list_head
      storage to be reclaimed with some assurance to back up the assumption
      that the page count never goes to zero and a list_add() is never
      attempted.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c2c2587
    • D
      x86, mm: introduce vmem_altmap to augment vmemmap_populate() · 4b94ffdc
      Dan Williams 提交于
      In support of providing struct page for large persistent memory
      capacities, use struct vmem_altmap to change the default policy for
      allocating memory for the memmap array.  The default vmemmap_populate()
      allocates page table storage area from the page allocator.  Given
      persistent memory capacities relative to DRAM it may not be feasible to
      store the memmap in 'System Memory'.  Instead vmem_altmap represents
      pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
      requests.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b94ffdc
    • D
      mm: introduce find_dev_pagemap() · 9476df7d
      Dan Williams 提交于
      There are several scenarios where we need to retrieve and update
      metadata associated with a given devm_memremap_pages() mapping, and the
      only lookup key available is a pfn in the range:
      
      1/ We want to augment vmemmap_populate() (called via arch_add_memory())
         to allocate memmap storage from pre-allocated pages reserved by the
         device driver.  At vmemmap_alloc_block_buf() time it grabs device pages
         rather than page allocator pages.  This is in support of
         devm_memremap_pages() mappings where the memmap is too large to fit in
         main memory (i.e. large persistent memory devices).
      
      2/ Taking a reference against the mapping when inserting device pages
         into the address_space radix of a given inode.  This facilitates
         unmap_mapping_range() and truncate_inode_pages() operations when the
         driver is tearing down the mapping.
      
      3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
         reference against the mapping so that the driver teardown path can
         revoke and drain usage of device pages.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9476df7d
    • D
      mm, dax, pmem: introduce pfn_t · 34c0fd54
      Dan Williams 提交于
      For the purpose of communicating the optional presence of a 'struct
      page' for the pfn returned from ->direct_access(), introduce a type that
      encapsulates a page-frame-number plus flags.  These flags contain the
      historical "page_link" encoding for a scatterlist entry, but can also
      denote "device memory".  Where "device memory" is a set of pfns that are
      not part of the kernel's linear mapping by default, but are accessed via
      the same memory controller as ram.
      
      The motivation for this new type is large capacity persistent memory
      that needs struct page entries in the 'memmap' to support 3rd party DMA
      (i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
      we also need it in support of maintaining a list of mapped inodes which
      need to be unmapped at driver teardown or freeze_bdev() time.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34c0fd54
    • K
      futex, thp: remove special case for THP in get_futex_key · 14d27abd
      Kirill A. Shutemov 提交于
      With new THP refcounting, we don't need tricks to stabilize huge page.
      If we've got reference to tail page, it can't split under us.
      
      This patch effectively reverts a5b338f2 ("thp: update futex compound
      knowledge").
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NArtem Savkov <artem.savkov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14d27abd
    • K
      memcg: adjust to support new THP refcounting · f627c2f5
      Kirill A. Shutemov 提交于
      As with rmap, with new refcounting we cannot rely on PageTransHuge() to
      check if we need to charge size of huge page form the cgroup.  We need
      to get information from caller to know whether it was mapped with PMD or
      PTE.
      
      We do uncharge when last reference on the page gone.  At that point if
      we see PageTransHuge() it means we need to unchange whole huge page.
      
      The tricky part is partial unmap -- when we try to unmap part of huge
      page.  We don't do a special handing of this situation, meaning we don't
      uncharge the part of huge page unless last user is gone or
      split_huge_page() is triggered.  In case of cgroup memory pressure
      happens the partial unmapped page will be split through shrinker.  This
      should be good enough.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f627c2f5
    • K
      rmap: add argument to charge compound page · d281ee61
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound
      page.  It means we cannot rely on PageTransHuge() check to decide if
      map/unmap small page or THP.
      
      The patch adds new argument to rmap functions to indicate whether we
      want to operate on whole compound page or only the small page.
      
      [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d281ee61