1. 19 2月, 2016 4 次提交
    • D
      mm: slab: free kmem_cache_node after destroy sysfs file · 52b4b950
      Dmitry Safonov 提交于
      When slub_debug alloc_calls_show is enabled we will try to track
      location and user of slab object on each online node, kmem_cache_node
      structure and cpu_cache/cpu_slub shouldn't be freed till there is the
      last reference to sysfs file.
      
      This fixes the following panic:
      
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
         IP:  list_locations+0x169/0x4e0
         PGD 257304067 PUD 438456067 PMD 0
         Oops: 0000 [#1] SMP
         CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
         Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
         task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
         RIP: list_locations+0x169/0x4e0
         Call Trace:
           alloc_calls_show+0x1d/0x30
           slab_attr_show+0x1b/0x30
           sysfs_read_file+0x9a/0x1a0
           vfs_read+0x9c/0x170
           SyS_read+0x58/0xb0
           system_call_fastpath+0x16/0x1b
         Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 <48> 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
         CR2: 0000000000000020
      
      Separated __kmem_cache_release from __kmem_cache_shutdown which now
      called on slab_kmem_cache_release (after the last reference to sysfs
      file object has dropped).
      
      Reintroduced locking in free_partial as sysfs file might access cache's
      partial list after shutdowning - partial revert of the commit
      69cb8e6b ("slub: free slabs without holding locks").  Zap
      __remove_partial and use remove_partial (w/o underscores) as
      free_partial now takes list_lock which s partial revert for commit
      1e4dd946 ("slub: do not assert not having lock in removing freed
      partial")
      Signed-off-by: NDmitry Safonov <dsafonov@virtuozzo.com>
      Suggested-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52b4b950
    • V
      mm/hugetlb.c: fix incorrect proc nr_hugepages value · f8b74815
      Vaishali Thakkar 提交于
      Currently incorrect default hugepage pool size is reported by proc
      nr_hugepages when number of pages for the default huge page size is
      specified twice.
      
      When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
      indicates the current number of pre-allocated huge pages of the default
      size.  Basically /proc/sys/vm/nr_hugepages displays default_hstate->
      max_huge_pages and after boot time pre-allocation, max_huge_pages should
      equal the number of pre-allocated pages (nr_hugepages).
      
      Test case:
      
      Note that this is specific to x86 architecture.
      
      Boot the kernel with command line option 'default_hugepagesz=1G
      hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'.  After
      boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
      returns the value X.  However, dmesg output shows that Z huge pages were
      pre-allocated.
      
      So, the root cause of the problem here is that the global variable
      default_hstate_max_huge_pages is set if a default huge page size is
      specified (directly or indirectly) on the command line.  After the command
      line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
      the value is assigned to default_hstae.max_huge_pages.  However,
      default_hstate.max_huge_pages may have already been set based on the
      number of pre-allocated huge pages of default_hstate size.
      
      The solution to this problem is if hstate->max_huge_pages is already set
      then it should not set as a result of global max_huge_pages value.
      Basically if the value of the variable hugepages is set multiple times on
      a command line for a specific supported hugepagesize then proc layer
      should consider the last specified value.
      Signed-off-by: NVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8b74815
    • K
      mm: fix regression in remap_file_pages() emulation · 48f7df32
      Kirill A. Shutemov 提交于
      Grazvydas Ignotas has reported a regression in remap_file_pages()
      emulation.
      
      Testcase:
      	#define _GNU_SOURCE
      	#include <assert.h>
      	#include <stdlib.h>
      	#include <stdio.h>
      	#include <sys/mman.h>
      
      	#define SIZE    (4096 * 3)
      
      	int main(int argc, char **argv)
      	{
      		unsigned long *p;
      		long i;
      
      		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			return -1;
      		}
      
      		for (i = 0; i < SIZE / 4096; i++)
      			p[i * 4096 / sizeof(*p)] = i;
      
      		if (remap_file_pages(p, 4096, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		assert(p[0] == 1);
      
      		munmap(p, SIZE);
      
      		return 0;
      	}
      
      The second remap_file_pages() fails with -EINVAL.
      
      The reason is that remap_file_pages() emulation assumes that the target
      vma covers whole area we want to over map.  That assumption is broken by
      first remap_file_pages() call: it split the area into two vma.
      
      The solution is to check next adjacent vmas, if they map the same file
      with the same flags.
      
      Fixes: c8d78c18 ("mm: replace remap_file_pages() syscall with emulation")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NGrazvydas Ignotas <notasas@gmail.com>
      Tested-by: NGrazvydas Ignotas <notasas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48f7df32
    • K
      thp, dax: do not try to withdraw pgtable from non-anon VMA · 69a8ec2d
      Kirill A. Shutemov 提交于
      DAX doesn't deposit pgtables when it maps huge pages: nothing to
      withdraw. It can lead to crash.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69a8ec2d
  2. 12 2月, 2016 5 次提交
  3. 06 2月, 2016 12 次提交
  4. 04 2月, 2016 9 次提交
  5. 01 2月, 2016 1 次提交
  6. 27 1月, 2016 1 次提交
  7. 25 1月, 2016 1 次提交
    • C
      vmstat: Remove BUG_ON from vmstat_update · 587198ba
      Christoph Lameter 提交于
      If we detect that there is nothing to do just set the flag and do not
      check if it was already set before.  Races really do not matter.  If the
      flag is set by any code then the shepherd will start dealing with the
      situation and reenable the vmstat workers when necessary again.
      
      Since commit 0eb77e98 ("vmstat: make vmstat_updater deferrable again
      and shut down on idle") quiet_vmstat might update cpu_stat_off and mark
      a particular cpu to be handled by vmstat_shepherd.  This might trigger a
      VM_BUG_ON in vmstat_update because the work item might have been
      sleeping during the idle period and see the cpu_stat_off updated after
      the wake up.  The VM_BUG_ON is therefore misleading and no more
      appropriate.  Moreover it doesn't really suite any protection from real
      bugs because vmstat_shepherd will simply reschedule the vmstat_work
      anytime it sees a particular cpu set or vmstat_update would do the same
      from the worker context directly.  Even when the two would race the
      result wouldn't be incorrect as the counters update is fully idempotent.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      587198ba
  8. 23 1月, 2016 6 次提交
    • T
      tree wide: use kvfree() than conditional kfree()/vfree() · 1d5cfdb0
      Tetsuo Handa 提交于
      There are many locations that do
      
        if (memory_was_allocated_by_vmalloc)
          vfree(ptr);
        else
          kfree(ptr);
      
      but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
      using is_vmalloc_addr().  Unless callers have special reasons, we can
      replace this branch with kvfree().  Please check and reply if you found
      problems.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Reviewed-by: NAndreas Dilger <andreas.dilger@intel.com>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Boris Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d5cfdb0
    • R
      dax: add support for fsync/sync · 9973c98e
      Ross Zwisler 提交于
      To properly handle fsync/msync in an efficient way DAX needs to track
      dirty pages so it is able to flush them durably to media on demand.
      
      The tracking of dirty pages is done via the radix tree in struct
      address_space.  This radix tree is already used by the page writeback
      infrastructure for tracking dirty pages associated with an open file,
      and it already has support for exceptional (non struct page*) entries.
      We build upon these features to add exceptional entries to the radix
      tree for DAX dirty PMD or PTE pages at fault time.
      
      [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9973c98e
    • R
      mm: add find_get_entries_tag() · 7e7f7749
      Ross Zwisler 提交于
      Add find_get_entries_tag() to the family of functions that include
      find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
      needed for DAX dirty page handling because we need a list of both page
      offsets and radix tree entries ('indices' and 'entries' in this
      function) that are marked with the PAGECACHE_TAG_TOWRITE tag.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e7f7749
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
    • A
      make sure that freeing shmem fast symlinks is RCU-delayed · 3ed47db3
      Al Viro 提交于
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3ed47db3
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  9. 22 1月, 2016 1 次提交