1. 20 4月, 2019 1 次提交
    • H
      mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES · dd862deb
      Hugh Dickins 提交于
      SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
      further testing has proved it to be a source of unnecessary swapoff
      EBUSY failures (which can then be followed by unmount EBUSY failures).
      
      When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
      or inode being evicted, freeing up swap independent of try_to_unuse().
      Those typically completed much sooner than the old quadratic swapoff,
      but now it's more common that swapoff may need to wait for them.
      
      It's possible to move those cases from init_mm.mmlist and shmem_swaplist
      to separate "exiting" swaplists, and try_to_unuse() then wait for those
      lists to be emptied; but we've not bothered with that in the past, and
      don't want to risk missing some other forgotten case.  So just revert to
      cycling around until the swap is gone, without any retries limit.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081256170.1523@eggly.anvils
      Fixes: b56a2d8a ("mm: rid swapoff of quadratic complexity")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vineeth Pillai <vpillai@digitalocean.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd862deb
  2. 06 3月, 2019 4 次提交
    • G
      mm/swapfile.c: use struct_size() in kvzalloc() · 96008744
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array.  For example:
      
        struct foo {
            int stuff;
            struct boo entry[];
        };
      
        size = sizeof(struct foo) + count * sizeof(struct boo);
        instance = kvzalloc(size, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
        instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      Notice that, in this case, variable size is not necessary, hence it is
      removed.
      
      This code was detected with the help of Coccinelle.
      
      Link: http://lkml.kernel.org/r/20190221154622.GA19599@embeddedorSigned-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96008744
    • A
      numa: make "nr_node_ids" unsigned int · b9726c26
      Alexey Dobriyan 提交于
      Number of NUMA nodes can't be negative.
      
      This saves a few bytes on x86_64:
      
      	add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
      	Function                                     old     new   delta
      	hv_synic_alloc.cold                           88     110     +22
      	prealloc_shrinker                            260     262      +2
      	bootstrap                                    249     251      +2
      	sched_init_numa                             1566    1567      +1
      	show_slab_objects                            778     777      -1
      	s_show                                      1201    1200      -1
      	kmem_cache_init                              346     345      -1
      	__alloc_workqueue_key                       1146    1145      -1
      	mem_cgroup_css_alloc                        1614    1612      -2
      	__do_sys_swapon                             4702    4699      -3
      	__list_lru_init                              655     651      -4
      	nic_probe                                   2379    2374      -5
      	store_user_store                             118     111      -7
      	red_zone_store                               106      99      -7
      	poison_store                                 106      99      -7
      	wq_numa_init                                 348     338     -10
      	__kmem_cache_empty                            75      65     -10
      	task_numa_free                               186     173     -13
      	merge_across_nodes_store                     351     336     -15
      	irq_create_affinity_masks                   1261    1246     -15
      	do_numa_crng_init                            343     321     -22
      	task_numa_fault                             4760    4737     -23
      	swapfile_init                                179     156     -23
      	hv_synic_alloc                               536     492     -44
      	apply_wqattrs_prepare                        746     695     -51
      
      Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9726c26
    • D
      mm, swap: bounds check swap_info array accesses to avoid NULL derefs · c10d38cc
      Daniel Jordan 提交于
      Dan Carpenter reports a potential NULL dereference in
      get_swap_page_of_type:
      
        Smatch complains that the NULL checks on "si" aren't consistent.  This
        seems like a real bug because we have not ensured that the type is
        valid and so "si" can be NULL.
      
      Add the missing check for NULL, taking care to use a read barrier to
      ensure CPU1 observes CPU0's updates in the correct order:
      
           CPU0                           CPU1
           alloc_swap_info()              if (type >= nr_swapfiles)
             swap_info[type] = p              /* handle invalid entry */
             smp_wmb()                    smp_rmb()
             ++nr_swapfiles               p = swap_info[type]
      
      Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
      CPU0's write to swap_info[type] and read NULL from swap_info[type].
      
      Ying Huang noticed other places in swapfile.c don't order these reads
      properly.  Introduce swap_type_to_swap_info to encourage correct usage.
      
      Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
      (see tools/memory-model/Documentation/explanation.txt).
      
      This ordering need not be enforced in places where swap_lock is held
      (e.g.  si_swapinfo) because swap_lock serializes updates to nr_swapfiles
      and the swap_info array.
      
      Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
      Fixes: ec8acf20 ("swap: add per-partition lock for swapfile")
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Suggested-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c10d38cc
    • V
      mm: rid swapoff of quadratic complexity · b56a2d8a
      Vineeth Remanan Pillai 提交于
      This patch was initially posted by Kelley Nielsen.  Reposting the patch
      with all review comments addressed and with minor modifications and
      optimizations.  Also, folding in the fixes offered by Hugh Dickins and
      Huang Ying.  Tests were rerun and commit message updated with new
      results.
      
      try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
      It unuses swap entries one by one, potentially iterating over all the
      page tables for all the processes in the system for each one.
      
      This new proposed implementation of try_to_unuse simplifies its
      complexity to linear.  It iterates over the system's mms once, unusing
      all the affected entries as it walks each set of page tables.  It also
      makes similar changes to shmem_unuse.
      
      Improvement
      
      swapoff was called on a swap partition containing about 6G of data, in a
      VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.
      
      Present implementation....about 1200M calls(8min, avg 80% cpu util).
      Prototype.................about  9.0K calls(3min, avg 5% cpu util).
      
      Details
      
      In shmem_unuse(), iterate over the shmem_swaplist and, for each
      shmem_inode_info that contains a swap entry, pass it to
      shmem_unuse_inode(), along with the swap type.  In shmem_unuse_inode(),
      iterate over its associated xarray, and store the index and value of
      each swap entry in an array for passing to shmem_swapin_page() outside
      of the RCU critical section.
      
      In try_to_unuse(), instead of iterating over the entries in the type and
      unusing them one by one, perhaps walking all the page tables for all the
      processes for each one, iterate over the mmlist, making one pass.  Pass
      each mm to unuse_mm() to begin its page table walk, and during the walk,
      unuse all the ptes that have backing store in the swap type received by
      try_to_unuse().  After the walk, check the type for orphaned swap
      entries with find_next_to_unuse(), and remove them from the swap cache.
      If find_next_to_unuse() starts over at the beginning of the type, repeat
      the check of the shmem_swaplist and the walk a maximum of three times.
      
      Change unuse_mm() and the intervening walk functions down to
      unuse_pte_range() to take the type as a parameter, and to iterate over
      their entire range, calling the next function down on every iteration.
      In unuse_pte_range(), make a swap entry from each pte in the range using
      the passed in type.  If it has backing store in the type, call
      swapin_readahead() to retrieve the page and pass it to unuse_pte().
      
      Pass the count of pages_to_unuse down the page table walks in
      try_to_unuse(), and return from the walk when the desired number of
      pages has been swapped back in.
      
      Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.comSigned-off-by: NVineeth Remanan Pillai <vpillai@digitalocean.com>
      Signed-off-by: NKelley Nielsen <kelleynnn@gmail.com>
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b56a2d8a
  3. 29 12月, 2018 2 次提交
    • H
      mm, swap: fix swapoff with KSM pages · 7af7a8e1
      Huang Ying 提交于
      KSM pages may be mapped to the multiple VMAs that cannot be reached from
      one anon_vma.  So during swapin, a new copy of the page need to be
      generated if a different anon_vma is needed, please refer to comments of
      ksm_might_need_to_copy() for details.
      
      During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
      virtual address mapped to the page, so not all mappings to a swapped out
      KSM page could be found.  So in try_to_unuse(), even if the swap count of
      a swap entry isn't zero, the page needs to be deleted from swap cache, so
      that, in the next round a new page could be allocated and swapin for the
      other mappings of the swapped out KSM page.
      
      But this contradicts with the THP swap support.  Where the THP could be
      deleted from swap cache only after the swap count of every swap entry in
      the huge swap cluster backing the THP has reach 0.  So try_to_unuse() is
      changed in commit e0709829 ("mm, THP, swap: support to reclaim swap
      space for THP swapped out") to check that before delete a page from swap
      cache, but this has broken KSM swapoff too.
      
      Fortunately, KSM is for the normal pages only, so the original behavior
      for KSM pages could be restored easily via checking PageTransCompound().
      That is how this patch works.
      
      The bug is introduced by e0709829 ("mm, THP, swap: support to reclaim
      swap space for THP swapped out"), which is merged by v4.14-rc1.  So I
      think we should backport the fix to from 4.14 on.  But Hugh thinks it may
      be rare for the KSM pages being in the swap device when swapoff, so nobody
      reports the bug so far.
      
      Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
      Fixes: e0709829 ("mm, THP, swap: support to reclaim swap space for THP swapped out")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7af7a8e1
    • A
      mm/swap: use nr_node_ids for avail_lists in swap_info_struct · 66f71da9
      Aaron Lu 提交于
      Since a2468cc9 ("swap: choose swap device according to numa node"),
      avail_lists field of swap_info_struct is changed to an array with
      MAX_NUMNODES elements.  This made swap_info_struct size increased to 40KiB
      and needs an order-4 page to hold it.
      
      This is not optimal in that:
      1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
        is a waste of memory;
      2 It could cause swapon failure if the swap device is swapped on
        after system has been running for a while, due to no order-4
        page is available as pointed out by Vasily Averin.
      
      Solve the above two issues by using nr_node_ids(which is the actual
      possible node number the running system has) for avail_lists instead of
      MAX_NUMNODES.
      
      nr_node_ids is unknown at compile time so can't be directly used when
      declaring this array.  What I did here is to declare avail_lists as zero
      element array and allocate space for it when allocating space for
      swap_info_struct.  The reason why keep using array but not pointer is
      plist_for_each_entry needs the field to be part of the struct, so pointer
      will not work.
      
      This patch is on top of Vasily Averin's fix commit.  I think the use of
      kvzalloc for swap_info_struct is still needed in case nr_node_ids is
      really big on some systems.
      
      Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66f71da9
  4. 19 11月, 2018 1 次提交
  5. 27 10月, 2018 5 次提交
  6. 23 8月, 2018 8 次提交
  7. 09 7月, 2018 1 次提交
  8. 21 6月, 2018 1 次提交
  9. 15 6月, 2018 1 次提交
  10. 13 6月, 2018 1 次提交
    • K
      treewide: kvzalloc() -> kvcalloc() · 778e1cdd
      Kees Cook 提交于
      The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
      patch replaces cases of:
      
              kvzalloc(a * b, gfp)
      
      with:
              kvcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kvzalloc(a * b * c, gfp)
      
      with:
      
              kvzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kvcalloc(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kvzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kvzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kvzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kvzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kvzalloc
      + kvcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kvzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(sizeof(THING) * C2, ...)
      |
        kvzalloc(sizeof(TYPE) * C2, ...)
      |
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(C1 * C2, ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      778e1cdd
  11. 26 5月, 2018 1 次提交
  12. 12 4月, 2018 2 次提交
  13. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  14. 28 11月, 2017 1 次提交
  15. 16 11月, 2017 3 次提交
  16. 03 11月, 2017 1 次提交
    • H
      mm, swap: fix race between swap count continuation operations · 2628bd6f
      Huang Ying 提交于
      One page may store a set of entries of the sis->swap_map
      (swap_info_struct->swap_map) in multiple swap clusters.
      
      If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
      multiple pages will be used to store the set of entries of the
      sis->swap_map.  And the pages are linked with page->lru.  This is called
      swap count continuation.  To access the pages which store the set of
      entries of the sis->swap_map simultaneously, previously, sis->lock is
      used.  But to improve the scalability of __swap_duplicate(), swap
      cluster lock may be used in swap_count_continued() now.  This may race
      with add_swap_count_continuation() which operates on a nearby swap
      cluster, in which the sis->swap_map entries are stored in the same page.
      
      The race can cause wrong swap count in practice, thus cause unfreeable
      swap entries or software lockup, etc.
      
      To fix the race, a new spin lock called cont_lock is added to struct
      swap_info_struct to protect the swap count continuation page list.  This
      is a lock at the swap device level, so the scalability isn't very well.
      But it is still much better than the original sis->lock, because it is
      only acquired/released when swap count continuation is used.  Which is
      considered rare in practice.  If it turns out that the scalability
      becomes an issue for some workloads, we can split the lock into some
      more fine grained locks.
      
      Link: http://lkml.kernel.org/r/20171017081320.28133-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[4.11+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2628bd6f
  17. 09 9月, 2017 2 次提交
  18. 07 9月, 2017 4 次提交
    • A
      swap: choose swap device according to numa node · a2468cc9
      Aaron Lu 提交于
      If the system has more than one swap device and swap device has the node
      information, we can make use of this information to decide which swap
      device to use in get_swap_pages() to get better performance.
      
      The current code uses a priority based list, swap_avail_list, to decide
      which swap device to use and if multiple swap devices share the same
      priority, they are used round robin.  This patch changes the previous
      single global swap_avail_list into a per-numa-node list, i.e.  for each
      numa node, it sees its own priority based list of available swap
      devices.  Swap device's priority can be promoted on its matching node's
      swap_avail_list.
      
      The current swap device's priority is set as: user can set a >=0 value,
      or the system will pick one starting from -1 then downwards.  The
      priority value in the swap_avail_list is the negated value of the swap
      device's due to plist being sorted from low to high.  The new policy
      doesn't change the semantics for priority >=0 cases, the previous
      starting from -1 then downwards now becomes starting from -2 then
      downwards and -1 is reserved as the promoted value.
      
      Take 4-node EX machine as an example, suppose 4 swap devices are
      available, each sit on a different node:
      swapA on node 0
      swapB on node 1
      swapC on node 2
      swapD on node 3
      
      After they are all swapped on in the sequence of ABCD.
      
      Current behaviour:
      their priorities will be:
      swapA: -1
      swapB: -2
      swapC: -3
      swapD: -4
      And their position in the global swap_avail_list will be:
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:2     prio:3     prio:4
      
      New behaviour:
      their priorities will be(note that -1 is skipped):
      swapA: -2
      swapB: -3
      swapC: -4
      swapD: -5
      And their positions in the 4 swap_avail_lists[nid] will be:
      swap_avail_lists[0]: /* node 0's available swap device list */
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:3     prio:4     prio:5
      swap_avali_lists[1]: /* node 1's available swap device list */
      swapB   -> swapA   -> swapC   -> swapD
      prio:1     prio:2     prio:4     prio:5
      swap_avail_lists[2]: /* node 2's available swap device list */
      swapC   -> swapA   -> swapB   -> swapD
      prio:1     prio:2     prio:3     prio:5
      swap_avail_lists[3]: /* node 3's available swap device list */
      swapD   -> swapA   -> swapB   -> swapC
      prio:1     prio:2     prio:3     prio:4
      
      To see the effect of the patch, a test that starts N process, each mmap
      a region of anonymous memory and then continually write to it at random
      position to trigger both swap in and out is used.
      
      On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
      are used as swap devices with each attached to a different node, the
      result is:
      
      runtime=30m/processes=32/total test size=128G/each process mmap region=4G
      kernel         throughput
      vanilla        13306
      auto-binding   15169 +14%
      
      runtime=30m/processes=64/total test size=128G/each process mmap region=2G
      kernel         throughput
      vanilla        11885
      auto-binding   14879 +25%
      
      [aaron.lu@intel.com: v2]
        Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
        Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
      [akpm@linux-foundation.org: use kmalloc_array()]
      Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
      Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2468cc9
    • H
      mm, swap: don't use VMA based swap readahead if HDD is used as swap · 81a0298b
      Huang Ying 提交于
      VMA based swap readahead will readahead the virtual pages that is
      continuous in the virtual address space.  While the original swap
      readahead will readahead the swap slots that is continuous in the swap
      device.  Although VMA based swap readahead is more correct for the swap
      slots to be readahead, it will trigger more small random readings, which
      may cause the performance of HDD (hard disk) to degrade heavily, and may
      finally exceed the benefit.
      
      To avoid the issue, in this patch, if the HDD is used as swap, the VMA
      based swap readahead will be disabled, and the original swap readahead
      will be used instead.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81a0298b
    • H
      mm, THP, swap: support splitting THP for THP swap out · 59807685
      Huang Ying 提交于
      After adding swapping out support for THP (Transparent Huge Page), it is
      possible that a THP in swap cache (partly swapped out) need to be split.
      To split such a THP, the swap cluster backing the THP need to be split
      too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared for the swap
      cluster.  The patch implemented this.
      
      And because the THP swap writing needs the THP keeps as huge page during
      writing.  The PageWriteback flag is checked before splitting.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-8-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59807685
    • H
      mm, THP, swap: don't allocate huge cluster for file backed swap device · f0eea189
      Huang Ying 提交于
      It's hard to write a whole transparent huge page (THP) to a file backed
      swap device during swapping out and the file backed swap device isn't
      very popular.  So the huge cluster allocation for the file backed swap
      device is disabled.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-5-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0eea189