1. 09 10月, 2018 1 次提交
  2. 06 10月, 2018 9 次提交
  3. 02 10月, 2018 2 次提交
    • M
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman 提交于
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efaffc5e
    • S
      mm/migrate: Use spin_trylock() while resetting rate limit · 75346121
      Srikar Dronamraju 提交于
      Since this spinlock will only serialize the migrate rate limiting,
      convert the spin_lock() to a spin_trylock(). If another thread is updating, this
      task can move on.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     205332  198512   -3.32145
      1     319785  313559   -1.94693
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      8     74912   74761.9  -0.200368
      1     206585  214874   4.01239
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     189162  180536   -4.56011
      1     213760  210281   -1.62753
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     58736.8  56511.4  -3.78877
      1     105419   104899   -0.49327
      
      Avoiding stretching of window intervals may be the reason for the
      regression. Also code now uses READ_ONCE/WRITE_ONCE. That may
      also be hurting performance to some extent.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        14,285,708      13,818,546
      migrations                1,180,621       1,149,960
      faults                    339,114         385,583
      cache-misses              55,205,631,894  55,259,546,768
      sched:sched_move_numa     843             2,257
      sched:sched_stick_numa    6               9
      sched:sched_swap_numa     219             512
      migrate:mm_migrate_pages  365             2,225
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        26907   72692
      numa_hint_faults_local  24279   62270
      numa_hit                239771  238762
      numa_huge_pte_updates   0       48
      numa_interleave         68      75
      numa_local              239688  238676
      numa_other              83      86
      numa_pages_migrated     363     2225
      numa_pte_updates        27415   98557
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,202,779       3,173,490
      migrations                37,186          36,966
      faults                    106,076         108,776
      cache-misses              12,024,873,744  12,200,075,320
      sched:sched_move_numa     931             1,264
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     1               0
      migrate:mm_migrate_pages  637             899
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        17409   21109
      numa_hint_faults_local  14367   17120
      numa_hit                73953   72934
      numa_huge_pte_updates   20      42
      numa_interleave         25      33
      numa_local              73892   72866
      numa_other              61      68
      numa_pages_migrated     668     915
      numa_pte_updates        27276   42326
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,474,013    8,312,022
      migrations                254,934      231,705
      faults                    320,506      310,242
      cache-misses              110,580,458  402,324,573
      sched:sched_move_numa     725          193
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     7            3
      migrate:mm_migrate_pages  145          93
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        22797   11838
      numa_hint_faults_local  21539   11216
      numa_hit                89308   90689
      numa_huge_pte_updates   0       0
      numa_interleave         865     1579
      numa_local              88955   89634
      numa_other              353     1055
      numa_pages_migrated     149     92
      numa_pte_updates        22930   12109
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,195,628  2,170,481
      migrations                11,179     10,126
      faults                    149,656    160,962
      cache-misses              8,117,515  10,834,845
      sched:sched_move_numa     49         10
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  5          2
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        3577    403
      numa_hint_faults_local  3476    358
      numa_hit                26142   25898
      numa_huge_pte_updates   0       0
      numa_interleave         358     207
      numa_local              26042   25860
      numa_other              100     38
      numa_pages_migrated     5       2
      numa_pte_updates        3587    400
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        100,602,296      110,339,633
      migrations                4,135,630        4,139,812
      faults                    789,256          863,622
      cache-misses              226,160,621,058  231,838,045,660
      sched:sched_move_numa     1,366            2,196
      sched:sched_stick_numa    16               33
      sched:sched_swap_numa     374              544
      migrate:mm_migrate_pages  1,350            2,469
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        47857   85748
      numa_hint_faults_local  39768   66831
      numa_hit                240165  242213
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              240165  242211
      numa_other              0       2
      numa_pages_migrated     1224    2376
      numa_pte_updates        48354   86233
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        58,515,496      59,331,057
      migrations                564,845         552,019
      faults                    245,807         266,586
      cache-misses              73,603,757,976  73,796,312,990
      sched:sched_move_numa     996             981
      sched:sched_stick_numa    10              54
      sched:sched_swap_numa     193             286
      migrate:mm_migrate_pages  646             713
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        13422   14807
      numa_hint_faults_local  5619    5738
      numa_hit                36118   36230
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36116   36228
      numa_other              2       2
      numa_pages_migrated     616     703
      numa_pte_updates        13374   14742
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75346121
  4. 21 9月, 2018 3 次提交
    • R
      mm: slowly shrink slabs with a relatively small number of objects · 172b06c3
      Roman Gushchin 提交于
      9092c71b ("mm: use sc->priority for slab shrink targets") changed the
      way that the target slab pressure is calculated and made it
      priority-based:
      
          delta = freeable >> priority;
          delta *= 4;
          do_div(delta, shrinker->seeks);
      
      The problem is that on a default priority (which is 12) no pressure is
      applied at all, if the number of potentially reclaimable objects is less
      than 4096 (1<<12).
      
      This causes the last objects on slab caches of no longer used cgroups to
      (almost) never get reclaimed.  It's obviously a waste of memory.
      
      It can be especially painful, if these stale objects are holding a
      reference to a dying cgroup.  Slab LRU lists are reparented on memcg
      offlining, but corresponding objects are still holding a reference to the
      dying cgroup.  If we don't scan these objects, the dying cgroup can't go
      away.  Most likely, the parent cgroup hasn't any directly charged objects,
      only remaining objects from dying children cgroups.  So it can easily hold
      a reference to hundreds of dying cgroups.
      
      If there are no big spikes in memory pressure, and new memory cgroups are
      created and destroyed periodically, this causes the number of dying
      cgroups grow steadily, causing a slow-ish and hard-to-detect memory
      "leak".  It's not a real leak, as the memory can be eventually reclaimed,
      but it could not happen in a real life at all.  I've seen hosts with a
      steadily climbing number of dying cgroups, which doesn't show any signs of
      a decline in months, despite the host is loaded with a production
      workload.
      
      It is an obvious waste of memory, and to prevent it, let's apply a minimal
      pressure even on small shrinker lists.  E.g.  if there are freeable
      objects, let's scan at least min(freeable, scan_batch) objects.
      
      This fix significantly improves a chance of a dying cgroup to be
      reclaimed, and together with some previous patches stops the steady growth
      of the dying cgroups number on some of our hosts.
      
      Link: http://lkml.kernel.org/r/20180905230759.12236-1-guro@fb.com
      Fixes: 9092c71b ("mm: use sc->priority for slab shrink targets")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      172b06c3
    • J
      mm: shmem.c: Correctly annotate new inodes for lockdep · b45d71fb
      Joel Fernandes (Google) 提交于
      Directories and inodes don't necessarily need to be in the same lockdep
      class.  For ex, hugetlbfs splits them out too to prevent false positives
      in lockdep.  Annotate correctly after new inode creation.  If its a
      directory inode, it will be put into a different class.
      
      This should fix a lockdep splat reported by syzbot:
      
      > ======================================================
      > WARNING: possible circular locking dependency detected
      > 4.18.0-rc8-next-20180810+ #36 Not tainted
      > ------------------------------------------------------
      > syz-executor900/4483 is trying to acquire lock:
      > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock
      > include/linux/fs.h:765 [inline]
      > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at:
      > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
      >
      > but task is already holding lock:
      > 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
      > drivers/staging/android/ashmem.c:448
      >
      > which lock already depends on the new lock.
      >
      > -> #2 (ashmem_mutex){+.+.}:
      >        __mutex_lock_common kernel/locking/mutex.c:925 [inline]
      >        __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
      >        mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
      >        ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
      >        call_mmap include/linux/fs.h:1844 [inline]
      >        mmap_region+0xf27/0x1c50 mm/mmap.c:1762
      >        do_mmap+0xa10/0x1220 mm/mmap.c:1535
      >        do_mmap_pgoff include/linux/mm.h:2298 [inline]
      >        vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
      >        ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
      >        __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
      >        __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
      >        __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > -> #1 (&mm->mmap_sem){++++}:
      >        __might_fault+0x155/0x1e0 mm/memory.c:4568
      >        _copy_to_user+0x30/0x110 lib/usercopy.c:25
      >        copy_to_user include/linux/uaccess.h:155 [inline]
      >        filldir+0x1ea/0x3a0 fs/readdir.c:196
      >        dir_emit_dot include/linux/fs.h:3464 [inline]
      >        dir_emit_dots include/linux/fs.h:3475 [inline]
      >        dcache_readdir+0x13a/0x620 fs/libfs.c:193
      >        iterate_dir+0x48b/0x5d0 fs/readdir.c:51
      >        __do_sys_getdents fs/readdir.c:231 [inline]
      >        __se_sys_getdents fs/readdir.c:212 [inline]
      >        __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > -> #0 (&sb->s_type->i_mutex_key#9){++++}:
      >        lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
      >        down_write+0x8f/0x130 kernel/locking/rwsem.c:70
      >        inode_lock include/linux/fs.h:765 [inline]
      >        shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
      >        ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
      >        ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
      >        vfs_ioctl fs/ioctl.c:46 [inline]
      >        file_ioctl fs/ioctl.c:501 [inline]
      >        do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
      >        ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
      >        __do_sys_ioctl fs/ioctl.c:709 [inline]
      >        __se_sys_ioctl fs/ioctl.c:707 [inline]
      >        __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > other info that might help us debug this:
      >
      > Chain exists of:
      >   &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex
      >
      >  Possible unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(ashmem_mutex);
      >                                lock(&mm->mmap_sem);
      >                                lock(ashmem_mutex);
      >   lock(&sb->s_type->i_mutex_key#9);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by syz-executor900/4483:
      >  #0: 0000000025208078 (ashmem_mutex){+.+.}, at:
      > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448
      
      Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.orgSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Suggested-by: NNeilBrown <neilb@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b45d71fb
    • P
      mm: disable deferred struct page for 32-bit arches · 889c695d
      Pasha Tatashin 提交于
      Deferred struct page init is needed only on systems with large amount of
      physical memory to improve boot performance.  32-bit systems do not
      benefit from this feature.
      
      Jiri reported a problem where deferred struct pages do not work well with
      x86-32:
      
      [    0.035162] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
      [    0.035725] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
      [    0.036269] Initializing CPU#0
      [    0.036513] Initializing HighMem for node 0 (00036ffe:0007ffe0)
      [    0.038459] page:f6780000 is uninitialized and poisoned
      [    0.038460] raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
      [    0.039509] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
      [    0.040038] ------------[ cut here ]------------
      [    0.040399] kernel BUG at include/linux/page-flags.h:293!
      [    0.040823] invalid opcode: 0000 [#1] SMP PTI
      [    0.041166] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc1_pt_jiri #9
      [    0.041694] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
      [    0.042496] EIP: free_highmem_page+0x64/0x80
      [    0.042839] Code: 13 46 d8 c1 e8 18 5d 83 e0 03 8d 04 c0 c1 e0 06 ff 80 ec 5f 44 d8 c3 8d b4 26 00 00 00 00 ba 08 65 28 d8 89 d8 e8 fc 71 02 00 <0f> 0b 8d 76 00 8d bc 27 00 00 00 00 ba d0 b1 26 d8 89 d8 e8 e4 71
      [    0.044338] EAX: 0000003c EBX: f6780000 ECX: 00000000 EDX: d856cbe8
      [    0.044868] ESI: 0007ffe0 EDI: d838df20 EBP: d838df00 ESP: d838defc
      [    0.045372] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210086
      [    0.045913] CR0: 80050033 CR2: 00000000 CR3: 18556000 CR4: 00040690
      [    0.046413] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      [    0.046913] DR6: fffe0ff0 DR7: 00000400
      [    0.047220] Call Trace:
      [    0.047419]  add_highpages_with_active_regions+0xbd/0x10d
      [    0.047854]  set_highmem_pages_init+0x5b/0x71
      [    0.048202]  mem_init+0x2b/0x1e8
      [    0.048460]  start_kernel+0x1d2/0x425
      [    0.048757]  i386_start_kernel+0x93/0x97
      [    0.049073]  startup_32_smp+0x164/0x168
      [    0.049379] Modules linked in:
      [    0.049626] ---[ end trace 337949378db0abbb ]---
      
      We free highmem pages before their struct pages are initialized:
      
      mem_init()
       set_highmem_pages_init()
        add_highpages_with_active_regions()
         free_highmem_page()
          .. Access uninitialized struct page here..
      
      Because there is no reason to have this feature on 32-bit systems, just
      disable it.
      
      Link: http://lkml.kernel.org/r/20180831150506.31246-1-pavel.tatashin@microsoft.com
      Fixes: 2e3ca40f ("mm: relax deferred struct page requirements")
      Signed-off-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reported-by: NJiri Slaby <jslaby@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      889c695d
  5. 14 9月, 2018 1 次提交
    • L
      mm: get rid of vmacache_flush_all() entirely · 7a9cdebd
      Linus Torvalds 提交于
      Jann Horn points out that the vmacache_flush_all() function is not only
      potentially expensive, it's buggy too.  It also happens to be entirely
      unnecessary, because the sequence number overflow case can be avoided by
      simply making the sequence number be 64-bit.  That doesn't even grow the
      data structures in question, because the other adjacent fields are
      already 64-bit.
      
      So simplify the whole thing by just making the sequence number overflow
      case go away entirely, which gets rid of all the complications and makes
      the code faster too.  Win-win.
      
      [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
        also just goes away entirely with this ]
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a9cdebd
  6. 05 9月, 2018 6 次提交
  7. 01 9月, 2018 1 次提交
    • D
      blkcg: delay blkg destruction until after writeback has finished · 59b57717
      Dennis Zhou (Facebook) 提交于
      Currently, blkcg destruction relies on a sequence of events:
        1. Destruction starts. blkcg_css_offline() is called and blkgs
           release their reference to the blkcg. This immediately destroys
           the cgwbs (writeback).
        2. With blkgs giving up their reference, the blkcg ref count should
           become zero and eventually call blkcg_css_free() which finally
           frees the blkcg.
      
      Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
      and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
      on the completion of all writeback associated with the blkcg. A count of
      the number of cgwbs is maintained and once that goes to zero, blkg
      destruction can follow. This should prevent premature blkg destruction
      related to writeback.
      
      The new process for blkcg cleanup is as follows:
        1. Destruction starts. blkcg_css_offline() is called which offlines
           writeback. Blkg destruction is delayed on the cgwb_refcnt count to
           avoid punting potentially large amounts of outstanding writeback
           to root while maintaining any ongoing policies. Here, the base
           cgwb_refcnt is put back.
        2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
           and handles destruction of blkgs. This is where the css reference
           held by each blkg is released.
        3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
           This finally frees the blkg.
      
      It seems in the past blk-throttle didn't do the most understandable
      things with taking data from a blkg while associating with current. So,
      the simplification and unification of what blk-throttle is doing caused
      this.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59b57717
  8. 31 8月, 2018 1 次提交
  9. 30 8月, 2018 2 次提交
  10. 26 8月, 2018 1 次提交
    • L
      mm/cow: don't bother write protecting already write-protected pages · 1b2de5d0
      Linus Torvalds 提交于
      This is not normally noticeable, but repeated forks are unnecessarily
      expensive because they repeatedly dirty the parent page tables during
      the page table copy operation.
      
      It's trivial to just avoid write protecting the page table entry if it
      was already not writable.
      
      This patch was inspired by
      
          https://bugzilla.kernel.org/show_bug.cgi?id=200447
      
      which points to an ancient "waste time re-doing fork" issue in the
      presence of lots of signals.
      
      That bug was fixed by Eric Biederman's signal handling series
      culminating in commit c3ad2c3b ("signal: Don't restart fork when
      signals come in"), but the unnecessary work for repeated forks is still
      work just fixing, particularly since the fix is trivial.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b2de5d0
  11. 24 8月, 2018 9 次提交
  12. 23 8月, 2018 4 次提交