1. 02 2月, 2019 23 次提交
    • J
    • D
      lib/test_kmod.c: potential double free in error handling · db7ddeab
      Dan Carpenter 提交于
      There is a copy and paste bug so we set "config->test_driver" to NULL
      twice instead of setting "config->test_fs".  Smatch complains that it
      leads to a double free:
      
        lib/test_kmod.c:840 __kmod_config_init() warn: 'config->test_fs' double freed
      
      Link: http://lkml.kernel.org/r/20190121140011.GA14283@kadam
      Fixes: d9c6a72d ("kmod: add test driver to stress test the module loader")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db7ddeab
    • S
      mm, oom: fix use-after-free in oom_kill_process · cefc7ef3
      Shakeel Butt 提交于
      Syzbot instance running on upstream kernel found a use-after-free bug in
      oom_kill_process.  On further inspection it seems like the process
      selected to be oom-killed has exited even before reaching
      read_lock(&tasklist_lock) in oom_kill_process().  More specifically the
      tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
      and the put_task_struct within for_each_thread() frees the tsk and
      for_each_thread() tries to access the tsk.  The easiest fix is to do
      get/put across the for_each_thread() on the selected task.
      
      Now the next question is should we continue with the oom-kill as the
      previously selected task has exited? However before adding more
      complexity and heuristics, let's answer why we even look at the children
      of oom-kill selected task? The select_bad_process() has already selected
      the worst process in the system/memcg.  Due to race, the selected
      process might not be the worst at the kill time but does that matter?
      The userspace can use the oom_score_adj interface to prefer children to
      be killed before the parent.  I looked at the history but it seems like
      this is there before git history.
      
      Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
      Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
      Fixes: 6b0c81b3 ("mm, oom: reduce dependency on tasklist_lock")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cefc7ef3
    • Q
      mm/hotplug: invalid PFNs from pfn_to_online_page() · b13bc351
      Qian Cai 提交于
      On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
      with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
      not directly mapped (MEMBLOCK_NOMAP).  Hence, the page->flags is
      uninitialized.
      
      This is due to the commit 9f1eb38e ("mm, kmemleak: little
      optimization while scanning") starts to use pfn_to_online_page() instead
      of pfn_valid().  However, in the CONFIG_MEMORY_HOTPLUG=y case,
      pfn_to_online_page() does not call memblock_is_map_memory() while
      pfn_valid() does.
      
      Historically, the commit 68709f45 ("arm64: only consider memblocks
      with NOMAP cleared for linear mapping") causes pages marked as nomap
      being no long reassigned to the new zone in memmap_init_zone() by
      calling __init_single_page().
      
      Since the commit 2d070eab ("mm: consider zone which is not fully
      populated to have holes") introduced pfn_to_online_page() and was
      designed to return a valid pfn only, but it is clearly broken on arm64.
      
      Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
      handle nomap thanks to the commit f52bb98f ("arm64: mm: always
      enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
      architectures where have no HOLES_IN_ZONE.
      
      [1]
        Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
        Mem abort info:
          ESR = 0x96000005
          Exception class = DABT (current EL), IL = 32 bits
          SET = 0, FnV = 0
          EA = 0, S1PTW = 0
        Data abort info:
          ISV = 0, ISS = 0x00000005
          CM = 0, WnR = 0
        Internal error: Oops: 96000005 [#1] SMP
        CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
        pstate: 60400009 (nZCv daif +PAN -UAO)
        pc : page_mapping+0x24/0x144
        lr : __dump_page+0x34/0x3dc
        sp : ffff00003a5cfd10
        x29: ffff00003a5cfd10 x28: 000000000000802f
        x27: 0000000000000000 x26: 0000000000277d00
        x25: ffff000010791f56 x24: ffff7fe000000000
        x23: ffff000010772f8b x22: ffff00001125f670
        x21: ffff000011311000 x20: ffff000010772f8b
        x19: fffffffffffffffe x18: 0000000000000000
        x17: 0000000000000000 x16: 0000000000000000
        x15: 0000000000000000 x14: ffff802698b19600
        x13: ffff802698b1a200 x12: ffff802698b16f00
        x11: ffff802698b1a400 x10: 0000000000001400
        x9 : 0000000000000001 x8 : ffff00001121a000
        x7 : 0000000000000000 x6 : ffff0000102c53b8
        x5 : 0000000000000000 x4 : 0000000000000003
        x3 : 0000000000000100 x2 : 0000000000000000
        x1 : ffff000010772f8b x0 : ffffffffffffffff
        Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____))
        Call trace:
         page_mapping+0x24/0x144
         __dump_page+0x34/0x3dc
         dump_page+0x28/0x4c
         kmemleak_scan+0x4ac/0x680
         kmemleak_scan_thread+0xb4/0xdc
         kthread+0x12c/0x13c
         ret_from_fork+0x10/0x18
        Code: d503201f f9400660 36000040 d1000413 (f9400661)
        ---[ end trace 4d4bd7f573490c8e ]---
        Kernel panic - not syncing: Fatal exception
        SMP: stopping secondary CPUs
        Kernel Offset: disabled
        CPU features: 0x002,20000c38
        Memory Limit: none
        ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190122132916.28360-1-cai@lca.pw
      Fixes: 9f1eb38e ("mm, kmemleak: little optimization while scanning")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b13bc351
    • O
      mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages · eeb0efd0
      Oscar Salvador 提交于
      This is the same sort of error we saw in commit 17e2e7d7 ("mm,
      page_alloc: fix has_unmovable_pages for HugePages").
      
      Gigantic hugepages cross several memblocks, so it can be that the page
      we get in scan_movable_pages() is a page-tail belonging to a
      1G-hugepage.  If that happens, page_hstate()->size_to_hstate() will
      return NULL, and we will blow up in hugepage_migration_supported().
      
      The splat is as follows:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 1 PID: 1350 Comm: bash Tainted: G            E     5.0.0-rc1-mm1-1-default+ #27
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        RIP: 0010:__offline_pages+0x6ae/0x900
        Call Trace:
         memory_subsys_offline+0x42/0x60
         device_offline+0x80/0xa0
         state_store+0xab/0xc0
         kernfs_fop_write+0x102/0x180
         __vfs_write+0x26/0x190
         vfs_write+0xad/0x1b0
         ksys_write+0x42/0x90
         do_syscall_64+0x5b/0x180
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E)
      
      [akpm@linux-foundation.org: fix brace layout, per David.  Reduce indentation]
      Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eeb0efd0
    • J
      psi: fix aggregation idle shut-off · 1b69ac6b
      Johannes Weiner 提交于
      psi has provisions to shut off the periodic aggregation worker when
      there is a period of no task activity - and thus no data that needs
      aggregating.  However, while developing psi monitoring, Suren noticed
      that the aggregation clock currently won't stay shut off for good.
      
      Debugging this revealed a flaw in the idle design: an aggregation run
      will see no task activity and decide to go to sleep; shortly thereafter,
      the kworker thread that executed the aggregation will go idle and cause
      a scheduling change, during which the psi callback will kick the
      !pending worker again.  This will ping-pong forever, and is equivalent
      to having no shut-off logic at all (but with more code!)
      
      Fix this by exempting aggregation workers from psi's clock waking logic
      when the state change is them going to sleep.  To do this, tag workers
      with the last work function they executed, and if in psi we see a worker
      going to sleep after aggregating psi data, we will not reschedule the
      aggregation work item.
      
      What if the worker is also executing other items before or after?
      
      Any psi state times that were incurred by work items preceding the
      aggregation work will have been collected from the per-cpu buckets
      during the aggregation itself.  If there are work items following the
      aggregation work, the worker's last_func tag will be overwritten and the
      aggregator will be kept alive to process this genuine new activity.
      
      If the aggregation work is the last thing the worker does, and we decide
      to go idle, the brief period of non-idle time incurred between the
      aggregation run and the kworker's dequeue will be stranded in the
      per-cpu buckets until the clock is woken by later activity.  But that
      should not be a problem.  The buckets can hold 4s worth of time, and
      future activity will wake the clock with a 2s delay, giving us 2s worth
      of data we can leave behind when disabling aggregation.  If it takes a
      worker more than two seconds to go idle after it finishes its last work
      item, we likely have bigger problems in the system, and won't notice one
      sample that was averaged with a bogus per-CPU weight.
      
      Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
      Fixes: eb414681 ("psi: pressure stall information for CPU, memory, and IO")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b69ac6b
    • M
      mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone · 24feb47c
      Mikhail Zaslonko 提交于
      If memory end is not aligned with the sparse memory section boundary,
      the mapping of such a section is only partly initialized.  This may lead
      to VM_BUG_ON due to uninitialized struct pages access from
      test_pages_in_a_zone() function triggered by memory_hotplug sysfs
      handlers.
      
      Here are the the panic examples:
       CONFIG_DEBUG_VM_PGFLAGS=y
       kernel parameter mem=2050M
       --------------------------
       page:000003d082008000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
         test_pages_in_a_zone+0xde/0x160
         show_valid_zones+0x5c/0x190
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         test_pages_in_a_zone+0xde/0x160
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      Fix this by checking whether the pfn to check is within the zone.
      
      [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
      Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org
      
      [mhocko@suse.com: separated this change from
      http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NMikhail Zaslonko <zaslonko@linux.ibm.com>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24feb47c
    • M
      mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone · efad4e47
      Michal Hocko 提交于
      Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.
      
      Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
      [1].  I have pushed back on those fixes because I believed that it is
      much better to plug the problem at the initialization time rather than
      play whack-a-mole all over the hotplug code and find all the places
      which expect the full memory section to be initialized.
      
      We have ended up with commit 2830bf6f ("mm, memory_hotplug:
      initialize struct pages for the full memory section") merged and cause a
      regression [2][3].  The reason is that there might be memory layouts
      when two NUMA nodes share the same memory section so the merged fix is
      simply incorrect.
      
      In order to plug this hole we really have to be zone range aware in
      those handlers.  I have split up the original patch into two.  One is
      unchanged (patch 2) and I took a different approach for `removable'
      crash.
      
      [1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com
      [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
      [3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz
      
      This patch (of 2):
      
      Mikhail has reported the following VM_BUG_ON triggered when reading sysfs
      removable state of a memory block:
      
       page:000003d08300c000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
         is_mem_section_removable+0xb4/0x190
         show_mem_removable+0x9a/0xd8
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         is_mem_section_removable+0xb4/0x190
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      The reason is that the memory block spans the zone boundary and we are
      stumbling over an unitialized struct page.  Fix this by enforcing zone
      range in is_mem_section_removable so that we never run away from a zone.
      
      Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikhail Zaslonko <zaslonko@linux.ibm.com>
      Debugged-by: NMikhail Zaslonko <zaslonko@linux.ibm.com>
      Tested-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efad4e47
    • T
      oom, oom_reaper: do not enqueue same task twice · 9bcdeb51
      Tetsuo Handa 提交于
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @P1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bcdeb51
    • J
      mm: migrate: make buffer_migrate_page_norefs() actually succeed · 80409c65
      Jan Kara 提交于
      Currently, buffer_migrate_page_norefs() was constantly failing because
      buffer_migrate_lock_buffers() grabbed reference on each buffer.  In
      fact, there's no reason for buffer_migrate_lock_buffers() to grab any
      buffer references as the page is locked during all our operation and
      thus nobody can reclaim buffers from the page.
      
      So remove grabbing of buffer references which also makes
      buffer_migrate_page_norefs() succeed.
      
      Link: http://lkml.kernel.org/r/20190116131217.7226-1-jack@suse.cz
      Fixes: 89cb0888 "mm: migrate: provide buffer_migrate_page_norefs()"
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80409c65
    • A
      kernel/exit.c: release ptraced tasks before zap_pid_ns_processes · 8fb335e0
      Andrei Vagin 提交于
      Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
      zap_pid_ns_processes() waits on all tasks in a current pidns, and only
      then are tasks from the dead list released.
      
      zap_pid_ns_processes() can get stuck on waiting tasks from the dead
      list.  In this case, we will have one unkillable process with one or
      more dead children.
      
      Thanks to Oleg for the advice to release tasks in find_child_reaper().
      
      Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
      Fixes: 7c8bd232 ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb335e0
    • Q
      x86_64: increase stack size for KASAN_EXTRA · a8e911d1
      Qian Cai 提交于
      If the kernel is configured with KASAN_EXTRA, the stack size is
      increasted significantly because this option sets "-fstack-reuse" to
      "none" in GCC [1].  As a result, it triggers stack overrun quite often
      with 32k stack size compiled using GCC 8.  For example, this reproducer
      
        https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise06.c
      
      triggers a "corrupted stack end detected inside scheduler" very reliably
      with CONFIG_SCHED_STACK_END_CHECK enabled.
      
      There are just too many functions that could have a large stack with
      KASAN_EXTRA due to large local variables that have been called over and
      over again without being able to reuse the stacks.  Some noticiable ones
      are
      
        size
        7648 shrink_page_list
        3584 xfs_rmap_convert
        3312 migrate_page_move_mapping
        3312 dev_ethtool
        3200 migrate_misplaced_transhuge_page
        3168 copy_process
      
      There are other 49 functions are over 2k in size while compiling kernel
      with "-Wframe-larger-than=" even with a related minimal config on this
      machine.  Hence, it is too much work to change Makefiles for each object
      to compile without "-fsanitize-address-use-after-scope" individually.
      
      [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c23
      
      Although there is a patch in GCC 9 to help the situation, GCC 9 probably
      won't be released in a few months and then it probably take another
      6-month to 1-year for all major distros to include it as a default.
      Hence, the stack usage with KASAN_EXTRA can be revisited again in 2020
      when GCC 9 is everywhere.  Until then, this patch will help users avoid
      stack overrun.
      
      This has already been fixed for arm64 for the same reason via
      6e883067 ("arm64: kasan: Increase stack size for KASAN_EXTRA").
      
      Link: http://lkml.kernel.org/r/20190109215209.2903-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8e911d1
    • A
      mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT · 1ac25013
      Andrea Arcangeli 提交于
      hugetlb needs the same fix as faultin_nopage (which was applied in
      commit 96312e61 ("mm/gup.c: teach get_user_pages_unlocked to handle
      FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already
      released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't
      in the FOLL_NOWAIT case.
      
      Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com
      Fixes: ce53053c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: N"Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Reported-by: N"Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ac25013
    • M
      arch: unexport asm/shmparam.h for all architectures · 36c0f7f0
      Masahiro Yamada 提交于
      Most architectures do not export shmparam.h to user-space.
      
        $ find arch -name shmparam.h  | sort
        arch/alpha/include/asm/shmparam.h
        arch/arc/include/asm/shmparam.h
        arch/arm64/include/asm/shmparam.h
        arch/arm/include/asm/shmparam.h
        arch/csky/include/asm/shmparam.h
        arch/ia64/include/asm/shmparam.h
        arch/mips/include/asm/shmparam.h
        arch/nds32/include/asm/shmparam.h
        arch/nios2/include/asm/shmparam.h
        arch/parisc/include/asm/shmparam.h
        arch/powerpc/include/asm/shmparam.h
        arch/s390/include/asm/shmparam.h
        arch/sh/include/asm/shmparam.h
        arch/sparc/include/asm/shmparam.h
        arch/x86/include/asm/shmparam.h
        arch/xtensa/include/asm/shmparam.h
      
      Strangely, some users of the asm-generic wrapper export shmparam.h
      
        $ git grep 'generic-y += shmparam.h'
        arch/c6x/include/uapi/asm/Kbuild:generic-y += shmparam.h
        arch/h8300/include/uapi/asm/Kbuild:generic-y += shmparam.h
        arch/hexagon/include/uapi/asm/Kbuild:generic-y += shmparam.h
        arch/m68k/include/uapi/asm/Kbuild:generic-y += shmparam.h
        arch/microblaze/include/uapi/asm/Kbuild:generic-y += shmparam.h
        arch/openrisc/include/uapi/asm/Kbuild:generic-y += shmparam.h
        arch/riscv/include/asm/Kbuild:generic-y += shmparam.h
        arch/unicore32/include/uapi/asm/Kbuild:generic-y += shmparam.h
      
      The newly added riscv correctly creates the asm-generic wrapper
      in the kernel space, but the others (c6x, h8300, hexagon, m68k,
      microblaze, openrisc, unicore32) create the one in the uapi directory.
      
      Digging into the git history, now I guess fcc8487d ("uapi:
      export all headers under uapi directories") was the misconversion.
      Prior to that commit, no architecture exported to shmparam.h
      As its commit description said, that commit exported shmparam.h
      for c6x, h8300, hexagon, m68k, openrisc, unicore32.
      
      83f0124a ("microblaze: remove asm-generic wrapper headers")
      accidentally exported shmparam.h for microblaze.
      
      This commit unexports shmparam.h for those architectures.
      
      There is no more reason to export include/uapi/asm-generic/shmparam.h,
      so it has been moved to include/asm-generic/shmparam.h
      
      Link: http://lkml.kernel.org/r/1546904307-11124-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NStafford Horne <shorne@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36c0f7f0
    • A
      proc: fix /proc/net/* after setns(2) · 1fde6f21
      Alexey Dobriyan 提交于
      /proc entries under /proc/net/* can't be cached into dcache because
      setns(2) can change current net namespace.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: avoid vim miscolorization]
      [adobriyan@gmail.com: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time]
        Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2
      Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2
      Fixes: 1da4d377 ("proc: revalidate misc dentries")
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reported-by: NMateusz Stępień <mateusz.stepien@netrounds.com>
      Reported-by: NAhmad Fatoum <a.fatoum@pengutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fde6f21
    • O
      mm, memory_hotplug: don't bail out in do_migrate_range() prematurely · 1723058e
      Oscar Salvador 提交于
      do_migrate_range() takes a memory range and tries to isolate the pages
      to put them into a list.  This list will be later on used in
      migrate_pages() to know the pages we need to migrate.
      
      Currently, if we fail to isolate a single page, we put all already
      isolated pages back to their LRU and we bail out from the function.
      This is quite suboptimal, as this will force us to start over again
      because scan_movable_pages will give us the same range.  If there is no
      chance that we can isolate that page, we will loop here forever.
      
      Issue debugged in [1] has proved that.  During the debugging of that
      issue, it was noticed that if do_migrate_ranges() fails to isolate a
      single page, we will just discard the work we have done so far and bail
      out, which means that scan_movable_pages() will find again the same set
      of pages.
      
      Instead, we can just skip the error, keep isolating as much pages as
      possible and then proceed with the call to migrate_pages().
      
      This will allow us to do as much work as possible at once.
      
      [1] https://lkml.org/lkml/2018/12/6/324
      
      Michal said:
      
      : I still think that this doesn't give us a whole picture.  Looping for
      : ever is a bug.  Failing the isolation is quite possible and it should
      : be a ephemeral condition (e.g.  a race with freeing the page or
      : somebody else isolating the page for whatever reason).  And here comes
      : the disadvantage of the current implementation.  We simply throw
      : everything on the floor just because of a ephemeral condition.  The
      : racy page_count check is quite dubious to prevent from that.
      
      Link: http://lkml.kernel.org/r/20181211135312.27034-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1723058e
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 5eeb6335
      Linus Torvalds 提交于
      Pull rdma fixes from Jason Gunthorpe:
       "Still not much going on, the usual set of oops and driver fixes this
        time:
      
         - Fix two uapi breakage regressions in mlx5 drivers
      
         - Various oops fixes in hfi1, mlx4, umem, uverbs, and ipoib
      
         - A protocol bug fix for hfi1 preventing it from implementing the
           verbs API properly, and a compatability fix for EXEC STACK user
           programs
      
         - Fix missed refcounting in the 'advise_mr' patches merged this
           cycle.
      
         - Fix wrong use of the uABI in the hns SRQ patches merged this cycle"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        IB/uverbs: Fix OOPs in uverbs_user_mmap_disassociate
        IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start
        IB/uverbs: Fix ioctl query port to consider device disassociation
        RDMA/mlx5: Fix flow creation on representors
        IB/uverbs: Fix OOPs upon device disassociation
        RDMA/umem: Add missing initialization of owning_mm
        RDMA/hns: Update the kernel header file of hns
        IB/mlx5: Fix how advise_mr() launches async work
        RDMA/device: Expose ib_device_try_get(()
        IB/hfi1: Add limit test for RC/UC send via loopback
        IB/hfi1: Remove overly conservative VM_EXEC flag check
        IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM
        IB/mlx4: Fix using wrong function to destroy sqp AHs under SRIOV
        RDMA/mlx5: Fix check for supported user flags when creating a QP
      5eeb6335
    • L
      Merge tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 9ace868a
      Linus Torvalds 提交于
      Pull iomap fixes from Darrick Wong:
       "A couple of iomap fixes to eliminate some memory corruption and hang
        problems that were reported:
      
         - fix page migration when using iomap for pagecache management
      
         - fix a use-after-free bug in the directio code"
      
      * tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        iomap: fix a use after free in iomap_dio_rw
        iomap: get/put the page in iomap_page_create/release()
      9ace868a
    • L
      Merge tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 3325254c
      Linus Torvalds 提交于
      Pull power management fixes from Rafael Wysocki:
       "These fix a PM-runtime framework regression introduced by the recent
        switch-over of device autosuspend to hrtimers and a mistake in the
        "poll idle state" code introduced by a recent change in it.
      
        Specifics:
      
         - Since ktime_get() turns out to be problematic for device
           autosuspend in the PM-runtime framework, make it use
           ktime_get_mono_fast_ns() instead (Vincent Guittot).
      
         - Fix an initial value of a local variable in the "poll idle state"
           code that makes it behave not exactly as expected when all idle
           states except for the "polling" one are disabled (Doug Smythies)"
      
      * tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpuidle: poll_state: Fix default time limit
        PM-runtime: Fix deadlock with ktime_get()
      3325254c
    • L
      Merge tag 'acpi-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 4771eec1
      Linus Torvalds 提交于
      Pull ACPI Kconfig fixes from Rafael Wysocki:
       "Prevent invalid configurations from being created (e.g. by randconfig)
        due to some ACPI-related Kconfig options' dependencies that are not
        specified directly (Sinan Kaya)"
      
      * tag 'acpi-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        platform/x86: Fix unmet dependency warning for SAMSUNG_Q10
        platform/x86: Fix unmet dependency warning for ACPI_CMPC
        mfd: Fix unmet dependency warning for MFD_TPS68470
      4771eec1
    • L
      Merge tag 'mmc-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · cca2e06a
      Linus Torvalds 提交于
      Pull MMC host fixes from Ulf Hansson:
      
       - mediatek: Fix incorrect register write for tunings
      
       - bcm2835: Fixup leakage of DMA channel on probe errors
      
      * tag 'mmc-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: mediatek: fix incorrect register setting of hs400_cmd_int_delay
        mmc: bcm2835: Fix DMA channel leak on probe error
      cca2e06a
    • L
      Merge tag 'i3c/fixes-for-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux · 520fac05
      Linus Torvalds 提交于
      Pull i3c fixes from Boris Brezillon:
      
       - Fix a deadlock in the designware driver
      
       - Fix the error path in i3c_master_add_i3c_dev_locked()
      
      * tag 'i3c/fixes-for-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
        i3c: master: dw: fix deadlock
        i3c: fix missing detach if failed to retrieve i3c dev
      520fac05
    • L
      x86: explicitly align IO accesses in memcpy_{to,from}io · c228d294
      Linus Torvalds 提交于
      In commit 170d13ca ("x86: re-introduce non-generic memcpy_{to,from}io")
      I made our copy from IO space use a separate copy routine rather than
      rely on the generic memcpy.  I did that because our generic memory copy
      isn't actually well-defined when it comes to internal access ordering or
      alignment, and will in fact depend on various CPUID flags.
      
      In particular, the default memcpy() for a modern Intel CPU will
      generally be just a "rep movsb", which works reasonably well for
      medium-sized memory copies of regular RAM, since the CPU will turn it
      into fairly optimized microcode.
      
      However, for non-cached memory and IO, "rep movs" ends up being
      horrendously slow and will just do the architectural "one byte at a
      time" accesses implied by the movsb.
      
      At the other end of the spectrum, if you _don't_ end up using the "rep
      movsb" code, you'd likely fall back to the software copy, which does
      overlapping accesses for the tail, and may copy things backwards.
      Again, for regular memory that's fine, for IO memory not so much.
      
      The thinking was that clearly nobody really cared (because things
      worked), but some people had seen horrible performance due to the byte
      accesses, so let's just revert back to our long ago version that dod
      "rep movsl" for the bulk of the copy, and then fixed up the potentially
      last few bytes of the tail with "movsw/b".
      
      Interestingly (and perhaps not entirely surprisingly), while that was
      our original memory copy implementation, and had been used before for
      IO, in the meantime many new users of memcpy_*io() had come about.  And
      while the access patterns for the memory copy weren't well-defined (so
      arguably _any_ access pattern should work), in practice the "rep movsb"
      case had been very common for the last several years.
      
      In particular Jarkko Sakkinen reported that the memcpy_*io() change
      resuled in weird errors from his Geminilake NUC TPM module.
      
      And it turns out that the TPM TCG accesses according to spec require
      that the accesses be
      
       (a) done strictly sequentially
      
       (b) be naturally aligned
      
      otherwise the TPM chip will abort the PCI transaction.
      
      And, in fact, the tpm_crb.c driver did this:
      
      	memcpy_fromio(buf, priv->rsp, 6);
      	...
      	memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);
      
      which really should never have worked in the first place, but back
      before commit 170d13ca it *happened* to work, because the
      memcpy_fromio() would be expanded to a regular memcpy, and
      
       (a) gcc would expand the first memcpy in-line, and turn it into a
           4-byte and a 2-byte read, and they happened to be in the right
           order, and the alignment was right.
      
       (b) gcc would call "memcpy()" for the second one, and the machines that
           had this TPM chip also apparently ended up always having ERMS
           ("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep
           movbs" for that copy.
      
      In other words, basically by pure luck, the code happened to use the
      right access sizes in the (two different!) memcpy() implementations to
      make it all work.
      
      But after commit 170d13ca, both of the memcpy_fromio() calls
      resulted in a call to the routine with the consistent memory accesses,
      and in both cases it started out transferring with 4-byte accesses.
      Which worked for the first copy, but resulted in the second copy doing a
      32-bit read at an address that was only 2-byte aligned.
      
      Jarkko is actually fixing the fragile code in the TPM driver, but since
      this is an excellent example of why we absolutely must not use a generic
      memcpy for IO accesses, _and_ an IO-specific one really should strive to
      align the IO accesses, let's do exactly that.
      
      Side note: Jarkko also noted that the driver had been used on ARM
      platforms, and had worked.  That was because on 32-bit ARM, memcpy_*io()
      ends up always doing byte accesses, and on 64-bit ARM it first does byte
      accesses to align to 8-byte boundaries, and then does 8-byte accesses
      for the bulk.
      
      So ARM actually worked by design, and the x86 case worked by pure luck.
      
      We *might* want to make x86-64 do the 8-byte case too.  That should be a
      pretty straightforward extension, but let's do one thing at a time.  And
      generally MMIO accesses aren't really all that performance-critical, as
      shown by the fact that for a long time we just did them a byte at a
      time, and very few people ever noticed.
      Reported-and-tested-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Tested-by: NJerry Snitselaar <jsnitsel@redhat.com>
      Cc: David Laight <David.Laight@aculab.com>
      Fixes: 170d13ca ("x86: re-introduce non-generic memcpy_{to,from}io")
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c228d294
  2. 01 2月, 2019 14 次提交
  3. 31 1月, 2019 3 次提交
    • D
      cpuidle: poll_state: Fix default time limit · 1617971c
      Doug Smythies 提交于
      The default time is declared in units of microsecnds,
      but is used as nanoseconds, resulting in significant
      accounting errors for idle state 0 time when all idle
      states deeper than 0 are disabled.
      
      Under these unusual conditions, we don't really care
      about the poll time limit anyhow.
      
      Fixes: 800fb34a ("cpuidle: poll_state: Disregard disable idle states")
      Signed-off-by: NDoug Smythies <dsmythies@telus.net>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      1617971c
    • V
      PM-runtime: Fix deadlock with ktime_get() · 15efb47d
      Vincent Guittot 提交于
      A deadlock has been seen when swicthing clocksources which use
      PM-runtime.  The call path is:
      
      change_clocksource
          ...
          write_seqcount_begin
          ...
          timekeeping_update
              ...
              sh_cmt_clocksource_enable
                  ...
                  rpm_resume
                      pm_runtime_mark_last_busy
                          ktime_get
                              do
                                  read_seqcount_begin
                              while read_seqcount_retry
          ....
          write_seqcount_end
      
      Although we should be safe because we haven't yet changed the
      clocksource at that time, we can't do that because of seqcount
      protection.
      
      Use ktime_get_mono_fast_ns() instead which is lock safe for such
      cases.
      
      With ktime_get_mono_fast_ns, the timestamp is not guaranteed to be
      monotonic across an update and as a result can goes backward.
      According to update_fast_timekeeper() description: "In the worst
      case, this can result is a slightly wrong timestamp (a few
      nanoseconds)". For PM-runtime autosuspend, this means only that
      the suspend decision may be slightly suboptimal.
      
      Fixes: 8234f673 ("PM-runtime: Switch autosuspend over to using hrtimers")
      Reported-by: NBiju Das <biju.das@bp.renesas.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      15efb47d
    • W
      fs/dcache: Track & report number of negative dentries · af0c9af1
      Waiman Long 提交于
      The current dentry number tracking code doesn't distinguish between
      positive & negative dentries.  It just reports the total number of
      dentries in the LRU lists.
      
      As excessive number of negative dentries can have an impact on system
      performance, it will be wise to track the number of positive and
      negative dentries separately.
      
      This patch adds tracking for the total number of negative dentries in
      the system LRU lists and reports it in the 5th field in the
      /proc/sys/fs/dentry-state file.  The number, however, does not include
      negative dentries that are in flight but not in the LRU yet as well as
      those in the shrinker lists which are on the way out anyway.
      
      The number of positive dentries in the LRU lists can be roughly found by
      subtracting the number of negative dentries from the unused count.
      
      Matthew Wilcox had confirmed that since the introduction of the
      dentry_stat structure in 2.1.60, the dummy array was there, probably for
      future extension.  They were not replacements of pre-existing fields.
      So no sane applications that read the value of /proc/sys/fs/dentry-state
      will do dummy thing if the last 2 fields of the sysctl parameter are not
      zero.  IOW, it will be safe to use one of the dummy array entry for
      negative dentry count.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af0c9af1