1. 27 9月, 2020 2 次提交
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
    • V
      mm/gup: fix gup_fast with dynamic page table folding · d3f7b1bb
      Vasily Gorbik 提交于
      Currently to make sure that every page table entry is read just once
      gup_fast walks perform READ_ONCE and pass pXd value down to the next
      gup_pXd_range function by value e.g.:
      
        static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                                 unsigned int flags, struct page **pages, int *nr)
        ...
                pudp = pud_offset(&p4d, addr);
      
      This function passes a reference on that local value copy to pXd_offset,
      and might get the very same pointer in return.  This happens when the
      level is folded (on most arches), and that pointer should not be
      iterated.
      
      On s390 due to the fact that each task might have different 5,4 or
      3-level address translation and hence different levels folded the logic
      is more complex and non-iteratable pointer to a local copy leads to
      severe problems.
      
      Here is an example of what happens with gup_fast on s390, for a task
      with 3-level paging, crossing a 2 GB pud boundary:
      
        // addr = 0x1007ffff000, end = 0x10080001000
        static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                                 unsigned int flags, struct page **pages, int *nr)
        {
              unsigned long next;
              pud_t *pudp;
      
              // pud_offset returns &p4d itself (a pointer to a value on stack)
              pudp = pud_offset(&p4d, addr);
              do {
                      // on second iteratation reading "random" stack value
                      pud_t pud = READ_ONCE(*pudp);
      
                      // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                      next = pud_addr_end(addr, end);
                      ...
              } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
      
              return 1;
        }
      
      This happens since s390 moved to common gup code with commit
      d1874a0c ("s390/mm: make the pxd_offset functions more robust") and
      commit 1a42010c ("s390/mm: convert to the generic
      get_user_pages_fast code").
      
      s390 tried to mimic static level folding by changing pXd_offset
      primitives to always calculate top level page table offset in pgd_offset
      and just return the value passed when pXd_offset has to act as folded.
      
      What is crucial for gup_fast and what has been overlooked is that
      PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
      And the latter is not possible with dynamic folding.
      
      To fix the issue in addition to pXd values pass original pXdp pointers
      down to gup_pXd_range functions.  And introduce pXd_offset_lockless
      helpers, which take an additional pXd entry value parameter.  This has
      already been discussed in
      
        https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
      
      Fixes: 1a42010c ("s390/mm: convert to the generic get_user_pages_fast code")
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Reviewed-by: NAlexander Gordeev <agordeev@linux.ibm.com>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hoursSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3f7b1bb
  2. 21 9月, 2020 1 次提交
  3. 20 9月, 2020 3 次提交
  4. 19 9月, 2020 2 次提交
  5. 18 9月, 2020 2 次提交
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
    • A
      arm64: paravirt: Initialize steal time when cpu is online · 75df529b
      Andrew Jones 提交于
      Steal time initialization requires mapping a memory region which
      invokes a memory allocation. Doing this at CPU starting time results
      in the following trace when CONFIG_DEBUG_ATOMIC_SLEEP is enabled:
      
      BUG: sleeping function called from invalid context at mm/slab.h:498
      in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #1
      Call trace:
       dump_backtrace+0x0/0x208
       show_stack+0x1c/0x28
       dump_stack+0xc4/0x11c
       ___might_sleep+0xf8/0x130
       __might_sleep+0x58/0x90
       slab_pre_alloc_hook.constprop.101+0xd0/0x118
       kmem_cache_alloc_node_trace+0x84/0x270
       __get_vm_area_node+0x88/0x210
       get_vm_area_caller+0x38/0x40
       __ioremap_caller+0x70/0xf8
       ioremap_cache+0x78/0xb0
       memremap+0x9c/0x1a8
       init_stolen_time_cpu+0x54/0xf0
       cpuhp_invoke_callback+0xa8/0x720
       notify_cpu_starting+0xc8/0xd8
       secondary_start_kernel+0x114/0x180
      CPU1: Booted secondary processor 0x0000000001 [0x431f0a11]
      
      However we don't need to initialize steal time at CPU starting time.
      We can simply wait until CPU online time, just sacrificing a bit of
      accuracy by returning zero for steal time until we know better.
      
      While at it, add __init to the functions that are only called by
      pv_time_init() which is __init.
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Fixes: e0685fa2 ("arm64: Retrieve stolen time as paravirtualized guest")
      Cc: stable@vger.kernel.org
      Reviewed-by: NSteven Price <steven.price@arm.com>
      Link: https://lore.kernel.org/r/20200916154530.40809-1-drjones@redhat.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      75df529b
  6. 17 9月, 2020 2 次提交
  7. 16 9月, 2020 2 次提交
  8. 11 9月, 2020 4 次提交
  9. 10 9月, 2020 2 次提交
  10. 09 9月, 2020 1 次提交
  11. 08 9月, 2020 2 次提交
  12. 07 9月, 2020 1 次提交
  13. 06 9月, 2020 1 次提交
  14. 05 9月, 2020 2 次提交
  15. 04 9月, 2020 3 次提交
  16. 03 9月, 2020 2 次提交
  17. 01 9月, 2020 1 次提交
    • M
      HID: core: Sanitize event code and type when mapping input · 35556bed
      Marc Zyngier 提交于
      When calling into hid_map_usage(), the passed event code is
      blindly stored as is, even if it doesn't fit in the associated bitmap.
      
      This event code can come from a variety of sources, including devices
      masquerading as input devices, only a bit more "programmable".
      
      Instead of taking the event code at face value, check that it actually
      fits the corresponding bitmap, and if it doesn't:
      - spit out a warning so that we know which device is acting up
      - NULLify the bitmap pointer so that we catch unexpected uses
      
      Code paths that can make use of untrusted inputs can now check
      that the mapping was indeed correct and bail out if not.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NBenjamin Tissoires <benjamin.tissoires@gmail.com>
      35556bed
  18. 29 8月, 2020 3 次提交
  19. 27 8月, 2020 4 次提交