1. 29 12月, 2018 15 次提交
    • J
      mm/mmu_notifier: use structure for invalidate_range_start/end callback · 5d6527a7
      Jérôme Glisse 提交于
      Patch series "mmu notifier contextual informations", v2.
      
      This patchset adds contextual information, why an invalidation is
      happening, to mmu notifier callback.  This is necessary for user of mmu
      notifier that wish to maintains their own data structure without having to
      add new fields to struct vm_area_struct (vma).
      
      For instance device can have they own page table that mirror the process
      address space.  When a vma is unmap (munmap() syscall) the device driver
      can free the device page table for the range.
      
      Today we do not have any information on why a mmu notifier call back is
      happening and thus device driver have to assume that it is always an
      munmap().  This is inefficient at it means that it needs to re-allocate
      device page table on next page fault and rebuild the whole device driver
      data structure for the range.
      
      Other use case beside munmap() also exist, for instance it is pointless
      for device driver to invalidate the device page table when the
      invalidation is for the soft dirtyness tracking.  Or device driver can
      optimize away mprotect() that change the page table permission access for
      the range.
      
      This patchset enables all this optimizations for device drivers.  I do not
      include any of those in this series but another patchset I am posting will
      leverage this.
      
      The patchset is pretty simple from a code point of view.  The first two
      patches consolidate all mmu notifier arguments into a struct so that it is
      easier to add/change arguments.  The last patch adds the contextual
      information (munmap, protection, soft dirty, clear, ...).
      
      This patch (of 3):
      
      To avoid having to change many callback definition everytime we want to
      add a parameter use a structure to group all parameters for the
      mmu_notifier invalidate_range_start/end callback.  No functional changes
      with this patch.
      
      [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
      Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d6527a7
    • M
      zram: writeback throttle · bb416d18
      Minchan Kim 提交于
      If there are lots of write IO with flash device, it could have a
      wearout problem of storage. To overcome the problem, admin needs
      to design write limitation to guarantee flash health
      for entire product life.
      
      This patch creates a new knob "writeback_limit" for zram.
      
      writeback_limit's default value is 0 so that it doesn't limit
      any writeback. If admin want to measure writeback count in a
      certain period, he could know it via /sys/block/zram0/bd_stat's
      3rd column.
      
      If admin want to limit writeback as per-day 400M, he could do it
      like below.
      
      	MB_SHIFT=20
      	4K_SHIFT=12
      	echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
      		/sys/block/zram0/writeback_limit.
      
      If admin want to allow further write again, he could do it like below
      
      	echo 0 > /sys/block/zram0/writeback_limit
      
      If admin want to see remaining writeback budget,
      
      	cat /sys/block/zram0/writeback_limit
      
      The writeback_limit count will reset whenever you reset zram (e.g., system
      reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback
      happened until you reset the zram to allocate extra writeback budget in
      next setting is user's job.
      
      [minchan@kernel.org: v4]
        Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb416d18
    • M
      zram: add bd_stat statistics · 23eddf39
      Minchan Kim 提交于
      bd_stat represents things that happened in the backing device.  Currently
      it supports bd_counts, bd_reads and bd_writes which are helpful to
      understand wearout of flash and memory saving.
      
      [minchan@kernel.org: v4]
        Link: http://lkml.kernel.org/r/20181203024045.153534-7-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23eddf39
    • M
      zram: support idle/huge page writeback · a939888e
      Minchan Kim 提交于
      Add a new feature "zram idle/huge page writeback".  In the zram-swap use
      case, zram usually has many idle/huge swap pages.  It's pointless to keep
      them in memory (ie, zram).
      
      To solve this problem, this feature introduces idle/huge page writeback to
      the backing device so the goal is to save more memory space on embedded
      systems.
      
      Normal sequence to use idle/huge page writeback feature is as follows,
      
      while (1) {
              # mark allocated zram slot to idle
              echo all > /sys/block/zram0/idle
              # leave system working for several hours
              # Unless there is no access for some blocks on zram,
      	# they are still IDLE marked pages.
      
              echo "idle" > /sys/block/zram0/writeback
      	or/and
      	echo "huge" > /sys/block/zram0/writeback
              # write the IDLE or/and huge marked slot into backing device
      	# and free the memory.
      }
      
      Per the discussion at
      https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
      
      This patch removes direct incommpressibe page writeback feature
      (d2afd25114f4 ("zram: write incompressible pages to backing device")).
      
      Below concerns from Sergey:
      == &< ==
      
      "IDLE writeback" is superior to "incompressible writeback".
      
      "incompressible writeback" is completely unpredictable and uncontrollable;
      it depens on data patterns and compression algorithms.  While "IDLE
      writeback" is predictable.
      
      I even suspect, that, *ideally*, we can remove "incompressible writeback".
      "IDLE pages" is a super set which also includes "incompressible" pages.
      So, technically, we still can do "incompressible writeback" from "IDLE
      writeback" path; but a much more reasonable one, based on a page idling
      period.
      
      I understand that you want to keep "direct incompressible writeback"
      around.  ZRAM is especially popular on devices which do suffer from flash
      wearout, so I can see "incompressible writeback" path becoming a dead
      code, long term.
      
      == &< ==
      
      Below concerns from Minchan:
      == &< ==
      
      My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
      both hugepage/idlepage writeck will turn on.  However someuser want to
      enable only idlepage writeback so we need to introduce turn on/off knob
      for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase.  I
      don't want to make it complicated *if possible*.
      
      Long term, I imagine we need to make VM aware of new swap hierarchy a
      little bit different with as-is.  For example, first high priority swap
      can return -EIO or -ENOCOMP, swap try to fallback to next lower priority
      swap device.  With that, hugepage writeback will work tranparently.
      
      So we could regard it as regression because incompressible pages doesn't
      go to backing storage automatically.  Instead, user should do it via "echo
      huge" > /sys/block/zram/writeback" manually.
      
      == &< ==
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a939888e
    • M
      zram: introduce ZRAM_IDLE flag · e82592c4
      Minchan Kim 提交于
      To support idle page writeback with upcoming patches, this patch
      introduces a new ZRAM_IDLE flag.
      
      Userspace can mark zram slots as "idle" via
      	"echo all > /sys/block/zramX/idle"
      which marks every allocated zram slot as ZRAM_IDLE.
      User could see it by /sys/kernel/debug/zram/zram0/block_state.
      
                300    75.033841 ...i
                301    63.806904 s..i
                302    63.806919 ..hi
      
      Once there is IO for the slot, the mark will be disappeared.
      
      	  300    75.033841 ...
                301    63.806904 s..i
                302    63.806919 ..hi
      
      Therefore, 300th block is idle zpage. With this feature,
      user can how many zram has idle pages which are waste of memory.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e82592c4
    • M
      zram: refactor flags and writeback stuff · 7e529283
      Minchan Kim 提交于
      Rename some variables and restructure some code for better readability in
      writeback and zs_free_page.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-4-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e529283
    • M
      zram: fix double free backing device · 5547932d
      Minchan Kim 提交于
      If blkdev_get fails, we shouldn't do blkdev_put.  Otherwise, kernel emits
      below log.  This patch fixes it.
      
        WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 blkdev_put+0x105/0x120
        Modules linked in:
        CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        RIP: 0010:blkdev_put+0x105/0x120
        Call Trace:
          __x64_sys_swapoff+0x46d/0x490
          do_syscall_64+0x5a/0x190
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
        irq event stamp: 4466
        hardirqs last  enabled at (4465):  __free_pages_ok+0x1e3/0x490
        hardirqs last disabled at (4466):  trace_hardirqs_off_thunk+0x1a/0x1c
        softirqs last  enabled at (3420):  __do_softirq+0x333/0x446
        softirqs last disabled at (3407):  irq_exit+0xd1/0xe0
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5547932d
    • M
      zram: fix lockdep warning of free block handling · 3c9959e0
      Minchan Kim 提交于
      Patch series "zram idle page writeback", v3.
      
      Inherently, swap device has many idle pages which are rare touched since
      it was allocated.  It is never problem if we use storage device as swap.
      However, it's just waste for zram-swap.
      
      This patchset supports zram idle page writeback feature.
      
      * Admin can define what is idle page "no access since X time ago"
      * Admin can define when zram should writeback them
      * Admin can define when zram should stop writeback to prevent wearout
      
      Details are in each patch's description.
      
      This patch (of 7):
      
        ================================
        WARNING: inconsistent lock state
        4.19.0+ #390 Not tainted
        --------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
        00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50
        {SOFTIRQ-ON-W} state was registered at:
          _raw_spin_lock+0x2c/0x40
          zram_make_request+0x755/0xdc9
          generic_make_request+0x373/0x6a0
          submit_bio+0x6c/0x140
          __swap_writepage+0x3a8/0x480
          shrink_page_list+0x1102/0x1a60
          shrink_inactive_list+0x21b/0x3f0
          shrink_node_memcg.constprop.99+0x4f8/0x7e0
          shrink_node+0x7d/0x2f0
          do_try_to_free_pages+0xe0/0x300
          try_to_free_pages+0x116/0x2b0
          __alloc_pages_slowpath+0x3f4/0xf80
          __alloc_pages_nodemask+0x2a2/0x2f0
          __handle_mm_fault+0x42e/0xb50
          handle_mm_fault+0x55/0xb0
          __do_page_fault+0x235/0x4b0
          page_fault+0x1e/0x30
        irq event stamp: 228412
        hardirqs last  enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600
        hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600
        softirqs last  enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427
        softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&(&zram->bitmap_lock)->rlock);
          <Interrupt>
            lock(&(&zram->bitmap_lock)->rlock);
      
         *** DEADLOCK ***
      
        no locks held by zram_verify/2095.
      
        stack backtrace:
        CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack+0x67/0x9b
         print_usage_bug+0x1bd/0x1d3
         mark_lock+0x4aa/0x540
         __lock_acquire+0x51d/0x1300
         lock_acquire+0x90/0x180
         _raw_spin_lock+0x2c/0x40
         put_entry_bdev+0x1e/0x50
         zram_free_page+0xf6/0x110
         zram_slot_free_notify+0x42/0xa0
         end_swap_bio_read+0x5b/0x170
         blk_update_request+0x8f/0x340
         scsi_end_request+0x2c/0x1e0
         scsi_io_completion+0x98/0x650
         blk_done_softirq+0x9e/0xd0
         __do_softirq+0xcc/0x427
         irq_exit+0xd1/0xe0
         do_IRQ+0x93/0x120
         common_interrupt+0xf/0xf
         </IRQ>
      
      With writeback feature, zram_slot_free_notify could be called in softirq
      context by end_swap_bio_read.  However, bitmap_lock is not aware of that
      so lockdep yell out:
      
        get_entry_bdev
        spin_lock(bitmap->lock);
        irq
        softirq
        end_swap_bio_read
        zram_slot_free_notify
        zram_slot_lock <-- deadlock prone
        zram_free_page
        put_entry_bdev
        spin_lock(bitmap->lock); <-- deadlock prone
      
      With akpm's suggestion (i.e.  bitmap operation is already atomic), we
      could remove bitmap lock.  It might fail to find a empty slot if serious
      contention happens.  However, it's not severe problem because huge page
      writeback has already possiblity to fail if there is severe memory
      pressure.  Worst case is just keeping the incompressible in memory, not
      storage.
      
      The other problem is zram_slot_lock in zram_slot_slot_free_notify.  To
      make it safe is this patch introduces zram_slot_trylock where
      zram_slot_free_notify uses it.  Although it's rare to be contented, this
      patch adds new debug stat "miss_free" to keep monitoring how often it
      happens.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c9959e0
    • D
      mm/memory_hotplug: drop "online" parameter from add_memory_resource() · f29d8e9c
      David Hildenbrand 提交于
      Userspace should always be in charge of how to online memory and if memory
      should be onlined automatically in the kernel.  Let's drop the parameter
      to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
      so we can directly use that instead internally.
      
      Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NJuergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f29d8e9c
    • W
      drivers/base/memory.c: remove an unnecessary check on NR_MEM_SECTIONS · 3b6fd6ff
      Wei Yang 提交于
      In cb5e39b8 ("drivers: base: refactor add_memory_section() to
      add_memory_block()"), add_memory_block() is introduced, which is only
      invoked in memory_dev_init().
      
      When combining these two loops in memory_dev_init() and
      add_memory_block(), they looks like this:
      
          for (i = 0; i < NR_MEM_SECTIONS; i += sections_per_block)
              for (j = i;
      	    (j < i + sections_per_block) && j < NR_MEM_SECTIONS;
      	    j++)
      
      Since it is sure the (i < NR_MEM_SECTIONS) and j sits in its own memory
      block, the check of (j < NR_MEM_SECTIONS) is not necessary.
      
      This patch just removes this check.
      
      Link: http://lkml.kernel.org/r/20181123222811.18216-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b6fd6ff
    • D
      mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL · 02917e9f
      Dan Williams 提交于
      At Maintainer Summit, Greg brought up a topic I proposed around
      EXPORT_SYMBOL_GPL usage.  The motivation was considerations for when
      EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
      step of reclassifying an existing export.  Specifically, I wanted to make
      the case that although the line is fuzzy and hard to specify in abstract
      terms, it is nonetheless clear that devm_memremap_pages() and HMM
      (Heterogeneous Memory Management) have crossed it.  The
      devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
      beginning, and HMM as a derivative of that functionality should have
      naturally picked up that designation as well.
      
      Contrary to typical rules, the HMM infrastructure was merged upstream with
      zero in-tree consumers.  There was a promise at the time that those users
      would be merged "soon", but it has been over a year with no drivers
      arriving.  While the Nouveau driver is about to belatedly make good on
      that promise it is clear that HMM was targeted first and foremost at an
      out-of-tree consumer.
      
      HMM is derived from devm_memremap_pages(), a facility Christoph and I
      spearheaded to support persistent memory.  It combines a device lifetime
      model with a dynamically created 'struct page' / memmap array for any
      physical address range.  It enables coordination and control of the many
      code paths in the kernel built to interact with memory via 'struct page'
      objects.  With HMM the integration goes even deeper by allowing device
      drivers to hook and manipulate page fault and page free events.
      
      One interpretation of when EXPORT_SYMBOL is suitable is when it is
      exporting stable and generic leaf functionality.  The
      devm_memremap_pages() facility continues to see expanding use cases,
      peer-to-peer DMA being the most recent, with no clear end date when it
      will stop attracting reworks and semantic changes.  It is not suitable to
      export devm_memremap_pages() as a stable 3rd party driver API due to the
      fact that it is still changing and manipulates core behavior.  Moreover,
      it is not in the best interest of the long term development of the core
      memory management subsystem to permit any external driver to effectively
      define its own system-wide memory management policies with no
      encouragement to engage with upstream.
      
      I am also concerned that HMM was designed in a way to minimize further
      engagement with the core-MM.  That, with these hooks in place,
      device-drivers are free to implement their own policies without much
      consideration for whether and how the core-MM could grow to meet that
      need.  Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
      core-MM should be allowed the opportunity and stimulus to change and
      address these new use cases as first class functionality.
      
      Original changelog:
      
      hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
      devm_memremap_pages() and are now simple now wrappers around the core
      facility to inject a dev_pagemap instance into the global pgmap_radix and
      hook page-idle events.  The devm_memremap_pages() interface is base
      infrastructure for HMM.  HMM has more and deeper ties into the kernel
      memory management implementation than base ZONE_DEVICE which is itself a
      EXPORT_SYMBOL_GPL facility.
      
      Originally, the HMM page structure creation routines copied the
      devm_memremap_pages() code and reused ZONE_DEVICE.  A cleanup to unify the
      implementations was discussed during the initial review:
      http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
      extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
      this cleanup to move forward.
      
      In addition to the integration with devm_memremap_pages() HMM depends on
      other GPL-only symbols:
      
          mmu_notifier_unregister_no_release
          percpu_ref
          region_intersects
          __class_create
      
      It goes further to consume / indirectly expose functionality that is not
      exported to any other driver:
      
          alloc_pages_vma
          walk_page_range
      
      HMM is derived from devm_memremap_pages(), and extends deep core-kernel
      fundamentals. Similar to devm_memremap_pages(), mark its entry points
      EXPORT_SYMBOL_GPL().
      
      [logang@deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()]
        Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com
      Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>,
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02917e9f
    • D
      mm, devm_memremap_pages: fix shutdown handling · a95c90f1
      Dan Williams 提交于
      The last step before devm_memremap_pages() returns success is to allocate
      a release action, devm_memremap_pages_release(), to tear the entire setup
      down.  However, the result from devm_add_action() is not checked.
      
      Checking the error from devm_add_action() is not enough.  The api
      currently relies on the fact that the percpu_ref it is using is killed by
      the time the devm_memremap_pages_release() is run.  Rather than continue
      this awkward situation, offload the responsibility of killing the
      percpu_ref to devm_memremap_pages_release() directly.  This allows
      devm_memremap_pages() to do the right thing relative to init failures and
      shutdown.
      
      Without this change we could fail to register the teardown of
      devm_memremap_pages().  The likelihood of hitting this failure is tiny as
      small memory allocations almost always succeed.  However, the impact of
      the failure is large given any future reconfiguration, or disable/enable,
      of an nvdimm namespace will fail forever as subsequent calls to
      devm_memremap_pages() will fail to setup the pgmap_radix since there will
      be stale entries for the physical address range.
      
      An argument could be made to require that the ->kill() operation be set in
      the @pgmap arg rather than passed in separately.  However, it helps code
      readability, tracking the lifetime of a given instance, to be able to grep
      the kill routine directly at the devm_memremap_pages() call site.
      
      Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Fixes: e8d51348 ("memremap: change devm_memremap_pages interface...")
      Reviewed-by: N"Jérôme Glisse" <jglisse@redhat.com>
      Reported-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a95c90f1
    • A
      mm: convert totalram_pages and totalhigh_pages variables to atomic · ca79b0c2
      Arun KS 提交于
      totalram_pages and totalhigh_pages are made static inline function.
      
      Main motivation was that managed_page_count_lock handling was complicating
      things.  It was discussed in length here,
      https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
      better to remove the lock and convert variables to atomic, with preventing
      poteintial store-to-read tearing as a bonus.
      
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca79b0c2
    • A
      mm: convert zone->managed_pages to atomic variable · 9705bea5
      Arun KS 提交于
      totalram_pages, zone->managed_pages and totalhigh_pages updates are
      protected by managed_page_count_lock, but readers never care about it.
      Convert these variables to atomic to avoid readers potentially seeing a
      store tear.
      
      This patch converts zone->managed_pages.  Subsequent patches will convert
      totalram_panges, totalhigh_pages and eventually managed_page_count_lock
      will be removed.
      
      Main motivation was that managed_page_count_lock handling was complicating
      things.  It was discussed in length here,
      https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
      better to remove the lock and convert variables to atomic, with preventing
      poteintial store-to-read tearing as a bonus.
      
      Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9705bea5
    • A
      mm: reference totalram_pages and managed_pages once per function · 3d6357de
      Arun KS 提交于
      Patch series "mm: convert totalram_pages, totalhigh_pages and managed
      pages to atomic", v5.
      
      This series converts totalram_pages, totalhigh_pages and
      zone->managed_pages to atomic variables.
      
      totalram_pages, zone->managed_pages and totalhigh_pages updates are
      protected by managed_page_count_lock, but readers never care about it.
      Convert these variables to atomic to avoid readers potentially seeing a
      store tear.
      
      Main motivation was that managed_page_count_lock handling was complicating
      things.  It was discussed in length here,
      https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
      to remove the lock and convert variables to atomic.  With the change,
      preventing poteintial store-to-read tearing comes as a bonus.
      
      This patch (of 4):
      
      This is in preparation to a later patch which converts totalram_pages and
      zone->managed_pages to atomic variables.  Please note that re-reading the
      value might lead to a different value and as such it could lead to
      unexpected behavior.  There are no known bugs as a result of the current
      code but it is better to prevent from them in principle.
      
      Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d6357de
  2. 25 12月, 2018 7 次提交
    • C
      drivers/net: appletalk/cops: remove redundant if statement and mask · bd437c99
      Colin Ian King 提交于
      The two different assignments for pkt_len are actually the same and
      so the if statement is redundant and can be removed.  Masking a u8
      return value from inb() with 0xFF is also redundant and can also be
      emoved.
      
      Similarly, the two different outb calls are identical as the mask
      of 0xff on the second outb is redundant since a u8 is being written,
      so the if statement is also redundant and can be also removed.
      
      Detected by CoverityScan, CID#1475639 ("Identical code for different
      branches")
      
      V2: Remove the if statement for the outb calls, thanks to David
      Miller for spotting this.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd437c99
    • I
      bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw · 38355a5f
      Ivan Mironov 提交于
      This happened when I tried to boot normal Fedora 29 system with latest
      available kernel (from fedora rawhide, plus some unrelated custom
      patches):
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      	PGD 0 P4D 0
      	Oops: 0010 [#1] SMP PTI
      	CPU: 6 PID: 1422 Comm: libvirtd Tainted: G          I       4.20.0-0.rc7.git3.hpsa2.1.fc29.x86_64 #1
      	Hardware name: HP ProLiant BL460c G6, BIOS I24 05/21/2018
      	RIP: 0010:          (null)
      	Code: Bad RIP value.
      	RSP: 0018:ffffa47ccdc9fbe0 EFLAGS: 00010246
      	RAX: 0000000000000000 RBX: 00000000000003e8 RCX: ffffa47ccdc9fbf8
      	RDX: ffffa47ccdc9fc00 RSI: ffff97d9ee7b01f8 RDI: ffff97d9f0150b80
      	RBP: ffff97d9f0150b80 R08: 0000000000000000 R09: 0000000000000000
      	R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
      	R13: ffff97d9ef1e53e8 R14: 0000000000000009 R15: ffff97d9f0ac6730
      	FS:  00007f4d224ef700(0000) GS:ffff97d9fa200000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: ffffffffffffffd6 CR3: 00000011ece52006 CR4: 00000000000206e0
      	Call Trace:
      	 ? bnx2x_chip_cleanup+0x195/0x610 [bnx2x]
      	 ? bnx2x_nic_unload+0x1e2/0x8f0 [bnx2x]
      	 ? bnx2x_reload_if_running+0x24/0x40 [bnx2x]
      	 ? bnx2x_set_features+0x79/0xa0 [bnx2x]
      	 ? __netdev_update_features+0x244/0x9e0
      	 ? netlink_broadcast_filtered+0x136/0x4b0
      	 ? netdev_update_features+0x22/0x60
      	 ? dev_disable_lro+0x1c/0xe0
      	 ? devinet_sysctl_forward+0x1c6/0x211
      	 ? proc_sys_call_handler+0xab/0x100
      	 ? __vfs_write+0x36/0x1a0
      	 ? rcu_read_lock_sched_held+0x79/0x80
      	 ? rcu_sync_lockdep_assert+0x2e/0x60
      	 ? __sb_start_write+0x14c/0x1b0
      	 ? vfs_write+0x159/0x1c0
      	 ? vfs_write+0xba/0x1c0
      	 ? ksys_write+0x52/0xc0
      	 ? do_syscall_64+0x60/0x1f0
      	 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      After some investigation I figured out that recently added cleanup code
      tries to call VLAN filtering de-initialization function which exist only
      for newer hardware. Corresponding function pointer is not
      set (== 0) for older hardware, namely these chips:
      
      	#define CHIP_NUM_57710			0x164e
      	#define CHIP_NUM_57711			0x164f
      	#define CHIP_NUM_57711E			0x1650
      
      And I have one of those in my test system:
      
      	Broadcom Inc. and subsidiaries NetXtreme II BCM57711E 10-Gigabit PCIe [14e4:1650]
      
      Function bnx2x_init_vlan_mac_fp_objs() from
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h decides whether to
      initialize relevant pointers in bnx2x_sp_objs.vlan_obj or not.
      
      This regression was introduced after v4.20-rc7, and still exists in v4.20
      release.
      
      Fixes: 04f05230 ("bnx2x: Remove configured vlans as part of unload sequence.")
      Signed-off-by: NIvan Mironov <mironov.ivan@gmail.com>
      Signed-off-by: NIvan Mironov <mironov.ivan@gmail.com>
      Acked-by: NSudarsana Kalluru <Sudarsana.Kalluru@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38355a5f
    • J
      net/mlx4_core: drop useless LIST_HEAD · 61988bd2
      Julia Lawall 提交于
      Drop LIST_HEAD where the variable it declares has never
      been used.
      
      The semantic patch that fixes this problem is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      identifier x;
      @@
      - LIST_HEAD(x);
        ... when != x
      // </smpl>
      
      Fixes: c82e9aa0 ("mlx4_core: resource tracking for HCA resources used by guests")
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61988bd2
    • J
      mlxsw: spectrum: drop useless LIST_HEAD · d0863792
      Julia Lawall 提交于
      Drop LIST_HEAD where the variable it declares is never used.
      
      The uses were removed in 244cd96a ("net_sched: remove
      list_head from tc_action"), but not the declaration.
      
      The semantic patch that fixes this problem is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      identifier x;
      @@
      - LIST_HEAD(x);
        ... when != x
      // </smpl>
      
      Fixes: 244cd96a ("net_sched: remove list_head from tc_action")
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0863792
    • J
      net/mlx5e: drop useless LIST_HEAD · 2534f14a
      Julia Lawall 提交于
      Drop LIST_HEAD where the variable it declares is never used.
      
      These became useless in 244cd96a ("net_sched: remove list_head
      from tc_action")
      
      The semantic patch that fixes this problem is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      identifier x;
      @@
      - LIST_HEAD(x);
        ... when != x
      // </smpl>
      
      Fixes: 244cd96a ("net_sched: remove list_head from tc_action")
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2534f14a
    • K
      net/mlx5e: fix semicolon.cocci warnings · 5d1f7354
      kbuild test robot 提交于
      drivers/net/ethernet/mellanox/mlx5/core/en_rep.c:1339:57-58: Unneeded semicolon
      
       Remove unneeded semicolon.
      
      Generated by: scripts/coccinelle/misc/semicolon.cocci
      
      Fixes: 4c8fb298 ("net/mlx5e: Increase VF representors' SQ size to 128")
      CC: Gavi Teitz <gavi@mellanox.com>
      Signed-off-by: Nkbuild test robot <fengguang.wu@intel.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d1f7354
    • F
      staging: octeon: fix build failure with XFRM enabled · 8762cdcd
      Florian Westphal 提交于
      skb->sp doesn't exist anymore in the next-next tree, so mips defconfig
      no longer builds.  Use helper instead to reset the secpath.
      
      Not even compile tested.
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Fixes: 4165079b ("net: switch secpath to use skb extension infrastructure")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8762cdcd
  3. 24 12月, 2018 5 次提交
  4. 23 12月, 2018 13 次提交