1. 03 8月, 2017 7 次提交
    • H
      mm: take memory hotplug lock within numa_zonelist_order_handler() · 167d0f25
      Heiko Carstens 提交于
      Andre Wild reported the following warning:
      
        WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
        Modules linked in:
        CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57 #10
        Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
        task: 00000000701d8100 task.stack: 0000000073594000
        Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
        ...
        Call Trace:
         lockdep_assert_cpus_held+0x42/0x60)
         stop_machine_cpuslocked+0x62/0xf0
         build_all_zonelists+0x92/0x150
         numa_zonelist_order_handler+0x102/0x150
         proc_sys_call_handler.isra.12+0xda/0x118
         proc_sys_write+0x34/0x48
         __vfs_write+0x3c/0x178
         vfs_write+0xbc/0x1a0
         SyS_write+0x66/0xc0
         system_call+0xc4/0x2b0
         locks held by bash/1205:
         #0:  (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
         #1:  (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
         #2:  (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
        Last Breaking-Event-Address:
          lockdep_assert_cpus_held+0x48/0x60
      
      This can be easily triggered with e.g.
      
          echo n > /proc/sys/vm/numa_zonelist_order
      
      In commit 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu
      rwsem") memory hotplug locking was changed to fix a potential deadlock.
      
      This also switched the stop_machine() invocation within
      build_all_zonelists() to stop_machine_cpuslocked() which now expects
      that online cpus are locked when being called.
      
      This assumption is not true if build_all_zonelists() is being called
      from numa_zonelist_order_handler().
      
      In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
      pair to numa_zonelist_order_handler().
      
      Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
      Fixes: 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu rwsem")
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: NAndre Wild <wild@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      167d0f25
    • T
      mm/page_io.c: fix oops during block io poll in swapin path · b0ba2d0f
      Tetsuo Handa 提交于
      When a thread is OOM-killed during swap_readpage() operation, an oops
      occurs because end_swap_bio_read() is calling wake_up_process() based on
      an assumption that the thread which called swap_readpage() is still
      alive.
      
        Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child
        Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
        oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
        Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        task: ffffffffb7c16500 task.stack: ffffffffb7c00000
        RIP: 0010:__lock_acquire+0x151/0x12f0
        Call Trace:
         <IRQ>
         lock_acquire+0x59/0x80
         _raw_spin_lock_irqsave+0x3b/0x4f
         try_to_wake_up+0x3b/0x410
         wake_up_process+0x10/0x20
         end_swap_bio_read+0x6f/0xf0
         bio_endio+0x92/0xb0
         blk_update_request+0x88/0x270
         scsi_end_request+0x32/0x1c0
         scsi_io_completion+0x209/0x680
         scsi_finish_command+0xd4/0x120
         scsi_softirq_done+0x120/0x140
         __blk_mq_complete_request_remote+0xe/0x10
         flush_smp_call_function_queue+0x51/0x120
         generic_smp_call_function_single_interrupt+0xe/0x20
         smp_trace_call_function_single_interrupt+0x22/0x30
         smp_call_function_single_interrupt+0x9/0x10
         call_function_single_interrupt+0xa7/0xb0
         </IRQ>
        RIP: 0010:native_safe_halt+0x6/0x10
         default_idle+0xe/0x20
         arch_cpu_idle+0xa/0x10
         default_idle_call+0x1e/0x30
         do_idle+0x187/0x200
         cpu_startup_entry+0x6e/0x70
         rest_init+0xd0/0xe0
         start_kernel+0x456/0x477
         x86_64_start_reservations+0x24/0x26
         x86_64_start_kernel+0xf7/0x11a
         secondary_startup_64+0xa5/0xa5
        Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff <f0> ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85
        RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50
        ---[ end trace 6c441db499169b1e ]---
        Kernel panic - not syncing: Fatal exception in interrupt
        Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
        ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix it by holding a reference to the thread.
      
      [akpm@linux-foundation.org: add comment]
      Fixes: 23955622 ("swap: add block io poll in swapin path")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NShaohua Li <shli@fb.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0ba2d0f
    • M
      zram: do not free pool->size_class · 3189c820
      Minchan Kim 提交于
      Mike reported kernel goes oops with ltp:zram03 testcase.
      
        zram: Added device: zram0
        zram0: detected capacity change from 0 to 107374182400
        BUG: unable to handle kernel paging request at 0000306d61727a77
        IP: zs_map_object+0xb9/0x260
        PGD 0
        P4D 0
        Oops: 0000 [#1] SMP
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in: zram(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) loop(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_powerclamp(E) coretemp(E) cdc_ether(E) kvm_intel(E) usbnet(E) mii(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) iTCO_wdt(E) ghash_clmulni_intel(E) bnx2(E) iTCO_vendor_support(E) pcbc(E) ioatdma(E) ipmi_ssif(E) aesni_intel(E) i5500_temp(E) i2c_i801(E) aes_x86_64(E) lpc_ich(E) shpchp(E) mfd_core(E) crypto_simd(E) i7core_edac(E) dca(E) glue_helper(E) cryptd(E) ipmi_si(E) button(E) acpi_cpufreq(E) ipmi_devintf(E) pcspkr(E) ipmi_msghandler(E)
         nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) ata_generic(E) i2c_algo_bit(E) ata_piix(E) drm_kms_helper(E) ahci(E) syscopyarea(E) sysfillrect(E) libahci(E) sysimgblt(E) fb_sys_fops(E) uhci_hcd(E) ehci_pci(E) ttm(E) ehci_hcd(E) libata(E) drm(E) megaraid_sas(E) usbcore(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E) [last unloaded: zram]
        CPU: 6 PID: 12356 Comm: swapon Tainted: G            E   4.13.0.g87b2c3fc-default #194
        Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698     , BIOS -[D6E150AUS-1.10]- 12/15/2010
        task: ffff880158d2c4c0 task.stack: ffffc90001680000
        RIP: 0010:zs_map_object+0xb9/0x260
        Call Trace:
         zram_bvec_rw.isra.26+0xe8/0x780 [zram]
         zram_rw_page+0x6e/0xa0 [zram]
         bdev_read_page+0x81/0xb0
         do_mpage_readpage+0x51a/0x710
         mpage_readpages+0x122/0x1a0
         blkdev_readpages+0x1d/0x20
         __do_page_cache_readahead+0x1b2/0x270
         ondemand_readahead+0x180/0x2c0
         page_cache_sync_readahead+0x31/0x50
         generic_file_read_iter+0x7e7/0xaf0
         blkdev_read_iter+0x37/0x40
         __vfs_read+0xce/0x140
         vfs_read+0x9e/0x150
         SyS_read+0x46/0xa0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        Code: 81 e6 00 c0 3f 00 81 fe 00 00 16 00 0f 85 9f 01 00 00 0f b7 13 65 ff 05 5e 07 dc 7e 66 c1 ea 02 81 e2 ff 01 00 00 49 8b 54 d4 08 <8b> 4a 48 41 0f af ce 81 e1 ff 0f 00 00 41 89 c9 48 c7 c3 a0 70
        RIP: zs_map_object+0xb9/0x260 RSP: ffffc90001683988
        CR2: 0000306d61727a77
      
      He bisected the problem is [1].
      
      After commit cf8e0fed ("mm/zsmalloc: simplify zs_max_alloc_size
      handling"), zram doesn't use double pointer for pool->size_class any
      more in zs_create_pool so counter function zs_destroy_pool don't need to
      free it, either.
      
      Otherwise, it does kfree wrong address and then, kernel goes Oops.
      
      Link: http://lkml.kernel.org/r/20170725062650.GA12134@bbox
      Fixes: cf8e0fed ("mm/zsmalloc: simplify zs_max_alloc_size handling")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Tested-by: NMike Galbraith <efault@gmx.de>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3189c820
    • A
      kasan: avoid -Wmaybe-uninitialized warning · e7701557
      Arnd Bergmann 提交于
      gcc-7 produces this warning:
      
        mm/kasan/report.c: In function 'kasan_report':
        mm/kasan/report.c:351:3: error: 'info.first_bad_addr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
           print_shadow_for_address(info->first_bad_addr);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        mm/kasan/report.c:360:27: note: 'info.first_bad_addr' was declared here
      
      The code seems fine as we only print info.first_bad_addr when there is a
      shadow, and we always initialize it in that case, but this is relatively
      hard for gcc to figure out after the latest rework.
      
      Adding an intialization to the most likely value together with the other
      struct members shuts up that warning.
      
      Fixes: b235b9808664 ("kasan: unify report headers")
      Link: https://patchwork.kernel.org/patch/9641417/
      Link: http://lkml.kernel.org/r/20170725152739.4176967-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Suggested-by: NAlexander Potapenko <glider@google.com>
      Suggested-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7701557
    • M
      userfaultfd: non-cooperative: notify about unmap of destination during mremap · b2282371
      Mike Rapoport 提交于
      When mremap is called with MREMAP_FIXED it unmaps memory at the
      destination address without notifying userfaultfd monitor.
      
      If the destination were registered with userfaultfd, the monitor has no
      way to distinguish between the old and new ranges and to properly relate
      the page faults that would occur in the destination region.
      
      Fixes: 897ab3e0 ("userfaultfd: non-cooperative: add event for memory unmaps")
      Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@virtuozzo.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2282371
    • M
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 3ea27719
      Mel Gorman 提交于
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      
      He described the race as follows:
      
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              try_to_unmap_one()
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
      
                                              user writes using cached RW PTE
              ...
      
              try_to_unmap_flush()
      
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      ack.
      
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>	[v4.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea27719
    • D
      mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors · 2be7cfed
      Daniel Jordan 提交于
      Commit 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when
      FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
      errors from follow_hugetlb_page.  After such error, __get_user_pages
      subsequently calls faultin_page on the same VMA and start address that
      follow_hugetlb_page failed on instead of returning the error immediately
      as it should.
      
      In follow_hugetlb_page, when hugetlb_fault returns a value covered under
      VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
      to 0 as __get_user_pages expects in this case, which causes the
      following to happen in __get_user_pages: the "while (nr_pages)" check
      succeeds, we skip the "if (!vma..." check because we got a VMA the last
      time around, we find no page with follow_page_mask, and we call
      faultin_page, which calls hugetlb_fault for the second time.
      
      This issue also slightly changes how __get_user_pages works.  Before, it
      only returned error if it had made no progress (i = 0).  But now,
      follow_hugetlb_page can clobber "i" with an error code since its new
      return path doesn't check for progress.  So if "i" is nonzero before a
      failing call to follow_hugetlb_page, that indication of progress is lost
      and __get_user_pages can return error even if some pages were
      successfully pinned.
      
      To fix this, change follow_hugetlb_page so that it updates nr_pages,
      allowing __get_user_pages to fail immediately and restoring the "error
      only if no progress" behavior to __get_user_pages.
      
      Tested that __get_user_pages returns when expected on error from
      hugetlb_fault in follow_hugetlb_page.
      
      Fixes: 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
      Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.comSigned-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: NPunit Agrawal <punit.agrawal@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: <stable@vger.kernel.org>	[4.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2be7cfed
  2. 26 7月, 2017 1 次提交
  3. 15 7月, 2017 1 次提交
  4. 13 7月, 2017 5 次提交
    • N
      writeback: rework wb_[dec|inc]_stat family of functions · 3e8f399d
      Nikolay Borisov 提交于
      Currently the writeback statistics code uses a percpu counters to hold
      various statistics.  Furthermore we have 2 families of functions - those
      which disable local irq and those which doesn't and whose names begin
      with double underscore.  However, they both end up calling
      __add_wb_stats which in turn calls percpu_counter_add_batch which is
      already irq-safe.
      
      Exploiting this fact allows to eliminated the __wb_* functions since
      they don't add any further protection than we already have.
      Furthermore, refactor the wb_* function to call __add_wb_stat directly
      without the irq-disabling dance.  This will likely result in better
      runtime of code which deals with modifying the stat counters.
      
      While at it also document why percpu_counter_add_batch is in fact
      preempt and irq-safe since at least 3 people got confused.
      
      Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.comSigned-off-by: NNikolay Borisov <nborisov@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e8f399d
    • M
      mm, migration: do not trigger OOM killer when migrating memory · 0f556856
      Michal Hocko 提交于
      Page migration (for memory hotplug, soft_offline_page or mbind) needs to
      allocate a new memory.  This can trigger an oom killer if the target
      memory is depleated.  Although quite unlikely, still possible,
      especially for the memory hotplug (offlining of memoery).
      
      Up to now we didn't really have reasonable means to back off.
      __GFP_NORETRY can fail just too easily and __GFP_THISNODE sticks to a
      single node and that is not suitable for all callers.
      
      But now that we have __GFP_RETRY_MAYFAIL we should use it.  It is
      preferable to fail the migration than disrupt the system by killing some
      processes.
      
      Link: http://lkml.kernel.org/r/20170623085345.11304-7-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f556856
    • M
      mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes · cc965a29
      Michal Hocko 提交于
      Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
      request size we can drop the hackish implementation for !costly orders.
      __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
      progress and backs of when we are out of memory for the requested size.
      Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
      to silent the oom killer anymore.
      
      Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc965a29
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
    • G
      mm/memory.c: mark create_huge_pmd() inline to prevent build failure · 91a90140
      Geert Uytterhoeven 提交于
      With gcc 4.1.2:
      
          mm/memory.o: In function `create_huge_pmd':
          memory.c:(.text+0x93e): undefined reference to `do_huge_pmd_anonymous_page'
      
      Interestingly, create_huge_pmd() is emitted in the assembler output, but
      never called.
      
      Converting transparent_hugepage_enabled() from a macro to a static
      inline function reduced the ability of the compiler to remove unused
      code.
      
      Fix this by marking create_huge_pmd() inline.
      
      Fixes: 16981d76 ("mm: improve readability of transparent_hugepage_enabled()")
      Link: http://lkml.kernel.org/r/1499842660-10665-1-git-send-email-geert@linux-m68k.orgSigned-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91a90140
  5. 11 7月, 2017 26 次提交