1. 25 6月, 2015 40 次提交
    • D
      sparc: perf: Disable pagefaults while walking userspace stacks · c17af4dd
      David Ahern 提交于
      Page faults generated walking userspace stacks can call schedule to switch
      out the task. When collecting callchains for scheduler tracepoints this
      causes a deadlock as the tracepoints can be hit with the runqueue lock held:
      
      [ 8138.159054] WARNING: CPU: 758 PID: 12488 at /opt/dahern/linux.git/arch/sparc/kernel/nmi.c:80 perfctr_irq+0x1f8/0x2b4()
      
      [ 8138.203152] Watchdog detected hard LOCKUP on cpu 758
      
      [ 8138.410969] CPU: 758 PID: 12488 Comm: perf Not tainted 4.0.0-rc6+ #6
      [ 8138.437146] Call Trace:
      [ 8138.447193]  [000000000045cdd4] warn_slowpath_common+0x7c/0xa0
      [ 8138.471238]  [000000000045ce90] warn_slowpath_fmt+0x30/0x40
      [ 8138.494189]  [0000000000983e38] perfctr_irq+0x1f8/0x2b4
      [ 8138.515716]  [00000000004209f4] tl0_irq15+0x14/0x20
      [ 8138.535791]  [00000000009839ec] _raw_spin_trylock_bh+0x68/0x108
      [ 8138.560180]  [0000000000980018] __schedule+0xcc/0x710
      [ 8138.580981]  [00000000009806dc] preempt_schedule_common+0x10/0x3c
      [ 8138.606082]  [000000000098077c] _cond_resched+0x34/0x44
      [ 8138.627603]  [0000000000565990] kmem_cache_alloc_node+0x24/0x1a0
      [ 8138.652345]  [0000000000450b60] tsb_grow+0xac/0x488
      [ 8138.672429]  [0000000000985040] do_sparc64_fault+0x4dc/0x6e4
      [ 8138.695736]  [0000000000407c2c] sparc64_realfault_common+0x10/0x20
      [ 8138.721202]  [00000000006f2e24] NG4copy_from_user+0xa4/0x3c0
      [ 8138.744510]  [000000000044f900] perf_callchain_user+0x5c/0x6c
      [ 8138.768182]  [0000000000517b5c] perf_callchain+0x16c/0x19c
      [ 8138.790774]  [0000000000515f84] perf_prepare_sample+0x68/0x218
      [ 8138.814801] ---[ end trace 42ca6294b1ff7573 ]---
      
      As with PowerPC (b59a1bfc, "powerpc/perf: Disable pagefaults during
      callchain stack read") disable pagefaults while walking userspace stacks.
      Signed-off-by: NDavid Ahern <david.ahern@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c17af4dd
    • L
      Merge branch 'akpm' (patches from Andrew) · aefbef10
      Linus Torvalds 提交于
      Merge first patchbomb from Andrew Morton:
      
       - a few misc things
      
       - ocfs2 udpates
      
       - kernel/watchdog.c feature work (took ages to get right)
      
       - most of MM.  A few tricky bits are held up and probably won't make 4.2.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (91 commits)
        mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
        mm, thp: respect MPOL_PREFERRED policy with non-local node
        tmpfs: truncate prealloc blocks past i_size
        mm/memory hotplug: print the last vmemmap region at the end of hot add memory
        mm/mmap.c: optimization of do_mmap_pgoff function
        mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
        mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
        mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
        mm: kmemleak: fix delete_object_*() race when called on the same memory block
        mm: kmemleak: allow safe memory scanning during kmemleak disabling
        memcg: convert mem_cgroup->under_oom from atomic_t to int
        memcg: remove unused mem_cgroup->oom_wakeups
        frontswap: allow multiple backends
        x86, mirror: x86 enabling - find mirrored memory ranges
        mm/memblock: allocate boot time data structures from mirrored memory
        mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
        mm: do not ignore mapping_gfp_mask in page cache allocation paths
        mm/cma.c: fix typos in comments
        mm/oom_kill.c: print points as unsigned int
        mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
        ...
      aefbef10
    • L
      Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · 266da6f1
      Linus Torvalds 提交于
      Pull pstore updates from Tony Luck:
       "Miscellaneous pstore improvements"
      
      * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
        ramoops: make it possible to change mem_type param.
        pstore/ram: verify ramoops header before saving record
        fs/pstore: Optimization function ramoops_init_przs
        fs/pstore: update the backend parameter in pstore module
        pstore: do not use message compression without lock
      266da6f1
    • L
      Merge tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs · cfcc0ad4
      Linus Torvalds 提交于
      Pull f2fs updates from Jaegeuk Kim:
       "New features:
         - per-file encryption (e.g., ext4)
         - FALLOC_FL_ZERO_RANGE
         - FALLOC_FL_COLLAPSE_RANGE
         - RENAME_WHITEOUT
      
        Major enhancement/fixes:
         - recovery broken superblocks
         - enhance f2fs_trim_fs with a discard_map
         - fix a race condition on dentry block allocation
         - fix a deadlock during summary operation
         - fix a missing fiemap result
      
        .. and many minor bug fixes and clean-ups were done"
      
      * tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (83 commits)
        f2fs: do not trim preallocated blocks when truncating after i_size
        f2fs crypto: add alloc_bounce_page
        f2fs crypto: fix to handle errors likewise ext4
        f2fs: drop the volatile_write flag only
        f2fs: skip committing valid superblock
        f2fs: setting discard option in parse_options()
        f2fs: fix to return exact trimmed size
        f2fs: support FALLOC_FL_INSERT_RANGE
        f2fs: hide common code in f2fs_replace_block
        f2fs: disable the discard option when device doesn't support
        f2fs crypto: remove alloc_page for bounce_page
        f2fs: fix a deadlock for summary page lock vs. sentry_lock
        f2fs crypto: clean up error handling in f2fs_fname_setup_filename
        f2fs crypto: avoid f2fs_inherit_context for symlink
        f2fs crypto: do not set encryption policy for non-directory by ioctl
        f2fs crypto: allow setting encryption policy once
        f2fs crypto: check context consistent for rename2
        f2fs: avoid duplicated code by reusing f2fs_read_end_io
        f2fs crypto: use per-inode tfm structure
        f2fs: recovering broken superblock during mount
        ...
      cfcc0ad4
    • L
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · a7296b49
      Linus Torvalds 提交于
      Pull UDF fixes and cleanups from Jan Kara:
       "The contains some small fixes and improvements in error handling for
        UDF.
      
        Bundled is also one ext3 coding style fix and a fix in quota
        documentation"
      
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        udf: fix udf_load_pvoldesc()
        udf: remove double err declaration in udf_file_write_iter()
        UDF: support NFSv2 export
        fs: ext3: super: fixed a space coding style issue
        quota: Update documentation
        udf: Return error from udf_find_entry()
        udf: Make udf_get_filename() return error instead of 0 length file name
        udf: bug on exotic flag in udf_get_filename()
        udf: improve error management in udf_CS0toNLS()
        udf: improve error management in udf_CS0toUTF8()
        udf: unicode: update function name in comments
        udf: remove unnecessary test in udf_build_ustr_exact()
        udf: Return -ENOMEM when allocation fails in udf_get_filename()
      a7296b49
    • L
      Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6 · 1e467e68
      Linus Torvalds 提交于
      Pull documentation updates from Jonathan Corbet:
       "The main thing here is Ingo's big subdirectory documenting feature
        support for each architecture.  Beyond that, it's the usual pile of
        fixes, tweaks, and small additions"
      
      * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (79 commits)
        doc:md: fix typo in md.txt.
        Documentation/mic/mpssd: don't build x86 userspace when cross compiling
        Documentation/prctl: don't build tsc tests when cross compiling
        Documentation/vDSO: don't build tests when cross compiling
        Doc:ABI/testing: Fix typo in sysfs-bus-fcoe
        Doc: Docbook: Change wikipedia's URL from http to https in scsi.tmpl
        Doc: Change wikipedia's URL from http to https
        Documentation/kernel-parameters: add missing pciserial to the earlyprintk
        Doc:pps: Fix typo in pps.txt
        kbuild : Fix documentation of INSTALL_HDR_PATH
        Documentation: filesystems: updated struct file_operations documentation in vfs.txt
        kbuild: edit explanation of clean-files variable
        Doc: ja_JP: Fix typo in HOWTO
        Move freefall program from Documentation/ to tools/
        Documentation: ARM: EXYNOS: Describe boot loaders interface
        Doc:nfc: Fix typo in nfc-hci.txt
        vfs: Minor documentation fix
        Doc: networking: txtimestamp: fix printf format warning
        Documentation, intel_pstate: Improve legacy mode internal governors description
        Documentation: extend use case for EXPORT_SYMBOL_GPL()
        ...
      1e467e68
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 14738e03
      Linus Torvalds 提交于
      Pull input subsystem updates from Dmitry Torokhov:
       "Thanks to Samuel Thibault input device (keyboard) LEDs are no longer
        hardwired within the input core but use LED subsystem and so allow use
        of different triggers; Hans de Goede did a large update for the ALPS
        touchpad driver; we have new TI drv2665 haptics driver and DA9063
        OnKey driver, and host of other drivers got various fixes"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (55 commits)
        Input: pixcir_i2c_ts - fix receive error
        MAINTAINERS: remove non existent input mt git tree
        Input: improve usage of gpiod API
        tty/vt/keyboard: define LED triggers for VT keyboard lock states
        tty/vt/keyboard: define LED triggers for VT LED states
        Input: export LEDs as class devices in sysfs
        Input: cyttsp4 - use swap() in cyttsp4_get_touch()
        Input: goodix - do not explicitly set evbits in input device
        Input: goodix - export id and version read from device
        Input: goodix - fix variable length array warning
        Input: goodix - fix alignment issues
        Input: add OnKey driver for DA9063 MFD part
        Input: elan_i2c - add product IDs FW names
        Input: elan_i2c - add support for multi IC type and iap format
        Input: focaltech - report finger width to userspace
        tty: remove platform_sysrq_reset_seq
        Input: synaptics_i2c - use proper boolean values
        Input: psmouse - use true instead of 1 for boolean values
        Input: cyapa - fix a few typos in comments
        Input: stmpe-ts - enforce device tree only mode
        ...
      14738e03
    • L
      Merge tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp · 45471cd9
      Linus Torvalds 提交于
      Pull EDAC updates from Borislav Petkov:
      
       - New APM X-Gene SoC EDAC driver (Loc Ho)
      
       - AMD error injection module improvements (Aravind Gopalakrishnan)
      
       - Altera Arria 10 support (Thor Thayer)
      
       - misc fixes and cleanups all over the place
      
      * tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (28 commits)
        EDAC: Update Documentation/edac.txt
        EDAC: Fix typos in Documentation/edac.txt
        EDAC, mce_amd_inj: Set MISCV on injection
        EDAC, mce_amd_inj: Move bit preparations before the injection
        EDAC, mce_amd_inj: Cleanup and simplify README
        EDAC, altera: Do not allow suspend when EDAC is enabled
        EDAC, mce_amd_inj: Make inj_type static
        arm: socfpga: dts: Add Arria10 SDRAM EDAC DTS support
        EDAC, altera: Add Arria10 EDAC support
        EDAC, altera: Refactor for Altera CycloneV SoC
        EDAC, altera: Generalize driver to use DT Memory size
        EDAC, mce_amd_inj: Add README file
        EDAC, mce_amd_inj: Add individual permissions field to dfs_node
        EDAC, mce_amd_inj: Modify flags attribute to use string arguments
        EDAC, mce_amd_inj: Read out number of MCE banks from the hardware
        EDAC, mce_amd_inj: Use MCE_INJECT_GET macro for bank node too
        EDAC, xgene: Fix cpuid abuse
        EDAC, mpc85xx: Extend error address to 64 bit
        EDAC, mpc8xxx: Adapt for FSL SoC
        EDAC, edac_stub: Drop arch-specific include
        ...
      45471cd9
    • L
      Merge tag 'pinctrl-v4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 93a4b1b9
      Linus Torvalds 提交于
      Pull pin control updates from Linus Walleij:
       "Here is the bulk of pin control changes for the v4.2 series: Quite a
        lot of new SoC subdrivers and two new main drivers this time, apart
        from that business as usual.
      
        Details:
      
        Core functionality:
         - Enable exclusive pin ownership: it is possible to flag a pin
           controller so that GPIO and other functions cannot use a single pin
           simultaneously.
      
        New drivers:
         - NXP LPC18xx System Control Unit pin controller
         - Imagination Pistachio SoC pin controller
      
        New subdrivers:
         - Freescale i.MX7d SoC
         - Intel Sunrisepoint-H PCH
         - Renesas PFC R8A7793
         - Renesas PFC R8A7794
         - Mediatek MT6397, MT8127
         - SiRF Atlas 7
         - Allwinner A33
         - Qualcomm MSM8660
         - Marvell Armada 395
         - Rockchip RK3368
      
        Cleanups:
         - A big cleanup of the Marvell MVEBU driver rectifying it to
           correspond to reality
         - Drop platform device probing from the SH PFC driver, we are now a
           DT only shop for SuperH
         - Drop obsolte multi-platform check for SH PFC
         - Various janitorial: constification, grammar etc
      
        Improvements:
         - The AT91 GPIO portions now supports the set_multiple() feature
         - Split out SPI pins on the Xilinx Zynq
         - Support DTs without specific function nodes in the i.MX driver"
      
      * tag 'pinctrl-v4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (99 commits)
        pinctrl: rockchip: add support for the rk3368
        pinctrl: rockchip: generalize perpin driver-strength setting
        pinctrl: sh-pfc: r8a7794: add SDHI pin groups
        pinctrl: sh-pfc: r8a7794: add MMCIF pin groups
        pinctrl: sh-pfc: add R8A7794 PFC support
        pinctrl: make pinctrl_register() return proper error code
        pinctrl: mvebu: armada-39x: add support for Armada 395 variant
        pinctrl: mvebu: armada-39x: add missing SATA functions
        pinctrl: mvebu: armada-39x: add missing PCIe functions
        pinctrl: mvebu: armada-38x: add ptp functions
        pinctrl: mvebu: armada-38x: add ua1 functions
        pinctrl: mvebu: armada-38x: add nand functions
        pinctrl: mvebu: armada-38x: add sata functions
        pinctrl: mvebu: armada-xp: add dram functions
        pinctrl: mvebu: armada-xp: add nand rb function
        pinctrl: mvebu: armada-xp: add spi1 function
        pinctrl: mvebu: armada-39x: normalize ref clock naming
        pinctrl: mvebu: armada-xp: rename spi to spi0
        pinctrl: mvebu: armada-370: align spi1 clock pin naming
        pinctrl: mvebu: armada-370: align VDD cpu-pd pin naming with datasheet
        ...
      93a4b1b9
    • L
      Merge tag 'backlight-for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight · d59b92f9
      Linus Torvalds 提交于
      Pull backlight updates from Lee Jones:
       "Changes to existing drivers:
      
         - supply MODULE_DEVICE_TABLE() to ensure probing
         - constify struct; da9052_bl
         - enable compile test; lcd_l4f00242t03, lcd_lms283fg05, backlight_gpio
         - suspend/resume bugfix; lp855x_bl
         - devm_gpiod_get_optional() API fixup; pwm_bl
         - error handling fixup; backlight"
      
      * tag 'backlight-for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
        backlight: Change the return type of backlight_update_status() to int
        backlight: pwm_bl: Simplify usage of devm_gpiod_get_optional
        backlight: lp855x: Don't clear level on suspend/blank
        backlight: Allow compile test of GPIO consumers if !GPIOLIB
        video: backlight: da9052: Constify platform_device_id
        gpio-backlight: Discover driver during boot time
      d59b92f9
    • L
      mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() · 8a8c35fa
      Larry Finger 提交于
      Beginning at commit d52d3997 ("ipv6: Create percpu rt6_info"), the
      following INFO splat is logged:
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.1.0-rc7-next-20150612 #1 Not tainted
        -------------------------------
        kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
        other info that might help us debug this:
        rcu_scheduler_active = 1, debug_locks = 0
         3 locks held by systemd/1:
         #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40
         #1:  (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540
         #2:  (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540
        stack backtrace:
        CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
        Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20   04/17/2014
        Call Trace:
          dump_stack+0x4c/0x6e
          lockdep_rcu_suspicious+0xe7/0x120
          ___might_sleep+0x1d5/0x1f0
          __might_sleep+0x4d/0x90
          kmem_cache_alloc+0x47/0x250
          create_object+0x39/0x2e0
          kmemleak_alloc_percpu+0x61/0xe0
          pcpu_alloc+0x370/0x630
      
      Additional backtrace lines are truncated.  In addition, the above splat
      is followed by several "BUG: sleeping function called from invalid
      context at mm/slub.c:1268" outputs.  As suggested by Martin KaFai Lau,
      these are the clue to the fix.  Routine kmemleak_alloc_percpu() always
      uses GFP_KERNEL for its allocations, whereas it should follow the gfp
      from its callers.
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NLarry Finger <Larry.Finger@lwfinger.net>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a8c35fa
    • V
      mm, thp: respect MPOL_PREFERRED policy with non-local node · 0867a57c
      Vlastimil Babka 提交于
      Since commit 077fcf11 ("mm/thp: allocate transparent hugepages on
      local node"), we handle THP allocations on page fault in a special way -
      for non-interleave memory policies, the allocation is only attempted on
      the node local to the current CPU, if the policy's nodemask allows the
      node.
      
      This is motivated by the assumption that THP benefits cannot offset the
      cost of remote accesses, so it's better to fallback to base pages on the
      local node (which might still be available, while huge pages are not due
      to fragmentation) than to allocate huge pages on a remote node.
      
      The nodemask check prevents us from violating e.g.  MPOL_BIND policies
      where the local node is not among the allowed nodes.  However, the
      current implementation can still give surprising results for the
      MPOL_PREFERRED policy when the preferred node is different than the
      current CPU's local node.
      
      In such case we should honor the preferred node and not use the local
      node, which is what this patch does.  If hugepage allocation on the
      preferred node fails, we fall back to base pages and don't try other
      nodes, with the same motivation as is done for the local node hugepage
      allocations.  The patch also moves the MPOL_INTERLEAVE check around to
      simplify the hugepage specific test.
      
      The difference can be demonstrated using in-tree transhuge-stress test
      on the following 2-node machine where half memory on one node was
      occupied to show the difference.
      
      > numactl --hardware
      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
      node 0 size: 7878 MB
      node 0 free: 3623 MB
      node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
      node 1 size: 8045 MB
      node 1 free: 7818 MB
      node distances:
      node   0   1
        0:  10  21
        1:  21  10
      
      Before the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages
      
      Number of successful THP allocations corresponds to free memory on node 0 in
      the first case and node 1 in the second case, i.e. -p parameter is ignored and
      cpu binding "wins".
      
      After the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -p1 -C0 ./transhuge-stress
      transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages
      
      The -p parameter is respected regardless of cpu binding.
      
      > numactl -C0 ./transhuge-stress
      transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -C12 ./transhuge-stress
      transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages
      
      Without -p parameter, hugepage restriction to CPU-local node works as before.
      
      Fixes: 077fcf11 ("mm/thp: allocate transparent hugepages on local node")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0867a57c
    • J
      tmpfs: truncate prealloc blocks past i_size · afa2db2f
      Josef Bacik 提交于
      One of the rocksdb people noticed that when you do something like this
      
          fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M)
          pwrite(fd, buf, 5M, 0)
          ftruncate(5M)
      
      on tmpfs, the file would still take up 10M: which led to super fun
      issues because we were getting ENOSPC before we thought we should be
      getting ENOSPC.  This patch fixes the problem, and mirrors what all the
      other fs'es do (and was agreed to be the correct behaviour at LSF).
      
      I tested it locally to make sure it worked properly with the following
      
          xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file
      
      Without the patch we have "Blocks: 20480", with the patch we have the
      correct value of "Blocks: 10240".
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afa2db2f
    • Z
      mm/memory hotplug: print the last vmemmap region at the end of hot add memory · c435a390
      Zhu Guihua 提交于
      When hot add two nodes continuously, we found the vmemmap region info is
      a bit messed.  The last region of node 2 is printed when node 3 hot
      added, like the following:
      
        Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
         On node 2 totalpages: 0
         Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
         Policy zone: Normal
         init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
          [mem 0x40000000000-0x407ffffffff] page 1G
          [ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
          [ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
        ...
          [ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
          [ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
        Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
         On node 3 totalpages: 0
         Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
         Policy zone: Normal
         init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
          [mem 0x60000000000-0x607ffffffff] page 1G
          [ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 <=== node 2 ???
          [ffffea1800000000-ffffea18001fffff] PMD -> [ffff8a074a600000-ffff8a074a7fffff] on node 3
          [ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
          [ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
        ...
      
      The cause is the last region was missed at the and of hot add memory,
      and p_start, p_end, node_start were not reset, so when hot add memory to
      a new node, it will consider they are not contiguous blocks and print
      the previous one.  So we print the last vmemmap region at the end of hot
      add memory to avoid the confusion.
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c435a390
    • P
      mm/mmap.c: optimization of do_mmap_pgoff function · e37609bb
      Piotr Kwapulinski 提交于
      The simple check for zero length memory mapping may be performed
      earlier.  So that in case of zero length memory mapping some unnecessary
      code is not executed at all.  It does not make the code less readable
      and saves some CPU cycles.
      Signed-off-by: NPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e37609bb
    • C
      mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan · 93ada579
      Catalin Marinas 提交于
      The kmemleak memory scanning uses finer grained object->lock spinlocks
      primarily to avoid races with the memory block freeing.  However, the
      pointer lookup in the rb tree requires the kmemleak_lock to be held.
      This is currently done in the find_and_get_object() function for each
      pointer-like location read during scanning.  While this allows a low
      latency on kmemleak_*() callbacks on other CPUs, the memory scanning is
      slower.
      
      This patch moves the kmemleak_lock outside the scan_block() loop,
      acquiring/releasing it only once per scanned memory block.  The
      allow_resched logic is moved outside scan_block() and a new
      scan_large_block() function is implemented which splits large blocks in
      MAX_SCAN_SIZE chunks with cond_resched() calls in-between.  A redundant
      (object->flags & OBJECT_NO_SCAN) check is also removed from
      scan_object().
      
      With this patch, the kmemleak scanning performance is significantly
      improved: at least 50% with lock debugging disabled and over an order of
      magnitude with lock proving enabled (on an arm64 system).
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93ada579
    • C
      mm: kmemleak: avoid deadlock on the kmemleak object insertion error path · 9d5a4c73
      Catalin Marinas 提交于
      While very unlikely (usually kmemleak or sl*b bug), the create_object()
      function in mm/kmemleak.c may fail to insert a newly allocated object into
      the rb tree.  When this happens, kmemleak disables itself and prints
      additional information about the object already found in the rb tree.
      Such printing is done with the parent->lock acquired, however the
      kmemleak_lock is already held.  This is a potential race with the scanning
      thread which acquires object->lock and kmemleak_lock in a
      
      This patch removes the locking around the 'parent' object information
      printing.  Such object cannot be freed or removed from object_tree_root
      and object_list since kmemleak_lock is already held.  There is a very
      small risk that some of the object data is being modified on another CPU
      but the only downside is inconsistent information printing.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d5a4c73
    • C
      mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup() · 5f369f37
      Catalin Marinas 提交于
      The kmemleak_do_cleanup() work thread already waits for the kmemleak_scan
      thread to finish via kthread_stop().  Waiting in kthread_stop() while
      scan_mutex is held may lead to deadlock if kmemleak_scan_thread() also
      waits to acquire for scan_mutex.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f369f37
    • C
      mm: kmemleak: fix delete_object_*() race when called on the same memory block · e781a9ab
      Catalin Marinas 提交于
      Calling delete_object_*() on the same pointer is not a standard use case
      (unless there is a bug in the code calling kmemleak_free()).  However,
      during kmemleak disabling (error or user triggered via /sys), there is a
      potential race between kmemleak_free() calls on a CPU and
      __kmemleak_do_cleanup() on a different CPU.
      
      The current delete_object_*() implementation first performs a look-up
      holding kmemleak_lock, increments the object->use_count and then
      re-acquires kmemleak_lock to remove the object from object_tree_root and
      object_list.
      
      This patch simplifies the delete_object_*() mechanism to both look up
      and remove an object from the object_tree_root and object_list
      atomically (guarded by kmemleak_lock).  This allows safe concurrent
      calls to delete_object_*() on the same pointer without additional
      locking for synchronising the kmemleak_free_enabled flag.
      
      A side effect is a slight improvement in the delete_object_*() performance
      by avoiding acquiring kmemleak_lock twice and incrementing/decrementing
      object->use_count.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e781a9ab
    • C
      mm: kmemleak: allow safe memory scanning during kmemleak disabling · c5f3b1a5
      Catalin Marinas 提交于
      The kmemleak scanning thread can run for minutes.  Callbacks like
      kmemleak_free() are allowed during this time, the race being taken care
      of by the object->lock spinlock.  Such lock also prevents a memory block
      from being freed or unmapped while it is being scanned by blocking the
      kmemleak_free() -> ...  -> __delete_object() function until the lock is
      released in scan_object().
      
      When a kmemleak error occurs (e.g.  it fails to allocate its metadata),
      kmemleak_enabled is set and __delete_object() is no longer called on
      freed objects.  If kmemleak_scan is running at the same time,
      kmemleak_free() no longer waits for the object scanning to complete,
      allowing the corresponding memory block to be freed or unmapped (in the
      case of vfree()).  This leads to kmemleak_scan potentially triggering a
      page fault.
      
      This patch separates the kmemleak_free() enabling/disabling from the
      overall kmemleak_enabled nob so that we can defer the disabling of the
      object freeing tracking until the scanning thread completed.  The
      kmemleak_free_part() is deliberately ignored by this patch since this is
      only called during boot before the scanning thread started.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: NVignesh Radhakrishnan <vigneshr@codeaurora.org>
      Tested-by: NVignesh Radhakrishnan <vigneshr@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5f3b1a5
    • T
      memcg: convert mem_cgroup->under_oom from atomic_t to int · c2b42d3c
      Tejun Heo 提交于
      memcg->under_oom tracks whether the memcg is under OOM conditions and is
      an atomic_t counter managed with mem_cgroup_[un]mark_under_oom().  While
      atomic_t appears to be simple synchronization-wise, when used as a
      synchronization construct like here, it's trickier and more error-prone
      due to weak memory ordering rules, especially around atomic_read(), and
      false sense of security.
      
      For example, both non-trivial read sites of memcg->under_oom are a bit
      problematic although not being actually broken.
      
      * mem_cgroup_oom_register_event()
      
        It isn't explicit what guarantees the memory ordering between event
        addition and memcg->under_oom check.  This isn't broken only because
        memcg_oom_lock is used for both event list and memcg->oom_lock.
      
      * memcg_oom_recover()
      
        The lockless test doesn't have any explanation why this would be
        safe.
      
      mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
      in avoiding locking memcg_oom_lock there.  This patch converts
      memcg->under_oom from atomic_t to int, puts their modifications under
      memcg_oom_lock and documents why the lockless test in
      memcg_oom_recover() is safe.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2b42d3c
    • T
      memcg: remove unused mem_cgroup->oom_wakeups · f4b90b70
      Tejun Heo 提交于
      Since commit 49426420 ("mm: memcg: handle non-error OOM situations
      more gracefully"), nobody uses mem_cgroup->oom_wakeups.  Remove it.
      
      While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
      is its only user.  This cleanup was suggested by Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4b90b70
    • D
      frontswap: allow multiple backends · d1dc6f1b
      Dan Streetman 提交于
      Change frontswap single pointer to a singly linked list of frontswap
      implementations.  Update Xen tmem implementation as register no longer
      returns anything.
      
      Frontswap only keeps track of a single implementation; any
      implementation that registers second (or later) will replace the
      previously registered implementation, and gets a pointer to the previous
      implementation that the new implementation is expected to pass all
      frontswap functions to if it can't handle the function itself.  However
      that method doesn't really make much sense, as passing that work on to
      every implementation adds unnecessary work to implementations; instead,
      frontswap should simply keep a list of all registered implementations
      and try each implementation for any function.  Most importantly, neither
      of the two currently existing frontswap implementations in the kernel
      actually do anything with any previous frontswap implementation that
      they replace when registering.
      
      This allows frontswap to successfully manage multiple implementations by
      keeping a list of them all.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1dc6f1b
    • T
      x86, mirror: x86 enabling - find mirrored memory ranges · b05b9f5f
      Tony Luck 提交于
      UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
      address ranges.  See UEFI 2.5 spec pages 157-158:
      
        http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf
      
      On EFI enabled systems scan the memory map and tell memblock about any
      mirrored ranges.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b05b9f5f
    • T
      mm/memblock: allocate boot time data structures from mirrored memory · a3f5bafc
      Tony Luck 提交于
      Try to allocate all boot time kernel data structures from mirrored
      memory.
      
      If we run out of mirrored memory print warnings, but fall back to using
      non-mirrored memory to make sure that we still boot.
      
      By number of bytes, most of what we allocate at boot time is the page
      structures.  64 bytes per 4K page on x86_64 ...  or about 1.5% of total
      system memory.  For workloads where the bulk of memory is allocated to
      applications this may represent a useful improvement to system
      availability since 1.5% of total memory might be a third of the memory
      allocated to the kernel.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3f5bafc
    • T
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck 提交于
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9
    • M
      mm: do not ignore mapping_gfp_mask in page cache allocation paths · 6afdb859
      Michal Hocko 提交于
      page_cache_read, do_generic_file_read, __generic_file_splice_read and
      __ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
      add_to_page_cache_lru which might cause recursion into fs down in the
      direct reclaim path if the mapping really relies on GFP_NOFS semantic.
      
      This doesn't seem to be the case now because page_cache_read (page fault
      path) doesn't seem to suffer from the reclaim recursion issues and
      do_generic_file_read and __generic_file_splice_read also shouldn't be
      called under fs locks which would deadlock in the reclaim path.  Anyway it
      is better to obey mapping gfp mask and prevent from later breakage.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6afdb859
    • S
    • W
      mm/oom_kill.c: print points as unsigned int · f0d6647e
      Wang Long 提交于
      In oom_kill_process(), the variable 'points' is unsigned int.  Print it as
      such.
      Signed-off-by: NWang Long <long.wanglong@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0d6647e
    • M
      mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages · 33039678
      Mike Kravetz 提交于
      alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the
      number of pages which will be added to the reserve map.  Subpool and
      global reserve counts are adjusted based on the output of region_chg.
      Before the pages are actually added to the reserve map, these routines
      could race and add fewer pages than expected.  If this happens, the
      subpool and global reserve counts are not correct.
      
      Compare the number of pages actually added (region_add) to those expected
      to added (region_chg).  If fewer pages are actually added, this indicates
      a race and adjust counters accordingly.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33039678
    • M
      mm/hugetlb: compute/return the number of regions added by region_add() · cf3ad20b
      Mike Kravetz 提交于
      Modify region_add() to keep track of regions(pages) added to the reserve
      map and return this value.  The return value can be compared to the return
      value of region_chg() to determine if the map was modified between calls.
      
      Make vma_commit_reservation() also pass along the return value of
      region_add().  In the normal case, we want vma_commit_reservation to
      return the same value as the preceding call to vma_needs_reservation.
      Create a common __vma_reservation_common routine to help keep the special
      case return values in sync
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf3ad20b
    • M
      mm/hugetlb: document the reserve map/region tracking routines · 1dd308a7
      Mike Kravetz 提交于
      While working on hugetlbfs fallocate support, I noticed the following race
      in the existing code.  It is unlikely that this race is hit very often in
      the current code.  However, if more functionality to add and remove pages
      to hugetlbfs mappings (such as fallocate) is added the likelihood of
      hitting this race will increase.
      
      alloc_huge_page and hugetlb_reserve_pages use information from the reserve
      map to determine if there are enough available huge pages to complete the
      operation, as well as adjust global reserve and subpool usage counts.  The
      order of operations is as follows:
      
      - call region_chg() to determine the expected change based on reserve map
      - determine if enough resources are available for this operation
      - adjust global counts based on the expected change
      - call region_add() to update the reserve map
      
      The issue is that reserve map could change between the call to region_chg
      and region_add.  In this case, the counters which were adjusted based on
      the output of region_chg will not be correct.
      
      In order to hit this race today, there must be an existing shared hugetlb
      mmap created with the MAP_NORESERVE flag.  A page fault to allocate a huge
      page via this mapping must occur at the same another task is mapping the
      same region without the MAP_NORESERVE flag.
      
      The patch set does not prevent the race from happening.  Rather, it adds
      simple functionality to detect when the race has occurred.  If a race is
      detected, then the incorrect counts are adjusted.
      
      Review comments pointed out the need for documentation of the existing
      region/reserve map routines.  This patch set also adds documentation in
      this area.
      
      This patch (of 3):
      
      This is a documentation only patch and does not modify any code.
      Descriptions of the routines used for reserve map/region tracking are
      added.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd308a7
    • M
      Documentation/vm/unevictable-lru.txt: clarify MAP_LOCKED behavior · 9b012a29
      Michal Hocko 提交于
      There is a very subtle difference between mmap()+mlock() vs
      mmap(MAP_LOCKED) semantic.  The former one fails if the population of the
      area fails while the later one doesn't.  This basically means that
      mmap(MAPLOCKED) areas might see major fault after mmap syscall returns
      which is not the case for mlock.  mmap man page has already been altered
      but Documentation/vm/unevictable-lru.txt deserves a clarification as well.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b012a29
    • L
      mm: nommu: refactor debug and warning prints · 22cc877b
      Leon Romanovsky 提交于
      kenter/kleave/kdebug are wrapper macros to print functions flow and debug
      information.  This set was written before pr_devel() was introduced, so it
      was controlled by "#if 0" construction.  It is questionable if anyone is
      using them [1] now.
      
      This patch removes these macros, converts numerous printk(KERN_WARNING,
      ...) to use general pr_warn(...) and removes debug print line from
      validate_mmap_request() function.
      Signed-off-by: NLeon Romanovsky <leon@leon.nu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22cc877b
    • A
      mm: clarify that the function operates on hugepage pte · 8809aa2d
      Aneesh Kumar K.V 提交于
      We have confusing functions to clear pmd, pmd_clear_* and pmd_clear.  Add
      _huge_ to pmdp_clear functions so that we are clear that they operate on
      hugepage pte.
      
      We don't bother about other functions like pmdp_set_wrprotect,
      pmdp_clear_flush_young, because they operate on PTE bits and hence
      indicate they are operating on hugepage ptes
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8809aa2d
    • A
      powerpc/mm: use generic version of pmdp_clear_flush() · f28b6ff8
      Aneesh Kumar K.V 提交于
      Also move the pmd_trans_huge check to generic code.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f28b6ff8
    • A
      mm/thp: split out pmd collapse flush into separate functions · 15a25b2e
      Aneesh Kumar K.V 提交于
      Architectures like ppc64 [1] need to do special things while clearing pmd
      before a collapse.  For them this operation is largely different from a
      normal hugepage pte clear.  Hence add a separate function to clear pmd
      before collapse.  After this patch pmdp_* functions operate only on
      hugepage pte, and not on regular pmd_t values pointing to page table.
      
      [1] ppc64 needs to invalidate all the normal page pte mappings we already
      have inserted in the hardware hash page table.  But before doing that we
      need to make sure there are no parallel hash page table insert going on.
      So we need to do a kick_all_cpus_sync() before flushing the older hash
      table entries.  By moving this to a separate function we capture these
      details and mention how it is different from a hugepage pte clear.
      
      This patch is a cleanup and only does code movement for clarity.  There
      should not be any change in functionality.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15a25b2e
    • X
      tracing: add trace event for memory-failure · 97f0b134
      Xie XiuQi 提交于
      RAS user space tools like rasdaemon which base on trace event, could
      receive mce error event, but no memory recovery result event.  So, I want
      to add this event to make this scenario complete.
      
      This patch add a event at ras group for memory-failure.
      
      The output like below:
      #  tracer: nop
      #
      #  entries-in-buffer/entries-written: 2/2   #P:24
      #
      #                               _-----=> irqs-off
      #                              / _----=> need-resched
      #                             | / _---=> hardirq/softirq
      #                             || / _--=> preempt-depth
      #                             ||| /     delay
      #            TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
      #               | |       |   ||||       |         |
             mce-inject-13150 [001] ....   277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed
      
      [xiexiuqi@huawei.com: fix build error]
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97f0b134
    • X
      memory-failure: change type of action_result's param 3 to enum · cc3e2af4
      Xie XiuQi 提交于
      Change type of action_result's param 3 to enum for type consistency,
      and rename mf_outcome to mf_result for clearly.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc3e2af4
    • X
      memory-failure: export page_type and action result · cc637b17
      Xie XiuQi 提交于
      Export 'outcome' and 'action_page_type' to mm.h, so we could use
      this emnus outside.
      
      This patch is preparation for adding trace events for memory-failure
      recovery action.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc637b17