1. 20 11月, 2018 1 次提交
    • B
      xfs: fix shared extent data corruption due to missing cow reservation · 59e42931
      Brian Foster 提交于
      Page writeback indirectly handles shared extents via the existence
      of overlapping COW fork blocks. If COW fork blocks exist, writeback
      always performs the associated copy-on-write regardless if the
      underlying blocks are actually shared. If the blocks are shared,
      then overlapping COW fork blocks must always exist.
      
      fstests shared/010 reproduces a case where a buffered write occurs
      over a shared block without performing the requisite COW fork
      reservation.  This ultimately causes writeback to the shared extent
      and data corruption that is detected across md5 checks of the
      filesystem across a mount cycle.
      
      The problem occurs when a buffered write lands over a shared extent
      that crosses an extent size hint boundary and that also happens to
      have a partial COW reservation that doesn't cover the start and end
      blocks of the data fork extent.
      
      For example, a buffered write occurs across the file offset (in FSB
      units) range of [29, 57]. A shared extent exists at blocks [29, 35]
      and COW reservation already exists at blocks [32, 34]. After
      accommodating a COW extent size hint of 32 blocks and the existing
      reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
      blocks of reservation at offset 0 and returns with COW reservation
      across the range of [0, 34]. The associated data fork extent is
      still [29, 35], however, which isn't fully covered by the COW
      reservation.
      
      This leads to a buffered write at file offset 35 over a shared
      extent without associated COW reservation. Writeback eventually
      kicks in, performs an overwrite of the underlying shared block and
      causes the associated data corruption.
      
      Update xfs_reflink_reserve_cow() to accommodate the fact that a
      delalloc allocation request may not fully cover the extent in the
      data fork. Trim the data fork extent appropriately, just as is done
      for shared extent boundaries and/or existing COW reservations that
      happen to overlap the start of the data fork extent. This prevents
      shared/010 failures due to data corruption on reflink enabled
      filesystems.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      59e42931
  2. 06 11月, 2018 3 次提交
    • D
      xfs: fix overflow in xfs_attr3_leaf_verify · 837514f7
      Dave Chinner 提交于
      generic/070 on 64k block size filesystems is failing with a verifier
      corruption on writeback or an attribute leaf block:
      
      [   94.973083] XFS (pmem0): Metadata corruption detected at xfs_attr3_leaf_verify+0x246/0x260, xfs_attr3_leaf block 0x811480
      [   94.975623] XFS (pmem0): Unmount and run xfs_repair
      [   94.976720] XFS (pmem0): First 128 bytes of corrupted metadata buffer:
      [   94.978270] 000000004b2e7b45: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      [   94.980268] 000000006b1db90b: 00 00 00 00 00 81 14 80 00 00 00 00 00 00 00 00  ................
      [   94.982251] 00000000433f2407: 22 7b 5c 82 2d 5c 47 4c bb 31 1c 37 fa a9 ce d6  "{\.-\GL.1.7....
      [   94.984157] 0000000010dc7dfb: 00 00 00 00 00 81 04 8a 00 0a 18 e8 dd 94 01 00  ................
      [   94.986215] 00000000d5a19229: 00 a0 dc f4 fe 98 01 68 f0 d8 07 e0 00 00 00 00  .......h........
      [   94.988171] 00000000521df36c: 0c 2d 32 e2 fe 20 01 00 0c 2d 58 65 fe 0c 01 00  .-2.. ...-Xe....
      [   94.990162] 000000008477ae06: 0c 2d 5b 66 fe 8c 01 00 0c 2d 71 35 fe 7c 01 00  .-[f.....-q5.|..
      [   94.992139] 00000000a4a6bca6: 0c 2d 72 37 fc d4 01 00 0c 2d d8 b8 f0 90 01 00  .-r7.....-......
      [   94.994789] XFS (pmem0): xfs_do_force_shutdown(0x8) called from line 1453 of file fs/xfs/xfs_buf.c. Return address = ffffffff815365f3
      
      This is failing this check:
      
                      end = ichdr.freemap[i].base + ichdr.freemap[i].size;
                      if (end < ichdr.freemap[i].base)
      >>>>>                   return __this_address;
                      if (end > mp->m_attr_geo->blksize)
                              return __this_address;
      
      And from the buffer output above, the freemap array is:
      
      	freemap[0].base = 0x00a0
      	freemap[0].size = 0xdcf4	end = 0xdd94
      	freemap[1].base = 0xfe98
      	freemap[1].size = 0x0168	end = 0x10000
      	freemap[2].base = 0xf0d8
      	freemap[2].size = 0x07e0	end = 0xf8b8
      
      These all look valid - the block size is 0x10000 and so from the
      last check in the above verifier fragment we know that the end
      of freemap[1] is valid. The problem is that end is declared as:
      
      	uint16_t	end;
      
      And (uint16_t)0x10000 = 0. So we have a verifier bug here, not a
      corruption. Fix the verifier to use uint32_t types for the check and
      hence avoid the overflow.
      
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=201577Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      837514f7
    • D
      xfs: print buffer offsets when dumping corrupt buffers · bdec055b
      Darrick J. Wong 提交于
      Use DUMP_PREFIX_OFFSET when printing hex dumps of corrupt buffers
      because modern Linux now prints a 32-bit hash of our 64-bit pointer when
      using DUMP_PREFIX_ADDRESS:
      
      00000000b4bb4297: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      00000005ec77e26: 00 00 00 00 02 d0 5a 00 00 00 00 00 00 00 00 00  ......Z.........
      000000015938018: 21 98 e8 b4 fd de 4c 07 bc ea 3c e5 ae b4 7c 48  !.....L...<...|H
      
      This is totally worthless for a sequential dump since we probably only
      care about tracking the buffer offsets and afaik there's no way to
      recover the actual pointer from the hashed value.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      bdec055b
    • C
      xfs: Fix error code in 'xfs_ioc_getbmap()' · 132bf672
      Christophe JAILLET 提交于
      In this function, once 'buf' has been allocated, we unconditionally
      return 0.
      However, 'error' is set to some error codes in several error handling
      paths.
      Before commit 232b5194 ("xfs: simplify the xfs_getbmap interface")
      this was not an issue because all error paths were returning directly,
      but now that some cleanup at the end may be needed, we must propagate the
      error code.
      
      Fixes: 232b5194 ("xfs: simplify the xfs_getbmap interface")
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      132bf672
  3. 05 11月, 2018 5 次提交
    • L
      Linux 4.20-rc1 · 65102238
      Linus Torvalds 提交于
      65102238
    • L
      Merge tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs · 42bd06e9
      Linus Torvalds 提交于
      Pull UBIFS updates from Richard Weinberger:
      
       - Full filesystem authentication feature, UBIFS is now able to have the
         whole filesystem structure authenticated plus user data encrypted and
         authenticated.
      
       - Minor cleanups
      
      * tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs: (26 commits)
        ubifs: Remove unneeded semicolon
        Documentation: ubifs: Add authentication whitepaper
        ubifs: Enable authentication support
        ubifs: Do not update inode size in-place in authenticated mode
        ubifs: Add hashes and HMACs to default filesystem
        ubifs: authentication: Authenticate super block node
        ubifs: Create hash for default LPT
        ubfis: authentication: Authenticate master node
        ubifs: authentication: Authenticate LPT
        ubifs: Authenticate replayed journal
        ubifs: Add auth nodes to garbage collector journal head
        ubifs: Add authentication nodes to journal
        ubifs: authentication: Add hashes to index nodes
        ubifs: Add hashes to the tree node cache
        ubifs: Create functions to embed a HMAC in a node
        ubifs: Add helper functions for authentication support
        ubifs: Add separate functions to init/crc a node
        ubifs: Format changes for authentication support
        ubifs: Store read superblock node
        ubifs: Drop write_node
        ...
      42bd06e9
    • L
      Merge tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 4710e789
      Linus Torvalds 提交于
      Pull NFS client bugfixes from Trond Myklebust:
       "Highlights include:
      
        Bugfix:
         - Fix build issues on architectures that don't provide 64-bit cmpxchg
      
        Cleanups:
         - Fix a spelling mistake"
      
      * tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        NFS: fix spelling mistake, EACCESS -> EACCES
        SUNRPC: Use atomic(64)_t for seq_send(64)
      4710e789
    • L
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 35e74524
      Linus Torvalds 提交于
      Pull more timer updates from Thomas Gleixner:
       "A set of commits for the new C-SKY architecture timers"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        dt-bindings: timer: gx6605s SOC timer
        clocksource/drivers/c-sky: Add gx6605s SOC system timer
        dt-bindings: timer: C-SKY Multi-processor timer
        clocksource/drivers/c-sky: Add C-SKY SMP timer
      35e74524
    • L
      Merge tag 'ntb-4.20' of git://github.com/jonmason/ntb · 04578e84
      Linus Torvalds 提交于
      Pull NTB updates from Jon Mason:
       "Fairly minor changes and bug fixes:
      
        NTB IDT thermal changes and hook into hwmon, ntb_netdev clean-up of
        private struct, and a few bug fixes"
      
      * tag 'ntb-4.20' of git://github.com/jonmason/ntb:
        ntb: idt: Alter the driver info comments
        ntb: idt: Discard temperature sensor IRQ handler
        ntb: idt: Add basic hwmon sysfs interface
        ntb: idt: Alter temperature read method
        ntb_netdev: Simplify remove with client device drvdata
        NTB: transport: Try harder to alloc an aligned MW buffer
        ntb: ntb_transport: Mark expected switch fall-throughs
        ntb: idt: Set PCIe bus address to BARLIMITx
        NTB: ntb_hw_idt: replace IS_ERR_OR_NULL with regular NULL checks
        ntb: intel: fix return value for ndev_vec_mask()
        ntb_netdev: fix sleep time mismatch
      04578e84
  4. 04 11月, 2018 29 次提交
    • L
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 71e56028
      Linus Torvalds 提交于
      Pull scheduler fixes from Ingo Molnar:
       "A memory (under-)allocation fix and a comment fix"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/topology: Fix off by one bug
        sched/rt: Update comment in pick_next_task_rt()
      71e56028
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 601a8807
      Linus Torvalds 提交于
      Pull x86 fixes from Ingo Molnar:
       "A number of fixes and some late updates:
      
         - make in_compat_syscall() behavior on x86-32 similar to other
           platforms, this touches a number of generic files but is not
           intended to impact non-x86 platforms.
      
         - objtool fixes
      
         - PAT preemption fix
      
         - paravirt fixes/cleanups
      
         - cpufeatures updates for new instructions
      
         - earlyprintk quirk
      
         - make microcode version in sysfs world-readable (it is already
           world-readable in procfs)
      
         - minor cleanups and fixes"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        compat: Cleanup in_compat_syscall() callers
        x86/compat: Adjust in_compat_syscall() to generic code under !COMPAT
        objtool: Support GCC 9 cold subfunction naming scheme
        x86/numa_emulation: Fix uniform-split numa emulation
        x86/paravirt: Remove unused _paravirt_ident_32
        x86/mm/pat: Disable preemption around __flush_tlb_all()
        x86/paravirt: Remove GPL from pv_ops export
        x86/traps: Use format string with panic() call
        x86: Clean up 'sizeof x' => 'sizeof(x)'
        x86/cpufeatures: Enumerate MOVDIR64B instruction
        x86/cpufeatures: Enumerate MOVDIRI instruction
        x86/earlyprintk: Add a force option for pciserial device
        objtool: Support per-function rodata sections
        x86/microcode: Make revision and processor flags world-readable
      601a8807
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 01897f3e
      Linus Torvalds 提交于
      Pull perf updates and fixes from Ingo Molnar:
       "These are almost all tooling updates: 'perf top', 'perf trace' and
        'perf script' fixes and updates, an UAPI header sync with the merge
        window versions, license marker updates, much improved Sparc support
        from David Miller, and a number of fixes"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
        perf intel-pt/bts: Calculate cpumode for synthesized samples
        perf intel-pt: Insert callchain context into synthesized callchains
        perf tools: Don't clone maps from parent when synthesizing forks
        perf top: Start display thread earlier
        tools headers uapi: Update linux/if_link.h header copy
        tools headers uapi: Update linux/netlink.h header copy
        tools headers: Sync the various kvm.h header copies
        tools include uapi: Update linux/mmap.h copy
        perf trace beauty: Use the mmap flags table generated from headers
        perf beauty: Wire up the mmap flags table generator to the Makefile
        perf beauty: Add a generator for MAP_ mmap's flag constants
        tools include uapi: Update asound.h copy
        tools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies
        tools include uapi: Update linux/fs.h copy
        perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}
        perf cs-etm: Correct CPU mode for samples
        perf unwind: Take pgoff into account when reporting elf to libdwfl
        perf top: Do not use overwrite mode by default
        perf top: Allow disabling the overwrite mode
        perf trace: Beautify mount's first pathname arg
        ...
      01897f3e
    • L
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e9ebc215
      Linus Torvalds 提交于
      Pull irq fixes from Ingo Molnar:
       "An irqchip driver fix and a memory (over-)allocation fix"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/irq-mvebu-sei: Fix a NULL vs IS_ERR() bug in probe function
        irq/matrix: Fix memory overallocation
      e9ebc215
    • P
      sched/topology: Fix off by one bug · 993f0b05
      Peter Zijlstra 提交于
      With the addition of the NUMA identity level, we increased @level by
      one and will run off the end of the array in the distance sort loop.
      
      Fixed: 051f3ca0 ("sched/topology: Introduce NUMA identity node sched domain")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      993f0b05
    • I
      23a12dde
    • L
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · d2ff0ff2
      Linus Torvalds 提交于
      Pull ARM SoC fixes from Olof Johansson:
       "A few fixes who have come in near or during the merge window:
      
         - Removal of a VLA usage in Marvell mpp platform code
      
         - Enable some IPMI options for ARM64 servers by default, helps
           testing
      
         - Enable PREEMPT on 32-bit ARMv7 defconfig
      
         - Minor fix for stm32 DT (removal of an unused DMA property)
      
         - Bugfix for TI OMAP1-based ams-delta (-EINVAL -> IRQ_NOTCONNECTED)"
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        ARM: dts: stm32: update HASH1 dmas property on stm32mp157c
        ARM: orion: avoid VLA in orion_mpp_conf
        ARM: defconfig: Update multi_v7 to use PREEMPT
        arm64: defconfig: Enable some IPMI configs
        soc: ti: QMSS: Fix usage of irq_set_affinity_hint
        ARM: OMAP1: ams-delta: Fix impossible .irq < 0
      d2ff0ff2
    • L
      Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 83650fd5
      Linus Torvalds 提交于
      Pull more arm64 updates from Catalin Marinas:
      
       - fix W+X page (mark RO) allocated by the arm64 kprobes code
      
       - Makefile fix for .i files in out of tree modules
      
      * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: kprobe: make page to RO mode when allocate it
        arm64: kdump: fix small typo
        arm64: makefile fix build of .i file in external module case
      83650fd5
    • L
      Merge tag 'dma-mapping-4.20-2' of git://git.infradead.org/users/hch/dma-mapping · 3308a383
      Linus Torvalds 提交于
      Pull dma-mapping fix from Christoph Hellwig:
       "Avoid compile warnings on non-default arm64 configs"
      
      * tag 'dma-mapping-4.20-2' of git://git.infradead.org/users/hch/dma-mapping:
        arm64: fix warnings without CONFIG_IOMMU_DMA
      3308a383
    • L
      Merge tag 'kbuild-v4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · 9a12efc5
      Linus Torvalds 提交于
      Pull Kbuild updates from Masahiro Yamada:
      
       - clean-up leftovers in Kconfig files
      
       - remove stale oldnoconfig and silentoldconfig targets
      
       - remove unneeded cc-fullversion and cc-name variables
      
       - improve merge_config script to allow overriding option prefix
      
      * tag 'kbuild-v4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: remove cc-name variable
        kbuild: replace cc-name test with CONFIG_CC_IS_CLANG
        merge_config.sh: Allow to define config prefix
        kbuild: remove unused cc-fullversion variable
        kconfig: remove silentoldconfig target
        kconfig: remove oldnoconfig target
        powerpc: PCI_MSI needs PCI
        powerpc: remove CONFIG_MCA leftovers
        powerpc: remove CONFIG_PCI_QSPAN
        scsi: aha152x: rename the PCMCIA define
      9a12efc5
    • L
      Merge tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 16944728
      Linus Torvalds 提交于
      Pull cifs fixes and updates from Steve French:
       "Three small fixes (one Kerberos related, one for stable, and another
        fixes an oops in xfstest 377), two helpful debugging improvements,
        three patches for cifs directio and some minor cleanup"
      
      * tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: fix signed/unsigned mismatch on aio_read patch
        cifs: don't dereference smb_file_target before null check
        CIFS: Add direct I/O functions to file_operations
        CIFS: Add support for direct I/O write
        CIFS: Add support for direct I/O read
        smb3: missing defines and structs for reparse point handling
        smb3: allow more detailed protocol info on open files for debugging
        smb3: on kerberos mount if server doesn't specify auth type use krb5
        smb3: add trace point for tree connection
        cifs: fix spelling mistake, EACCESS -> EACCES
        cifs: fix return value for cifs_listxattr
      16944728
    • L
      Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · ed61a132
      Linus Torvalds 提交于
      Pull 9p fix from Al Viro:
       "Regression fix for net/9p handling of iov_iter; broken by braino when
        switching to iov_iter_is_kvec() et.al., spotted and fixed by Marc"
      
      * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        iov_iter: Fix 9p virtio breakage
      ed61a132
    • L
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · af102b33
      Linus Torvalds 提交于
      Pull more SCSI updates from James Bottomley:
       "This is a set of minor small (and safe changes) that didn't make the
        initial pull request plus some bug fixes"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: mvsas: Remove set but not used variable 'id'
        scsi: qla2xxx: Remove two arguments from qlafx00_error_entry()
        scsi: qla2xxx: Make sure that qlafx00_ioctl_iosb_entry() initializes 'res'
        scsi: qla2xxx: Remove a set-but-not-used variable
        scsi: qla2xxx: Make qla2x00_sysfs_write_nvram() easier to analyze
        scsi: qla2xxx: Declare local functions 'static'
        scsi: qla2xxx: Improve several kernel-doc headers
        scsi: qla2xxx: Modify fall-through annotations
        scsi: 3w-sas: 3w-9xxx: Use unsigned char for cdb
        scsi: mvsas: Use dma_pool_zalloc
        scsi: target: Don't request modules that aren't even built
        scsi: target: Set response length for REPORT TARGET PORT GROUPS
      af102b33
    • L
      Merge branch 'akpm' (patches from Andrew) · cddfa11a
      Linus Torvalds 提交于
      Merge more updates from Andrew Morton:
      
       - more ocfs2 work
      
       - various leftovers
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        memory_hotplug: cond_resched in __remove_pages
        bfs: add sanity check at bfs_fill_super()
        kernel/sysctl.c: remove duplicated include
        kernel/kexec_file.c: remove some duplicated includes
        mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
        ocfs2: fix clusters leak in ocfs2_defrag_extent()
        ocfs2: dlmglue: clean up timestamp handling
        ocfs2: don't put and assigning null to bh allocated outside
        ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry
        ocfs2: don't use iocb when EIOCBQUEUED returns
        ocfs2: without quota support, avoid calling quota recovery
        ocfs2: remove ocfs2_is_o2cb_active()
        mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
        include/linux/notifier.h: SRCU: fix ctags
        mm: handle no memcg case in memcg_kmem_charge() properly
      cddfa11a
    • M
      memory_hotplug: cond_resched in __remove_pages · dd33ad7b
      Michal Hocko 提交于
      We have received a bug report that unbinding a large pmem (>1TB) can
      result in a soft lockup:
      
        NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
        [...]
        Supported: Yes
        CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
        Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
        task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
        RIP: 0010:__put_page+0x62/0x80
        Call Trace:
         devm_memremap_pages_release+0x152/0x260
         release_nodes+0x18d/0x1d0
         device_release_driver_internal+0x160/0x210
         unbind_store+0xb3/0xe0
         kernfs_fop_write+0x102/0x180
         __vfs_write+0x26/0x150
         vfs_write+0xad/0x1a0
         SyS_write+0x42/0x90
         do_syscall_64+0x74/0x150
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
        RIP: 0033:0x7fd13166b3d0
      
      It has been reported on an older (4.12) kernel but the current upstream
      code doesn't cond_resched in the hot remove code at all and the given
      range to remove might be really large.  Fix the issue by calling
      cond_resched once per memory section.
      
      Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd33ad7b
    • T
      bfs: add sanity check at bfs_fill_super() · 9f2df09a
      Tetsuo Handa 提交于
      syzbot is reporting too large memory allocation at bfs_fill_super() [1].
      Since file system image is corrupted such that bfs_sb->s_start == 0,
      bfs_fill_super() is trying to allocate 8MB of continuous memory. Fix
      this by adding a sanity check on bfs_sb->s_start, __GFP_NOWARN and
      printf().
      
      [1] https://syzkaller.appspot.com/bug?id=16a87c236b951351374a84c8a32f40edbc034e96
      
      Link: http://lkml.kernel.org/r/1525862104-3407-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: Nsyzbot <syzbot+71c6b5d68e91149fc8a4@syzkaller.appspotmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f2df09a
    • M
      6f0483d1
    • Z
      kernel/kexec_file.c: remove some duplicated includes · 3383b360
      zhong jiang 提交于
      We include kexec.h and slab.h twice in kexec_file.c. It's unnecessary.
      hence just remove them.
      
      Link: http://lkml.kernel.org/r/1537498098-19171-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NBhupesh Sharma <bhsharma@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3383b360
    • M
      mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask · 89c83fb5
      Michal Hocko 提交于
      THP allocation mode is quite complex and it depends on the defrag mode.
      This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
      part currently. The NUMA special casing (namely __GFP_THISNODE) is
      however independent and placed in alloc_pages_vma currently. This both
      adds an unnecessary branch to all vma based page allocation requests and
      it makes the code more complex unnecessarily as well. Not to mention
      that e.g. shmem THP used to do the node reclaiming unconditionally
      regardless of the defrag mode until recently. This was not only
      unexpected behavior but it was also hardly a good default behavior and I
      strongly suspect it was just a side effect of the code sharing more than
      a deliberate decision which suggests that such a layering is wrong.
      
      Get rid of the thp special casing from alloc_pages_vma and move the
      logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
      resulting gfp mask only when the direct reclaim is not requested and
      when there is no explicit numa binding to preserve the current logic.
      
      Please note that there's also a slight difference wrt MPOL_BIND now. The
      previous code would avoid using __GFP_THISNODE if the local node was
      outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
      for all MPOL_BIND policies. So there's a difference that if local node
      is actually allowed by the bind policy's nodemask, previously
      __GFP_THISNODE would be added, but now it won't be. From the behavior
      POV this is still correct because the policy nodemask is used.
      
      Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c83fb5
    • L
      ocfs2: fix clusters leak in ocfs2_defrag_extent() · 6194ae42
      Larry Chen 提交于
      ocfs2_defrag_extent() might leak allocated clusters.  When the file
      system has insufficient space, the number of claimed clusters might be
      less than the caller wants.  If that happens, the original code might
      directly commit the transaction without returning clusters.
      
      This patch is based on code in ocfs2_add_clusters_in_btree().
      
      [akpm@linux-foundation.org: include localalloc.h, reduce scope of data_ac]
      Link: http://lkml.kernel.org/r/20180904041621.16874-3-lchen@suse.comSigned-off-by: NLarry Chen <lchen@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6194ae42
    • A
      ocfs2: dlmglue: clean up timestamp handling · 3a3d1e51
      Arnd Bergmann 提交于
      The handling of timestamps outside of the 1970..2038 range in the dlm
      glue is rather inconsistent: on 32-bit architectures, this has always
      wrapped around to negative timestamps in the 1902..1969 range, while on
      64-bit kernels all timestamps are interpreted as positive 34 bit numbers
      in the 1970..2514 year range.
      
      Now that the VFS code handles 64-bit timestamps on all architectures, we
      can make the behavior more consistent here, and return the same result
      that we had on 64-bit already, making the file system y2038 safe in the
      process.  Outside of dlmglue, it already uses 64-bit on-disk timestamps
      anway, so that part is fine.
      
      For consistency, I'm changing ocfs2_pack_timespec() to clamp anything
      outside of the supported range to the minimum and maximum values.  This
      avoids a possible ambiguity of values before 1970 in particular, which
      used to be interpreted as times at the end of the 2514 range previously.
      
      Link: http://lkml.kernel.org/r/20180619155826.4106487-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a3d1e51
    • C
      ocfs2: don't put and assigning null to bh allocated outside · cf76c785
      Changwei Ge 提交于
      ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
      several blocks from disk.  Currently, the input argument *bhs* can be
      NULL or NOT.  It depends on the caller's behavior.  If the function
      fails in reading blocks from disk, the corresponding bh will be assigned
      to NULL and put.
      
      Obviously, above process for non-NULL input bh is not appropriate.
      Because the caller doesn't even know its bhs are put and re-assigned.
      
      If buffer head is managed by caller, ocfs2_read_blocks and
      ocfs2_read_blocks_sync() should not evaluate it to NULL.  It will cause
      caller accessing illegal memory, thus crash.
      
      Link: http://lkml.kernel.org/r/HK2PR06MB045285E0F4FBB561F9F2F9B3D5680@HK2PR06MB0452.apcprd06.prod.outlook.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: NGuozhonghua <guozhonghua@h3c.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf76c785
    • C
      ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry · 29aa3016
      Changwei Ge 提交于
      Somehow, file system metadata was corrupted, which causes
      ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().
      
      According to the original design intention, if above happens we should
      skip the problematic block and continue to retrieve dir entry.  But
      there is obviouse misuse of brelse around related code.
      
      After failure of ocfs2_check_dir_entry(), current code just moves to
      next position and uses the problematic buffer head again and again
      during which the problematic buffer head is released for multiple times.
      I suppose, this a serious issue which is long-lived in ocfs2.  This may
      cause other file systems which is also used in a the same host insane.
      
      So we should also consider about bakcporting this patch into linux
      -stable.
      
      Link: http://lkml.kernel.org/r/HK2PR06MB045211675B43EED794E597B6D56E0@HK2PR06MB0452.apcprd06.prod.outlook.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Suggested-by: NChangkuo Shi <shi.changkuo@h3c.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29aa3016
    • C
      ocfs2: don't use iocb when EIOCBQUEUED returns · 9e985787
      Changwei Ge 提交于
      When -EIOCBQUEUED returns, it means that aio_complete() will be called
      from dio_complete(), which is an asynchronous progress against
      write_iter.  Generally, IO is a very slow progress than executing
      instruction, but we still can't take the risk to access a freed iocb.
      
      And we do face a BUG crash issue.  Using the crash tool, iocb is
      obviously freed already.
      
        crash> struct -x kiocb ffff881a350f5900
        struct kiocb {
          ki_filp = 0xffff881a350f5a80,
          ki_pos = 0x0,
          ki_complete = 0x0,
          private = 0x0,
          ki_flags = 0x0
        }
      
      And the backtrace shows:
        ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
        aio_run_iocb+0x229/0x2f0
        do_io_submit+0x291/0x540
        SyS_io_submit+0x10/0x20
        system_call_fastpath+0x16/0x75
      
      Link: http://lkml.kernel.org/r/1523361653-14439-1-git-send-email-ge.changwei@h3c.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e985787
    • G
      ocfs2: without quota support, avoid calling quota recovery · 21158ca8
      Guozhonghua 提交于
      During one dead node's recovery by other node, quota recovery work will
      be queued.  We should avoid calling quota when it is not supported, so
      check the quota flags.
      
      Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071AC9FB@H3CMLB12-EX.srv.huawei-3com.comSigned-off-by: Nguozhonghua <guozhonghua@h3c.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21158ca8
    • G
      ocfs2: remove ocfs2_is_o2cb_active() · a6346447
      Gang He 提交于
      Remove ocfs2_is_o2cb_active().  We have similar functions to identify
      which cluster stack is being used via osb->osb_cluster_stack.
      
      Secondly, the current implementation of ocfs2_is_o2cb_active() is not
      totally safe.  Based on the design of stackglue, we need to get
      ocfs2_stack_lock before using ocfs2_stack related data structures, and
      that active_stack pointer can be NULL in the case of mount failure.
      
      Link: http://lkml.kernel.org/r/1495441079-11708-1-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Reviewed-by: NEric Ren <zren@suse.com>
      Acked-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6346447
    • A
      mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings · ac5b2c18
      Andrea Arcangeli 提交于
      THP allocation might be really disruptive when allocated on NUMA system
      with the local node full or hard to reclaim.  Stefan has posted an
      allocation stall report on 4.12 based SLES kernel which suggests the
      same issue:
      
        kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
        kvm cpuset=/ mems_allowed=0-1
        CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
        Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
        Call Trace:
         dump_stack+0x5c/0x84
         warn_alloc+0xe0/0x180
         __alloc_pages_slowpath+0x820/0xc90
         __alloc_pages_nodemask+0x1cc/0x210
         alloc_pages_vma+0x1e5/0x280
         do_huge_pmd_wp_page+0x83f/0xf00
         __handle_mm_fault+0x93d/0x1060
         handle_mm_fault+0xc6/0x1b0
         __do_page_fault+0x230/0x430
         do_page_fault+0x2a/0x70
         page_fault+0x7b/0x80
         [...]
        Mem-Info:
        active_anon:126315487 inactive_anon:1612476 isolated_anon:5
         active_file:60183 inactive_file:245285 isolated_file:0
         unevictable:15657 dirty:286 writeback:1 unstable:0
         slab_reclaimable:75543 slab_unreclaimable:2509111
         mapped:81814 shmem:31764 pagetables:370616 bounce:0
         free:32294031 free_pcp:6233 free_cma:0
        Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
        Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
      
      The defrag mode is "madvise" and from the above report it is clear that
      the THP has been allocated for MADV_HUGEPAGA vma.
      
      Andrea has identified that the main source of the problem is
      __GFP_THISNODE usage:
      
      : The problem is that direct compaction combined with the NUMA
      : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
      : hard the local node, instead of failing the allocation if there's no
      : THP available in the local node.
      :
      : Such logic was ok until __GFP_THISNODE was added to the THP allocation
      : path even with MPOL_DEFAULT.
      :
      : The idea behind the __GFP_THISNODE addition, is that it is better to
      : provide local memory in PAGE_SIZE units than to use remote NUMA THP
      : backed memory. That largely depends on the remote latency though, on
      : threadrippers for example the overhead is relatively low in my
      : experience.
      :
      : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
      : extremely slow qemu startup with vfio, if the VM is larger than the
      : size of one host NUMA node. This is because it will try very hard to
      : unsuccessfully swapout get_user_pages pinned pages as result of the
      : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
      : allocations and instead of trying to allocate THP on other nodes (it
      : would be even worse without vfio type1 GUP pins of course, except it'd
      : be swapping heavily instead).
      
      Fix this by removing __GFP_THISNODE for THP requests which are
      requesting the direct reclaim.  This effectivelly reverts 5265047a
      on the grounds that the zone/node reclaim was known to be disruptive due
      to premature reclaim when there was memory free.  While it made sense at
      the time for HPC workloads without NUMA awareness on rare machines, it
      was ultimately harmful in the majority of cases.  The existing behaviour
      is similar, if not as widespare as it applies to a corner case but
      crucially, it cannot be tuned around like zone_reclaim_mode can.  The
      default behaviour should always be to cause the least harm for the
      common case.
      
      If there are specialised use cases out there that want zone_reclaim_mode
      in specific cases, then it can be built on top.  Longterm we should
      consider a memory policy which allows for the node reclaim like behavior
      for the specific memory ranges which would allow a
      
      [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
      
      Mel said:
      
      : Both patches look correct to me but I'm responding to this one because
      : it's the fix.  The change makes sense and moves further away from the
      : severe stalling behaviour we used to see with both THP and zone reclaim
      : mode.
      :
      : I put together a basic experiment with usemem configured to reference a
      : buffer multiple times that is 80% the size of main memory on a 2-socket
      : box with symmetric node sizes and defrag set to "always".  The defrag
      : setting is not the default but it would be functionally similar to
      : accessing a buffer with madvise(MADV_HUGEPAGE).  Usemem is configured to
      : reference the buffer multiple times and while it's not an interesting
      : workload, it would be expected to complete reasonably quickly as it fits
      : within memory.  The results were;
      :
      : usemem
      :                                   vanilla           noreclaim-v1
      : Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
      : Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
      : Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)
      :
      : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
      : threads referencing buffers 80% the size of memory.  With the patches
      : applied, it's 37.18% faster for the single thread and 73% faster with two
      : threads.  Note that 4 threads showing little difference does not indicate
      : the problem is related to thread counts.  It's simply the case that 4
      : threads gets spread so their workload mostly fits in one node.
      :
      : The overall view from /proc/vmstats is more startling
      :
      :                          4.19.0-rc1  4.19.0-rc1
      :                             vanillanoreclaim-v1r1
      : Minor Faults               35593425      708164
      : Major Faults                 484088          36
      : Swap Ins                    3772837           0
      : Swap Outs                   3932295           0
      :
      : Massive amounts of swap in/out without the patch
      :
      : Direct pages scanned        6013214           0
      : Kswapd pages scanned              0           0
      : Kswapd pages reclaimed            0           0
      : Direct pages reclaimed      4033009           0
      :
      : Lots of reclaim activity without the patch
      :
      : Kswapd efficiency              100%        100%
      : Kswapd velocity               0.000       0.000
      : Direct efficiency               67%        100%
      : Direct velocity           11191.956       0.000
      :
      : Mostly from direct reclaim context as you'd expect without the patch.
      :
      : Page writes by reclaim  3932314.000       0.000
      : Page writes file                 19           0
      : Page writes anon            3932295           0
      : Page reclaim immediate        42336           0
      :
      : Writes from reclaim context is never good but the patch eliminates it.
      :
      : We should never have default behaviour to thrash the system for such a
      : basic workload.  If zone reclaim mode behaviour is ever desired but on a
      : single task instead of a global basis then the sensible option is to build
      : a mempolicy that enforces that behaviour.
      
      This was a severe regression compared to previous kernels that made
      important workloads unusable and it starts when __GFP_THISNODE was
      added to THP allocations under MADV_HUGEPAGE.  It is not a significant
      risk to go to the previous behavior before __GFP_THISNODE was added, it
      worked like that for years.
      
      This was simply an optimization to some lucky workloads that can fit in
      a single node, but it ended up breaking the VM for others that can't
      possibly fit in a single node, so going back is safe.
      
      [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
      Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
      Fixes: 5265047a ("mm, thp: really limit transparent hugepage allocation to local node")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Debugged-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac5b2c18
    • S
      include/linux/notifier.h: SRCU: fix ctags · 94e297c5
      Sam Protsenko 提交于
      ctags indexing ("make tags" command) throws this warning:
      
          ctags: Warning: include/linux/notifier.h:125:
          null expansion of name pattern "\1"
      
      This is the result of DEFINE_PER_CPU() macro expansion.  Fix that by
      getting rid of line break.
      
      Similar fix was already done in commit 25528213 ("tags: Fix
      DEFINE_PER_CPU expansions"), but this one probably wasn't noticed.
      
      Link: http://lkml.kernel.org/r/20181030202808.28027-1-semen.protsenko@linaro.org
      Fixes: 9c80172b ("kernel/SRCU: provide a static initializer")
      Signed-off-by: NSam Protsenko <semen.protsenko@linaro.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94e297c5
    • R
      mm: handle no memcg case in memcg_kmem_charge() properly · e68599a3
      Roman Gushchin 提交于
      Mike Galbraith reported a regression caused by the commit 9b6f7e16
      ("mm: rework memcg kernel stack accounting") on a system with
      "cgroup_disable=memory" boot option: the system panics with the following
      stack trace:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
        PGD 0 P4D 0
        Oops: 0002 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 1 Comm: systemd Not tainted 4.19.0-preempt+ #410
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fed4
        RIP: 0010:page_counter_try_charge+0x22/0xc0
        Code: 41 5d c3 c3 0f 1f 40 00 0f 1f 44 00 00 48 85 ff 0f 84 a7 00 00 00 41 56 48 89 f8 49 89 fe 49
        Call Trace:
         try_charge+0xcb/0x780
         memcg_kmem_charge_memcg+0x28/0x80
         memcg_kmem_charge+0x8b/0x1d0
         copy_process.part.41+0x1ca/0x2070
         _do_fork+0xd7/0x3d0
         do_syscall_64+0x5a/0x180
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The problem occurs because get_mem_cgroup_from_current() returns the NULL
      pointer if memory controller is disabled.  Let's check if this is a case
      at the beginning of memcg_kmem_charge() and just return 0 if
      mem_cgroup_disabled() returns true.  This is how we handle this case in
      many other places in the memory controller code.
      
      Link: http://lkml.kernel.org/r/20181029215123.17830-1-guro@fb.com
      Fixes: 9b6f7e16 ("mm: rework memcg kernel stack accounting")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e68599a3
  5. 03 11月, 2018 2 次提交