1. 11 6月, 2017 1 次提交
    • W
      KVM: async_pf: avoid async pf injection when in guest mode · 9bc1f09f
      Wanpeng Li 提交于
       INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
             Not tainted 4.12.0-rc4+ #8
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       gnome-terminal- D    0  1734   1015 0x00000000
       Call Trace:
        __schedule+0x3cd/0xb30
        schedule+0x40/0x90
        kvm_async_pf_task_wait+0x1cc/0x270
        ? __vfs_read+0x37/0x150
        ? prepare_to_swait+0x22/0x70
        do_async_page_fault+0x77/0xb0
        ? do_async_page_fault+0x77/0xb0
        async_page_fault+0x28/0x30
      
      This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
      and then gives stress to memory on L1, I can observed this hang on L1 when
      at least ~70% swap area is occupied on L0.
      
      This is due to async pf was injected to L2 which should be injected to L1,
      L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
      actually), and L1 guest starts accumulating tasks stuck in D state in
      kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.
      
      This patch fixes the hang by doing async pf when executing L1 guest.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9bc1f09f
  2. 08 6月, 2017 3 次提交
    • P
      Merge tag 'kvm-s390-master-4.12-1' of... · 9e53932d
      Paolo Bonzini 提交于
      Merge tag 'kvm-s390-master-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      KVM: s390: Fix for master (4.12)
      
      - The newly created AIS capability enables the feature unconditionally
        and ignores the cpu model
      9e53932d
    • W
      KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation · a3641631
      Wanpeng Li 提交于
      If "i" is the last element in the vcpu->arch.cpuid_entries[] array, it
      potentially can be exploited the vulnerability. this will out-of-bounds
      read and write.  Luckily, the effect is small:
      
      	/* when no next entry is found, the current entry[i] is reselected */
      	for (j = i + 1; ; j = (j + 1) % nent) {
      		struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j];
      		if (ej->function == e->function) {
      
      It reads ej->maxphyaddr, which is user controlled.  However...
      
      			ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;
      
      After cpuid_entries there is
      
      	int maxphyaddr;
      	struct x86_emulate_ctxt emulate_ctxt;  /* 16-byte aligned */
      
      So we have:
      
      - cpuid_entries at offset 1B50 (6992)
      - maxphyaddr at offset 27D0 (6992 + 3200 = 10192)
      - padding at 27D4...27DF
      - emulate_ctxt at 27E0
      
      And it writes in the padding.  Pfew, writing the ops field of emulate_ctxt
      would have been much worse.
      
      This patch fixes it by modding the index to avoid the out-of-bounds
      access. Worst case, i == j and ej->function == e->function,
      the loop can bail out.
      Reported-by: NMoguofang <moguofang@huawei.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Guofang Mo <moguofang@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3641631
    • P
      Merge tag 'kvm-arm-for-v4.12-rc5-take2' of... · 38a4f43d
      Paolo Bonzini 提交于
      Merge tag 'kvm-arm-for-v4.12-rc5-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/ARM Fixes for v4.12-rc5 - Take 2
      
      Changes include:
       - Fix an issue with migrating GICv2 VMs on GICv3 systems.
       - Squashed a bug for gicv3 when figuring out preemption levels.
       - Fix a potential null pointer derefence in KVM happening under memory
         pressure.
       - Maintain RES1 bits in the SCTLR_EL2 to make sure KVM works on new
         architecture revisions.
       - Allow unaligned accesses at EL2/HYP
      38a4f43d
  3. 07 6月, 2017 3 次提交
  4. 06 6月, 2017 4 次提交
    • M
      KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages · d6dbdd3c
      Marc Zyngier 提交于
      Under memory pressure, we start ageing pages, which amounts to parsing
      the page tables. Since we don't want to allocate any extra level,
      we pass NULL for our private allocation cache. Which means that
      stage2_get_pud() is allowed to fail. This results in the following
      splat:
      
      [ 1520.409577] Unable to handle kernel NULL pointer dereference at virtual address 00000008
      [ 1520.417741] pgd = ffff810f52fef000
      [ 1520.421201] [00000008] *pgd=0000010f636c5003, *pud=0000010f56f48003, *pmd=0000000000000000
      [ 1520.429546] Internal error: Oops: 96000006 [#1] PREEMPT SMP
      [ 1520.435156] Modules linked in:
      [ 1520.438246] CPU: 15 PID: 53550 Comm: qemu-system-aar Tainted: G        W       4.12.0-rc4-00027-g1885c397eaec #7205
      [ 1520.448705] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB12A 10/26/2016
      [ 1520.463726] task: ffff800ac5fb4e00 task.stack: ffff800ce04e0000
      [ 1520.469666] PC is at stage2_get_pmd+0x34/0x110
      [ 1520.474119] LR is at kvm_age_hva_handler+0x44/0xf0
      [ 1520.478917] pc : [<ffff0000080b137c>] lr : [<ffff0000080b149c>] pstate: 40000145
      [ 1520.486325] sp : ffff800ce04e33d0
      [ 1520.489644] x29: ffff800ce04e33d0 x28: 0000000ffff40064
      [ 1520.494967] x27: 0000ffff27e00000 x26: 0000000000000000
      [ 1520.500289] x25: ffff81051ba65008 x24: 0000ffff40065000
      [ 1520.505618] x23: 0000ffff40064000 x22: 0000000000000000
      [ 1520.510947] x21: ffff810f52b20000 x20: 0000000000000000
      [ 1520.516274] x19: 0000000058264000 x18: 0000000000000000
      [ 1520.521603] x17: 0000ffffa6fe7438 x16: ffff000008278b70
      [ 1520.526940] x15: 000028ccd8000000 x14: 0000000000000008
      [ 1520.532264] x13: ffff7e0018298000 x12: 0000000000000002
      [ 1520.537582] x11: ffff000009241b93 x10: 0000000000000940
      [ 1520.542908] x9 : ffff0000092ef800 x8 : 0000000000000200
      [ 1520.548229] x7 : ffff800ce04e36a8 x6 : 0000000000000000
      [ 1520.553552] x5 : 0000000000000001 x4 : 0000000000000000
      [ 1520.558873] x3 : 0000000000000000 x2 : 0000000000000008
      [ 1520.571696] x1 : ffff000008fd5000 x0 : ffff0000080b149c
      [ 1520.577039] Process qemu-system-aar (pid: 53550, stack limit = 0xffff800ce04e0000)
      [...]
      [ 1521.510735] [<ffff0000080b137c>] stage2_get_pmd+0x34/0x110
      [ 1521.516221] [<ffff0000080b149c>] kvm_age_hva_handler+0x44/0xf0
      [ 1521.522054] [<ffff0000080b0610>] handle_hva_to_gpa+0xb8/0xe8
      [ 1521.527716] [<ffff0000080b3434>] kvm_age_hva+0x44/0xf0
      [ 1521.532854] [<ffff0000080a58b0>] kvm_mmu_notifier_clear_flush_young+0x70/0xc0
      [ 1521.539992] [<ffff000008238378>] __mmu_notifier_clear_flush_young+0x88/0xd0
      [ 1521.546958] [<ffff00000821eca0>] page_referenced_one+0xf0/0x188
      [ 1521.552881] [<ffff00000821f36c>] rmap_walk_anon+0xec/0x250
      [ 1521.558370] [<ffff000008220f78>] rmap_walk+0x78/0xa0
      [ 1521.563337] [<ffff000008221104>] page_referenced+0x164/0x180
      [ 1521.569002] [<ffff0000081f1af0>] shrink_active_list+0x178/0x3b8
      [ 1521.574922] [<ffff0000081f2058>] shrink_node_memcg+0x328/0x600
      [ 1521.580758] [<ffff0000081f23f4>] shrink_node+0xc4/0x328
      [ 1521.585986] [<ffff0000081f2718>] do_try_to_free_pages+0xc0/0x340
      [ 1521.592000] [<ffff0000081f2a64>] try_to_free_pages+0xcc/0x240
      [...]
      
      The trivial fix is to handle this NULL pud value early, rather than
      dereferencing it blindly.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NChristoffer Dall <cdall@linaro.org>
      Signed-off-by: NChristoffer Dall <cdall@linaro.org>
      d6dbdd3c
    • W
      KVM: nVMX: Fix exception injection · d4912215
      Wanpeng Li 提交于
       WARNING: CPU: 3 PID: 2840 at arch/x86/kvm/vmx.c:10966 nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
       CPU: 3 PID: 2840 Comm: qemu-system-x86 Tainted: G           OE   4.12.0-rc3+ #23
       RIP: 0010:nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
       Call Trace:
        ? kvm_check_async_pf_completion+0xef/0x120 [kvm]
        ? rcu_read_lock_sched_held+0x79/0x80
        vmx_queue_exception+0x104/0x160 [kvm_intel]
        ? vmx_queue_exception+0x104/0x160 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0x1171/0x1ce0 [kvm]
        ? kvm_arch_vcpu_load+0x47/0x240 [kvm]
        ? kvm_arch_vcpu_load+0x62/0x240 [kvm]
        kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? __fget+0xf3/0x210
        do_vfs_ioctl+0xa4/0x700
        ? __fget+0x114/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x81/0x220
        entry_SYSCALL64_slow_path+0x25/0x25
      
      This is triggered occasionally by running both win7 and win2016 in L2, in
      addition, EPT is disabled on both L1 and L2. It can't be reproduced easily.
      
      Commit 0b6ac343 (KVM: nVMX: Correct handling of exception injection) mentioned
      that "KVM wants to inject page-faults which it got to the guest. This function
      assumes it is called with the exit reason in vmcs02 being a #PF exception".
      Commit e011c663 (KVM: nVMX: Check all exceptions for intercept during delivery to
      L2) allows to check all exceptions for intercept during delivery to L2. However,
      there is no guarantee the exit reason is exception currently, when there is an
      external interrupt occurred on host, maybe a time interrupt for host which should
      not be injected to guest, and somewhere queues an exception, then the function
      nested_vmx_check_exception() will be called and the vmexit emulation codes will
      try to emulate the "Acknowledge interrupt on exit" behavior, the warning is
      triggered.
      
      Reusing the exit reason from the L2->L0 vmexit is wrong in this case,
      the reason must always be EXCEPTION_NMI when injecting an exception into
      L1 as a nested vmexit.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Fixes: e011c663 ("KVM: nVMX: Check all exceptions for intercept during delivery to L2")
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d4912215
    • P
      kvm: async_pf: fix rcu_irq_enter() with irqs enabled · bbaf0e2b
      Paolo Bonzini 提交于
      native_safe_halt enables interrupts, and you just shouldn't
      call rcu_irq_enter() with interrupts enabled.  Reorder the
      call with the following local_irq_disable() to respect the
      invariant.
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      bbaf0e2b
    • C
      KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction · d68356cc
      Christoffer Dall 提交于
      We used to extract PRIbits from the ICH_VT_EL2 which was the upper field
      in the register word, so a mask wasn't necessary, but as we switched to
      looking at PREbits, which is bits 26 through 28 with the PRIbits field
      being potentially non-zero, we really need to mask off the field value,
      otherwise fun things may happen.
      Signed-off-by: NChristoffer Dall <cdall@linaro.org>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      d68356cc
  5. 05 6月, 2017 8 次提交
    • L
      Linux 4.12-rc4 · 3c2993b8
      Linus Torvalds 提交于
      3c2993b8
    • R
      fs/ufs: Set UFS default maximum bytes per file · 239e250e
      Richard Narron 提交于
      This fixes a problem with reading files larger than 2GB from a UFS-2
      file system:
      
          https://bugzilla.kernel.org/show_bug.cgi?id=195721
      
      The incorrect UFS s_maxsize limit became a problem as of commit
      c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
      which started using s_maxbytes to avoid a page index overflow in
      do_generic_file_read().
      
      That caused files to be truncated on UFS-2 file systems because the
      default maximum file size is 2GB (MAX_NON_LFS) and UFS didn't update it.
      
      Here I simply increase the default to a common value used by other file
      systems.
      Signed-off-by: NRichard Narron <comet.berkeley@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Will B <will.brokenbourgh2877@gmail.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: <stable@vger.kernel.org> # v4.9 and backports of c2a9737fSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      239e250e
    • L
      Merge tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 125f42b0
      Linus Torvalds 提交于
      Pull NFS client bugfixes from Trond Myklebust:
       "Bugfixes include:
      
         - Fix a typo in commit e0926934 ("NFS append COMMIT after
           synchronous COPY") that breaks copy offload
      
         - Fix the connect error propagation in xs_tcp_setup_socket()
      
         - Fix a lock leak in nfs40_walk_client_list
      
         - Verify that pNFS requests lie within the offset range of the layout
           segment"
      
      * tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        nfs: Mark unnecessarily extern functions as static
        SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
        NFSv4.0: Fix a lock leak in nfs40_walk_client_list
        pnfs: Fix the check for requests in range of layout segment
        xprtrdma: Delete an error message for a failed memory allocation in xprt_rdma_bc_setup()
        pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
        NFS fix COMMIT after COPY
      125f42b0
    • L
      Merge tag 'tty-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 3c06e6cb
      Linus Torvalds 提交于
      Pull tty fix from Greg KH:
       "Here is a single tty core fix for 4.12-rc4. It reverts a patch that a
        lot of people reported as causing lockdep and other warnings.
      
        Right after I reverted this in my tree, it seems like another
        "correct" fix might have shown up, but it's too late in the release
        cycle to be messing with tty core locking, so let's just revert this
        for now to go back how things always have been and try it again for
        4.13.
      
        This has not been in linux-next as I only reverted it a few hours ago"
      
      * tag 'tty-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        Revert "tty: fix port buffer locking"
      3c06e6cb
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · e00811b4
      Linus Torvalds 提交于
      Pull input subsystem fixes from Dmitry Torokhov:
      
       - a couple of regression fixes in synaptics and axp20x-pek drivers
      
       - try to ease transition from PS/2 to RMI for Synaptics touchpad users
         by ensuring we do not try to activate RMI mode when RMI SMBus support
         is not enabled, and nag users a bit to enable it
      
       - plus a couple of other changes that seemed worthwhile for this
         release
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: axp20x-pek - switch to acpi_dev_present and check for ACPI0011 too
        Input: axp20x-pek - only check for "INTCFD9" ACPI device on Cherry Trail
        Input: tm2-touchkey - use LEN_ON as boolean value instead of LED_FULL
        Input: synaptics - tell users to report when they should be using rmi-smbus
        Input: synaptics - warn the users when there is a better mode
        Input: synaptics - keep PS/2 around when RMI4_SMB is not enabled
        Input: synaptics - clear device info before filling in
        Input: silead - disable interrupt during suspend
      e00811b4
    • L
      Merge tag 'rtc-4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · 9f03b2c7
      Linus Torvalds 提交于
      Pull RTC fixlet from Alexandre Belloni:
       "A single patch, not really a fix but I don't think there is any reason
        to delay it.
      
        Change the mailing list address"
      
      * tag 'rtc-4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
        MAINTAINERS: update RTC mailing list
      9f03b2c7
    • L
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 1f915b7f
      Linus Torvalds 提交于
      Pull SCSI fixes from James Bottomley:
       "This is nine fixes, seven of which are for the qedi driver (new as of
        4.10) the other two are a use after free in the cxgbi drivers and a
        potential NULL dereference in the rdac device handler"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: libcxgbi: fix skb use after free
        scsi: qedi: Fix endpoint NULL panic during recovery.
        scsi: qedi: set max_fin_rt default value
        scsi: qedi: Set firmware tcp msl timer value.
        scsi: qedi: Fix endpoint NULL panic in qedi_set_path.
        scsi: qedi: Set dma_boundary to 0xfff.
        scsi: qedi: Correctly set firmware max supported BDs.
        scsi: qedi: Fix bad pte call trace when iscsiuio is stopped.
        scsi: scsi_dh_rdac: Use ctlr directly in rdac_failover_get()
      1f915b7f
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · 55cbdaf6
      Linus Torvalds 提交于
      Pull rdma fixes from Doug Ledford:
       "For the most part this is just a minor -rc cycle for the rdma
        subsystem. Even given that this is all of the -rc patches since the
        merge window closed, it's still only about 25 patches:
      
         - Multiple i40iw, nes, iw_cxgb4, hfi1, qib, mlx4, mlx5 fixes
      
         - A few upper layer protocol fixes (IPoIB, iSER, SRP)
      
         - A modest number of core fixes"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (26 commits)
        RDMA/SA: Fix kernel panic in CMA request handler flow
        RDMA/umem: Fix missing mmap_sem in get umem ODP call
        RDMA/core: not to set page dirty bit if it's already set.
        RDMA/uverbs: Declare local function static and add brackets to sizeof
        RDMA/netlink: Reduce exposure of RDMA netlink functions
        RDMA/srp: Fix NULL deref at srp_destroy_qp()
        RDMA/IPoIB: Limit the ipoib_dev_uninit_default scope
        RDMA/IPoIB: Replace netdev_priv with ipoib_priv for ipoib_get_link_ksettings
        RDMA/qedr: add null check before pointer dereference
        RDMA/mlx5: set UMR wqe fence according to HCA cap
        net/mlx5: Define interface bits for fencing UMR wqe
        RDMA/mlx4: Fix MAD tunneling when SRIOV is enabled
        RDMA/qib,hfi1: Fix MR reference count leak on write with immediate
        RDMA/hfi1: Defer setting VL15 credits to link-up interrupt
        RDMA/hfi1: change PCI bar addr assignments to Linux API functions
        RDMA/hfi1: fix array termination by appending NULL to attr array
        RDMA/iw_cxgb4: fix the calculation of ipv6 header size
        RDMA/iw_cxgb4: calculate t4_eq_status_entries properly
        RDMA/iw_cxgb4: Avoid touch after free error in ARP failure handlers
        RDMA/nes: ACK MPA Reply frame
        ...
      55cbdaf6
  6. 04 6月, 2017 2 次提交
  7. 03 6月, 2017 19 次提交
    • L
      Merge tag 'hwmon-for-linus-v4.12-rc4' of... · ea094f3c
      Linus Torvalds 提交于
      Merge tag 'hwmon-for-linus-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
       "A couple of patches for the aspeed pwm fan driver"
      
      * tag 'hwmon-for-linus-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        hwmon: (aspeed-pwm-tacho) make fan/pwm names start with index 1
        hwmon: (aspeed-pwm-tacho) Call of_node_put() on a node not claimed
        hwmon: (aspeed-pwm-tacho) On read failure return -ETIMEDOUT
        hwmon: (aspeed-pwm-tacho) Select REGMAP
      ea094f3c
    • L
      Merge tag 'for-linus-20170602' of git://git.infradead.org/linux-mtd · cc548740
      Linus Torvalds 提交于
      Pull MTD fixes from Brian Norris:
       "NAND updates from Boris:
      
        tango fixes:
         - Add missing MODULE_DEVICE_TABLE() in tango_nand.c
         - Update the number of corrected bitflips
      
        core fixes:
         - Fix a long standing memory leak in nand_scan_tail()
         - Fix several bugs introduced by the per-vendor init/detection
           infrastructure (introduced in 4.12)
         - Add a static specifier to nand_ooblayout_lp_hamming_ops definition"
      
      * tag 'for-linus-20170602' of git://git.infradead.org/linux-mtd:
        mtd: nand: make nand_ooblayout_lp_hamming_ops static
        mtd: nand: tango: Update ecc_stats.corrected
        mtd: nand: tango: Export OF device ID table as module aliases
        mtd: nand: samsung: warn about un-parseable ECC info
        mtd: nand: free vendor-specific resources in init failure paths
        mtd: nand: drop unneeded module.h include
        mtd: nand: don't leak buffers when ->scan_bbt() fails
      cc548740
    • S
      hwmon: (aspeed-pwm-tacho) make fan/pwm names start with index 1 · 5f348fa3
      Stefan Schaeckeler 提交于
      Make fan and pwm names in sysfs start with index 1 in accordance to
      Documentation/hwmon/sysfs-interface conventions.
      
      Current implementation starts with index 0, making tools such as
      sensors(1) skip the first fan.
      Signed-off-by: NStefan Schaeckeler <sschaeck@cisco.com>
      Fixes: 2d7a548a ("drivers: hwmon: Support for ASPEED PWM/Fan tach")
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      5f348fa3
    • S
      hwmon: (aspeed-pwm-tacho) Call of_node_put() on a node not claimed · 4d58e732
      Stefan Schaeckeler 提交于
      Call of_node_put() on a node claimed with of_node_get() or by any other
      means such as for_each_child_of_node().
      Signed-off-by: NStefan Schaeckeler <sschaeck@cisco.com>
      Fixes: 2d7a548a ("drivers: hwmon: Support for ASPEED PWM/Fan tach")
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      4d58e732
    • H
      Input: axp20x-pek - switch to acpi_dev_present and check for ACPI0011 too · 0fd5f221
      Hans de Goede 提交于
      acpi_dev_found checks that there is a matching ACPI node, but it
      may be disabled (_STA method returns 0) in which case the
      soc_button_array driver will not bind to it and axp20x-pek should
      handle the power-button.
      
      This commit switches from acpi_dev_found to acpi_dev_present to
      avoid not registering an input-dev for the powerbutton when there
      is a disabled PNP0C40 device.
      
      The ACPI-6.0 standard defines a standard gpio button device using
      the ACPI0011 HID replacing the custom PNP0C40 gpio device, many
      newer devices define both PNP0C40 and ACPI0011 devices enabling one
      or the other depending on whether the BIOS thinks it is going to boot
      Android or Windows.
      
      This commit adds a check for the ACPI0011 device, so that if
      either device is present *and* enabled we don't register an input-dev
      for the powerbutton.
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      0fd5f221
    • H
      Input: axp20x-pek - only check for "INTCFD9" ACPI device on Cherry Trail · 8d4b3137
      Hans de Goede 提交于
      Commit 9b13a4ca ("Input: axp20x-pek - do not register input device
      on some systems") added a check for the INTCFD9 ACPI device which also
      handles the powerbutton as on some systems the powerbutton is connected
      to both the PMIC, handled by axp20x-pek, and to a gpio on the SoC, handled
      by soc_button_array which attaches itself to the INTCFD9 ACPI device.
      
      Testing + comparing DSDTs has shown that this only happens on Cherry
      Trail devices with an AXP288 PMIC, the AXP288 PMIC is also used on
      Bay Trail devices but there the power button is only connected to
      the PMIC and not handled by soc_button_array.
      
      This means that the INTCFD9 check has caused a regression on Bay Trail
      devices, causing power-button presses to no longer be seen.
      
      This commit fixes this by limiting the check to devices where the ACPI
      node for the AXP288 contains a _HRV (hardware revision) attribute with
      a value of 3 which indicates we are dealing with a Cherry Trail platform.
      
      Fixes: 9b13a4ca ("Input: axp20x-pek - do not register input ...")
      Reported-by: NСергей Трусов <t.rus76@ya.ru>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      8d4b3137
    • D
      Merge tag 'v4.12-rc3' into for-linus · eadcbfa5
      Dmitry Torokhov 提交于
      Merge with mainline to get acpi_dev_present() needed by patches to
      axp20x-pek driver.
      eadcbfa5
    • L
      Merge tag 'acpi-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 104c08ba
      Linus Torvalds 提交于
      Pull ACPI fixes from Rafael Wysocki:
       "These revert one more problematic commit related to the ACPI-based
        handling of laptop lids and make some unuseful error messages coming
        from ACPICA go away.
      
        Specifics:
      
         - Revert one more commit related to the ACPI-based handling of laptop
           lids that changed the default behavior on laptops that booted with
           closed lids and introduced a regression there (Benjamin Tissoires).
      
         - Add a missing acpi_put_table() to the code implementing the
           /sys/firmware/acpi/tables interface to prevent a counter in the
           ACPICA core from overflowing (Dan Williams).
      
         - Drop error messages printed by ACPICA on acpi_get_table() reference
           counting mismatches as they need not indicate real errors at this
           point (Lv Zheng)"
      
      * tag 'acpi-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPICA: Tables: Fix regression introduced by a too early mechanism enabling
        Revert "ACPI / button: Change default behavior to lid_init_state=open"
        ACPI / sysfs: fix acpi_get_table() leak / acpi-sysfs denial of service
      104c08ba
    • L
      Merge tag 'pm-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 89af529a
      Linus Torvalds 提交于
      Pull power management fixes from Rafael Wysocki:
       "These fix two bugs in error code paths in the cpufreq core and in the
        kirkwood-cpufreq driver.
      
        Specifics:
      
         - Make cpufreq_register_driver() return an error if the ->init()
           calls fail for all CPUs to prevent non-functional drivers from
           hanging around for no reason (David Arcari).
      
         - Make kirkwood-cpufreq check the return value of
           clk_prepare_enable() (which may fail) as appropriate (Arvind
           Yadav)"
      
      * tag 'pm-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: kirkwood-cpufreq:- Handle return value of clk_prepare_enable()
        cpufreq: cpufreq_register_driver() should return -ENODEV if init fails
      89af529a
    • L
      Merge tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random · 5a4829b5
      Linus Torvalds 提交于
      Pull /dev/random bug fix from Ted Ts'o:
       "Fix a race on architectures with prioritized interrupts (such as m68k)
        which can causes crashes in drivers/char/random.c:get_reg()"
      
      * tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
        fix race in drivers/char/random.c:get_reg()
      5a4829b5
    • L
      Merge branch 'akpm' (patches from Andrew) · f2197649
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "15 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        scripts/gdb: make lx-dmesg command work (reliably)
        mm: consider memblock reservations for deferred memory initialization sizing
        mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
        mlock: fix mlock count can not decrease in race condition
        mm/migrate: fix refcount handling when !hugepage_migration_supported()
        dax: fix race between colliding PMD & PTE entries
        mm: avoid spurious 'bad pmd' warning messages
        mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once
        pcmcia: remove left-over %Z format
        slub/memcg: cure the brainless abuse of sysfs attributes
        initramfs: fix disabling of initramfs (and its compression)
        mm: clarify why we want kmalloc before falling backto vmallock
        frv: declare jiffies to be located in the .data section
        include/linux/gfp.h: fix ___GFP_NOLOCKDEP value
        ksm: prevent crash after write_protect_page fails
      f2197649
    • A
      scripts/gdb: make lx-dmesg command work (reliably) · d6c97087
      André Draszik 提交于
      lx-dmesg needs access to the log_buf symbol from printk.c.
      Unfortunately, the symbol log_buf also exists in BPF's verifier.c and
      hence gdb can pick one or the other.  If it happens to pick BPF's
      log_buf, lx-dmesg doesn't work:
      
        (gdb) lx-dmesg
        Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x0:
        Error occurred in Python command: Cannot access memory at address 0x0
        (gdb) p log_buf
        $15 = 0x0
      
      Luckily, GDB has a way to deal with this, see
        https://sourceware.org/gdb/onlinedocs/gdb/Symbols.html
      
        (gdb) info variables ^log_buf$
        All variables matching regular expression "^log_buf$":
      
        File <linux.git>/kernel/bpf/verifier.c:
        static char *log_buf;
      
        File <linux.git>/kernel/printk/printk.c:
        static char *log_buf;
        (gdb) p 'verifier.c'::log_buf
        $1 = 0x0
        (gdb) p 'printk.c'::log_buf
        $2 = 0x811a6aa0 <__log_buf> ""
        (gdb) p &log_buf
        $3 = (char **) 0x8120fe40 <log_buf>
        (gdb) p &'verifier.c'::log_buf
        $4 = (char **) 0x8120fe40 <log_buf>
        (gdb) p &'printk.c'::log_buf
        $5 = (char **) 0x8048b7d0 <log_buf>
      
      By being explicit about the location of the symbol, we can make lx-dmesg
      work again.  While at it, do the same for the other symbols we need from
      printk.c
      
      Link: http://lkml.kernel.org/r/20170526112222.3414-1-git@andred.netSigned-off-by: NAndré Draszik <git@andred.net>
      Tested-by: NKieran Bingham <kieran@bingham.xyz>
      Acked-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6c97087
    • M
      mm: consider memblock reservations for deferred memory initialization sizing · 864b9a39
      Michal Hocko 提交于
      We have seen an early OOM killer invocation on ppc64 systems with
      crashkernel=4096M:
      
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	  dump_header+0xb0/0x258
      	  out_of_memory+0x5f0/0x640
      	  __alloc_pages_nodemask+0xa8c/0xc80
      	  kmem_getpages+0x84/0x1a0
      	  fallback_alloc+0x2a4/0x320
      	  kmem_cache_alloc_node+0xc0/0x2e0
      	  copy_process.isra.25+0x260/0x1b30
      	  _do_fork+0x94/0x470
      	  kernel_thread+0x48/0x60
      	  kthreadd+0x264/0x330
      	  ret_from_kernel_thread+0x5c/0xa4
      
      	Mem-Info:
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      
      so the low 2GB is mostly depleted.
      
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      864b9a39
    • J
      mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified · 9a291a7c
      James Morse 提交于
      KVM uses get_user_pages() to resolve its stage2 faults.  KVM sets the
      FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
      finds a VM_FAULT_HWPOISON.  KVM handles these hwpoison pages as a
      special case.  (check_user_page_hwpoison())
      
      When huge pages are involved, this doesn't work so well.
      get_user_pages() calls follow_hugetlb_page(), which stops early if it
      receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
      -EFAULT to the caller.  The step to map this to -EHWPOISON based on the
      FOLL_ flags is missing.  The hwpoison special case is skipped, and
      -EFAULT is returned to user-space, causing Qemu or kvmtool to exit.
      
      Instead, move this VM_FAULT_ to errno mapping code into a header file
      and use it from faultin_page() and follow_hugetlb_page().
      
      With this, KVM works as expected.
      
      This isn't a problem for arm64 today as we haven't enabled
      MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
      too, so I think this should be a fix.  This doesn't apply earlier than
      stable's v4.11.1 due to all sorts of cleanup.
      
      [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
      suggested.
        Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.comSigned-off-by: NJames Morse <james.morse@arm.com>
      Acked-by: NPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[4.11.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a291a7c
    • Y
      mlock: fix mlock count can not decrease in race condition · 70feee0e
      Yisheng Xie 提交于
      Kefeng reported that when running the follow test, the mlock count in
      meminfo will increase permanently:
      
       [1] testcase
       linux:~ # cat test_mlockal
       grep Mlocked /proc/meminfo
        for j in `seq 0 10`
        do
       	for i in `seq 4 15`
       	do
       		./p_mlockall >> log &
       	done
       	sleep 0.2
       done
       # wait some time to let mlock counter decrease and 5s may not enough
       sleep 5
       grep Mlocked /proc/meminfo
      
       linux:~ # cat p_mlockall.c
       #include <sys/mman.h>
       #include <stdlib.h>
       #include <stdio.h>
      
       #define SPACE_LEN	4096
      
       int main(int argc, char ** argv)
       {
      	 	int ret;
      	 	void *adr = malloc(SPACE_LEN);
      	 	if (!adr)
      	 		return -1;
      
      	 	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
      	 	printf("mlcokall ret = %d\n", ret);
      
      	 	ret = munlockall();
      	 	printf("munlcokall ret = %d\n", ret);
      
      	 	free(adr);
      	 	return 0;
      	 }
      
      In __munlock_pagevec() we should decrement NR_MLOCK for each page where
      we clear the PageMlocked flag.  Commit 1ebb7cc6 ("mm: munlock: batch
      NR_MLOCK zone state updates") has introduced a bug where we don't
      decrement NR_MLOCK for pages where we clear the flag, but fail to
      isolate them from the lru list (e.g.  when the pages are on some other
      cpu's percpu pagevec).  Since PageMlocked stays cleared, the NR_MLOCK
      accounting gets permanently disrupted by this.
      
      Fix it by counting the number of page whose PageMlock flag is cleared.
      
      Fixes: 1ebb7cc6 (" mm: munlock: batch NR_MLOCK zone state updates")
      Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Tested-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70feee0e
    • P
      mm/migrate: fix refcount handling when !hugepage_migration_supported() · 30809f55
      Punit Agrawal 提交于
      On failing to migrate a page, soft_offline_huge_page() performs the
      necessary update to the hugepage ref-count.
      
      But when !hugepage_migration_supported() , unmap_and_move_hugepage()
      also decrements the page ref-count for the hugepage.  The combined
      behaviour leaves the ref-count in an inconsistent state.
      
      This leads to soft lockups when running the overcommitted hugepage test
      from mce-tests suite.
      
        Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
        soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
        INFO: rcu_preempt detected stalls on CPUs/tasks:
         Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
          (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
          thugetlb_overco R  running task        0  2715   2685 0x00000008
          Call trace:
            dump_backtrace+0x0/0x268
            show_stack+0x24/0x30
            sched_show_task+0x134/0x180
            rcu_print_detail_task_stall_rnp+0x54/0x7c
            rcu_check_callbacks+0xa74/0xb08
            update_process_times+0x34/0x60
            tick_sched_handle.isra.7+0x38/0x70
            tick_sched_timer+0x4c/0x98
            __hrtimer_run_queues+0xc0/0x300
            hrtimer_interrupt+0xac/0x228
            arch_timer_handler_phys+0x3c/0x50
            handle_percpu_devid_irq+0x8c/0x290
            generic_handle_irq+0x34/0x50
            __handle_domain_irq+0x68/0xc0
            gic_handle_irq+0x5c/0xb0
      
      Address this by changing the putback_active_hugepage() in
      soft_offline_huge_page() to putback_movable_pages().
      
      This only triggers on systems that enable memory failure handling
      (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
      (!ARCH_ENABLE_HUGEPAGE_MIGRATION).
      
      I imagine this wasn't triggered as there aren't many systems running
      this configuration.
      
      [akpm@linux-foundation.org: remove dead comment, per Naoya]
      Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.comReported-by: NManoj Iyer <manoj.iyer@canonical.com>
      Tested-by: NManoj Iyer <manoj.iyer@canonical.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NPunit Agrawal <punit.agrawal@arm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[3.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30809f55
    • R
      dax: fix race between colliding PMD & PTE entries · e2093926
      Ross Zwisler 提交于
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      
      Here is the first race:
      
        CPU 0					CPU 1
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
          handle_pte_fault()
            passes check for pmd_devmap()
      
      					(private mapping read)
      					__handle_mm_fault()
      					  create_huge_pmd()
      					    dax_iomap_pmd_fault() inserts PMD
      
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      
      Here's the second race:
      
        CPU 0					CPU 1
      
        (private mapping read)
        __handle_mm_fault()
          passes check for pmd_none()
          create_huge_pmd()
            dax_iomap_pmd_fault() inserts PMD
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					__handle_mm_fault()
      					  passes check for pmd_none()
      					  create_huge_pmd()
      
          handle_pte_fault()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      
      In my testing these races result in the following types of errors:
      
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2093926
    • R
      mm: avoid spurious 'bad pmd' warning messages · d0f0931d
      Ross Zwisler 提交于
      When the pmd_devmap() checks were added by 5c7fb56e ("mm, dax:
      dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
      pages, they were all added to the end of if() statements after existing
      pmd_trans_huge() checks.  So, things like:
      
        -       if (pmd_trans_huge(*pmd))
        +       if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
      
      When further checks were added after pmd_trans_unstable() checks by
      commit 7267ec00 ("mm: postpone page table allocation until we have
      page to map") they were also added at the end of the conditional:
      
        +       if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
      
      This ordering is fine for pmd_trans_huge(), but doesn't work for
      pmd_trans_unstable().  This is because DAX huge pages trip the bad_pmd()
      check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
      pmd_trans_unstable()), which prints out a warning and returns 1.  So, we
      do end up doing the right thing, but only after spamming dmesg with
      suspicious looking messages:
      
        mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)
      
      Reorder these checks in a helper so that pmd_devmap() is checked first,
      avoiding the error messages, and add a comment explaining why the
      ordering is important.
      
      Fixes: commit 7267ec00 ("mm: postpone page table allocation until we have page to map")
      Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0f0931d
    • T
      mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once · c288983d
      Tetsuo Handa 提交于
      Roman Gushchin has reported that the OOM killer can trivially selects
      next OOM victim when a thread doing memory allocation from page fault
      path was selected as first OOM victim.
      
          allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           __alloc_pages_slowpath+0xc84/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
          Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
          allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           __alloc_pages_slowpath+0xd32/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
          ...
          allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           pagefault_out_of_memory+0x68/0x80
           mm_fault_error+0x8f/0x190
           ? handle_mm_fault+0xf3/0x210
           __do_page_fault+0x4b2/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
          Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
      
      There is a race window that the OOM reaper completes reclaiming the
      first victim's memory while nothing but mutex_trylock() prevents the
      first victim from calling out_of_memory() from pagefault_out_of_memory()
      after memory allocation for page fault path failed due to being selected
      as an OOM victim.
      
      This is a side effect of commit 9a67f648 ("mm: consolidate
      GFP_NOFAIL checks in the allocator slowpath") because that commit
      silently changed the behavior from
      
          /* Avoid allocations with no watermarks from looping endlessly */
      
      to
      
          /*
           * Give up allocations without trying memory reserves if selected
           * as an OOM victim
           */
      
      in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
      flag.  I have noticed this change but I didn't post a patch because I
      thought it is an acceptable change other than noise by warn_alloc()
      because !__GFP_NOFAIL allocations are allowed to fail.  But we
      overlooked that failing memory allocation from page fault path makes
      difference due to the race window explained above.
      
      While it might be possible to add a check to pagefault_out_of_memory()
      that prevents the first victim from calling out_of_memory() or remove
      out_of_memory() from pagefault_out_of_memory(), changing
      pagefault_out_of_memory() does not suppress noise by warn_alloc() when
      allocating thread was selected as an OOM victim.  There is little point
      with printing similar backtraces and memory information from both
      out_of_memory() and warn_alloc().
      
      Instead, if we guarantee that current thread can try allocations with no
      watermarks once when current thread looping inside
      __alloc_pages_slowpath() was selected as an OOM victim, we can follow "who
      can use memory reserves" rules and suppress noise by warn_alloc() and
      prevent memory allocations from page fault path from calling
      pagefault_out_of_memory().
      
      If we take the comment literally, this patch would do
      
        -    if (test_thread_flag(TIF_MEMDIE))
        -        goto nopage;
        +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
        +        goto nopage;
      
      because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
      given.  But if I recall correctly (I couldn't find the message), the
      condition is meant to apply to only OOM victims despite the comment.
      Therefore, this patch preserves TIF_MEMDIE check.
      
      Fixes: 9a67f648 ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
      Link: http://lkml.kernel.org/r/201705192112.IAF69238.OQOHSJLFOFFMtV@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NRoman Gushchin <guro@fb.com>
      Tested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.11]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c288983d