1. 27 6月, 2016 8 次提交
    • Q
      KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode. · ff30ef40
      Quentin Casasnovas 提交于
      I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was
      getting a GP(0) on a rather unspecial vmread from Xen:
      
           (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=n  Not tainted ]----
           (XEN) CPU:    1
           (XEN) RIP:    e008:[<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d1v0)
           (XEN) rax: ffff82d0801e6288   rbx: ffff83003ffbfb7c   rcx: fffffffffffab928
           (XEN) rdx: 0000000000000000   rsi: 0000000000000000   rdi: ffff83000bdd0000
           (XEN) rbp: ffff83000bdd0000   rsp: ffff83003ffbfab0   r8:  ffff830038813910
           (XEN) r9:  ffff83003faf3958   r10: 0000000a3b9f7640   r11: ffff83003f82d418
           (XEN) r12: 0000000000000000   r13: ffff83003ffbffff   r14: 0000000000004802
           (XEN) r15: 0000000000000008   cr0: 0000000080050033   cr4: 00000000001526e0
           (XEN) cr3: 000000003fc79000   cr2: 0000000000000000
           (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
           (XEN) Xen code around <ffff82d0801e629e> (vmx_get_segment_register+0x14e/0x450):
           (XEN)  00 00 41 be 02 48 00 00 <44> 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00
           (XEN) Xen stack trace from rsp=ffff83003ffbfab0:
      
           ...
      
           (XEN) Xen call trace:
           (XEN)    [<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN)    [<ffff82d0801f3695>] get_page_from_gfn_p2m+0x165/0x300
           (XEN)    [<ffff82d0801bfe32>] hvmemul_get_seg_reg+0x52/0x60
           (XEN)    [<ffff82d0801bfe93>] hvm_emulate_prepare+0x53/0x70
           (XEN)    [<ffff82d0801ccacb>] handle_mmio+0x2b/0xd0
           (XEN)    [<ffff82d0801be591>] emulate.c#_hvm_emulate_one+0x111/0x2c0
           (XEN)    [<ffff82d0801cd6a4>] handle_hvm_io_completion+0x274/0x2a0
           (XEN)    [<ffff82d0801f334a>] __get_gfn_type_access+0xfa/0x270
           (XEN)    [<ffff82d08012f3bb>] timer.c#add_entry+0x4b/0xb0
           (XEN)    [<ffff82d08012f80c>] timer.c#remove_entry+0x7c/0x90
           (XEN)    [<ffff82d0801c8433>] hvm_do_resume+0x23/0x140
           (XEN)    [<ffff82d0801e4fe7>] vmx_do_resume+0xa7/0x140
           (XEN)    [<ffff82d080164aeb>] context_switch+0x13b/0xe40
           (XEN)    [<ffff82d080128e6e>] schedule.c#schedule+0x22e/0x570
           (XEN)    [<ffff82d08012c0cc>] softirq.c#__do_softirq+0x5c/0x90
           (XEN)    [<ffff82d0801602c5>] domain.c#idle_loop+0x25/0x50
           (XEN)
           (XEN)
           (XEN) ****************************************
           (XEN) Panic on CPU 1:
           (XEN) GENERAL PROTECTION FAULT
           (XEN) [error_code=0000]
           (XEN) ****************************************
      
      Tracing my host KVM showed it was the one injecting the GP(0) when
      emulating the VMREAD and checking the destination segment permissions in
      get_vmx_mem_address():
      
           3)               |    vmx_handle_exit() {
           3)               |      handle_vmread() {
           3)               |        nested_vmx_check_permission() {
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.065 us    |            vmx_read_guest_seg_selector();
           3)   0.066 us    |            vmx_read_guest_seg_ar();
           3)   1.636 us    |          }
           3)   0.058 us    |          vmx_get_rflags();
           3)   0.062 us    |          vmx_read_guest_seg_ar();
           3)   3.469 us    |        }
           3)               |        vmx_get_cs_db_l_bits() {
           3)   0.058 us    |          vmx_read_guest_seg_ar();
           3)   0.662 us    |        }
           3)               |        get_vmx_mem_address() {
           3)   0.068 us    |          vmx_cache_reg();
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.068 us    |            vmx_read_guest_seg_selector();
           3)   0.071 us    |            vmx_read_guest_seg_ar();
           3)   1.756 us    |          }
           3)               |          kvm_queue_exception_e() {
           3)   0.066 us    |            kvm_multiple_exception();
           3)   0.684 us    |          }
           3)   4.085 us    |        }
           3)   9.833 us    |      }
           3) + 10.366 us   |    }
      
      Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software
      Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine
      Control Structure", I found that we're enforcing that the destination
      operand is NOT located in a read-only data segment or any code segment when
      the L1 is in long mode - BUT that check should only happen when it is in
      protected mode.
      
      Shuffling the code a bit to make our emulation follow the specification
      allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests
      without problems.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Eugene Korenevsky <ekorenevsky@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff30ef40
    • M
      KVM: LAPIC: cap __delay at lapic_timer_advance_ns · b606f189
      Marcelo Tosatti 提交于
      The host timer which emulates the guest LAPIC TSC deadline
      timer has its expiration diminished by lapic_timer_advance_ns
      nanoseconds. Therefore if, at wait_lapic_expire, a difference
      larger than lapic_timer_advance_ns is encountered, delay at most
      lapic_timer_advance_ns.
      
      This fixes a problem where the guest can cause the host
      to delay for large amounts of time.
      Reported-by: NAlan Jenkins <alan.christopher.jenkins@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b606f189
    • M
      KVM: x86: move nsec_to_cycles from x86.c to x86.h · 8d93c874
      Marcelo Tosatti 提交于
      Move the inline function nsec_to_cycles from x86.c to x86.h, as
      the next patch uses it from lapic.c.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d93c874
    • M
      pvclock: Get rid of __pvclock_read_cycles in function pvclock_read_flags · ed911b43
      Minfei Huang 提交于
      There is a generic function __pvclock_read_cycles to be used to get both
      flags and cycles. For function pvclock_read_flags, it's useless to get
      cycles value. To make this function be more effective, get this variable
      flags directly in function.
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed911b43
    • M
      pvclock: Cleanup to remove function pvclock_get_nsec_offset · f7550d07
      Minfei Huang 提交于
      Function __pvclock_read_cycles is short enough, so there is no need to
      have another function pvclock_get_nsec_offset to calculate tsc delta.
      It's better to combine it into function __pvclock_read_cycles.
      
      Remove useless variables in function __pvclock_read_cycles.
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7550d07
    • M
      pvclock: Add CPU barriers to get correct version value · 749d088b
      Minfei Huang 提交于
      Protocol for the "version" fields is: hypervisor raises it (making it
      uneven) before it starts updating the fields and raises it again (making
      it even) when it is done.  Thus the guest can make sure the time values
      it got are consistent by checking the version before and after reading
      them.
      
      Add CPU barries after getting version value just like what function
      vread_pvclock does, because all of callees in this function is inline.
      
      Fixes: 502dfeff
      Cc: stable@vger.kernel.org
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      749d088b
    • L
      Linux 4.7-rc5 · 4c2e07c6
      Linus Torvalds 提交于
      4c2e07c6
    • L
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 2ac9b973
      Linus Torvalds 提交于
      Pull SCSI fixes from James Bottomley:
       "Two straightforward fixes.
      
        One is a concurrency issue only affecting SAS connected SATA drives,
        but which could hang the storage subsystem if it triggers (because the
        outstanding command count on error never goes back to zero) and the
        other is a NO_TAG fallout from the switch to hostwide tags which
        causes the system to crash on module insertion (we've checked
        carefully and only the 53c700 family of drivers is vulnerable to this
        issue)"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        53c700: fix BUG on untagged commands
        scsi: fix race between simultaneous decrements of ->host_failed
      2ac9b973
  2. 25 6月, 2016 32 次提交
    • L
      Merge branch 'for-linus-4.7-part2' of... · da2f6aba
      Linus Torvalds 提交于
      Merge branch 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
      
      Pull btrfs fixes part 2 from Chris Mason:
       "This has one patch from Omar to bring iterate_shared back to btrfs.
      
        We have a tree of work we queue up for directory items and it doesn't
        lend itself well to shared access.  While we're cleaning it up, Omar
        has changed things to use an exclusive lock when there are delayed
        items"
      
      * 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
      da2f6aba
    • L
      Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · b971712a
      Linus Torvalds 提交于
      Pull btrfs fixes from Chris Mason:
       "I have a two part pull this time because one of the patches Dave
        Sterba collected needed to be against v4.7-rc2 or higher (we used
        rc4).  I try to make my for-linus-xx branch testable on top of the
        last major so we can hand fixes to people on the list more easily, so
        I've split this pull in two.
      
        This first part has some fixes and two performance improvements that
        we've been testing for some time.
      
        Josef's two performance fixes are most notable.  The transid tracking
        patch makes a big improvement on pretty much every workload"
      
      * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: Force stripesize to the value of sectorsize
        btrfs: fix disk_i_size update bug when fallocate() fails
        Btrfs: fix error handling in map_private_extent_buffer
        Btrfs: fix error return code in btrfs_init_test_fs()
        Btrfs: don't do nocow check unless we have to
        btrfs: fix deadlock in delayed_ref_async_start
        Btrfs: track transid for delayed ref flushing
      b971712a
    • L
      Merge tag 'sound-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · ca83a55c
      Linus Torvalds 提交于
      Pull sound fixes from Takashi Iwai:
       "Again pretty calm weeks: we've had only a few trivial / stable
        HD-audio fixes in addition to a possible race fix for snd-dummy driver
        spotted by syzkaller"
      
      * tag 'sound-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: dummy: Fix a use-after-free at closing
        ALSA: hda / realtek - add two more Thinkpad IDs (5050,5053) for tpt460 fixup
        ALSA: hda - Fix the headset mic jack detection on Dell machine
        ALSA: hda/tegra: iomem fixups for sparse warnings
        ALSA: hdac_regmap - fix the register access for runtime PM
      ca83a55c
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9a949a98
      Linus Torvalds 提交于
      Pull x86 kprobe fix from Thomas Gleixner:
       "A single fix clearing the TF bit when a fault is single stepped"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        kprobes/x86: Clear TF bit in fault on single-stepping
      9a949a98
    • L
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 57801c1b
      Linus Torvalds 提交于
      Pull scheduler fixes from Thomas Gleixner:
       "A couple of scheduler fixes:
      
         - force watchdog reset while processing sysrq-w
      
         - fix a deadlock when enabling trace events in the scheduler
      
         - fixes to the throttled next buddy logic
      
         - fixes for the average accounting (missing serialization and
           underflow handling)
      
         - allow kernel threads for fallback to online but not active cpus"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/core: Allow kthreads to fall back to online && !active cpus
        sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
        sched/fair: Initialize throttle_count for new task-groups lazily
        sched/fair: Fix cfs_rq avg tracking underflow
        kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
        sched/debug: Fix deadlock when enabling sched events
        sched/fair: Fix post_init_entity_util_avg() serialization
      57801c1b
    • O
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval 提交于
      Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      
      Tested with xfstests and by doing a parallel kernel build:
      
      	while make tinyconfig && make -j4 && git clean dqfx; do
      		:
      	done
      
      along with a bunch of parallel finds in another shell:
      
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      		done
      		wait
      	done
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      02dbfc99
    • L
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e3b22bc3
      Linus Torvalds 提交于
      Pull locking fix from Thomas Gleixner:
       "A single fix to address a race in the static key logic"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/static_key: Fix concurrent static_key_slow_inc()
      e3b22bc3
    • L
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2de23071
      Linus Torvalds 提交于
      Pull irq fix from Thomas Gleixner:
       "A single fix for the fallout from the conversion of MIPS GIC to irq
        domains"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/mips-gic: Fix IRQs in gic_dev_domain
      2de23071
    • L
      Merge tag 'powerpc-4.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 2f6e9747
      Linus Torvalds 提交于
      Pull powerpc fixes from Michael Ellerman:
       "mm/radix (Aneesh Kumar K.V):
         - Update to tlb functions ric argument
         - Flush page walk cache when freeing page table
         - Update Radix tree size as per ISA 3.0
      
        mm/hash (Aneesh Kumar K.V):
         - Use the correct PPP mask when updating HPTE
         - Don't add memory coherence if cache inhibited is set
      
        eeh (Gavin Shan):
         - Fix invalid cached PE primary bus
      
        bpf/jit (Naveen N. Rao):
         - Disable classic BPF JIT on ppc64le
      
        .. and fix faults caused by radix patching of SLB miss handler
        (Michael Ellerman)"
      
      * tag 'powerpc-4.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/bpf/jit: Disable classic BPF JIT on ppc64le
        powerpc: Fix faults caused by radix patching of SLB miss handler
        powerpc/eeh: Fix invalid cached PE primary bus
        powerpc/mm/radix: Update Radix tree size as per ISA 3.0
        powerpc/mm/hash: Don't add memory coherence if cache inhibited is set
        powerpc/mm/hash: Use the correct PPP mask when updating HPTE
        powerpc/mm/radix: Flush page walk cache when freeing page table
        powerpc/mm/radix: Update to tlb functions ric argument
      2f6e9747
    • M
      Fix build break in fork.c when THREAD_SIZE < PAGE_SIZE · 9521d399
      Michael Ellerman 提交于
      Commit b235beea ("Clarify naming of thread info/stack allocators")
      breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:
      
        kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
        kernel/fork.c:355:8: error: assignment from incompatible pointer type
          stack = alloc_thread_stack_node(tsk, node);
          ^
      
      Fix it by renaming free_stack() to free_thread_stack(), and updating the
      return type of alloc_thread_stack_node().
      
      Fixes: b235beea ("Clarify naming of thread info/stack allocators")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9521d399
    • L
      Merge branch 'akpm' (patches from Andrew) · 086e3eb6
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "Two weeks worth of fixes here"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
        init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
        autofs: don't get stuck in a loop if vfs_write() returns an error
        mm/page_owner: avoid null pointer dereference
        tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
        fs/nilfs2: fix potential underflow in call to crc32_le
        oom, suspend: fix oom_reaper vs. oom_killer_disable race
        ocfs2: disable BUG assertions in reading blocks
        mm, compaction: abort free scanner if split fails
        mm: prevent KASAN false positives in kmemleak
        mm/hugetlb: clear compound_mapcount when freeing gigantic pages
        mm/swap.c: flush lru pvecs on compound page arrival
        memcg: css_alloc should return an ERR_PTR value on error
        memcg: mem_cgroup_migrate() may be called with irq disabled
        hugetlb: fix nr_pmds accounting with shared page tables
        Revert "mm: disable fault around on emulated access bit architecture"
        Revert "mm: make faultaround produce old ptes"
        mailmap: add Boris Brezillon's email
        mailmap: add Antoine Tenart's email
        mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
        mm: mempool: kasan: don't poot mempool objects in quarantine
        ...
      086e3eb6
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · aebe9bb8
      Linus Torvalds 提交于
      Pull rdma fixes from Doug Ledford:
       "This is the second batch of queued up rdma patches for this rc cycle.
      
        There isn't anything really major in here.  It's passed 0day,
        linux-next, and local testing across a wide variety of hardware.
        There are still a few known issues to be tracked down, but this should
        amount to the vast majority of the rdma RC fixes.
      
        Round two of 4.7 rc fixes:
      
         - A couple minor fixes to the rdma core
         - Multiple minor fixes to hfi1
         - Multiple minor fixes to mlx4/mlx4
         - A few minor fixes to i40iw"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (31 commits)
        IB/srpt: Reduce QP buffer size
        i40iw: Enable level-1 PBL for fast memory registration
        i40iw: Return correct max_fast_reg_page_list_len
        i40iw: Correct status check on i40iw_get_pble
        i40iw: Correct CQ arming
        IB/rdmavt: Correct qp_priv_alloc() return value test
        IB/hfi1: Don't zero out qp->s_ack_queue in rvt_reset_qp
        IB/hfi1: Fix deadlock with txreq allocation slow path
        IB/mlx4: Prevent cross page boundary allocation
        IB/mlx4: Fix memory leak if QP creation failed
        IB/mlx4: Verify port number in flow steering create flow
        IB/mlx4: Fix error flow when sending mads under SRIOV
        IB/mlx4: Fix the SQ size of an RC QP
        IB/mlx5: Fix wrong naming of port_rcv_data counter
        IB/mlx5: Fix post send fence logic
        IB/uverbs: Initialize ib_qp_init_attr with zeros
        IB/core: Fix false search of the IB_SA_WELL_KNOWN_GUID
        IB/core: Fix RoCE v1 multicast join logic issue
        IB/core: Fix no default GIDs when netdevice reregisters
        IB/hfi1: Send a pkey change event on driver pkey update
        ...
      aebe9bb8
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid · 3fb5e59c
      Linus Torvalds 提交于
      Pull HID fix from Jiri Kosina:
       "hiddev ioctl() validation fix from Scott Bauer"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
        HID: hiddev: validate num_values for HIDIOCGUSAGES, HIDIOCSUSAGES commands
      3fb5e59c
    • L
      Merge tag 'hwmon-for-linus-v4.7-rc5' of... · 260eaba4
      Linus Torvalds 提交于
      Merge tag 'hwmon-for-linus-v4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fix from Guenter Roeck:
       "Improve fan type detection for dell-smm to prevent kernel hang"
      
      * tag 'hwmon-for-linus-v4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        hwmon: (dell-smm) Cache fan_type() calls and change fan detection
      260eaba4
    • L
      Merge tag 'acpi-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · ed13fbbf
      Linus Torvalds 提交于
      Pull ACPI fix from Rafael Wysocki:
       "Stable-candidate fix for a deadlock in ACPICA introduced during the
        4.5 development cycle by a commit attempting to improve the handling
        of AML code that doesn't belong to any namespace objects in a given
        definition block (Lv Zheng)"
      
      * tag 'acpi-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPICA: Namespace: Fix deadlock triggered by MLC support in dynamic table loading
      ed13fbbf
    • L
      Merge tag 'pm-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 3522b35c
      Linus Torvalds 提交于
      Pull power management fixes from Rafael Wysocki:
       "Fix for a latent cpufreq driver bug uncovered by a recent ACPICA
        change and several fixes for the devfreq framework, including one fix
        for an issue introduced recently.
      
        Specifics:
      
         - Fix a latent initialization issue in the pcc-cpufreq driver
           (incorrect initial value of a structure field) that has been
           uncovered by a recent ACPICA commit (Mike Galbraith).
      
         - Add a missing notification in an update_devfreq() error code path
           forgotten by a recent devfreq commit (Chanwoo Choi).
      
         - Fix devfreq device frequency initialization (Lukasz Luba).
      
         - Fix an incorrect IS_ERR() check in the devfreq framework discovered
           by the Smatch checker (Dan Carpenter).
      
         - Drop two excessive put_device() calls from the devfreq framework
           (MyungJoo Ham, Cai Zhiyong).
      
         - Fix a possible memory leak in the devfreq framework and drop an
           unnecessary kfree() invocation from it (MyungJoo Ham)"
      
      * tag 'pm-4.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM / devfreq: Send the DEVFREQ_POSTCHANGE notification when target() is failed
        cpufreq: pcc-cpufreq: Fix doorbell.access_width
        PM / devfreq: fix initialization of current frequency in last status
        PM / devfreq: exynos-nocp: Remove incorrect IS_ERR() check
        PM / devfreq: remove double put_device
        PM / devfreq: fix double call put_device
        PM / devfreq: fix duplicated kfree on devfreq pointer
        PM / devfreq: devm_kzalloc to have dev pointer more precisely
      3522b35c
    • L
      Merge tag 'for-linus-4.7b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 032fd3e5
      Linus Torvalds 提交于
      Pull xen bug fixes from David Vrabel:
      
       - fix x86 PV dom0 crash during early boot on some hardware
      
       - fix two pciback bugs affects certain devices
      
       - fix potential overflow when clearing page tables in x86 PV
      
      * tag 'for-linus-4.7b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen-pciback: return proper values during BAR sizing
        x86/xen: avoid m2p lookup when setting early page table entries
        xen/pciback: Fix conf_space read/write overlap check.
        x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()
        xen/balloon: Fix declared-but-not-defined warning
      032fd3e5
    • L
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · d05be0d7
      Linus Torvalds 提交于
      Pull arm64 fixes from Will Deacon:
       "Here are a few more arm64 fixes, but things do finally appear to be
        slowing down.  The main fix is avoiding hibernation in a previously
        unanticipated situation where we have CPUs parked in the kernel, but
        it's all good stuff.
      
         - Fix icache/dcache sync for anonymous pages under migration
         - Correct the ASID limit check
         - Fix parallel builds of Image and Image.gz
         - Refuse to hibernate when we have CPUs that we can't offline"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: hibernate: Don't hibernate on systems with stuck CPUs
        arm64: smp: Add function to determine if cpus are stuck in the kernel
        arm64: mm: remove page_mapping check in __sync_icache_dcache
        arm64: fix boot image dependencies to not generate invalid images
        arm64: update ASID limit
      d05be0d7
    • R
      init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64 · 0fd5ed8d
      Rasmus Villemoes 提交于
      When I replaced kasprintf("%pf") with a direct call to
      sprint_symbol_no_offset I must have broken the initcall blacklisting
      feature on the arches where dereference_function_descriptor() is
      non-trivial.
      
      Fixes: c8cdd2be (init/main.c: simplify initcall_blacklisted())
      Link: http://lkml.kernel.org/r/1466027283-4065-1-git-send-email-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fd5ed8d
    • A
    • S
      mm/page_owner: avoid null pointer dereference · 8285027f
      Sudip Mukherjee 提交于
      We have dereferenced page_ext before checking it.  Lets check it first
      and then used it.
      
      Fixes: f86e4271 ("mm: check the return value of lookup_page_ext for all call sites")
      Link: http://lkml.kernel.org/r/1465249059-7883-1-git-send-email-sudipm.mukherjee@gmail.comSigned-off-by: NSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8285027f
    • C
    • T
      fs/nilfs2: fix potential underflow in call to crc32_le · 63d2f95d
      Torsten Hilbrich 提交于
      The value `bytes' comes from the filesystem which is about to be
      mounted.  We cannot trust that the value is always in the range we
      expect it to be.
      
      Check its value before using it to calculate the length for the crc32_le
      call.  It value must be larger (or equal) sumoff + 4.
      
      This fixes a kernel bug when accidentially mounting an image file which
      had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance.
      The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a
      s_bytes value of 1.  This caused an underflow when substracting sumoff +
      4 (20) in the call to crc32_le.
      
        BUG: unable to handle kernel paging request at ffff88021e600000
        IP:  crc32_le+0x36/0x100
        ...
        Call Trace:
          nilfs_valid_sb.part.5+0x52/0x60 [nilfs2]
          nilfs_load_super_block+0x142/0x300 [nilfs2]
          init_nilfs+0x60/0x390 [nilfs2]
          nilfs_mount+0x302/0x520 [nilfs2]
          mount_fs+0x38/0x160
          vfs_kern_mount+0x67/0x110
          do_mount+0x269/0xe00
          SyS_mount+0x9f/0x100
          entry_SYSCALL_64_fastpath+0x16/0x71
      
      Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jpSigned-off-by: NTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Tested-by: NTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63d2f95d
    • M
      oom, suspend: fix oom_reaper vs. oom_killer_disable race · 74070542
      Michal Hocko 提交于
      Tetsuo has reported the following potential oom_killer_disable vs.
      oom_reaper race:
      
       (1) freeze_processes() starts freezing user space threads.
       (2) Somebody (maybe a kenrel thread) calls out_of_memory().
       (3) The OOM killer calls mark_oom_victim() on a user space thread
           P1 which is already in __refrigerator().
       (4) oom_killer_disable() sets oom_killer_disabled = true.
       (5) P1 leaves __refrigerator() and enters do_exit().
       (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
           exit_oom_victim(P1).
       (7) oom_killer_disable() returns while P1 not yet finished
       (8) P1 perform IO/interfere with the freezer.
      
      This situation is unfortunate.  We cannot move oom_killer_disable after
      all the freezable kernel threads are frozen because the oom victim might
      depend on some of those kthreads to make a forward progress to exit so
      we could deadlock.  It is also far from trivial to teach the oom_reaper
      to not call exit_oom_victim() because then we would lose a guarantee of
      the OOM killer and oom_killer_disable forward progress because
      exit_mm->mmput might block and never call exit_oom_victim.
      
      It seems the easiest way forward is to workaround this race by calling
      try_to_freeze_tasks again after oom_killer_disable.  This will make sure
      that all the tasks are frozen or it bails out.
      
      Fixes: 449d777d ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
      Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74070542
    • G
      ocfs2: disable BUG assertions in reading blocks · 7186ee06
      Gang He 提交于
      According to some high-load testing, these two BUG assertions were
      encountered, this led system panic.  Actually, there were some
      discussions about removing these two BUG() assertions, it would not
      bring any side effect.
      
      Then, I did the the following changes,
      
      1) use the existing macro CATCH_BH_JBD_RACES to wrap BUG() in the
         ocfs2_read_blocks_sync function like before.
      
      2) disable the macro CATCH_BH_JBD_RACES in Makefile by default.
      
      Link: http://lkml.kernel.org/r/1466574294-26863-1-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7186ee06
    • D
      mm, compaction: abort free scanner if split fails · a4f04f2c
      David Rientjes 提交于
      If the memory compaction free scanner cannot successfully split a free
      page (only possible due to per-zone low watermark), terminate the free
      scanner rather than continuing to scan memory needlessly.  If the
      watermark is insufficient for a free page of order <= cc->order, then
      terminate the scanner since all future splits will also likely fail.
      
      This prevents the compaction freeing scanner from scanning all memory on
      very large zones (very noticeable for zones > 128GB, for instance) when
      all splits will likely fail while holding zone->lock.
      
      compaction_alloc() iterating a 128GB zone has been benchmarked to take
      over 400ms on some systems whereas any free page isolated and ready to
      be split ends up failing in split_free_page() because of the low
      watermark check and thus the iteration continues.
      
      The next time compaction occurs, the freeing scanner will likely start
      at the end of the zone again since no success was made previously and we
      get the same lengthy iteration until the zone is brought above the low
      watermark.  All thp page faults can take >400ms in such a state without
      this fix.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4f04f2c
    • D
      mm: prevent KASAN false positives in kmemleak · 5c335fe0
      Dmitry Vyukov 提交于
      When kmemleak dumps contents of leaked objects it reads whole objects
      regardless of user-requested size.  This upsets KASAN.  Disable KASAN
      checks around object dump.
      
      Link: http://lkml.kernel.org/r/1466617631-68387-1-git-send-email-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c335fe0
    • G
      mm/hugetlb: clear compound_mapcount when freeing gigantic pages · c8cc708a
      Gerald Schaefer 提交于
      While working on s390 support for gigantic hugepages I ran into the
      following "Bad page state" warning when freeing gigantic pages:
      
        BUG: Bad page state in process bash  pfn:580001
        page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0
        flags: 0x7fffc0000000000()
        page dumped because: non-NULL mapping
      
      This is because page->compound_mapcount, which is part of a union with
      page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
      and not cleared again during destroy_compound_gigantic_page().  Fix this
      by clearing the compound_mapcount in destroy_compound_gigantic_page()
      before clearing compound_head.
      
      Interestingly enough, the warning will not show up on x86_64, although
      this should not be architecture specific.  Apparently there is an
      endianness issue, combined with the fact that the union contains both a
      64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
      members.  The resulting bogus page->mapping on x86_64 therefore contains
      00000000ffffffff instead of ffffffff00000000 on s390, which will falsely
      trigger the PageAnon() check in free_pages_prepare() because
      page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
      like x86_64 in this case (the page is not compound anymore,
      ->compound_head was already cleared before).  As a result, page->mapping
      will be cleared before doing the checks in free_pages_check().
      
      Not sure if the bogus "PageAnon() returning true" on x86_64 for the
      first tail page of a gigantic page (at this stage) has other theoretical
      implications, but they would also be fixed with this patch.
      
      Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.comSigned-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8cc708a
    • L
      mm/swap.c: flush lru pvecs on compound page arrival · 8f182270
      Lukasz Odzioba 提交于
      Currently we can have compound pages held on per cpu pagevecs, which
      leads to a lot of memory unavailable for reclaim when needed.  In the
      systems with hundreads of processors it can be GBs of memory.
      
      On of the way of reproducing the problem is to not call munmap
      explicitly on all mapped regions (i.e.  after receiving SIGTERM).  After
      that some pages (with THP enabled also huge pages) may end up on
      lru_add_pvec, example below.
      
        void main() {
        #pragma omp parallel
        {
      	size_t size = 55 * 1000 * 1000; // smaller than  MEM/CPUS
      	void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
      		MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
      	if (p != MAP_FAILED)
      		memset(p, 0, size);
      	//munmap(p, size); // uncomment to make the problem go away
        }
        }
      
      When we run it with THP enabled it will leave significant amount of
      memory on lru_add_pvec.  This memory will be not reclaimed if we hit
      OOM, so when we run above program in a loop:
      
      	for i in `seq 100`; do ./a.out; done
      
      many processes (95% in my case) will be killed by OOM.
      
      The primary point of the LRU add cache is to save the zone lru_lock
      contention with a hope that more pages will belong to the same zone and
      so their addition can be batched.  The huge page is already a form of
      batched addition (it will add 512 worth of memory in one go) so skipping
      the batching seems like a safer option when compared to a potential
      excess in the caching which can be quite large and much harder to fix
      because lru_add_drain_all is way to expensive and it is not really clear
      what would be a good moment to call it.
      
      Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
      madvise(p, size, MADV_FREE); after memset.
      
      This patch flushes lru pvecs on compound page arrival making the problem
      less severe - after applying it kill rate of above example drops to 0%,
      due to reducing maximum amount of memory held on pvec from 28MB (with
      THP) to 56kB per CPU.
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.comSigned-off-by: NLukasz Odzioba <lukasz.odzioba@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Ming Li <mingli199x@qq.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f182270
    • T
      memcg: css_alloc should return an ERR_PTR value on error · ea3a9645
      Tejun Heo 提交于
      mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
      expected it to return an ERR_PTR value leading to the following NULL
      deref after a css allocation failure.  Fix it by return
      ERR_PTR(-ENOMEM) instead.  I'll also update cgroup core so that it
      can handle NULL returns.
      
        mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
        CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
        ...
        Call Trace:
          dump_stack+0x68/0xa1
          warn_alloc_failed+0xd6/0x130
          __alloc_pages_nodemask+0x4c6/0xf20
          alloc_pages_current+0x66/0xe0
          alloc_kmem_pages+0x14/0x80
          kmalloc_order_trace+0x2a/0x1a0
          __kmalloc+0x291/0x310
          memcg_update_all_caches+0x6c/0x130
          mem_cgroup_css_alloc+0x590/0x610
          cgroup_apply_control_enable+0x18b/0x370
          cgroup_mkdir+0x1de/0x2e0
          kernfs_iop_mkdir+0x55/0x80
          vfs_mkdir+0xb9/0x150
          SyS_mkdir+0x66/0xd0
          do_syscall_64+0x53/0x120
          entry_SYSCALL64_slow_path+0x25/0x25
        ...
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP:  init_and_link_css+0x37/0x220
        PGD 34b1e067 PUD 3a109067 PMD 0
        Oops: 0002 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
        task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
        RIP: 0010:[<ffffffff810f2ca7>]  [<ffffffff810f2ca7>] init_and_link_css+0x37/0x220
        RSP: 0018:ffff8800666d7d90  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
        RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
        R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
        R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
        FS:  00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          cgroup_apply_control_enable+0x1ac/0x370
          cgroup_mkdir+0x1de/0x2e0
          kernfs_iop_mkdir+0x55/0x80
          vfs_mkdir+0xb9/0x150
          SyS_mkdir+0x66/0xd0
          do_syscall_64+0x53/0x120
          entry_SYSCALL64_slow_path+0x25/0x25
        Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
        RIP   init_and_link_css+0x37/0x220
         RSP <ffff8800666d7d90>
        CR2: 00000000000000d0
        ---[ end trace a2d8836ae1e852d1 ]---
      
      Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea3a9645
    • T
      memcg: mem_cgroup_migrate() may be called with irq disabled · d93c4130
      Tejun Heo 提交于
      mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
      with irq disabled from migrate_page_copy().  This ends up enabling irq
      while holding a irq context lock triggering the following lockdep
      warning.  Fix it by using irq_save/restore instead.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.7.0-rc1+ #52 Tainted: G        W
        ---------------------------------
        inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<000000000038fd96>] aio_migratepage+0x156/0x1e8
        {IN-SOFTIRQ-W} state was registered at:
           __lock_acquire+0x5b6/0x1930
           lock_acquire+0xee/0x270
           _raw_spin_lock_irqsave+0x66/0xb0
           aio_complete+0x98/0x328
           dio_complete+0xe4/0x1e0
           blk_update_request+0xd4/0x450
           scsi_end_request+0x48/0x1c8
           scsi_io_completion+0x272/0x698
           blk_done_softirq+0xca/0xe8
           __do_softirq+0xc8/0x518
           irq_exit+0xee/0x110
           do_IRQ+0x6a/0x88
           io_int_handler+0x11a/0x25c
           __mutex_unlock_slowpath+0x144/0x1d8
           __mutex_unlock_slowpath+0x140/0x1d8
           kernfs_iop_permission+0x64/0x80
           __inode_permission+0x9e/0xf0
           link_path_walk+0x6e/0x510
           path_lookupat+0xc4/0x1a8
           filename_lookup+0x9c/0x160
           user_path_at_empty+0x5c/0x70
           SyS_readlinkat+0x68/0x140
           system_call+0xd6/0x270
        irq event stamp: 971410
        hardirqs last  enabled at (971409):  migrate_page_move_mapping+0x3ea/0x588
        hardirqs last disabled at (971410):  _raw_spin_lock_irqsave+0x3c/0xb0
        softirqs last  enabled at (970526):  __do_softirq+0x460/0x518
        softirqs last disabled at (970519):  irq_exit+0xee/0x110
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
      	 CPU0
      	 ----
          lock(&(&ctx->completion_lock)->rlock);
          <Interrupt>
            lock(&(&ctx->completion_lock)->rlock);
      
          *** DEADLOCK ***
      
        3 locks held by kcompactd0/151:
         #0:  (&(&mapping->private_lock)->rlock){+.+.-.}, at:  aio_migratepage+0x42/0x1e8
         #1:  (&ctx->ring_lock){+.+.+.}, at:  aio_migratepage+0x5a/0x1e8
         #2:  (&(&ctx->completion_lock)->rlock){+.?.-.}, at:  aio_migratepage+0x156/0x1e8
      
        stack backtrace:
        CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G        W       4.7.0-rc1+ #52
        Call Trace:
          show_trace+0xea/0xf0
          show_stack+0x72/0xf0
          dump_stack+0x9a/0xd8
          print_usage_bug.part.27+0x2d4/0x2e8
          mark_lock+0x17e/0x758
          mark_held_locks+0xa2/0xd0
          trace_hardirqs_on_caller+0x140/0x1c0
          mem_cgroup_migrate+0x266/0x370
          aio_migratepage+0x16a/0x1e8
          move_to_new_page+0xb0/0x260
          migrate_pages+0x8f4/0x9f0
          compact_zone+0x4dc/0xdc8
          kcompactd_do_work+0x1aa/0x358
          kcompactd+0xba/0x2c8
          kthread+0x10a/0x110
          kernel_thread_starter+0x6/0xc
          kernel_thread_starter+0x0/0xc
        INFO: lockdep is turned off.
      
      Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
      Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
      Fixes: 74485cf2 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d93c4130
    • K
      hugetlb: fix nr_pmds accounting with shared page tables · c17b1f42
      Kirill A. Shutemov 提交于
      We account HugeTLB's shared page table to all processes who share it.
      The accounting happens during huge_pmd_share().
      
      If somebody populates pud entry under us, we should decrease pagetable's
      refcount and decrease nr_pmds of the process.
      
      By mistake, I increase nr_pmds again in this case.  :-/ It will lead to
      "BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.
      
      Let's fix this by increasing nr_pmds only when we're sure that the page
      table will be used.
      
      Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
      Fixes: dc6c9a35 ("mm: account pmd page tables to the process")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: Nzhongjiang <zhongjiang@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c17b1f42