1. 05 10月, 2017 1 次提交
    • T
      NFSv4/pnfs: Fix an infinite layoutget loop · e8fa33a6
      Trond Myklebust 提交于
      Since we can now use a lock stateid or a delegation stateid, that
      differs from the context stateid, we need to change the test in
      nfs4_layoutget_handle_exception() to take this into account.
      
      This fixes an infinite layoutget loop in the NFS client whereby
      it keeps retrying the initial layoutget using the same broken
      stateid.
      
      Fixes: 70d2f7b1 ("pNFS: Use the standard I/O stateid when...")
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      e8fa33a6
  2. 02 10月, 2017 14 次提交
    • S
      nfs/filelayout: fix oops when freeing filelayout segment · 0a47df11
      Scott Mayhew 提交于
      Check for a NULL dsaddr in filelayout_free_lseg() before calling
      nfs4_fl_put_deviceid().  This fixes the following oops:
      
      [ 1967.645207] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
      [ 1967.646010] IP: [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
      [ 1967.646010] PGD c08bc067 PUD 915d3067 PMD 0
      [ 1967.753036] Oops: 0000 [#1] SMP
      [ 1967.753036] Modules linked in: nfs_layout_nfsv41_files ext4 mbcache jbd2 loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache amd64_edac_mod ipmi_ssif edac_mce_amd edac_core kvm_amd sg kvm ipmi_si ipmi_devintf irqbypass pcspkr k8temp ipmi_msghandler i2c_piix4 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mptsas ttm scsi_transport_sas mptscsih drm mptbase serio_raw i2c_core bnx2 dm_mirror dm_region_hash dm_log dm_mod
      [ 1967.790031] CPU: 2 PID: 1370 Comm: ls Not tainted 3.10.0-709.el7.test.bz1463784.x86_64 #1
      [ 1967.790031] Hardware name: IBM BladeCenter LS21 -[7971AC1]-/Server Blade, BIOS -[BAE155AUS-1.10]- 06/03/2009
      [ 1967.790031] task: ffff8800c42a3f40 ti: ffff8800c4064000 task.ti: ffff8800c4064000
      [ 1967.790031] RIP: 0010:[<ffffffffc06d6aea>]  [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
      [ 1967.790031] RSP: 0000:ffff8800c4067978  EFLAGS: 00010246
      [ 1967.790031] RAX: ffffffffc062f000 RBX: ffff8801d468a540 RCX: dead000000000200
      [ 1967.790031] RDX: ffff8800c40679f8 RSI: ffff8800c4067a0c RDI: 0000000000000000
      [ 1967.790031] RBP: ffff8800c4067980 R08: ffff8801d468a540 R09: 0000000000000000
      [ 1967.790031] R10: 0000000000000000 R11: ffffffffffffffff R12: ffff8801d468a540
      [ 1967.790031] R13: ffff8800c40679f8 R14: ffff8801d5645300 R15: ffff880126f15ff0
      [ 1967.790031] FS:  00007f11053c9800(0000) GS:ffff88012bd00000(0000) knlGS:0000000000000000
      [ 1967.790031] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 1967.790031] CR2: 0000000000000030 CR3: 0000000094b55000 CR4: 00000000000007e0
      [ 1967.790031] Stack:
      [ 1967.790031]  ffff8801d468a540 ffff8800c4067990 ffffffffc062d2fe ffff8800c40679b0
      [ 1967.790031]  ffffffffc062b5b4 ffff8800c40679f8 ffff8801d468a540 ffff8800c40679d8
      [ 1967.790031]  ffffffffc06d39af ffff8800c40679f8 ffff880126f16078 0000000000000001
      [ 1967.790031] Call Trace:
      [ 1967.790031]  [<ffffffffc062d2fe>] nfs4_fl_put_deviceid+0xe/0x10 [nfs_layout_nfsv41_files]
      [ 1967.790031]  [<ffffffffc062b5b4>] filelayout_free_lseg+0x24/0x90 [nfs_layout_nfsv41_files]
      [ 1967.790031]  [<ffffffffc06d39af>] pnfs_free_lseg_list+0x5f/0x80 [nfsv4]
      [ 1967.790031]  [<ffffffffc06d5a67>] _pnfs_return_layout+0x157/0x270 [nfsv4]
      [ 1967.790031]  [<ffffffffc06c17dd>] nfs4_evict_inode+0x4d/0x70 [nfsv4]
      [ 1967.790031]  [<ffffffff8121de19>] evict+0xa9/0x180
      [ 1967.790031]  [<ffffffff8121e729>] iput+0xf9/0x190
      [ 1967.790031]  [<ffffffffc0652cea>] nfs_dentry_iput+0x3a/0x50 [nfs]
      [ 1967.790031]  [<ffffffff8121ab4f>] shrink_dentry_list+0x20f/0x490
      [ 1967.790031]  [<ffffffff8121b018>] d_invalidate+0xd8/0x150
      [ 1967.790031]  [<ffffffffc065446b>] nfs_readdir_page_filler+0x40b/0x600 [nfs]
      [ 1967.790031]  [<ffffffffc0654bbd>] nfs_readdir_xdr_to_array+0x20d/0x3b0 [nfs]
      [ 1967.790031]  [<ffffffff811f3482>] ? __mem_cgroup_commit_charge+0xe2/0x2f0
      [ 1967.790031]  [<ffffffff81183208>] ? __add_to_page_cache_locked+0x48/0x170
      [ 1967.790031]  [<ffffffffc0654d60>] ? nfs_readdir_xdr_to_array+0x3b0/0x3b0 [nfs]
      [ 1967.790031]  [<ffffffffc0654d82>] nfs_readdir_filler+0x22/0x90 [nfs]
      [ 1967.790031]  [<ffffffff8118351f>] do_read_cache_page+0x7f/0x190
      [ 1967.790031]  [<ffffffff81215d30>] ? fillonedir+0xe0/0xe0
      [ 1967.790031]  [<ffffffff8118366c>] read_cache_page+0x1c/0x30
      [ 1967.790031]  [<ffffffffc0654f9b>] nfs_readdir+0x1ab/0x6b0 [nfs]
      [ 1967.790031]  [<ffffffffc06bd1c0>] ? nfs4_xdr_dec_layoutget+0x270/0x270 [nfsv4]
      [ 1967.790031]  [<ffffffff81215d30>] ? fillonedir+0xe0/0xe0
      [ 1967.790031]  [<ffffffff81215c20>] vfs_readdir+0xb0/0xe0
      [ 1967.790031]  [<ffffffff81216045>] SyS_getdents+0x95/0x120
      [ 1967.790031]  [<ffffffff816b9449>] system_call_fastpath+0x16/0x1b
      [ 1967.790031] Code: 90 31 d2 48 89 d0 5d c3 85 f6 74 f5 8d 4e 01 89 f0 f0 0f b1 0f 39 f0 74 e2 89 c6 eb eb 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 53 <48> 8b 47 30 48 89 fb a8 04 74 3b 8b 57 60 83 fa 02 74 19 8d 4a
      [ 1967.790031] RIP  [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4]
      [ 1967.790031]  RSP <ffff8800c4067978>
      [ 1967.790031] CR2: 0000000000000030
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Fixes: 1ebf9801 ("NFS/filelayout: Fix racy setting of fl->dsaddr...")
      Cc: stable@vger.kernel.org # v4.13+
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      0a47df11
    • C
      sunrpc: remove redundant initialization of sock · d099b8af
      Colin Ian King 提交于
      sock is being initialized and then being almost immediately updated
      hence the initialized value is not being used and is redundant. Remove
      the initialization. Cleans up clang warning:
      
      warning: Value stored to 'sock' during its initialization is never read
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      d099b8af
    • B
      NFS: Fix uninitialized rpc_wait_queue · 68ebf8fe
      Benjamin Coddington 提交于
      Michael Sterrett reports a NULL pointer dereference on NFSv3 mounts when
      CONFIG_NFS_V4 is not set because the NFS UOC rpc_wait_queue has not been
      initialized.  Move the initialization of the queue out of the CONFIG_NFS_V4
      conditional setion.
      
      Fixes: 7d6ddf88 ("NFS: Add an iocounter wait function for async RPC tasks")
      Cc: stable@vger.kernel.org # 4.11+
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      68ebf8fe
    • D
      NFS: Cleanup error handling in nfs_idmap_request_key() · cdb2e53f
      Dan Carpenter 提交于
      nfs_idmap_get_desc() can't actually return zero.  But if it did then
      we would return ERR_PTR(0) which is NULL and the caller,
      nfs_idmap_get_key(), doesn't expect that so it leads to a NULL pointer
      dereference.
      
      I've cleaned this up by changing the "<=" to "<" so it's more clear that
      we don't return ERR_PTR(0).
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      cdb2e53f
    • J
      nfs: RPC_MAX_AUTH_SIZE is in bytes · 35c036ef
      J. Bruce Fields 提交于
      The units of RPC_MAX_AUTH_SIZE is bytes, not 4-byte words.  This causes
      the client to request a larger-than-necessary session replay slot size.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      35c036ef
    • L
      Linux 4.14-rc3 · 9e66317d
      Linus Torvalds 提交于
      9e66317d
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 368f8998
      Linus Torvalds 提交于
      Pull x86 fixes from Thomas Gleixner:
       "This contains the following fixes and improvements:
      
         - Avoid dereferencing an unprotected VMA pointer in the fault signal
           generation code
      
         - Fix inline asm call constraints for GCC 4.4
      
         - Use existing register variable to retrieve the stack pointer
           instead of forcing the compiler to create another indirect access
           which results in excessive extra 'mov %rsp, %<dst>' instructions
      
         - Disable branch profiling for the memory encryption code to prevent
           an early boot crash
      
         - Fix a sparse warning caused by casting the __user annotation in
           __get_user_asm_u64() away
      
         - Fix an off by one error in the loop termination of the error patch
           in the x86 sysfs init code
      
         - Add missing CPU IDs to various Intel specific drivers to enable the
           functionality on recent hardware
      
         - More (init) constification in the numachip code"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/asm: Use register variable to get stack pointer value
        x86/mm: Disable branch profiling in mem_encrypt.c
        x86/asm: Fix inline asm call constraints for GCC 4.4
        perf/x86/intel/uncore: Correct num_boxes for IIO and IRP
        perf/x86/intel/rapl: Add missing CPU IDs
        perf/x86/msr: Add missing CPU IDs
        perf/x86/intel/cstate: Add missing CPU IDs
        x86: Don't cast away the __user in __get_user_asm_u64()
        x86/sysfs: Fix off-by-one error in loop termination
        x86/mm: Fix fault error path using unsafe vma pointer
        x86/numachip: Add const and __initconst to numachip2_clockevent
      368f8998
    • L
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c42ed9f9
      Linus Torvalds 提交于
      Pull timer fixes from Thomas Gleixner:
       "This adds a new timer wheel function which is required for the
        conversion of the timer callback function from the 'unsigned long
        data' argument to 'struct timer_list *timer'. This conversion has two
        benefits:
      
         1) It makes struct timer_list smaller
      
         2) Many callers hand in a pointer to the timer or to the structure
            containing the timer, which happens via type casting both at setup
            and in the callback. This change gets rid of the typecasts.
      
        Once the conversion is complete, which is planned for 4.15, the old
        setup function and the intermediate typecast in the new setup function
        go away along with the data field in struct timer_list.
      
        Merging this now into mainline allows a smooth queueing of the actual
        conversion in the affected maintainer trees without creating
        dependencies"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        um/time: Fixup namespace collision
        timer: Prepare to change timer callback argument type
      c42ed9f9
    • L
      Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 82513545
      Linus Torvalds 提交于
      Pull smp/hotplug fixes from Thomas Gleixner:
       "This addresses the fallout of the new lockdep mechanism which covers
        completions in the CPU hotplug code.
      
        The lockdep splats are false positives, but there is no way to
        annotate that reliably. The solution is to split the completions for
        CPU up and down, which requires some reshuffling of the failure
        rollback handling as well"
      
      * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        smp/hotplug: Hotplug state fail injection
        smp/hotplug: Differentiate the AP completion between up and down
        smp/hotplug: Differentiate the AP-work lockdep class between up and down
        smp/hotplug: Callback vs state-machine consistency
        smp/hotplug: Rewrite AP state machine core
        smp/hotplug: Allow external multi-instance rollback
        smp/hotplug: Add state diagram
      82513545
    • L
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7e103ace
      Linus Torvalds 提交于
      Pull scheduler fixes from Thomas Gleixner:
       "The scheduler pull request comes with the following updates:
      
         - Prevent a divide by zero issue by validating the input value of
           sysctl_sched_time_avg
      
         - Make task state printing consistent all over the place and have
           explicit state characters for IDLE and PARKED so they wont be
           displayed as 'D' state which confuses tools"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/sysctl: Check user input value of sysctl_sched_time_avg
        sched/debug: Add explicit TASK_PARKED printing
        sched/debug: Ignore TASK_IDLE for SysRq-W
        sched/debug: Add explicit TASK_IDLE printing
        sched/tracing: Use common task-state helpers
        sched/tracing: Fix trace_sched_switch task-state printing
        sched/debug: Remove unused variable
        sched/debug: Convert TASK_state to hex
        sched/debug: Implement consistent task-state printing
      7e103ace
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1c6f705b
      Linus Torvalds 提交于
      Pull perf fixes from Thomas Gleixner:
      
       - Prevent a division by zero in the perf aux buffer handling
      
       - Sync kernel headers with perf tool headers
      
       - Fix a build failure in the syscalltbl code
      
       - Make the debug messages of perf report --call-graph work correctly
      
       - Make sure that all required perf files are in the MANIFEST for
         container builds
      
       - Fix the atrr.exclude kernel handling so it respects the
         perf_event_paranoid and the user permissions
      
       - Make perf test on s390x work correctly
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/aux: Only update ->aux_wakeup in non-overwrite mode
        perf test: Fix vmlinux failure on s390x part 2
        perf test: Fix vmlinux failure on s390x
        perf tools: Fix syscalltbl build failure
        perf report: Fix debug messages with --call-graph option
        perf evsel: Fix attr.exclude_kernel setting for default cycles:p
        tools include: Sync kernel ABI headers with tooling headers
        perf tools: Get all of tools/{arch,include}/ in the MANIFEST
      1c6f705b
    • L
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1de47f3c
      Linus Torvalds 提交于
      Pull  locking fixes from Thomas Gleixner:
       "Two fixes for locking:
      
         - Plug a hole the pi_stat->owner serialization which was changed
           recently and failed to fixup two usage sites.
      
         - Prevent reordering of the rwsem_has_spinner() check vs the
           decrement of rwsem count in up_write() which causes a missed
           wakeup"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/rwsem-xadd: Fix missed wakeup due to reordering of load
        futex: Fix pi_state->owner serialization
      1de47f3c
    • L
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3d9d62b9
      Linus Torvalds 提交于
      Pull irq fixes from Thomas Gleixner:
      
       - Add a missing NULL pointer check in free_irq()
      
       - Fix a memory leak/memory corruption in the generic irq chip
      
       - Add missing rcu annotations for radix tree access
      
       - Use ffs instead of fls when extracting data from a chip register in
         the MIPS GIC irq driver
      
       - Fix the unmasking of IPI interrupts in the MIPS GIC driver so they
         end up at the target CPU and not at CPU0
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irq/generic-chip: Don't replace domain's name
        irqdomain: Add __rcu annotations to radix tree accessors
        irqchip/mips-gic: Use effective affinity to unmask
        irqchip/mips-gic: Fix shifts to extract register fields
        genirq: Check __free_irq() return value for NULL
      3d9d62b9
    • L
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 156069f8
      Linus Torvalds 提交于
      Pull objtool fixes from Thomas Gleixner:
       "Two small fixes for objtool:
      
         - Support frame pointer setup via 'lea (%rsp), %rbp' which was not
           yet supported and caused build warnings
      
         - Disable unreacahble warnings for GCC4.4 and older to avoid false
           positives caused by the compiler itself"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        objtool: Support unoptimized frame pointer setup
        objtool: Skip unreachable warnings for GCC 4.4 and older
      156069f8
  3. 01 10月, 2017 2 次提交
  4. 30 9月, 2017 16 次提交
  5. 29 9月, 2017 7 次提交
    • W
      arm64: fault: Route pte translation faults via do_translation_fault · 760bfb47
      Will Deacon 提交于
      We currently route pte translation faults via do_page_fault, which elides
      the address check against TASK_SIZE before invoking the mm fault handling
      code. However, this can cause issues with the path walking code in
      conjunction with our word-at-a-time implementation because
      load_unaligned_zeropad can end up faulting in kernel space if it reads
      across a page boundary and runs into a page fault (e.g. by attempting to
      read from a guard region).
      
      In the case of such a fault, load_unaligned_zeropad has registered a
      fixup to shift the valid data and pad with zeroes, however the abort is
      reported as a level 3 translation fault and we dispatch it straight to
      do_page_fault, despite it being a kernel address. This results in calling
      a sleeping function from atomic context:
      
        BUG: sleeping function called from invalid context at arch/arm64/mm/fault.c:313
        in_atomic(): 0, irqs_disabled(): 0, pid: 10290
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        [...]
        [<ffffff8e016cd0cc>] ___might_sleep+0x134/0x144
        [<ffffff8e016cd158>] __might_sleep+0x7c/0x8c
        [<ffffff8e016977f0>] do_page_fault+0x140/0x330
        [<ffffff8e01681328>] do_mem_abort+0x54/0xb0
        Exception stack(0xfffffffb20247a70 to 0xfffffffb20247ba0)
        [...]
        [<ffffff8e016844fc>] el1_da+0x18/0x78
        [<ffffff8e017f399c>] path_parentat+0x44/0x88
        [<ffffff8e017f4c9c>] filename_parentat+0x5c/0xd8
        [<ffffff8e017f5044>] filename_create+0x4c/0x128
        [<ffffff8e017f59e4>] SyS_mkdirat+0x50/0xc8
        [<ffffff8e01684e30>] el0_svc_naked+0x24/0x28
        Code: 36380080 d5384100 f9400800 9402566d (d4210000)
        ---[ end trace 2d01889f2bca9b9f ]---
      
      Fix this by dispatching all translation faults to do_translation_faults,
      which avoids invoking the page fault logic for faults on kernel addresses.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NAnkit Jain <ankijain@codeaurora.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      760bfb47
    • W
      arm64: mm: Use READ_ONCE when dereferencing pointer to pte table · f069faba
      Will Deacon 提交于
      On kernels built with support for transparent huge pages, different CPUs
      can access the PMD concurrently due to e.g. fast GUP or page_vma_mapped_walk
      and they must take care to use READ_ONCE to avoid value tearing or caching
      of stale values by the compiler. Unfortunately, these functions call into
      our pgtable macros, which don't use READ_ONCE, and compiler caching has
      been observed to cause the following crash during ext4 writeback:
      
      PC is at check_pte+0x20/0x170
      LR is at page_vma_mapped_walk+0x2e0/0x540
      [...]
      Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
      Call trace:
      [<ffff000008233328>] check_pte+0x20/0x170
      [<ffff000008233758>] page_vma_mapped_walk+0x2e0/0x540
      [<ffff000008234adc>] page_mkclean_one+0xac/0x278
      [<ffff000008234d98>] rmap_walk_file+0xf0/0x238
      [<ffff000008236e74>] rmap_walk+0x64/0xa0
      [<ffff0000082370c8>] page_mkclean+0x90/0xa8
      [<ffff0000081f3c64>] clear_page_dirty_for_io+0x84/0x2a8
      [<ffff00000832f984>] mpage_submit_page+0x34/0x98
      [<ffff00000832fb4c>] mpage_process_page_bufs+0x164/0x170
      [<ffff00000832fc8c>] mpage_prepare_extent_to_map+0x134/0x2b8
      [<ffff00000833530c>] ext4_writepages+0x484/0xe30
      [<ffff0000081f6ab4>] do_writepages+0x44/0xe8
      [<ffff0000081e5bd4>] __filemap_fdatawrite_range+0xbc/0x110
      [<ffff0000081e5e68>] file_write_and_wait_range+0x48/0xd8
      [<ffff000008324310>] ext4_sync_file+0x80/0x4b8
      [<ffff0000082bd434>] vfs_fsync_range+0x64/0xc0
      [<ffff0000082332b4>] SyS_msync+0x194/0x1e8
      
      This is because page_vma_mapped_walk loads the PMD twice before calling
      pte_offset_map: the first time without READ_ONCE (where it gets all zeroes
      due to a concurrent pmdp_invalidate) and the second time with READ_ONCE
      (where it sees a valid table pointer due to a concurrent pmd_populate).
      However, the compiler inlines everything and caches the first value in
      a register, which is subsequently used in pte_offset_phys which returns
      a junk pointer that is later dereferenced when attempting to access the
      relevant pte.
      
      This patch fixes the issue by using READ_ONCE in pte_offset_phys to ensure
      that a stale value is not used. Whilst this is a point fix for a known
      failure (and simple to backport), a full fix moving all of our page table
      accessors over to {READ,WRITE}_ONCE and consistently using READ_ONCE in
      page_vma_mapped_walk is in the works for a future kernel release.
      
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Timur Tabi <timur@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Fixes: f27176cf ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()")
      Tested-by: NRichard Ruigrok <rruigrok@codeaurora.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      f069faba
    • B
      kvm/x86: Handle async PF in RCU read-side critical sections · b862789a
      Boqun Feng 提交于
      Sasha Levin reported a WARNING:
      
      | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
      | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
      | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
      | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
      ...
      | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
      | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      | 1.10.1-1ubuntu1 04/01/2014
      | Call Trace:
      ...
      | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
      | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
      | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
      | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
      | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
      | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
      | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
      | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
      | __schedule+0x201/0x2240 kernel/sched/core.c:3292
      | schedule+0x113/0x460 kernel/sched/core.c:3421
      | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
      | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
      | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
      | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
      | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
      | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
      | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
      | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
      | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
      | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
      | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
      | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
      | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
      | get_link fs/namei.c:1047 [inline]
      | link_path_walk+0x1041/0x1490 fs/namei.c:2127
      ...
      
      This happened when the host hit a page fault, and delivered it as in an
      async page fault, while the guest was in an RCU read-side critical
      section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
      but rcu_preempt_note_context_switch() would treat the reschedule as a
      sleep in RCU read-side critical section, which is not allowed (even in
      preemptible RCU).  Thus the WARN.
      
      To cure this, make kvm_async_pf_task_wait() go to the halt path if the
      PF happens in a RCU read-side critical section.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b862789a
    • W
      KVM: nVMX: Fix nested #PF intends to break L1's vmlauch/vmresume · 305d0ab4
      Wanpeng Li 提交于
      ------------[ cut here ]------------
       WARNING: CPU: 4 PID: 5280 at /home/kernel/linux/arch/x86/kvm//vmx.c:11394 nested_vmx_vmexit+0xc2b/0xd70 [kvm_intel]
       CPU: 4 PID: 5280 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0+ #17
       RIP: 0010:nested_vmx_vmexit+0xc2b/0xd70 [kvm_intel]
       Call Trace:
        ? emulator_read_emulated+0x15/0x20 [kvm]
        ? segmented_read+0xae/0xf0 [kvm]
        vmx_inject_page_fault_nested+0x60/0x70 [kvm_intel]
        ? vmx_inject_page_fault_nested+0x60/0x70 [kvm_intel]
        x86_emulate_instruction+0x733/0x810 [kvm]
        vmx_handle_exit+0x2f4/0xda0 [kvm_intel]
        ? kvm_arch_vcpu_ioctl_run+0xd2f/0x1c60 [kvm]
        kvm_arch_vcpu_ioctl_run+0xdab/0x1c60 [kvm]
        ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        entry_SYSCALL_64_fastpath+0x23/0xc2
      
      A nested #PF is triggered during L0 emulating instruction for L2. However, it
      doesn't consider we should not break L1's vmlauch/vmresme. This patch fixes
      it by queuing the #PF exception instead ,requesting an immediate VM exit from
      L2 and keeping the exception for L1 pending for a subsequent nested VM exit.
      
      This should actually work all the time, making vmx_inject_page_fault_nested
      totally unnecessary.  However, that's not working yet, so this patch can work
      around the issue in the meanwhile.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      305d0ab4
    • E
      sched/sysctl: Check user input value of sysctl_sched_time_avg · 5ccba44b
      Ethan Zhao 提交于
      System will hang if user set sysctl_sched_time_avg to 0:
      
        [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0
      
        Stack traceback for pid 0
        0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
        ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
        0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
        ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
        Call Trace:
        <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200
        [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600
        [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530
        [<ffffffff810c5b84>] ? load_balance+0x194/0x900
        [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0
        [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290
        [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0
        [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320
        [<ffffffff8108ac85>] ? irq_exit+0x125/0x130
        [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160
        [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30
        [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80
         <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230
        [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230
        [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20
        [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420
        [<ffffffff81053373>] ? start_secondary+0x173/0x1e0
      
      Because divide-by-zero error happens in function:
      
      update_group_capacity()
        update_cpu_capacity()
          scale_rt_capacity()
           {
                ...
                total = sched_avg_period() + delta;
                used = div_u64(avg, total);
                ...
           }
      
      To fix this issue, check user input value of sysctl_sched_time_avg, keep
      it unchanged when hitting invalid input, and set the minimum limit of
      sysctl_sched_time_avg to 1 ms.
      Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Signed-off-by: NEthan Zhao <ethan.zhao@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: ethan.kernel@gmail.com
      Cc: keescook@chromium.org
      Cc: mcgrof@kernel.org
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ccba44b
    • J
      x86/asm: Fix inline asm call constraints for GCC 4.4 · 520a13c5
      Josh Poimboeuf 提交于
      The kernel test bot (run by Xiaolong Ye) reported that the following commit:
      
        f5caf621 ("x86/asm: Fix inline asm call constraints for Clang")
      
      is causing double faults in a kernel compiled with GCC 4.4.
      
      Linus subsequently diagnosed the crash pattern and the buggy commit and found that
      the issue is with this code:
      
        register unsigned int __asm_call_sp asm("esp");
        #define ASM_CALL_CONSTRAINT "+r" (__asm_call_sp)
      
      Even on a 64-bit kernel, it's using ESP instead of RSP.  That causes GCC
      to produce the following bogus code:
      
        ffffffff8147461d:       89 e0                   mov    %esp,%eax
        ffffffff8147461f:       4c 89 f7                mov    %r14,%rdi
        ffffffff81474622:       4c 89 fe                mov    %r15,%rsi
        ffffffff81474625:       ba 20 00 00 00          mov    $0x20,%edx
        ffffffff8147462a:       89 c4                   mov    %eax,%esp
        ffffffff8147462c:       e8 bf 52 05 00          callq  ffffffff814c98f0 <copy_user_generic_unrolled>
      
      Despite the absurdity of it backing up and restoring the stack pointer
      for no reason, the bug is actually the fact that it's only backing up
      and restoring the lower 32 bits of the stack pointer.  The upper 32 bits
      are getting cleared out, corrupting the stack pointer.
      
      So change the '__asm_call_sp' register variable to be associated with
      the actual full-size stack pointer.
      
      This also requires changing the __ASM_SEL() macro to be based on the
      actual compiled arch size, rather than the CONFIG value, because
      CONFIG_X86_64 compiles some files with '-m32' (e.g., realmode and vdso).
      Otherwise Clang fails to build the kernel because it complains about the
      use of a 64-bit register (RSP) in a 32-bit file.
      Reported-and-Bisected-and-Tested-by: Nkernel test robot <xiaolong.ye@intel.com>
      Diagnosed-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: LKP <lkp@01.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: f5caf621 ("x86/asm: Fix inline asm call constraints for Clang")
      Link: http://lkml.kernel.org/r/20170928215826.6sdpmwtkiydiytim@trebleSigned-off-by: NIngo Molnar <mingo@kernel.org>
      520a13c5
    • P
      sched/debug: Add explicit TASK_PARKED printing · 8ef9925b
      Peter Zijlstra 提交于
      Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
      its own print state because it will not in fact get woken by regular
      wakeups and is a long-term state.
      
      This requires moving TASK_PARKED into the TASK_REPORT mask, and since
      that latter needs to be a contiguous bitmask, we need to shuffle the
      bits around a bit.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8ef9925b