1. 28 3月, 2017 4 次提交
  2. 18 3月, 2017 1 次提交
  3. 17 3月, 2017 3 次提交
  4. 16 3月, 2017 1 次提交
  5. 15 3月, 2017 3 次提交
    • J
      x86/intel_rdt: Put group node in rdtgroup_kn_unlock · 49ec8f5b
      Jiri Olsa 提交于
      The rdtgroup_kn_unlock waits for the last user to release and put its
      node. But it's calling kernfs_put on the node which calls the
      rdtgroup_kn_unlock, which might not be the group's directory node, but
      another group's file node.
      
      This race could be easily reproduced by running 2 instances
      of following script:
      
        mount -t resctrl resctrl /sys/fs/resctrl/
        pushd /sys/fs/resctrl/
        mkdir krava
        echo "krava" > krava/schemata
        rmdir krava
        popd
        umount  /sys/fs/resctrl
      
      It triggers the slub debug error message with following command
      line config: slub_debug=,kernfs_node_cache.
      
      Call kernfs_put on the group's node to fix it.
      
      Fixes: 60cf5e10 ("x86/intel_rdt: Add mkdir to resctrl file system")
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1489501253-20248-1-git-send-email-jolsa@kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      49ec8f5b
    • J
      x86/unwind: Fix last frame check for aligned function stacks · 87a6b297
      Josh Poimboeuf 提交于
      Pavel Machek reported the following warning on x86-32:
      
        WARNING: kernel stack frame pointer at f50cdf98 in swapper/2:0 has bad value   (null)
      
      The warning is caused by the unwinder not realizing that it reached the
      end of the stack, due to an unusual prologue which gcc sometimes
      generates for aligned stacks.  The prologue is based on a gcc feature
      called the Dynamic Realign Argument Pointer (DRAP).  It's almost always
      enabled for aligned stacks when -maccumulate-outgoing-args isn't set.
      
      This issue is similar to the one fixed by the following commit:
      
        8023e0e2 ("x86/unwind: Adjust last frame check for aligned function stacks")
      
      ... but that fix was specific to x86-64.
      
      Make the fix more generic to cover x86-32 as well, and also ensure that
      the return address referred to by the frame pointer is a copy of the
      original return address.
      
      Fixes: acb4608a ("x86/unwind: Create stack frames for saved syscall registers")
      Reported-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/50d4924db716c264b14f1633037385ec80bf89d2.1489465609.git.jpoimboe@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      87a6b297
    • A
      mm, x86: Fix native_pud_clear build error · d434936e
      Arnd Bergmann 提交于
      We still get a build error in random configurations, after this has been
      modified a few times:
      
      In file included from include/linux/mm.h:68:0,
                       from include/linux/suspend.h:8,
                       from arch/x86/kernel/asm-offsets.c:12:
      arch/x86/include/asm/pgtable.h:66:26: error: redefinition of 'native_pud_clear'
       #define pud_clear(pud)   native_pud_clear(pud)
      
      My interpretation is that the build error comes from a typo in __PAGETABLE_PUD_FOLDED,
      so fix that typo now, and remove the incorrect #ifdef around the native_pud_clear
      definition.
      
      Fixes: 3e761a42 ("mm, x86: fix HIGHMEM64 && PARAVIRT build config for native_pud_clear()")
      Fixes: a00cc7d9 ("mm, x86: add support for PUD-sized transparent hugepages")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDave Jiang <dave.jiang@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Thomas Garnier <thgarnie@google.com>
      Link: http://lkml.kernel.org/r/20170314121330.182155-1-arnd@arndb.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d434936e
  6. 14 3月, 2017 6 次提交
  7. 13 3月, 2017 1 次提交
  8. 12 3月, 2017 1 次提交
    • D
      x86/tlb: Fix tlb flushing when lguest clears PGE · 2c4ea6e2
      Daniel Borkmann 提交于
      Fengguang reported random corruptions from various locations on x86-32
      after commits d2852a22 ("arch: add ARCH_HAS_SET_MEMORY config") and
      9d876e79 ("bpf: fix unlocking of jited image when module ronx not set")
      that uses the former. While x86-32 doesn't have a JIT like x86_64, the
      bpf_prog_lock_ro() and bpf_prog_unlock_ro() got enabled due to
      ARCH_HAS_SET_MEMORY, whereas Fengguang's test kernel doesn't have module
      support built in and therefore never had the DEBUG_SET_MODULE_RONX setting
      enabled.
      
      After investigating the crashes further, it turned out that using
      set_memory_ro() and set_memory_rw() didn't have the desired effect, for
      example, setting the pages as read-only on x86-32 would still let
      probe_kernel_write() succeed without error. This behavior would manifest
      itself in situations where the vmalloc'ed buffer was accessed prior to
      set_memory_*() such as in case of bpf_prog_alloc(). In cases where it
      wasn't, the page attribute changes seemed to have taken effect, leading to
      the conclusion that a TLB invalidate didn't happen. Moreover, it turned out
      that this issue reproduced with qemu in "-cpu kvm64" mode, but not for
      "-cpu host". When the issue occurs, change_page_attr_set_clr() did trigger
      a TLB flush as expected via __flush_tlb_all() through cpa_flush_range(),
      though.
      
      There are 3 variants for issuing a TLB flush: invpcid_flush_all() (depends
      on CPU feature bits X86_FEATURE_INVPCID, X86_FEATURE_PGE), cr4 based flush
      (depends on X86_FEATURE_PGE), and cr3 based flush.  For "-cpu host" case in
      my setup, the flush used invpcid_flush_all() variant, whereas for "-cpu
      kvm64", the flush was cr4 based. Switching the kvm64 case to cr3 manually
      worked fine, and further investigating the cr4 one turned out that
      X86_CR4_PGE bit was not set in cr4 register, meaning the
      __native_flush_tlb_global_irq_disabled() wrote cr4 twice with the same
      value instead of clearing X86_CR4_PGE in the first write to trigger the
      flush.
      
      It turned out that X86_CR4_PGE was cleared from cr4 during init from
      lguest_arch_host_init() via adjust_pge(). The X86_FEATURE_PGE bit is also
      cleared from there due to concerns of using PGE in guest kernel that can
      lead to hard to trace bugs (see bff672e6 ("lguest: documentation V:
      Host") in init()). The CPU feature bits are cleared in dynamic
      boot_cpu_data, but they never propagated to __flush_tlb_all() as it uses
      static_cpu_has() instead of boot_cpu_has() for testing which variant of TLB
      flushing to use, meaning they still used the old setting of the host
      kernel.
      
      Clearing via setup_clear_cpu_cap(X86_FEATURE_PGE) so this would propagate
      to static_cpu_has() checks is too late at this point as sections have been
      patched already, so for now, it seems reasonable to switch back to
      boot_cpu_has(X86_FEATURE_PGE) as it was prior to commit c109bf95
      ("x86/cpufeature: Remove cpu_has_pge"). This lets the TLB flush trigger via
      cr3 as originally intended, properly makes the new page attributes visible
      and thus fixes the crashes seen by Fengguang.
      
      Fixes: c109bf95 ("x86/cpufeature: Remove cpu_has_pge")
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: bp@suse.de
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: lkp@01.org
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernrl.org/r/20170301125426.l4nf65rx4wahohyl@wfg-t540p.sh.intel.com
      Link: http://lkml.kernel.org/r/25c41ad9eca164be4db9ad84f768965b7eb19d9e.1489191673.git.daniel@iogearbox.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2c4ea6e2
  9. 11 3月, 2017 3 次提交
    • D
      x86/acpi: Restore the order of CPU IDs · 2b85b3d2
      Dou Liyang 提交于
      The following commits:
      
        f7c28833 ("x86/acpi: Enable acpi to register all possible cpus at
      boot time") and 8f54969d ("x86/acpi: Introduce persistent storage
      for cpuid <-> apicid mapping")
      
      ... registered all the possible CPUs at boot time via ACPI tables to
      make the mapping of cpuid <-> apicid fixed. Both enabled and disabled
      CPUs could have a logical CPU ID after boot time.
      
      But, ACPI tables are unreliable. the number amd order of Local APIC
      entries which depends on the firmware is often inconsistent with the
      physical devices. Even if they are consistent, The disabled CPUs which
      take up some logical CPU IDs will also make the order discontinuous.
      
      Revert the part of disabled CPUs registration, keep the allocation
      logic of logical CPU IDs and also keep some code location changes.
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Tested-by: NXiaolong Ye <xiaolong.ye@intel.com>
      Cc: rjw@rjwysocki.net
      Cc: linux-acpi@vger.kernel.org
      Cc: guzheng1@huawei.com
      Cc: izumi.taku@jp.fujitsu.com
      Cc: lenb@kernel.org
      Link: http://lkml.kernel.org/r/1488528147-2279-4-git-send-email-douly.fnst@cn.fujitsu.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2b85b3d2
    • D
      Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting" · c962cff1
      Dou Liyang 提交于
      Revert: dc6db24d ("x86/acpi: Set persistent cpuid <-> nodeid mapping when booting")
      
      The mapping of "cpuid <-> nodeid" is established at boot time via ACPI
      tables to keep associations of workqueues and other node related items
      consistent across cpu hotplug.
      
      But, ACPI tables are unreliable and failures with that boot time mapping
      have been reported on machines where the ACPI table and the physical
      information which is retrieved at actual hotplug is inconsistent.
      
      Revert the mapping implementation so it can be replaced with a less error
      prone approach.
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Tested-by: NXiaolong Ye <xiaolong.ye@intel.com>
      Cc: rjw@rjwysocki.net
      Cc: linux-acpi@vger.kernel.org
      Cc: guzheng1@huawei.com
      Cc: izumi.taku@jp.fujitsu.com
      Cc: lenb@kernel.org
      Link: http://lkml.kernel.org/r/1488528147-2279-2-git-send-email-douly.fnst@cn.fujitsu.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c962cff1
    • T
      kexec, x86/purgatory: Unbreak it and clean it up · 40c50c1f
      Thomas Gleixner 提交于
      The purgatory code defines global variables which are referenced via a
      symbol lookup in the kexec code (core and arch).
      
      A recent commit addressing sparse warnings made these static and thereby
      broke kexec_file.
      
      Why did this happen? Simply because the whole machinery is undocumented and
      lacks any form of forward declarations. The variable names are unspecific
      and lack a prefix, so adding forward declarations creates shadow variables
      in the core code. Aside of that the code relies on magic constants and
      duplicate struct definitions with no way to ensure that these things stay
      in sync. The section placement of the purgatory variables happened by
      chance and not by design.
      
      Unbreak kexec and cleanup the mess:
      
       - Add proper forward declarations and document the usage
       - Use common struct definition
       - Use the proper common defines instead of magic constants
       - Add a purgatory_ prefix to have a proper name space
       - Use ARRAY_SIZE() instead of a homebrewn reimplementation
       - Add proper sections to the purgatory variables [ From Mike ]
      
      Fixes: 72042a8c ("x86/purgatory: Make functions and variables static")
      Reported-by: NMike Galbraith <&lt;efault@gmx.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Nicholas Mc Guire <der.herr@hofr.at>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Tobin C. Harding" <me@tobin.cc>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      40c50c1f
  10. 10 3月, 2017 6 次提交
  11. 09 3月, 2017 1 次提交
    • R
      KVM: nVMX: do not warn when MSR bitmap address is not backed · 05d8d346
      Radim Krčmář 提交于
      Before trying to do nested_get_page() in nested_vmx_merge_msr_bitmap(),
      we have already checked that the MSR bitmap address is valid (4k aligned
      and within physical limits).  SDM doesn't specify what happens if the
      there is no memory mapped at the valid address, but Intel CPUs treat the
      situation as if the bitmap was configured to trap all MSRs.
      
      KVM already does that by returning false and a correct handling doesn't
      need the guest-trigerrable warning that was reported by syzkaller:
      (The warning was originally there to catch some possible bugs in nVMX.)
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 7832 at arch/x86/kvm/vmx.c:9709
        nested_vmx_merge_msr_bitmap arch/x86/kvm/vmx.c:9709 [inline]
        WARNING: CPU: 0 PID: 7832 at arch/x86/kvm/vmx.c:9709
        nested_get_vmcs12_pages+0xfb6/0x15c0 arch/x86/kvm/vmx.c:9640
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 0 PID: 7832 Comm: syz-executor1 Not tainted 4.10.0+ #229
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:15 [inline]
         dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
         panic+0x1fb/0x412 kernel/panic.c:179
         __warn+0x1c4/0x1e0 kernel/panic.c:540
         warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
         nested_vmx_merge_msr_bitmap arch/x86/kvm/vmx.c:9709 [inline]
         nested_get_vmcs12_pages+0xfb6/0x15c0 arch/x86/kvm/vmx.c:9640
         enter_vmx_non_root_mode arch/x86/kvm/vmx.c:10471 [inline]
         nested_vmx_run+0x6186/0xaab0 arch/x86/kvm/vmx.c:10561
         handle_vmlaunch+0x1a/0x20 arch/x86/kvm/vmx.c:7312
         vmx_handle_exit+0xfc0/0x3f00 arch/x86/kvm/vmx.c:8526
         vcpu_enter_guest arch/x86/kvm/x86.c:6982 [inline]
         vcpu_run arch/x86/kvm/x86.c:7044 [inline]
         kvm_arch_vcpu_ioctl_run+0x1418/0x4840 arch/x86/kvm/x86.c:7205
         kvm_vcpu_ioctl+0x673/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2570
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      [Jim Mattson explained the bare metal behavior: "I believe this behavior
       would be documented in the chipset data sheet rather than the SDM,
       since the chipset returns all 1s for an unclaimed read."]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      05d8d346
  12. 07 3月, 2017 2 次提交
    • W
      KVM: nVMX: reset nested_run_pending if the vCPU is going to be reset · 2f707d97
      Wanpeng Li 提交于
      Reported by syzkaller:
      
          WARNING: CPU: 1 PID: 27742 at arch/x86/kvm/vmx.c:11029
          nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029
          CPU: 1 PID: 27742 Comm: a.out Not tainted 4.10.0+ #229
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:15 [inline]
           dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
           panic+0x1fb/0x412 kernel/panic.c:179
           __warn+0x1c4/0x1e0 kernel/panic.c:540
           warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
           nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029
           vmx_leave_nested arch/x86/kvm/vmx.c:11136 [inline]
           vmx_set_msr+0x1565/0x1910 arch/x86/kvm/vmx.c:3324
           kvm_set_msr+0xd4/0x170 arch/x86/kvm/x86.c:1099
           do_set_msr+0x11e/0x190 arch/x86/kvm/x86.c:1128
           __msr_io arch/x86/kvm/x86.c:2577 [inline]
           msr_io+0x24b/0x450 arch/x86/kvm/x86.c:2614
           kvm_arch_vcpu_ioctl+0x35b/0x46a0 arch/x86/kvm/x86.c:3497
           kvm_vcpu_ioctl+0x232/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2721
           vfs_ioctl fs/ioctl.c:43 [inline]
           do_vfs_ioctl+0x1bf/0x1790 fs/ioctl.c:683
           SYSC_ioctl fs/ioctl.c:698 [inline]
           SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689
           entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      The syzkaller folks reported a nested_run_pending warning during userspace
      clear VMX capability which is exposed to L1 before.
      
      The warning gets thrown while doing
      
      (*(uint32_t*)0x20aecfe8 = (uint32_t)0x1);
      (*(uint32_t*)0x20aecfec = (uint32_t)0x0);
      (*(uint32_t*)0x20aecff0 = (uint32_t)0x3a);
      (*(uint32_t*)0x20aecff4 = (uint32_t)0x0);
      (*(uint64_t*)0x20aecff8 = (uint64_t)0x0);
      r[29] = syscall(__NR_ioctl, r[4], 0x4008ae89ul,
      		0x20aecfe8ul, 0, 0, 0, 0, 0, 0);
      
      i.e. KVM_SET_MSR ioctl with
      
      struct kvm_msrs {
      	.nmsrs = 1,
      		.pad = 0,
      		.entries = {
      			{.index = MSR_IA32_FEATURE_CONTROL,
      			 .reserved = 0,
      			 .data = 0}
      		}
      }
      
      The VMLANCH/VMRESUME emulation should be stopped since the CPU is going to
      reset here. This patch resets the nested_run_pending since the CPU is going
      to be reset hence there should be nothing pending.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Suggested-by: NRadim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      2f707d97
    • J
      kvm: nVMX: VMCLEAR should not cause the vCPU to shut down · 587d7e72
      Jim Mattson 提交于
      VMCLEAR should silently ignore a failure to clear the launch state of
      the VMCS referenced by the operand.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [Changed "kvm_write_guest(vcpu->kvm" to "kvm_vcpu_write_guest(vcpu".]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      587d7e72
  13. 06 3月, 2017 2 次提交
  14. 03 3月, 2017 3 次提交
    • D
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells 提交于
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      function.
      
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      
      ========
      OVERVIEW
      ========
      
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      
      A number of requests were gathered for features to be included.  The
      following have been included:
      
       (1) Make the fields a consistent size on all arches and make them large.
      
       (2) Spare space, request flags and information flags are provided for
           future expansion.
      
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
           __s64).
      
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
      
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
      
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
      
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
           (AT_STATX_FORCE_SYNC).
      
      And the following have been left out for future extension:
      
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Kumar].
      
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
      
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
      
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
      
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
      
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
      
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
      
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
      
           (Deferred to fsinfo).
      
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
      
           (Deferred to fsinfo).
      
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
      
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
      
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      
      (16) Femtosecond-resolution timestamps [Dave Chinner].
      
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      
      (17) A set multiple attributes syscall to go with this.
      
      ===============
      NEW SYSTEM CALL
      ===============
      
      The new system call is:
      
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
      
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
           respect.
      
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
      
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
           approximate.
      
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      
      buffer points to the destination for the data.  This must be 256 bytes in
      size.
      
      ======================
      MAIN ATTRIBUTES RECORD
      ======================
      
      The following structures are defined in which to return the main attribute
      set:
      
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	};
      
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      	};
      
      The defined bits in request_mask and stx_mask are:
      
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      placed.
      
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      
      Within the kernel, the supported flags are listed by:
      
      	KSTAT_ATTR_FS_IOC_FLAGS
      
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      
      New flags include:
      
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      
      Fields in struct statx come in a number of classes:
      
       (0) stx_dev_*, stx_blksize.
      
           These are local system information and are always available.
      
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
      
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
      
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
      
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
      
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
      
       (2) stx_rdev_*.
      
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
      
       (3) stx_btime.
      
           Similar to (1), except this will be set to 0 if it doesn't exist.
      
      =======
      TESTING
      =======
      
      The following test program can be used to test the statx system call:
      
      	samples/statx/test-statx.c
      
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      
      Secondly, the result of automounting on that directory.
      
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a528d35e
    • I
      sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> · 1a79a72c
      Ingo Molnar 提交于
      We want to simplify <linux/sched.h>'s header dependencies, but one
      roadblock to that is <asm/apic.h>'s inclusion of pm.h,
      which brings in other, problematic headers.
      
      Remove it, as it appears to be entirely spurious, apic.h does not
      actually make use of any PM facilities.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1a79a72c
    • I
      sched/headers: Move task->mm handling methods to <linux/sched/mm.h> · 68e21be2
      Ingo Molnar 提交于
      Move the following task->mm helper APIs into a new header file,
      <linux/sched/mm.h>, to further reduce the size and complexity
      of <linux/sched.h>.
      
      Here are how the APIs are used in various kernel files:
      
        # mm_alloc():
        arch/arm/mach-rpc/ecard.c
        fs/exec.c
        include/linux/sched/mm.h
        kernel/fork.c
      
        # __mmdrop():
        arch/arc/include/asm/mmu_context.h
        include/linux/sched/mm.h
        kernel/fork.c
      
        # mmdrop():
        arch/arm/mach-rpc/ecard.c
        arch/m68k/sun3/mmu_emu.c
        arch/x86/mm/tlb.c
        drivers/gpu/drm/amd/amdkfd/kfd_process.c
        drivers/gpu/drm/i915/i915_gem_userptr.c
        drivers/infiniband/hw/hfi1/file_ops.c
        drivers/vfio/vfio_iommu_spapr_tce.c
        fs/exec.c
        fs/proc/base.c
        fs/proc/task_mmu.c
        fs/proc/task_nommu.c
        fs/userfaultfd.c
        include/linux/mmu_notifier.h
        include/linux/sched/mm.h
        kernel/fork.c
        kernel/futex.c
        kernel/sched/core.c
        mm/khugepaged.c
        mm/ksm.c
        mm/mmu_context.c
        mm/mmu_notifier.c
        mm/oom_kill.c
        virt/kvm/kvm_main.c
      
        # mmdrop_async_fn():
        include/linux/sched/mm.h
      
        # mmdrop_async():
        include/linux/sched/mm.h
        kernel/fork.c
      
        # mmget_not_zero():
        fs/userfaultfd.c
        include/linux/sched/mm.h
        mm/oom_kill.c
      
        # mmput():
        arch/arc/include/asm/mmu_context.h
        arch/arc/kernel/troubleshoot.c
        arch/frv/mm/mmu-context.c
        arch/powerpc/platforms/cell/spufs/context.c
        arch/sparc/include/asm/mmu_context_32.h
        drivers/android/binder.c
        drivers/gpu/drm/etnaviv/etnaviv_gem.c
        drivers/gpu/drm/i915/i915_gem_userptr.c
        drivers/infiniband/core/umem.c
        drivers/infiniband/core/umem_odp.c
        drivers/infiniband/core/uverbs_main.c
        drivers/infiniband/hw/mlx4/main.c
        drivers/infiniband/hw/mlx5/main.c
        drivers/infiniband/hw/usnic/usnic_uiom.c
        drivers/iommu/amd_iommu_v2.c
        drivers/iommu/intel-svm.c
        drivers/lguest/lguest_user.c
        drivers/misc/cxl/fault.c
        drivers/misc/mic/scif/scif_rma.c
        drivers/oprofile/buffer_sync.c
        drivers/vfio/vfio_iommu_type1.c
        drivers/vhost/vhost.c
        drivers/xen/gntdev.c
        fs/exec.c
        fs/proc/array.c
        fs/proc/base.c
        fs/proc/task_mmu.c
        fs/proc/task_nommu.c
        fs/userfaultfd.c
        include/linux/sched/mm.h
        kernel/cpuset.c
        kernel/events/core.c
        kernel/events/uprobes.c
        kernel/exit.c
        kernel/fork.c
        kernel/ptrace.c
        kernel/sys.c
        kernel/trace/trace_output.c
        kernel/tsacct.c
        mm/memcontrol.c
        mm/memory.c
        mm/mempolicy.c
        mm/migrate.c
        mm/mmu_notifier.c
        mm/nommu.c
        mm/oom_kill.c
        mm/process_vm_access.c
        mm/rmap.c
        mm/swapfile.c
        mm/util.c
        virt/kvm/async_pf.c
      
        # mmput_async():
        include/linux/sched/mm.h
        kernel/fork.c
        mm/oom_kill.c
      
        # get_task_mm():
        arch/arc/kernel/troubleshoot.c
        arch/powerpc/platforms/cell/spufs/context.c
        drivers/android/binder.c
        drivers/gpu/drm/etnaviv/etnaviv_gem.c
        drivers/infiniband/core/umem.c
        drivers/infiniband/core/umem_odp.c
        drivers/infiniband/hw/mlx4/main.c
        drivers/infiniband/hw/mlx5/main.c
        drivers/infiniband/hw/usnic/usnic_uiom.c
        drivers/iommu/amd_iommu_v2.c
        drivers/iommu/intel-svm.c
        drivers/lguest/lguest_user.c
        drivers/misc/cxl/fault.c
        drivers/misc/mic/scif/scif_rma.c
        drivers/oprofile/buffer_sync.c
        drivers/vfio/vfio_iommu_type1.c
        drivers/vhost/vhost.c
        drivers/xen/gntdev.c
        fs/proc/array.c
        fs/proc/base.c
        fs/proc/task_mmu.c
        include/linux/sched/mm.h
        kernel/cpuset.c
        kernel/events/core.c
        kernel/exit.c
        kernel/fork.c
        kernel/ptrace.c
        kernel/sys.c
        kernel/trace/trace_output.c
        kernel/tsacct.c
        mm/memcontrol.c
        mm/memory.c
        mm/mempolicy.c
        mm/migrate.c
        mm/mmu_notifier.c
        mm/nommu.c
        mm/util.c
      
        # mm_access():
        fs/proc/base.c
        include/linux/sched/mm.h
        kernel/fork.c
        mm/process_vm_access.c
      
        # mm_release():
        arch/arc/include/asm/mmu_context.h
        fs/exec.c
        include/linux/sched/mm.h
        include/uapi/linux/sched.h
        kernel/exit.c
        kernel/fork.c
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      68e21be2
  15. 02 3月, 2017 3 次提交
    • T
      x86/hpet: Prevent might sleep splat on resume · bb1a2c26
      Thomas Gleixner 提交于
      Sergey reported a might sleep warning triggered from the hpet resume
      path. It's caused by the call to disable_irq() from interrupt disabled
      context.
      
      The problem with the low level resume code is that it is not accounted as a
      special system_state like we do during the boot process. Calling the same
      code during system boot would not trigger the warning. That's inconsistent
      at best.
      
      In this particular case it's trivial to replace the disable_irq() with
      disable_hardirq() because this particular code path is solely used from
      system resume and the involved hpet interrupts can never be force threaded.
      Reported-and-tested-by: NSergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703012108460.3684@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bb1a2c26
    • P
      sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface · f94c8d11
      Peter Zijlstra 提交于
      Wanpeng Li reported that since the following commit:
      
        acb04058 ("sched/clock: Fix hotplug crash")
      
      ... KVM always runs with unstable sched-clock even though KVM's
      kvm_clock _is_ stable.
      
      The problem is that we've tied clear_sched_clock_stable() to the TSC
      state, and overlooked that sched_clock() is a paravirt function.
      
      Solve this by doing two things:
      
       - tie the sched_clock() stable state more clearly to the TSC stable
         state for the normal (!paravirt) case.
      
       - only call clear_sched_clock_stable() when we mark TSC unstable
         when we use native_sched_clock().
      
      The first means we can actually run with stable sched_clock in more
      situations then before, which is good. And since commit:
      
        12907fbb ("sched/clock, clocksource: Add optional cs::mark_unstable() method")
      
      ... this should be reliable. Since any detection of TSC fail now results
      in marking the TSC unstable.
      Reported-by: NWanpeng Li <kernellwp@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: acb04058 ("sched/clock: Fix hotplug crash")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f94c8d11
    • I
      sched/headers: Prepare to move sched_info_on() and force_schedstat_enabled()... · 3905f9ad
      Ingo Molnar 提交于
      sched/headers: Prepare to move sched_info_on() and force_schedstat_enabled() from <linux/sched.h> to <linux/sched/stat.h>
      
      But first update usage sites with the new header dependency.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3905f9ad