1. 31 10月, 2014 2 次提交
  2. 30 10月, 2014 4 次提交
    • J
      mm: memcontrol: fix missed end-writeback page accounting · d7365e78
      Johannes Weiner 提交于
      Commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API") changed
      page migration to uncharge the old page right away.  The page is locked,
      unmapped, truncated, and off the LRU, but it could race with writeback
      ending, which then doesn't unaccount the page properly:
      
      test_clear_page_writeback()              migration
                                                 wait_on_page_writeback()
        TestClearPageWriteback()
                                                 mem_cgroup_migrate()
                                                   clear PCG_USED
        mem_cgroup_update_page_stat()
          if (PageCgroupUsed(pc))
            decrease memcg pages under writeback
      
        release pc->mem_cgroup->move_lock
      
      The per-page statistics interface is heavily optimized to avoid a
      function call and a lookup_page_cgroup() in the file unmap fast path,
      which means it doesn't verify whether a page is still charged before
      clearing PageWriteback() and it has to do it in the stat update later.
      
      Rework it so that it looks up the page's memcg once at the beginning of
      the transaction and then uses it throughout.  The charge will be
      verified before clearing PageWriteback() and migration can't uncharge
      the page as long as that is still set.  The RCU lock will protect the
      memcg past uncharge.
      
      As far as losing the optimization goes, the following test results are
      from a microbenchmark that maps, faults, and unmaps a 4GB sparse file
      three times in a nested fashion, so that there are two negative passes
      that don't account but still go through the new transaction overhead.
      There is no actual difference:
      
       old:     33.195102545 seconds time elapsed       ( +-  0.01% )
       new:     33.199231369 seconds time elapsed       ( +-  0.03% )
      
      The time spent in page_remove_rmap()'s callees still adds up to the
      same, but the time spent in the function itself seems reduced:
      
           # Children      Self  Command        Shared Object       Symbol
       old:     0.12%     0.11%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
       new:     0.12%     0.08%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7365e78
    • J
      mm: page-writeback: inline account_page_dirtied() into single caller · 3a3c02ec
      Johannes Weiner 提交于
      A follow-up patch would have changed the call signature.  To save the
      trouble, just fold it instead.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a3c02ec
    • D
      mm, thp: fix collapsing of hugepages on madvise · 6d50e60c
      David Rientjes 提交于
      If an anonymous mapping is not allowed to fault thp memory and then
      madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
      collapse this memory into thp memory.
      
      This occurs because the madvise(2) handler for thp, hugepage_madvise(),
      clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
      until the final action of madvise_behavior().  This causes the
      khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
      the vma had previously had VM_NOHUGEPAGE set.
      
      Fix this by passing the correct vma flags to the khugepaged mm slot
      handler.  There's no chance khugepaged can run on this vma until after
      madvise_behavior() returns since we hold mm->mmap_sem.
      
      It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
      in hugepage_advise(), but I didn't want to introduce special case
      behavior into madvise_behavior().  I think it's best to just let it
      always set vma->vm_flags itself.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d50e60c
    • M
      drivers: of: add return value to of_reserved_mem_device_init() · 47f29df7
      Marek Szyprowski 提交于
      Driver calling of_reserved_mem_device_init() might be interested if the
      initialization has been successful or not, so add support for returning
      error code.
      
      This fixes a build warining caused by commit 7bfa5ab6 ("drivers:
      dma-coherent: add initialization from device tree"), which has been
      merged without this change and without fixing function return value.
      
      Fixes: 7bfa5ab6 ("drivers: dma-coherent: add initialization from device tree")
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Josh Cartwright <joshc@codeaurora.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47f29df7
  3. 29 10月, 2014 5 次提交
  4. 28 10月, 2014 1 次提交
  5. 24 10月, 2014 8 次提交
    • W
      kvm: vfio: fix unregister kvm_device_ops of vfio · 571ee1b6
      Wanpeng Li 提交于
      After commit 80ce1639 (KVM: VFIO: register kvm_device_ops dynamically),
      kvm_device_ops of vfio can be registered dynamically. Commit 3c3c29fd
      (kvm-vfio: do not use module_init) move the dynamic register invoked by
      kvm_init in order to fix broke unloading of the kvm module. However,
      kvm_device_ops of vfio is unregistered after rmmod kvm-intel module
      which lead to device type collision detection warning after kvm-intel
      module reinsmod.
      
          WARNING: CPU: 1 PID: 10358 at /root/cathy/kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:3289 kvm_init+0x234/0x282 [kvm]()
          Modules linked in: kvm_intel(O+) kvm(O) nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs fscache lockd sunrpc pci_stub bridge stp llc autofs4 8021q cpufreq_ondemand ipv6 joydev microcode pcspkr igb i2c_algo_bit ehci_pci ehci_hcd e1000e i2c_i801 ixgbe ptp pps_core hwmon mdio tpm_tis tpm ipmi_si ipmi_msghandler acpi_cpufreq isci libsas scsi_transport_sas button dm_mirror dm_region_hash dm_log dm_mod [last unloaded: kvm_intel]
          CPU: 1 PID: 10358 Comm: insmod Tainted: G        W  O   3.17.0-rc1 #2
          Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
           0000000000000cd9 ffff880ff08cfd18 ffffffff814a61d9 0000000000000cd9
           0000000000000000 ffff880ff08cfd58 ffffffff810417b7 ffff880ff08cfd48
           ffffffffa045bcac ffffffffa049c420 0000000000000040 00000000000000ff
          Call Trace:
           [<ffffffff814a61d9>] dump_stack+0x49/0x60
           [<ffffffff810417b7>] warn_slowpath_common+0x7c/0x96
           [<ffffffffa045bcac>] ? kvm_init+0x234/0x282 [kvm]
           [<ffffffff810417e6>] warn_slowpath_null+0x15/0x17
           [<ffffffffa045bcac>] kvm_init+0x234/0x282 [kvm]
           [<ffffffffa016e995>] vmx_init+0x1bf/0x42a [kvm_intel]
           [<ffffffffa016e7d6>] ? vmx_check_processor_compat+0x64/0x64 [kvm_intel]
           [<ffffffff810002ab>] do_one_initcall+0xe3/0x170
           [<ffffffff811168a9>] ? __vunmap+0xad/0xb8
           [<ffffffff8109c58f>] do_init_module+0x2b/0x174
           [<ffffffff8109d414>] load_module+0x43e/0x569
           [<ffffffff8109c6d8>] ? do_init_module+0x174/0x174
           [<ffffffff8109c75a>] ? copy_module_from_user+0x39/0x82
           [<ffffffff8109b7dd>] ? module_sect_show+0x20/0x20
           [<ffffffff8109d65f>] SyS_init_module+0x54/0x81
           [<ffffffff814a9a12>] system_call_fastpath+0x16/0x1b
          ---[ end trace 0626f4a3ddea56f3 ]---
      
      The bug can be reproduced by:
      
          rmmod kvm_intel.ko
          insmod kvm_intel.ko
      
      without rmmod/insmod kvm.ko
      This patch fixes the bug by unregistering kvm_device_ops of vfio when the
      kvm-intel module is removed.
      Reported-by: NLiu Rongrong <rongrongx.liu@intel.com>
      Fixes: 3c3c29fdSigned-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      571ee1b6
    • M
      fs: limit filesystem stacking depth · 69c433ed
      Miklos Szeredi 提交于
      Add a simple read-only counter to super_block that indicates how deep this
      is in the stack of filesystems.  Previously ecryptfs was the only stackable
      filesystem and it explicitly disallowed multiple layers of itself.
      
      Overlayfs, however, can be stacked recursively and also may be stacked
      on top of ecryptfs or vice versa.
      
      To limit the kernel stack usage we must limit the depth of the
      filesystem stack.  Initially the limit is set to 2.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      69c433ed
    • M
      vfs: add whiteout support · 787fb6bc
      Miklos Szeredi 提交于
      Whiteout isn't actually a new file type, but is represented as a char
      device (Linus's idea) with 0/0 device number.
      
      This has several advantages compared to introducing a new whiteout file
      type:
      
       - no userspace API changes (e.g. trivial to make backups of upper layer
         filesystem, without losing whiteouts)
      
       - no fs image format changes (you can boot an old kernel/fsck without
         whiteout support and things won't break)
      
       - implementation is trivial
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      787fb6bc
    • M
      vfs: export check_sticky() · cbdf35bc
      Miklos Szeredi 提交于
      It's already duplicated in btrfs and about to be used in overlayfs too.
      
      Move the sticky bit check to an inline helper and call the out-of-line
      helper only in the unlikly case of the sticky bit being set.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      cbdf35bc
    • M
      vfs: introduce clone_private_mount() · c771d683
      Miklos Szeredi 提交于
      Overlayfs needs a private clone of the mount, so create a function for
      this and export to modules.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      c771d683
    • M
      vfs: export __inode_permission() to modules · bd5d0856
      Miklos Szeredi 提交于
      We need to be able to check inode permissions (but not filesystem implied
      permissions) for stackable filesystems.  Expose this interface for overlayfs.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      bd5d0856
    • M
      vfs: export do_splice_direct() to modules · 1c118596
      Miklos Szeredi 提交于
      Export do_splice_direct() to modules.  Needed by overlay filesystem.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      1c118596
    • M
      vfs: add i_op->dentry_open() · 4aa7c634
      Miklos Szeredi 提交于
      Add a new inode operation i_op->dentry_open().  This is for stacked filesystems
      that want to return a struct file from a different filesystem.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      4aa7c634
  6. 23 10月, 2014 6 次提交
    • B
      uprobes: Remove "weak" from function declarations · 271a9c35
      Bjorn Helgaas 提交于
      For the following interfaces:
      
        set_swbp()
        set_orig_insn()
        is_swbp_insn()
        is_trap_insn()
        uprobe_get_swbp_addr()
        arch_uprobe_ignore()
        arch_uprobe_copy_ixol()
      
      kernel/events/uprobes.c provides default definitions explicitly marked
      "weak".  Some architectures provide their own definitions intended to
      override the defaults, but the "weak" attribute on the declarations applied
      to the arch definitions as well, so the linker chose one based on link
      order (see 10629d71 ("PCI: Remove __weak annotation from
      pcibios_get_phb_of_node decl")).
      
      Remove the "weak" attribute from the declarations so we always prefer a
      non-weak definition over the weak one, independent of link order.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      CC: Victor Kamensky <victor.kamensky@linaro.org>
      CC: Oleg Nesterov <oleg@redhat.com>
      CC: David A. Long <dave.long@linaro.org>
      CC: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      271a9c35
    • B
      memory-hotplug: Remove "weak" from memory_block_size_bytes() declaration · e0a8400c
      Bjorn Helgaas 提交于
      drivers/base/memory.c provides a default memory_block_size_bytes()
      definition explicitly marked "weak".  Several architectures provide their
      own definitions intended to override the default, but the "weak" attribute
      on the declaration applied to the arch definitions as well, so the linker
      chose one based on link order (see 10629d71 ("PCI: Remove __weak
      annotation from pcibios_get_phb_of_node decl")).
      
      Remove the "weak" attribute from the declaration so we always prefer a
      non-weak definition over the weak one, independent of link order.
      
      Fixes: 41f10726 ("drivers: base: Add prototype declaration to the header file")
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      CC: Rashika Kheria <rashika.kheria@gmail.com>
      CC: Nathan Fontenot <nfont@austin.ibm.com>
      CC: Anton Blanchard <anton@au1.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: Yinghai Lu <yinghai@kernel.org>
      e0a8400c
    • B
      kgdb: Remove "weak" from kgdb_arch_pc() declaration · 107bcc6d
      Bjorn Helgaas 提交于
      kernel/debug/debug_core.c provides a default kgdb_arch_pc() definition
      explicitly marked "weak".  Several architectures provide their own
      definitions intended to override the default, but the "weak" attribute on
      the declaration applied to the arch definitions as well, so the linker
      chose one based on link order (see 10629d71 ("PCI: Remove __weak
      annotation from pcibios_get_phb_of_node decl")).
      
      Remove the "weak" attribute from the declaration so we always prefer a
      non-weak definition over the weak one, independent of link order.
      
      Fixes: 688b744d ("kgdb: fix signedness mixmatches, add statics, add declaration to header")
      Tested-by: Vineet Gupta <vgupta@synopsys.com>	# for ARC build
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NHarvey Harrison <harvey.harrison@gmail.com>
      107bcc6d
    • B
      vmcore: Remove "weak" from function declarations · 5ab03ac5
      Bjorn Helgaas 提交于
      For the following functions:
      
        elfcorehdr_alloc()
        elfcorehdr_free()
        elfcorehdr_read()
        elfcorehdr_read_notes()
        remap_oldmem_pfn_range()
      
      fs/proc/vmcore.c provides default definitions explicitly marked "weak".
      arch/s390 provides its own definitions intended to override the default
      ones, but the "weak" attribute on the declarations applied to the s390
      definitions as well, so the linker chose one based on link order (see
      10629d71 ("PCI: Remove __weak annotation from pcibios_get_phb_of_node
      decl")).
      
      Remove the "weak" attribute from the declarations so we always prefer a
      non-weak definition over the weak one, independent of link order.
      
      Fixes: be8a8d06 ("vmcore: introduce ELF header in new memory feature")
      Fixes: 9cb21813 ("vmcore: introduce remap_oldmem_pfn_range()")
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      CC: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      5ab03ac5
    • B
      clocksource: Remove "weak" from clocksource_default_clock() declaration · 96a2adbc
      Bjorn Helgaas 提交于
      kernel/time/jiffies.c provides a default clocksource_default_clock()
      definition explicitly marked "weak".  arch/s390 provides its own definition
      intended to override the default, but the "weak" attribute on the
      declaration applied to the s390 definition as well, so the linker chose one
      based on link order (see 10629d71 ("PCI: Remove __weak annotation from
      pcibios_get_phb_of_node decl")).
      
      Remove the "weak" attribute from the clocksource_default_clock()
      declaration so we always prefer a non-weak definition over the weak one,
      independent of link order.
      
      Fixes: f1b82746 ("clocksource: Cleanup clocksource selection")
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      CC: Daniel Lezcano <daniel.lezcano@linaro.org>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      96a2adbc
    • B
      audit: Remove "weak" from audit_classify_compat_syscall() declaration · 9e8beeb7
      Bjorn Helgaas 提交于
      There's only one audit_classify_compat_syscall() definition, so it doesn't
      need to be weak.
      
      Remove the "weak" attribute from the audit_classify_compat_syscall()
      declaration.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NRichard Guy Briggs <rgb@redhat.com>
      CC: AKASHI Takahiro <takahiro.akashi@linaro.org>
      9e8beeb7
  7. 22 10月, 2014 1 次提交
    • M
      OOM, PM: OOM killed task shouldn't escape PM suspend · 5695be14
      Michal Hocko 提交于
      PM freezer relies on having all tasks frozen by the time devices are
      getting frozen so that no task will touch them while they are getting
      frozen. But OOM killer is allowed to kill an already frozen task in
      order to handle OOM situtation. In order to protect from late wake ups
      OOM killer is disabled after all tasks are frozen. This, however, still
      keeps a window open when a killed task didn't manage to die by the time
      freeze_processes finishes.
      
      Reduce the race window by checking all tasks after OOM killer has been
      disabled. This is still not race free completely unfortunately because
      oom_killer_disable cannot stop an already ongoing OOM killer so a task
      might still wake up from the fridge and get killed without
      freeze_processes noticing. Full synchronization of OOM and freezer is,
      however, too heavy weight for this highly unlikely case.
      
      Introduce and check oom_kills counter which gets incremented early when
      the allocator enters __alloc_pages_may_oom path and only check all the
      tasks if the counter changes during the freezing attempt. The counter
      is updated so early to reduce the race window since allocator checked
      oom_killer_disabled which is set by PM-freezing code. A false positive
      will push the PM-freezer into a slow path but that is not a big deal.
      
      Changes since v1
      - push the re-check loop out of freeze_processes into
        check_frozen_processes and invert the condition to make the code more
        readable as per Rafael
      
      Fixes: f660daac (oom: thaw threads if oom killed thread is frozen before deferring)
      Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5695be14
  8. 21 10月, 2014 3 次提交
  9. 20 10月, 2014 1 次提交
  10. 17 10月, 2014 1 次提交
    • D
      random: add and use memzero_explicit() for clearing data · d4c5efdb
      Daniel Borkmann 提交于
      zatimend has reported that in his environment (3.16/gcc4.8.3/corei7)
      memset() calls which clear out sensitive data in extract_{buf,entropy,
      entropy_user}() in random driver are being optimized away by gcc.
      
      Add a helper memzero_explicit() (similarly as explicit_bzero() variants)
      that can be used in such cases where a variable with sensitive data is
      being cleared out in the end. Other use cases might also be in crypto
      code. [ I have put this into lib/string.c though, as it's always built-in
      and doesn't need any dependencies then. ]
      
      Fixes kernel bugzilla: 82041
      
      Reported-by: zatimend@hotmail.co.uk
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d4c5efdb
  11. 16 10月, 2014 4 次提交
  12. 15 10月, 2014 4 次提交