1. 15 9月, 2012 3 次提交
    • O
      ptrace/x86: Partly fix set_task_blockstep()->update_debugctlmsr() logic · 95cf00fa
      Oleg Nesterov 提交于
      Afaics the usage of update_debugctlmsr() and TIF_BLOCKSTEP in
      step.c was always very wrong.
      
      1. update_debugctlmsr() was simply unneeded. The child sleeps
         TASK_TRACED, __switch_to_xtra(next_p => child) should notice
         TIF_BLOCKSTEP and set/clear DEBUGCTLMSR_BTF after resume if
         needed.
      
      2. It is wrong. The state of DEBUGCTLMSR_BTF bit in CPU register
         should always match the state of current's TIF_BLOCKSTEP bit.
      
      3. Even get_debugctlmsr() + update_debugctlmsr() itself does not
         look right. Irq can change other bits in MSR_IA32_DEBUGCTLMSR
         register or the caller can be preempted in between.
      
      4. It is not safe to play with TIF_BLOCKSTEP if task != current.
         DEBUGCTLMSR_BTF and TIF_BLOCKSTEP should always match each
         other if the task is running. The tracee is stopped but it
         can be SIGKILL'ed right before set/clear_tsk_thread_flag().
      
      However, now that uprobes uses user_enable_single_step(current)
      we can't simply remove update_debugctlmsr(). So this patch adds
      the additional "task == current" check and disables irqs to avoid
      the race with interrupts/preemption.
      
      Unfortunately this patch doesn't solve the last problem, we need
      another fix. Probably we should teach ptrace_stop() to set/clear
      single/block stepping after resume.
      
      And afaics there is yet another problem: perf can play with
      MSR_IA32_DEBUGCTLMSR from nmi, this obviously means that even
      __switch_to_xtra() has problems.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      95cf00fa
    • O
      ptrace/x86: Introduce set_task_blockstep() helper · 848e8f5f
      Oleg Nesterov 提交于
      No functional changes, preparation for the next fix and for uprobes
      single-step fixes.
      
      Move the code playing with TIF_BLOCKSTEP/DEBUGCTLMSR_BTF into the
      new helper, set_task_blockstep().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      848e8f5f
    • S
      uprobes/x86: Implement x86 specific arch_uprobe_*_step · bdc1e472
      Sebastian Andrzej Siewior 提交于
      The arch specific implementation behaves like user_enable_single_step()
      except that it does not disable single stepping if it was already
      enabled by ptrace. This allows the debugger to single step over an
      uprobe. The state of block stepping is not restored. It makes only sense
      together with TF and if that was enabled then the debugger is notified.
      
      Note: this is still not correct. For example, TIF_SINGLESTEP check
      is not right, the application itself can set X86_EFLAGS_TF. And otoh
      we leak TIF_SINGLESTEP (set by enable) if the probed insn is "popf".
      See the next patches, we need the changes in arch/x86/kernel/step.c
      first.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      bdc1e472
  2. 27 8月, 2012 1 次提交
  3. 24 8月, 2012 14 次提交
  4. 23 8月, 2012 15 次提交
    • K
      powerpc/fsl: fix "Failed to mount /dev: No such device" errors · 1267643d
      Kim Phillips 提交于
      Yocto (Built by Poky 7.0) 1.2 root filesystems fail to boot,
      at least over nfs, with:
      
      Failed to mount /dev: No such device
      
      Configuring DEVTMPFS fixes it.
      Signed-off-by: NKim Phillips <kim.phillips@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      1267643d
    • K
      powerpc/fsl: update defconfigs · 823f7473
      Kim Phillips 提交于
      run make savedefconfig on fsl defconfigs.
      Signed-off-by: NKim Phillips <kim.phillips@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      823f7473
    • A
      ARM: ux500: don't select LEDS_GPIO for snowball · db43b184
      Arnd Bergmann 提交于
      Using 'select' in Kconfig is hard, a platform cannot just
      enable a driver without also making sure that its subsystem
      is there. Also, there is no actual code dependency between
      the platform and the gpio leds driver.
      
      Without this patch, building without LEDS_CLASS esults in:
      
      drivers/built-in.o: In function `create_gpio_led.part.2':
      governor_userspace.c:(.devinit.text+0x5a58): undefined reference to `led_classdev_register'
      drivers/built-in.o: In function `gpio_led_remove':
      governor_userspace.c:(.devexit.text+0x6b8): undefined reference to `led_classdev_unregister'
      
      This reverts 8733f53c "ARM: ux500: Kconfig: Compile in leds-gpio
      support for Snowball" that introduced the regression and did not
      provide a helpful explanation.
      
      In order to leave the GPIO LED code still present in normal
      builds, this also enables the symbol in u8500_defconfig, in addition
      to the other LED drivers that are already selected there.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Lee Jones <lee.jones@linaro.org>
      db43b184
    • A
      ARM: imx: build i.MX6 functions only when needed · 1fc593fe
      Arnd Bergmann 提交于
      The head-v7.S contains a call to the generic cpu_suspend function,
      which is only available when selected by the i.MX6 code. As
      pointed out by Shawn Guo, i.MX5 does not actually use any
      functions defined in head-v7.S. It is also needed only for
      the i.MX6 power management code and for the SMP code, so
      we can restrict building this file to situations in which
      at least one of those two is present.
      
      Finally, other platforms with a similar file call it headsmp.S,
      so we can rename it to the same for consistency.
      
      Without this patch, building imx5 standalone results in:
      
      arch/arm/mach-imx/built-in.o: In function `v7_cpu_resume':
      arch/arm/mach-imx/head-v7.S:104: undefined reference to `cpu_resume'
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NShawn Guo <shawn.guo@linaro.org>
      Cc: Eric Miao <eric.miao@linaro.org>
      Cc: stable@vger.kernel.org
      1fc593fe
    • S
      ftrace/x86: Add support for -mfentry to x86_64 · d57c5d51
      Steven Rostedt 提交于
      If the kernel is compiled with gcc 4.6.0 which supports -mfentry,
      then use that instead of mcount.
      
      With mcount, frame pointers are forced with the -pg option and we
      get something like:
      
      <can_vma_merge_before>:
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             53                      push   %rbx
             41 51                   push   %r9
             e8 fe 6a 39 00          callq  ffffffff81483d00 <mcount>
             31 c0                   xor    %eax,%eax
             48 89 fb                mov    %rdi,%rbx
             48 89 d7                mov    %rdx,%rdi
             48 33 73 30             xor    0x30(%rbx),%rsi
             48 f7 c6 ff ff ff f7    test   $0xfffffffff7ffffff,%rsi
      
      With -mfentry, frame pointers are no longer forced and the call looks
      like this:
      
      <can_vma_merge_before>:
             e8 33 af 37 00          callq  ffffffff81461b40 <__fentry__>
             53                      push   %rbx
             48 89 fb                mov    %rdi,%rbx
             31 c0                   xor    %eax,%eax
             48 89 d7                mov    %rdx,%rdi
             41 51                   push   %r9
             48 33 73 30             xor    0x30(%rbx),%rsi
             48 f7 c6 ff ff ff f7    test   $0xfffffffff7ffffff,%rsi
      
      This adds the ftrace hook at the beginning of the function before a
      frame is set up, and allows the function callbacks to be able to access
      parameters. As kprobes now can use function tracing (at least on x86)
      this speeds up the kprobe hooks that are at the beginning of the
      function.
      
      Link: http://lkml.kernel.org/r/20120807194100.130477900@goodmis.orgAcked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d57c5d51
    • A
      ARM: imx: select CPU_FREQ_TABLE when needed · f637c4c9
      Arnd Bergmann 提交于
      The i.MX cpufreq implementation uses the CPU_FREQ_TABLE helpers,
      so it needs to select that code to be built. This problem has
      apparently existed since the i.MX cpufreq code was first merged
      in v2.6.37.
      
      Building IMX without CPU_FREQ_TABLE results in:
      
      arch/arm/plat-mxc/built-in.o: In function `mxc_cpufreq_exit':
      arch/arm/plat-mxc/cpufreq.c:173: undefined reference to `cpufreq_frequency_table_put_attr'
      arch/arm/plat-mxc/built-in.o: In function `mxc_set_target':
      arch/arm/plat-mxc/cpufreq.c:84: undefined reference to `cpufreq_frequency_table_target'
      arch/arm/plat-mxc/built-in.o: In function `mxc_verify_speed':
      arch/arm/plat-mxc/cpufreq.c:65: undefined reference to `cpufreq_frequency_table_verify'
      arch/arm/plat-mxc/built-in.o: In function `mxc_cpufreq_init':
      arch/arm/plat-mxc/cpufreq.c:154: undefined reference to `cpufreq_frequency_table_cpuinfo'
      arch/arm/plat-mxc/cpufreq.c:162: undefined reference to `cpufreq_frequency_table_get_attr'
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NShawn Guo <shawn.guo@linaro.org>
      Cc: Sascha Hauer <s.hauer@pengutronix.de>
      Cc: Yong Shen <yong.shen@linaro.org>
      Cc: stable@vger.kernel.org
      f637c4c9
    • A
      ARM: imx: fix ksz9021rn_phy_fixup · 9f9ba0fd
      Arnd Bergmann 提交于
      The ksz9021rn_phy_fixup and mx6q_sabrelite functions try to
      set up an ethernet phy if they can. They do check whether
      phylib is enabled, but unfortunately the functions can only
      be called from platform code if phylib is builtin, not
      if it is a module
      
      Without this patch, building with a modular phylib results in:
      
      arch/arm/mach-imx/mach-imx6q.c: In function 'imx6q_sabrelite_init':
      arch/arm/mach-imx/mach-imx6q.c:120:5: error: 'ksz9021rn_phy_fixup' undeclared (first use in this function)
      arch/arm/mach-imx/mach-imx6q.c:120:5: note: each undeclared identifier is reported only once for each function it appears in
      
      The bug was originally reported by Artem Bityutskiy but only
      partially fixed in ef441806 "ARM: imx6q: register phy fixup only when
      CONFIG_PHYLIB is enabled".
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NShawn Guo <shawn.guo@linaro.org>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Sascha Hauer <s.hauer@pengutronix.de>
      9f9ba0fd
    • A
      ARM: imx: build pm-imx5 code only when PM is enabled · a28eecef
      Arnd Bergmann 提交于
      This moves the imx5 pm code out of the list of unconditionally
      compiled files for imx5, mirroring what we already do for imx6
      and how it was done before the code was move from mach-mx5 to
      mach-imx in v3.3.
      
      Without this patch, building with CONFIG_PM disabled results in:
      
      arch/arm/mach-imx/pm-imx5.c:202:116: error: redefinition of 'imx51_pm_init'
      arch/arm/mach-imx/include/mach-imx/common.h:154:91: note: previous definition of 'imx51_pm_init' was here
      arch/arm/mach-imx/pm-imx5.c:209:116: error: redefinition of 'imx53_pm_init'
      arch/arm/mach-imx/include/mach-imx/common.h:155:91: note: previous definition of 'imx53_pm_init' was here
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NShawn Guo <shawn.guo@linaro.org>
      Cc: Sascha Hauer <s.hauer@pengutronix.de>
      Cc: stable@vger.kernel.org
      a28eecef
    • A
      ARM: omap: allow building omap44xx without SMP · c7a9b09b
      Arnd Bergmann 提交于
      The new omap4 cpuidle implementation currently requires
      ARCH_NEEDS_CPU_IDLE_COUPLED, which only works on SMP.
      
      This patch makes it possible to build a non-SMP kernel
      for that platform. This is not normally desired for
      end-users but can be useful for testing.
      
      Without this patch, building rand-0y2jSKT results in:
      
      drivers/cpuidle/coupled.c: In function 'cpuidle_coupled_poke':
      drivers/cpuidle/coupled.c:317:3: error: implicit declaration of function '__smp_call_function_single' [-Werror=implicit-function-declaration]
      
      It's not clear if this patch is the best solution for
      the problem at hand. I have made sure that we can now
      build the kernel in all configurations, but that does
      not mean it will actually work on an OMAP44xx.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Tested-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Tony Lindgren <tony@atomide.com>
      c7a9b09b
    • K
      xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M. · c96aae1f
      Konrad Rzeszutek Wilk 提交于
      When we are finished with return PFNs to the hypervisor, then
      populate it back, and also mark the E820 MMIO and E820 gaps
      as IDENTITY_FRAMEs, we then call P2M to set areas that can
      be used for ballooning. We were off by one, and ended up
      over-writting a P2M entry that most likely was an IDENTITY_FRAME.
      For example:
      
      1-1 mapping on 40000->40200
      1-1 mapping on bc558->bc5ac
      1-1 mapping on bc5b4->bc8c5
      1-1 mapping on bc8c6->bcb7c
      1-1 mapping on bcd00->100000
      Released 614 pages of unused memory
      Set 277889 page(s) to 1-1 mapping
      Populating 40200-40466 pfn range: 614 pages added
      
      => here we set from 40466 up to bc559 P2M tree to be
      INVALID_P2M_ENTRY. We should have done it up to bc558.
      
      The end result is that if anybody is trying to construct
      a PTE for PFN bc558 they end up with ~PAGE_PRESENT.
      
      CC: stable@vger.kernel.org
      Reported-by-and-Tested-by: NAndre Przywara <andre.przywara@amd.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      c96aae1f
    • G
      MIPS: pci-ar724x: avoid data bus error due to a missing PCIe module · a1dca315
      Gabor Juhos 提交于
      If the controller has no PCIe module attached, accessing of the device
      configuration space causes a data bus error. Avoid this by checking the
      status of the PCIe link in advance, and indicate an error if the link
      is down.
      Signed-off-by: NGabor Juhos <juhosg@openwrt.org>
      Cc: stable@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/4293/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      a1dca315
    • S
      ARM: dts: imx51-babbage: fix esdhc cd/wp properties · a46d2619
      Shawn Guo 提交于
      The binding doc and dts use properties "fsl,{cd,wp}-internal" while
      esdhc driver uses "fsl,{cd,wp}-controller".  Fix binding doc and dts
      to get them match driver code.
      Reported-by: NChris Ball <cjb@laptop.org>
      Signed-off-by: NShawn Guo <shawn.guo@linaro.org>
      Cc: <stable@vger.kernel.org>
      Acked-by: NChris Ball <cjb@laptop.org>
      a46d2619
    • S
      ARM: imx6: spin the cpu until hardware takes it down · c944b0b9
      Shawn Guo 提交于
      Though commit 602bf409 (ARM: imx6: exit coherency when shutting down
      a cpu) improves the stability of imx6q cpu hotplug a lot, there are
      still hangs seen with a more stressful hotplug testing.
      
      It's expected that once imx_enable_cpu(cpu, false) is called, the cpu
      will be taken down by hardware immediately, and the code after that
      will not get any chance to execute.  However, this is not always the
      case from the testing.  The cpu could possibly be alive for a few
      cycles before hardware actually takes it down.  So rather than letting
      cpu execute some code that could cause a hang in these cycles, let's
      make the cpu spin there and wait for hardware to take it down.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NShawn Guo <shawn.guo@linaro.org>
      c944b0b9
    • A
      x86, microcode, AMD: Fix broken ucode patch size check · 36bf50d7
      Andreas Herrmann 提交于
      This issue was recently observed on an AMD C-50 CPU where a patch of
      maximum size was applied.
      
      Commit be62adb4 ("x86, microcode, AMD: Simplify ucode verification")
      added current_size in get_matching_microcode(). This is calculated as
      size of the ucode patch + 8 (ie. size of the header). Later this is
      compared against the maximum possible ucode patch size for a CPU family.
      And of course this fails if the patch has already maximum size.
      
      Cc: <stable@vger.kernel.org> [3.3+]
      Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      Link: http://lkml.kernel.org/r/1344361461-10076-1-git-send-email-bp@amd64.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      36bf50d7
    • A
      KVM: x86 emulator: use stack size attribute to mask rsp in stack ops · 5ad105e5
      Avi Kivity 提交于
      The sub-register used to access the stack (sp, esp, or rsp) is not
      determined by the address size attribute like other memory references,
      but by the stack segment's B bit (if not in x86_64 mode).
      
      Fix by using the existing stack_mask() to figure out the correct mask.
      
      This long-existing bug was exposed by a combination of a27685c3
      (emulate invalid guest state by default), which causes many more
      instructions to be emulated, and a seabios change (possibly a bug) which
      causes the high 16 bits of esp to become polluted across calls to real
      mode software interrupts.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      5ad105e5
  5. 22 8月, 2012 5 次提交
    • T
      KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended · 35f2d16b
      Takuya Yoshikawa 提交于
      Although the possible race described in
      
        commit 85b70591
        KVM: MMU: fix shrinking page from the empty mmu
      
      was correct, the real cause of that issue was a more trivial bug of
      mmu_shrink() introduced by
      
        commit 19526396
        KVM: MMU: do not iterate over all VMs in mmu_shrink()
      
      Here is the bug:
      
      	if (kvm->arch.n_used_mmu_pages > 0) {
      		if (!nr_to_scan--)
      			break;
      		continue;
      	}
      
      We skip VMs whose n_used_mmu_pages is not zero and try to shrink others:
      in other words we try to shrink empty ones by mistake.
      
      This patch reverses the logic so that mmu_shrink() can free pages from
      the first VM whose n_used_mmu_pages is not zero.  Note that we also add
      comments explaining the role of nr_to_scan which is not practically
      important now, hoping this will be improved in the future.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      35f2d16b
    • A
      x86/alternatives: Fix p6 nops on non-modular kernels · cb09cad4
      Avi Kivity 提交于
      Probably a leftover from the early days of self-patching, p6nops
      are marked __initconst_or_module, which causes them to be
      discarded in a non-modular kernel.  If something later triggers
      patching, it will overwrite kernel code with garbage.
      Reported-by: NTomas Racek <tracek@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      Cc: Michael Tokarev <mjt@tls.msk.ru>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: qemu-devel@nongnu.org
      Cc: Anthony Liguori <anthony@codemonkey.ws>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb09cad4
    • L
      x86/fixup_irq: Use cpu_online_mask instead of cpu_all_mask · 2530cd4f
      Liu, Chuansheng 提交于
      When one CPU is going down and this CPU is the last one in irq
      affinity, current code is setting cpu_all_mask as the new
      affinity for that irq.
      
      But for some systems (such as in Medfield Android mobile) the
      firmware sends the interrupt to each CPU in the irq affinity
      mask, averaged, and cpu_all_mask includes all potential CPUs,
      i.e. offline ones as well.
      
      So replace cpu_all_mask with cpu_online_mask.
      Signed-off-by: Nliu chuansheng <chuansheng.liu@intel.com>
      Acked-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A137286@SHSMSX101.ccr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2530cd4f
    • R
      x86/spinlocks: Fix comment in spinlock.h · 83be4ffa
      Richard Weinberger 提交于
      This comment is no longer true.  We support up to 2^16 CPUs
      because __ticket_t is an u16 if NR_CPUS is larger than 256.
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      83be4ffa
    • M
      mm: hugetlbfs: correctly populate shared pmd · eb48c071
      Michal Hocko 提交于
      Each page mapped in a process's address space must be correctly
      accounted for in _mapcount.  Normally the rules for this are
      straightforward but hugetlbfs page table sharing is different.  The page
      table pages at the PMD level are reference counted while the mapcount
      remains the same.
      
      If this accounting is wrong, it causes bugs like this one reported by
      Larry Woodman:
      
        kernel BUG at mm/filemap.c:135!
        invalid opcode: 0000 [#1] SMP
        CPU 22
        Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
        Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
        RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
        Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
        Call Trace:
          delete_from_page_cache+0x40/0x80
          truncate_hugepages+0x115/0x1f0
          hugetlbfs_evict_inode+0x18/0x30
          evict+0x9f/0x1b0
          iput_final+0xe3/0x1e0
          iput+0x3e/0x50
          d_kill+0xf8/0x110
          dput+0xe2/0x1b0
          __fput+0x162/0x240
      
      During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
      shared page tables with the check dst_pte == src_pte.  The logic is if
      the PMD page is the same, they must be shared.  This assumes that the
      sharing is between the parent and child.  However, if the sharing is
      with a different process entirely then this check fails as in this
      diagram:
      
        parent
          |
          ------------>pmd
                       src_pte----------> data page
                                              ^
        other--------->pmd--------------------|
                        ^
        child-----------|
                       dst_pte
      
      For this situation to occur, it must be possible for Parent and Other to
      have faulted and failed to share page tables with each other.  This is
      possible due to the following style of race.
      
        PROC A                                          PROC B
        copy_hugetlb_page_range                         copy_hugetlb_page_range
          src_pte == huge_pte_offset                      src_pte == huge_pte_offset
          !src_pte so no sharing                          !src_pte so no sharing
      
        (time passes)
      
        hugetlb_fault                                   hugetlb_fault
          huge_pte_alloc                                  huge_pte_alloc
            huge_pmd_share                                 huge_pmd_share
              LOCK(i_mmap_mutex)
              find nothing, no sharing
              UNLOCK(i_mmap_mutex)
                                                            LOCK(i_mmap_mutex)
                                                            find nothing, no sharing
                                                            UNLOCK(i_mmap_mutex)
            pmd_alloc                                       pmd_alloc
            LOCK(instantiation_mutex)
            fault
            UNLOCK(instantiation_mutex)
                                                        LOCK(instantiation_mutex)
                                                        fault
                                                        UNLOCK(instantiation_mutex)
      
      These two processes are not poing to the same data page but are not
      sharing page tables because the opportunity was missed.  When either
      process later forks, the src_pte == dst pte is potentially insufficient.
      As the check falls through, the wrong PTE information is copied in
      (harmless but wrong) and the mapcount is bumped for a page mapped by a
      shared page table leading to the BUG_ON.
      
      This patch addresses the issue by moving pmd_alloc into huge_pmd_share
      which guarantees that the shared pud is populated in the same critical
      section as pmd.  This also means that huge_pte_offset test in
      huge_pmd_share is serialized correctly now which in turn means that the
      success of the sharing will be higher as the racing tasks see the pud
      and pmd populated together.
      
      Race identified and changelog written mostly by Mel Gorman.
      
      {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
      Reported-by: NLarry Woodman <lwoodman@redhat.com>
      Tested-by: NLarry Woodman <lwoodman@redhat.com>
      Reviewed-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb48c071
  6. 21 8月, 2012 1 次提交
  7. 19 8月, 2012 1 次提交