1. 12 3月, 2013 1 次提交
  2. 11 3月, 2013 1 次提交
  3. 08 3月, 2013 5 次提交
  4. 06 3月, 2013 1 次提交
  5. 05 3月, 2013 5 次提交
  6. 03 3月, 2013 1 次提交
    • Y
      x86, ACPI, mm: Revert movablemem_map support · 20e6926d
      Yinghai Lu 提交于
      Tim found:
      
        WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
        Hardware name: S2600CP
        sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
        smpboot: Booting Node   1, Processors  #1
        Modules linked in:
        Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
        Call Trace:
          set_cpu_sibling_map+0x279/0x449
          start_secondary+0x11d/0x1e5
      
      Don Morris reproduced on a HP z620 workstation, and bisected it to
      commit e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock
      is ready")
      
      It turns out movable_map has some problems, and it breaks several things
      
      1. numa_init is called several times, NOT just for srat. so those
      	nodes_clear(numa_nodes_parsed)
      	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
         can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
         and make fall back path working.
      
      2. simply split acpi_numa_init to early_parse_srat.
         a. that early_parse_srat is NOT called for ia64, so you break ia64.
         b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
      	     set_apicid_to_node(i, NUMA_NO_NODE)
           still left in numa_init. So it will just clear result from early_parse_srat.
           it should be moved before that....
         c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
             early before override from INITRD is settled.
      
      3. that patch TITLE is total misleading, there is NO x86 in the title,
         but it changes critical x86 code. It caused x86 guys did not
         pay attention to find the problem early. Those patches really should
         be routed via tip/x86/mm.
      
      4. after that commit, following range can not use movable ram:
        a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
        b. initrd... it will be freed after booting, so it could be on movable...
        c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
      	anymore.
        d. init_mem_mapping: can not put page table high anymore.
        e. initmem_init: vmemmap can not be high local node anymore. That is
           not good.
      
      If node is hotplugable, the mem related range like page table and
      vmemmap could be on the that node without problem and should be on that
      node.
      
      We have workaround patch that could fix some problems, but some can not
      be fixed.
      
      So just remove that offending commit and related ones including:
      
       f7210e6c ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
          protect movablecore_map in memblock_overlaps_region().")
      
       01a178a9 ("acpi, memory-hotplug: support getting hotplug info from
          SRAT")
      
       27168d38 ("acpi, memory-hotplug: extend movablemem_map ranges to
          the end of node")
      
       e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock is
          ready")
      
       fb06bc8e ("page_alloc: bootmem limit with movablecore_map")
      
       42f47e27 ("page_alloc: make movablemem_map have higher priority")
      
       6981ec31 ("page_alloc: introduce zone_movable_limit[] to keep
          movable limit for nodes")
      
       34b71f1e ("page_alloc: add movable_memmap kernel parameter")
      
       4d59a751 ("x86: get pg_data_t's memory from other node")
      
      Later we should have patches that will make sure kernel put page table
      and vmemmap on local node ram instead of push them down to node0.  Also
      need to find way to put other kernel used ram to local node ram.
      Reported-by: NTim Gardner <tim.gardner@canonical.com>
      Reported-by: NDon Morris <don.morris@hp.com>
      Bisected-by: NDon Morris <don.morris@hp.com>
      Tested-by: NDon Morris <don.morris@hp.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20e6926d
  7. 01 3月, 2013 1 次提交
    • K
      xen/pci: We don't do multiple MSI's. · 884ac297
      Konrad Rzeszutek Wilk 提交于
      There is no hypercall to setup multiple MSI per PCI device.
      As such with these two new commits:
      -  08261d87
         PCI/MSI: Enable multiple MSIs with pci_enable_msi_block_auto()
      - 5ca72c4f
         AHCI: Support multiple MSIs
      
      we would call the PHYSDEVOP_map_pirq 'nvec' times with the same
      contents of the PCI device. Sander discovered that we would get
      the same PIRQ value 'nvec' times and return said values to the
      caller. That of course meant that the device was configured only
      with one MSI and AHCI would fail with:
      
      ahci 0000:00:11.0: version 3.0
      xen: registering gsi 19 triggering 0 polarity 1
      xen: --> pirq=19 -> irq=19 (gsi=19)
      (XEN) [2013-02-27 19:43:07] IOAPIC[0]: Set PCI routing entry (6-19 -> 0x99 -> IRQ 19 Mode:1 Active:1)
      ahci 0000:00:11.0: AHCI 0001.0200 32 slots 4 ports 6 Gbps 0xf impl SATA mode
      ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part
      ahci: probe of 0000:00:11.0 failed with error -22
      
      That is b/c in ahci_host_activate the second call to
      devm_request_threaded_irq  would return -EINVAL as we passed in
      (on the second run) an IRQ that was never initialized.
      
      CC: stable@vger.kernel.org
      Reported-and-Tested-by: NSander Eikelenboom <linux@eikelenboom.it>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      884ac297
  8. 28 2月, 2013 8 次提交
    • K
      xen/pat: Disable PAT using pat_enabled value. · c79c4982
      Konrad Rzeszutek Wilk 提交于
      The git commit 8eaffa67
      (xen/pat: Disable PAT support for now) explains in details why
      we want to disable PAT for right now. However that
      change was not enough and we should have also disabled
      the pat_enabled value. Otherwise we end up with:
      
      mmap-example:3481 map pfn expected mapping type write-back for
      [mem 0x00010000-0x00010fff], got uncached-minus
       ------------[ cut here ]------------
      WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0()
      mem 0x00010000-0x00010fff], got uncached-minus
      ------------[ cut here ]------------
      WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774
      untrack_pfn+0xb8/0xd0()
      ...
      Pid: 3481, comm: mmap-example Tainted: GF 3.8.0-6-generic #13-Ubuntu
      Call Trace:
       [<ffffffff8105879f>] warn_slowpath_common+0x7f/0xc0
       [<ffffffff810587fa>] warn_slowpath_null+0x1a/0x20
       [<ffffffff8104bcc8>] untrack_pfn+0xb8/0xd0
       [<ffffffff81156c1c>] unmap_single_vma+0xac/0x100
       [<ffffffff81157459>] unmap_vmas+0x49/0x90
       [<ffffffff8115f808>] exit_mmap+0x98/0x170
       [<ffffffff810559a4>] mmput+0x64/0x100
       [<ffffffff810560f5>] dup_mm+0x445/0x660
       [<ffffffff81056d9f>] copy_process.part.22+0xa5f/0x1510
       [<ffffffff81057931>] do_fork+0x91/0x350
       [<ffffffff81057c76>] sys_clone+0x16/0x20
       [<ffffffff816ccbf9>] stub_clone+0x69/0x90
       [<ffffffff816cc89d>] ? system_call_fastpath+0x1a/0x1f
      ---[ end trace 4918cdd0a4c9fea4 ]---
      
      (a similar message shows up if you end up launching 'mcelog')
      
      The call chain is (as analyzed by Liu, Jinsong):
      do_fork
        --> copy_process
          --> dup_mm
            --> dup_mmap
             	--> copy_page_range
                --> track_pfn_copy
                  --> reserve_pfn_range
                    --> line 624: flags != want_flags
      It comes from different memory types of page table (_PAGE_CACHE_WB) and MTRR
      (_PAGE_CACHE_UC_MINUS).
      
      Stefan Bader dug in this deep and found out that:
      "That makes it clearer as this will do
      
      reserve_memtype(...)
      --> pat_x_mtrr_type
        --> mtrr_type_lookup
          --> __mtrr_type_lookup
      
      And that can return -1/0xff in case of MTRR not being enabled/initialized. Which
      is not the case (given there are no messages for it in dmesg). This is not equal
      to MTRR_TYPE_WRBACK and thus becomes _PAGE_CACHE_UC_MINUS.
      
      It looks like the problem starts early in reserve_memtype:
      
             	if (!pat_enabled) {
                      /* This is identical to page table setting without PAT */
                      if (new_type) {
                              if (req_type == _PAGE_CACHE_WC)
                                      *new_type = _PAGE_CACHE_UC_MINUS;
                              else
                                     	*new_type = req_type & _PAGE_CACHE_MASK;
                     	}
                      return 0;
              }
      
      This would be what we want, that is clearing the PWT and PCD flags from the
      supported flags - if pat_enabled is disabled."
      
      This patch does that - disabling PAT.
      
      CC: stable@vger.kernel.org # 3.3 and further
      Reported-by: NSander Eikelenboom <linux@eikelenboom.it>
      Reported-and-Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reported-and-Tested-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      c79c4982
    • J
      KVM: VMX: Pass vcpu to __vmx_complete_interrupts · 3ab66e8a
      Jan Kiszka 提交于
      Cleanup: __vmx_complete_interrupts has no use for the vmx structure.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      3ab66e8a
    • J
      KVM: nVMX: Avoid one redundant vmcs_read in prepare_vmcs12 · 44ceb9d6
      Jan Kiszka 提交于
      IDT_VECTORING_INFO_FIELD was already read right after vmexit.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      44ceb9d6
    • P
      x86/kvm: Fix pvclock vsyscall fixmap · 3d2a80a2
      Peter Hurley 提交于
      The physical memory fixmapped for the pvclock clock_gettime vsyscall
      was allocated, and thus is not a kernel symbol. __pa() is the proper
      method to use in this case.
      
      Fixes the crash below when booting a next-20130204+ smp guest on a
      3.8-rc5+ KVM host.
      
      [    0.666410] udevd[97]: starting version 175
      [    0.674043] udevd[97]: udevd:[97]: segfault at ffffffffff5fd020
           ip 00007fff069e277f sp 00007fff068c9ef8 error d
      Acked-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      3d2a80a2
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
    • S
      arch Kconfig: centralise CONFIG_ARCH_NO_VIRT_TO_BUS · 887cbce0
      Stephen Rothwell 提交于
      Change it to CONFIG_HAVE_VIRT_TO_BUS and set it in all architecures
      that already provide virt_to_bus().
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: H Hartley Sweeten <hartleys@visionengravers.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      887cbce0
    • A
      more file_inode() open-coded instances · 6131ffaa
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6131ffaa
    • H
      x86: Make sure we can boot in the case the BDA contains pure garbage · 7c100936
      H. Peter Anvin 提交于
      On non-BIOS platforms it is possible that the BIOS data area contains
      garbage instead of being zeroed or something equivalent (firmware
      people: we are talking of 1.5K here, so please do the sane thing.)
      
      We need on the order of 20-30K of low memory in order to boot, which
      may grow up to < 64K in the future.  We probably want to avoid the
      lowest of the low memory.  At the same time, it seems extremely
      unlikely that a legitimate EBDA would ever reach down to the 128K
      (which would require it to be over half a megabyte in size.)  Thus,
      pick 128K as the cutoff for "this is insane, ignore."  We may still
      end up reserving a bunch of extra memory on the low megabyte, but that
      is not really a major issue these days.  In the worst case we lose
      512K of RAM.
      
      This code really should be merged with trim_bios_range() in
      arch/x86/kernel/setup.c, but that is a bigger patch for a later merge
      window.
      Reported-by: NDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Matt Fleming <matt.fleming@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/n/tip-oebml055yyfm8yxmria09rja@git.kernel.org
      7c100936
  9. 27 2月, 2013 6 次提交
  10. 26 2月, 2013 2 次提交
  11. 25 2月, 2013 1 次提交
  12. 24 2月, 2013 8 次提交
    • A
      switch lseek to COMPAT_SYSCALL_DEFINE · 561c6731
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      561c6731
    • A
      x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge · a8aed3e0
      Andrea Arcangeli 提交于
      Without this patch any kernel code that reads kernel memory in
      non present kernel pte/pmds (as set by pageattr.c) will crash.
      
      With this kernel code:
      
      static struct page *crash_page;
      static unsigned long *crash_address;
      [..]
      	crash_page = alloc_pages(GFP_KERNEL, 9);
      	crash_address = page_address(crash_page);
      	if (set_memory_np((unsigned long)crash_address, 1))
      		printk("set_memory_np failure\n");
      [..]
      
      The kernel will crash if inside the "crash tool" one would try
      to read the memory at the not present address.
      
      crash> p crash_address
      crash_address = $8 = (long unsigned int *) 0xffff88023c000000
      crash> rd 0xffff88023c000000
      [ *lockup* ]
      
      The lockup happens because _PAGE_GLOBAL and _PAGE_PROTNONE
      shares the same bit, and pageattr leaves _PAGE_GLOBAL set on a
      kernel pte which is then mistaken as _PAGE_PROTNONE (so
      pte_present returns true by mistake and the kernel fault then
      gets confused and loops).
      
      With THP the same can happen after we taught pmd_present to
      check _PAGE_PROTNONE and _PAGE_PSE in commit
      027ef6c8 ("mm: thp: fix pmd_present for
      split_huge_page and PROT_NONE with THP").  THP has the same
      problem with _PAGE_GLOBAL as the 4k pages, but it also has a
      problem with _PAGE_PSE, which must be cleared too.
      
      After the patch is applied copy_user correctly returns -EFAULT
      and doesn't lockup anymore.
      
      crash> p crash_address
      crash_address = $9 = (long unsigned int *) 0xffff88023c000000
      crash> rd 0xffff88023c000000
      rd: read error: kernel virtual address: ffff88023c000000  type:
      "64-bit KVADDR"
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a8aed3e0
    • A
      Revert "x86, mm: Make spurious_fault check explicitly check explicitly check the PRESENT bit" · 954f8571
      Andrea Arcangeli 提交于
      I got a report for a minor regression introduced by commit
      027ef6c8 ("mm: thp: fix pmd_present for split_huge_page and
      PROT_NONE with THP").
      
      So the problem is, pageattr creates kernel pagetables (pte and
      pmds) that breaks pte_present/pmd_present and the patch above
      exposed this invariant breakage for pmd_present.
      
      The same problem already existed for the pte and pte_present and
      it was fixed by commit 660a293e ("x86, mm: Make
      spurious_fault check explicitly check the PRESENT bit") (if it
      wasn't for that commit, it wouldn't even be a regression).  That
      fix avoids the pagefault to use pte_present.  I could follow
      through by stopping using pmd_present/pmd_huge too.
      
      However I think it's more robust to fix pageattr and to clear
      the PSE/GLOBAL bitflags too in addition to the present bitflag.
      So the kernel page fault can keep using the regular
      pte_present/pmd_present/pmd_huge.
      
      The confusion arises because _PAGE_GLOBAL and _PAGE_PROTNONE are
      sharing the same bit, and in the pmd case we pretend _PAGE_PSE
      to be set only in present pmds (to facilitate split_huge_page
      final tlb flush).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      954f8571
    • W
      x86/mm/numa: Don't check if node is NUMA_NO_NODE · 942670d0
      Wen Congyang 提交于
      If we aren't debugging per_cpu maps, the cpu's node is stored in
      per_cpu variable numa_node.  If `node' is NUMA_NO_NODE, it means
      the caller wants to clear the cpu's node.  So we should also
      call set_cpu_numa_node() in this case.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      942670d0
    • T
      acpi, memory-hotplug: support getting hotplug info from SRAT · 01a178a9
      Tang Chen 提交于
      We now provide an option for users who don't want to specify physical
      memory address in kernel commandline.
      
               /*
                * For movablemem_map=acpi:
                *
                * SRAT:                |_____| |_____| |_________| |_________| ......
                * node id:                0       1         1           2
                * hotpluggable:           n       y         y           n
                * movablemem_map:              |_____| |_________|
                *
                * Using movablemem_map, we can prevent memblock from allocating memory
                * on ZONE_MOVABLE at boot time.
                */
      
      So user just specify movablemem_map=acpi, and the kernel will use
      hotpluggable info in SRAT to determine which memory ranges should be set
      as ZONE_MOVABLE.
      
      If all the memory ranges in SRAT is hotpluggable, then no memory can be
      used by kernel.  But before parsing SRAT, memblock has already reserve
      some memory ranges for other purposes, such as for kernel image, and so
      on.  We cannot prevent kernel from using these memory.  So we need to
      exclude these ranges even if these memory is hotpluggable.
      
      Furthermore, there could be several memory ranges in the single node
      which the kernel resides in.  We may skip one range that have memory
      reserved by memblock, but if the rest of memory is too small, then the
      kernel will fail to boot.  So, make the whole node which the kernel
      resides in un-hotpluggable.  Then the kernel has enough memory to use.
      
      NOTE: Using this way will cause NUMA performance down because the
            whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
            on it.  If users don't want to lose NUMA performance, just don't use
            it.
      
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: use strcmp()]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01a178a9
    • T
      acpi, memory-hotplug: extend movablemem_map ranges to the end of node · 27168d38
      Tang Chen 提交于
      When implementing movablemem_map boot option, we introduced an array
      movablemem_map.map[] to store the memory ranges to be set as
      ZONE_MOVABLE.
      
      Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
      the whole node memory range, we need to extend it to the node end so
      that we can use it to prevent memblock from allocating memory in the
      ranges user didn't specify.
      
      We now implement movablemem_map boot option like this:
      
              /*
               * For movablemem_map=nn[KMG]@ss[KMG]:
               *
               * SRAT:                |_____| |_____| |_________| |_________| ......
               * node id:                0       1         1           2
               * user specified:                |__|                 |___|
               * movablemem_map:                |___| |_________|    |______| ......
               *
               * Using movablemem_map, we can prevent memblock from allocating memory
               * on ZONE_MOVABLE at boot time.
               *
               * NOTE: In this case, SRAT info will be ingored.
               */
      
      [akpm@linux-foundation.org: clean up code, fix build warning]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27168d38
    • T
      acpi, memory-hotplug: parse SRAT before memblock is ready · e8d19552
      Tang Chen 提交于
      On linux, the pages used by kernel could not be migrated.  As a result,
      if a memory range is used by kernel, it cannot be hot-removed.  So if we
      want to hot-remove memory, we should prevent kernel from using it.
      
      The way now used to prevent this is specify a memory range by
      movablemem_map boot option and set it as ZONE_MOVABLE.
      
      But when the system is booting, memblock will allocate memory, and
      reserve the memory for kernel.  And before we parse SRAT, and know the
      node memory ranges, memblock is working.  And it may allocate memory in
      ranges to be set as ZONE_MOVABLE.  This memory can be used by kernel,
      and never be freed.
      
      So, let's parse SRAT before memblock is called first.  And it is early
      enough.
      
      The first call of memblock_find_in_range_node() is in:
      
        setup_arch()
          |-->setup_real_mode()
      
      so, this patch add a function early_parse_srat() to parse SRAT, and call
      it before setup_real_mode() is called.
      
      NOTE:
      
      1) early_parse_srat() is called before numa_init(), and has initialized
         numa_meminfo.  So DO NOT clear numa_nodes_parsed in numa_init() and DO
         NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
         numa info.
      
      2) I don't know why using count of memory affinities parsed from SRAT
         as a return value in original acpi_numa_init().  So I add a static
         variable srat_mem_cnt to remember this count and use it as the return
         value of the new acpi_numa_init()
      
      [mhocko@suse.cz: parse SRAT before memblock is ready fix]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8d19552
    • Y
      x86: get pg_data_t's memory from other node · 4d59a751
      Yasuaki Ishimatsu 提交于
      During the implementation of SRAT support, we met a problem.  In
      setup_arch(), we have the following call series:
      
       1) memblock is ready;
       2) some functions use memblock to allocate memory;
       3) parse ACPI tables, such as SRAT.
      
      Before 3), we don't know which memory is hotpluggable, and as a result,
      we cannot prevent memblock from allocating hotpluggable memory.  So, in
      2), there could be some hotpluggable memory allocated by memblock.
      
      Now, we are trying to parse SRAT earlier, before memblock is ready.  But
      I think we need more investigation on this topic.  So in this v5, I
      dropped all the SRAT support, and v5 is just the same as v3, and it is
      based on 3.8-rc3.
      
      As we planned, we will support getting info from SRAT without users'
      participation at last.  And we will post another patch-set to do so.
      
      And also, I think for now, we can add this boot option as the first step
      of supporting movable node.  Since Linux cannot migrate the direct
      mapped pages, the only way for now is to limit the whole node containing
      only movable memory.
      
      Using SRAT is one way.  But even if we can use SRAT, users still need an
      interface to enable/disable this functionality if they don't want to
      loose their NUMA performance.  So I think, a user interface is always
      needed.
      
      For now, users can disable this functionality by not specifying the boot
      option.  Later, we will post SRAT support, and add another option value
      "movablecore_map=acpi" to using SRAT.
      
      This patch:
      
      If system can create movable node which all memory of the node is
      allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
      the node's pg_data_t.  So, use memblock_alloc_try_nid() instead of
      memblock_alloc_nid() to retry when the first allocation fails.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d59a751