1. 26 2月, 2010 1 次提交
  2. 25 2月, 2010 1 次提交
    • I
      x86, mm: Allow highmem user page tables to be disabled at boot time · 14315592
      Ian Campbell 提交于
      Distros generally (I looked at Debian, RHEL5 and SLES11) seem to
      enable CONFIG_HIGHPTE for any x86 configuration which has highmem
      enabled. This means that the overhead applies even to machines which
      have a fairly modest amount of high memory and which therefore do not
      really benefit from allocating PTEs in high memory but still pay the
      price of the additional mapping operations.
      
      Running kernbench on a 4G box I found that with CONFIG_HIGHPTE=y but
      no actual highptes being allocated there was a reduction in system
      time used from 59.737s to 55.9s.
      
      With CONFIG_HIGHPTE=y and highmem PTEs being allocated:
        Average Optimal load -j 4 Run (std deviation):
        Elapsed Time 175.396 (0.238914)
        User Time 515.983 (5.85019)
        System Time 59.737 (1.26727)
        Percent CPU 263.8 (71.6796)
        Context Switches 39989.7 (4672.64)
        Sleeps 42617.7 (246.307)
      
      With CONFIG_HIGHPTE=y but with no highmem PTEs being allocated:
        Average Optimal load -j 4 Run (std deviation):
        Elapsed Time 174.278 (0.831968)
        User Time 515.659 (6.07012)
        System Time 55.9 (1.07799)
        Percent CPU 263.8 (71.266)
        Context Switches 39929.6 (4485.13)
        Sleeps 42583.7 (373.039)
      
      This patch allows the user to control the allocation of PTEs in
      highmem from the command line ("userpte=nohigh") but retains the
      status-quo as the default.
      
      It is possible that some simple heuristic could be developed which
      allows auto-tuning of this option however I don't have a sufficiently
      large machine available to me to perform any particularly meaningful
      experiments. We could probably handwave up an argument for a threshold
      at 16G of total RAM.
      
      Assuming 768M of lowmem we have 196608 potential lowmem PTE
      pages. Each page can map 2M of RAM in a PAE-enabled configuration,
      meaning a maximum of 384G of RAM could potentially be mapped using
      lowmem PTEs.
      
      Even allowing generous factor of 10 to account for other required
      lowmem allocations, generous slop to account for page sharing (which
      reduces the total amount of RAM mappable by a given number of PT
      pages) and other innacuracies in the estimations it would seem that
      even a 32G machine would not have a particularly pressing need for
      highmem PTEs. I think 32G could be considered to be at the upper bound
      of what might be sensible on a 32 bit machine (although I think in
      practice 64G is still supported).
      
      It's seems questionable if HIGHPTE is even a win for any amount of RAM
      you would sensibly run a 32 bit kernel on rather than going 64 bit.
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      LKML-Reference: <1266403090-20162-1-git-send-email-ian.campbell@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      14315592
  3. 23 2月, 2010 1 次提交
    • S
      x86_64, cpa: Don't work hard in preserving kernel 2M mappings when using 4K already · 281ff33b
      Suresh Siddha 提交于
      We currently enforce the !RW mapping for the kernel mapping that maps
      holes between different text, rodata and data sections. However, kernel
      identity mappings will have different RWX permissions to the pages mapping to
      text and to the pages padding (which are freed) the text, rodata sections.
      Hence kernel identity mappings will be broken to smaller pages. For 64-bit,
      kernel text and kernel identity mappings are different, so we can enable
      protection checks that come with CONFIG_DEBUG_RODATA, as well as retain 2MB
      large page mappings for kernel text.
      
      Konrad reported a boot failure with the Linux Xen paravirt guest because of
      this. In this paravirt guest case, the kernel text mapping and the kernel
      identity mapping share the same page-table pages. Thus forcing the !RW mapping
      for some of the kernel mappings also cause the kernel identity mappings to be
      read-only resulting in the boot failure. Linux Xen paravirt guest also
      uses 4k mappings and don't use 2M mapping.
      
      Fix this issue and retain large page performance advantage for native kernels
      by not working hard and not enforcing !RW for the kernel text mapping,
      if the current mapping is already using small page mapping.
      Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <1266522700.2909.34.camel@sbs-t61.sc.intel.com>
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: stable@kernel.org	[2.6.32, 2.6.33]
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      281ff33b
  4. 18 2月, 2010 2 次提交
  5. 16 2月, 2010 3 次提交
    • D
      x86, numa: Remove configurable node size support for numa emulation · ca2107c9
      David Rientjes 提交于
      Now that numa=fake=<size>[MG] is implemented, it is possible to remove
      configurable node size support.  The command-line parsing was already
      broken (numa=fake=*128, for example, would not work) and since fake nodes
      are now interleaved over physical nodes, this support is no longer
      required.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151343080.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ca2107c9
    • D
      x86, numa: Add fixed node size option for numa emulation · 8df5bb34
      David Rientjes 提交于
      numa=fake=N specifies the number of fake nodes, N, to partition the
      system into and then allocates them by interleaving over physical nodes.
      This requires knowledge of the system capacity when attempting to
      allocate nodes of a certain size: either very large nodes to benchmark
      scalability of code that operates on individual nodes, or very small
      nodes to find bugs in the VM.
      
      This patch introduces numa=fake=<size>[MG] so it is possible to specify
      the size of each node to allocate.  When used, nodes of the size
      specified will be allocated and interleaved over the set of physical
      nodes.
      
      FAKE_NODE_MIN_SIZE was also moved to the more-appropriate
      include/asm/numa_64.h.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151342510.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      8df5bb34
    • D
      x86, numa: Fix numa emulation calculation of big nodes · 68fd111e
      David Rientjes 提交于
      numa=fake=N uses split_nodes_interleave() to partition the system into N
      fake nodes.  Each node size must have be a multiple of
      FAKE_NODE_MIN_SIZE, otherwise it is possible to get strange alignments.
      Because of this, the remaining memory from each node when rounded to
      FAKE_NODE_MIN_SIZE is consolidated into a number of "big nodes" that are
      bigger than the rest.
      
      The calculation of the number of big nodes is incorrect since it is using
      a logical AND operator when it should be multiplying the rounded-off
      portion of each node with N.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151342230.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      68fd111e
  6. 13 2月, 2010 3 次提交
    • Y
      x86: Make 32bit support NO_BOOTMEM · 59be5a8e
      Yinghai Lu 提交于
      Let's make 32bit consistent with 64bit.
      
      -v2: Andrew pointed out for 32bit that we should use -1ULL
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-25-git-send-email-yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      59be5a8e
    • Y
      sparsemem: Put mem map for one node together. · 9bdac914
      Yinghai Lu 提交于
      Add vmemmap_alloc_block_buf for mem map only.
      
      It will fallback to the old way if it cannot get a block that big.
      
      Before this patch, when a node have 128g ram installed, memmap are
      split into two parts or more.
      [    0.000000]  [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
      [    0.000000]  [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
      [    0.000000]  [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
      [    0.000000]  [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
      [    0.000000]  [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
      [    0.000000]  [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
      [    0.000000]  [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
      [    0.000000]  [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
      [    0.000000]  [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
      [    0.000000]  [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
      [    0.000000]  [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
      [    0.000000]  [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
      [    0.000000]  [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
      [    0.000000]  [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
      [    0.000000]  [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7
      
      after patch will get
      [    0.000000]  [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
      [    0.000000]  [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
      [    0.000000]  [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7
      
      -v2: change buf to vmemmap_buf instead according to Ingo
           also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
      -v3: according to Andrew, use sizeof(name) instead of hard coded 15
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-19-git-send-email-yinghai@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      9bdac914
    • Y
      x86: Make 64 bit use early_res instead of bootmem before slab · 08677214
      Yinghai Lu 提交于
      Finally we can use early_res to replace bootmem for x86_64 now.
      
      Still can use CONFIG_NO_BOOTMEM to enable it or not.
      
      -v2: fix 32bit compiling about MAX_DMA32_PFN
      -v3: folded bug fix from LKML message below
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B747239.4070907@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      08677214
  7. 11 2月, 2010 2 次提交
    • Y
      x86: Make early_node_mem get mem > 4 GB if possible · cef625ee
      Yinghai Lu 提交于
      So we could put pgdata for the node high, and later sparse
      vmmap will get the section nr that need.
      
      With this patch will make <4 GB ram not use a sparse vmmap.
      
      before this patch, will get, before swiotlb try get bootmem
      [    0.000000] nid=1 start=0 end=2080000 aligned=1
      [    0.000000]   free [10 - 96]
      [    0.000000]   free [b12 - 1000]
      [    0.000000]   free [359f - 38a3]
      [    0.000000]   free [38b5 - 3a00]
      [    0.000000]   free [41e01 - 42000]
      [    0.000000]   free [73dde - 73e00]
      [    0.000000]   free [73fdd - 74000]
      [    0.000000]   free [741dd - 74200]
      [    0.000000]   free [743dd - 74400]
      [    0.000000]   free [745dd - 74600]
      [    0.000000]   free [747dd - 74800]
      [    0.000000]   free [749dd - 74a00]
      [    0.000000]   free [74bdd - 74c00]
      [    0.000000]   free [74ddd - 74e00]
      [    0.000000]   free [74fdd - 75000]
      [    0.000000]   free [751dd - 75200]
      [    0.000000]   free [753dd - 75400]
      [    0.000000]   free [755dd - 75600]
      [    0.000000]   free [757dd - 75800]
      [    0.000000]   free [759dd - 75a00]
      [    0.000000]   free [75bdd - 7bf5f]
      [    0.000000]   free [7f730 - 7f750]
      [    0.000000]   free [100000 - 2080000]
      [    0.000000]   total free 1f87170
      [   93.301474] Placing 64MB software IO TLB between ffff880075bdd000 - ffff880079bdd000
      [   93.311814] software IO TLB at phys 0x75bdd000 - 0x79bdd000
      
      with this patch will get: before swiotlb try get bootmem
      [    0.000000] nid=1 start=0 end=2080000 aligned=1
      [    0.000000]   free [a - 96]
      [    0.000000]   free [702 - 1000]
      [    0.000000]   free [359f - 3600]
      [    0.000000]   free [37de - 3800]
      [    0.000000]   free [39dd - 3a00]
      [    0.000000]   free [3bdd - 3c00]
      [    0.000000]   free [3ddd - 3e00]
      [    0.000000]   free [3fdd - 4000]
      [    0.000000]   free [41dd - 4200]
      [    0.000000]   free [43dd - 4400]
      [    0.000000]   free [45dd - 4600]
      [    0.000000]   free [47dd - 4800]
      [    0.000000]   free [49dd - 4a00]
      [    0.000000]   free [4bdd - 4c00]
      [    0.000000]   free [4ddd - 4e00]
      [    0.000000]   free [4fdd - 5000]
      [    0.000000]   free [51dd - 5200]
      [    0.000000]   free [53dd - 5400]
      [    0.000000]   free [55dd - 7bf5f]
      [    0.000000]   free [7f730 - 7f750]
      [    0.000000]   free [100428 - 100600]
      [    0.000000]   free [13ea01 - 13ec00]
      [    0.000000]   free [170800 - 2080000]
      [    0.000000]   total free 1f87170
      
      [   92.689485] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
      [   92.699799] Placing 64MB software IO TLB between ffff8800055dd000 - ffff8800095dd000
      [   92.710916] software IO TLB at phys 0x55dd000 - 0x95dd000
      
      so will get enough space below 4G, aka pfn 0x100000
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-15-git-send-email-yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      cef625ee
    • Y
      x86: Call early_res_to_bootmem one time · 1842f90c
      Yinghai Lu 提交于
      Simplify setup_node_mem: don't use bootmem from other node, instead
      just find_e820_area in early_node_mem.
      
      This keeps the boundary between early_res and boot mem more clear, and
      lets us only call early_res_to_bootmem() one time instead of for all
      nodes.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-12-git-send-email-yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      1842f90c
  8. 03 2月, 2010 2 次提交
  9. 02 2月, 2010 2 次提交
  10. 28 1月, 2010 1 次提交
    • J
      x86: Use helpers for rlimits · 2854e72b
      Jiri Slaby 提交于
      Make sure compiler won't do weird things with limits.  Fetching them
      twice may return 2 different values after writable limits are
      implemented.
      
      We can either use rlimit helpers added in
      3e10e716 or ACCESS_ONCE if not
      applicable; this patch uses the helpers.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      LKML-Reference: <1264609942-24621-1-git-send-email-jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      2854e72b
  11. 23 1月, 2010 1 次提交
    • D
      x86: Set hotpluggable nodes in nodes_possible_map · 3a5fc0e4
      David Rientjes 提交于
      nodes_possible_map does not currently include nodes that have SRAT
      entries that are all ACPI_SRAT_MEM_HOT_PLUGGABLE since the bit is
      cleared in nodes_parsed if it does not have an online address range.
      
      Unequivocally setting the bit in nodes_parsed is insufficient since
      existing code, such as acpi_get_nodes(), assumes all nodes in the map
      have online address ranges.  In fact, all code using nodes_parsed
      assumes such nodes represent an address range of online memory.
      
      nodes_possible_map is created by unioning nodes_parsed and
      cpu_nodes_parsed; the former represents nodes with online memory and
      the latter represents memoryless nodes.  We now set the bit for
      hotpluggable nodes in cpu_nodes_parsed so that it also gets set in
      nodes_possible_map.
      
      [ hpa: Haicheng Li points out that this makes the naming of the
        variable cpu_nodes_parsed somewhat counterintuitive.  However, leave
        it as is in the interest of keeping the pure bug fix patch small. ]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NHaicheng Li <haicheng.li@linux.intel.com>
      LKML-Reference: <alpine.DEB.2.00.1001201152040.30528@chino.kir.corp.google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      3a5fc0e4
  12. 17 1月, 2010 1 次提交
  13. 12 1月, 2010 1 次提交
  14. 30 12月, 2009 1 次提交
    • J
      x86: Lift restriction on the location of FIX_BTMAP_* · 499a5f1e
      Jan Beulich 提交于
      The early ioremap fixmap entries cover half (or for 32-bit
      non-PAE, a quarter) of a page table, yet they got
      uncondtitionally aligned so far to a 256-entry boundary. This is
      not necessary if the range of page table entries anyway falls
      into a single page table.
      
      This buys back, for (theoretically) 50% of all configurations
      (25% of all non-PAE ones), at least some of the lowmem
      necessarily lost with commit e621bd18.
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B2BB66F0200007800026AD6@vpn.id2.novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      499a5f1e
  15. 28 12月, 2009 1 次提交
  16. 17 12月, 2009 1 次提交
    • Y
      x86: Fix checking of SRAT when node 0 ram is not from 0 · 32996250
      Yinghai Lu 提交于
      Found one system that boot from socket1 instead of socket0, SRAT get rejected...
      
      [    0.000000] SRAT: Node 1 PXM 0 0-a0000
      [    0.000000] SRAT: Node 1 PXM 0 100000-80000000
      [    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
      [    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
      [    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
      [    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
      [    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
      [    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
      [    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
      [    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
      ...
      [    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
      [    0.000000] NUMA: Using 20 for the hash shift.
      [    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
      [    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
      [    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
      [    0.000000] SRAT: SRAT not used.
      
      the early_node_map is not sorted because node0 with non zero start come first.
      
      so try to sort it right away after all regions are registered.
      
      also fixs refression by 8716273c (x86: Export srat physical topology)
      
      -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
      -v3: update comments.
      Reported-and-tested-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B2579D2.3010201@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      32996250
  17. 14 12月, 2009 1 次提交
  18. 10 12月, 2009 3 次提交
    • C
      vfs: Implement proper O_SYNC semantics · 6b2f3d1f
      Christoph Hellwig 提交于
      While Linux provided an O_SYNC flag basically since day 1, it took until
      Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
      since that day we had generic_osync_around with only minor changes and the
      great "For now, when the user asks for O_SYNC, we'll actually give
      O_DSYNC" comment.  This patch intends to actually give us real O_SYNC
      semantics in addition to the O_DSYNC semantics.  After Jan's O_SYNC
      patches which are required before this patch it's actually surprisingly
      simple, we just need to figure out when to set the datasync flag to
      vfs_fsync_range and when not.
      
      This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
      numerical value to keep binary compatibility, and adds a new real O_SYNC
      flag.  To guarantee backwards compatiblity it is defined as expanding to
      both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
      sure we are backwards-compatible when compiled against the new headers.
      
      This also means that all places that don't care about the differences can
      just check O_DSYNC and get the right behaviour for O_SYNC, too - only
      places that actuall care need to check __O_SYNC in addition.  Drivers and
      network filesystems have been updated in a fail safe way to always do the
      full sync magic if O_DSYNC is set.  The few places setting O_SYNC for
      lower layers are kept that way for now to stay failsafe.
      
      We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
      to make sure we always get these sane options.
      
      Note that parisc really screwed up their headers as they already define a
      O_DSYNC that has always been a no-op.  We try to repair it by using it for
      the new O_DSYNC and redefinining O_SYNC to send both the traditional
      O_SYNC numerical value _and_ the O_DSYNC one.
      
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger@sun.com>
      Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NUlrich Drepper <drepper@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      6b2f3d1f
    • J
      x86: mmio-mod.c: Use pr_fmt · 3a0340be
      Joe Perches 提交于
      - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
       - Remove #define NAME
       - Remove NAME from pr_<level>
      Signed-off-by: NJoe Perches <joe@perches.com>
      LKML-Reference: <009cb214c45ef932df0242856228f4739cc91408.1260383912.git.joe@perches.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3a0340be
    • J
      x86: kmmio.c: Add and use pr_fmt(fmt) · 1bd591a5
      Joe Perches 提交于
      - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
       - Strip "kmmio: " from pr_<level>s
      Signed-off-by: NJoe Perches <joe@perches.com>
      LKML-Reference: <7aa509f8a23933036d39f54bd51e9acc52068049.1260383912.git.joe@perches.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1bd591a5
  19. 06 12月, 2009 1 次提交
  20. 04 12月, 2009 1 次提交
  21. 26 11月, 2009 1 次提交
  22. 24 11月, 2009 4 次提交
  23. 23 11月, 2009 4 次提交
    • J
      x86: Suppress stack overrun message for init_task · 0e7810be
      Jan Beulich 提交于
      init_task doesn't get its stack end location set to
      STACK_END_MAGIC, and hence the message is confusing
      rather than helpful in this case.
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      LKML-Reference: <4B06AEFE02000078000211F4@vpn.id2.novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0e7810be
    • Y
      x86, numa: Use near(er) online node instead of roundrobin for NUMA · d9c2d5ac
      Yinghai Lu 提交于
      CPU to node mapping is set via the following sequence:
      
       1. numa_init_array(): Set up roundrobin from cpu to online node
      
       2. init_cpu_to_node(): Set that according to apicid_to_node[]
      			according to srat only handle the node that
      			is online, and leave other cpu on node
      			without ram (aka not online) to still
      			roundrobin.
      
      3. later call srat_detect_node for Intel/AMD, will use first_online
         node or nearby node.
      
      Problem is that setup_per_cpu_areas() is not called between 2 and 3,
      the per_cpu for cpu on node with ram is on different node, and could
      put that on node with two hops away.
      
      So try to optimize this and add find_near_online_node() and call
      init_cpu_to_node().
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B07A739.3030104@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d9c2d5ac
    • Y
      x86, numa, bootmem: Only free bootmem on NUMA failure path · 021428ad
      Yinghai Lu 提交于
      In the NUMA bootmem setup failure path we freed nodedata_phys
      incorrectly.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B07A739.3030104@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      021428ad
    • Y
      x86: apic: Print out SRAT table APIC id in hex · 163d3866
      Yinghai Lu 提交于
      Make it consistent with APIC MADT print out,
      for big systems APIC id in hex is more readable.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B07A739.3030104@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      163d3866
  24. 19 11月, 2009 1 次提交
    • J
      x86: Eliminate redundant/contradicting cache line size config options · 350f8f56
      Jan Beulich 提交于
      Rather than having X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT
      (with inconsistent defaults), just having the latter suffices as
      the former can be easily calculated from it.
      
      To be consistent, also change X86_INTERNODE_CACHE_BYTES to
      X86_INTERNODE_CACHE_SHIFT, and set it to 7 (128 bytes) for NUMA
      to account for last level cache line size (which here matters
      more than L1 cache line size).
      
      Finally, make sure the default value for X86_L1_CACHE_SHIFT,
      when X86_GENERIC is selected, is being seen before that for the
      individual CPU model options (other than on x86-64, where
      GENERIC_CPU is part of the choice construct, X86_GENERIC is a
      separate option on ix86).
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Acked-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Acked-by: NNick Piggin <npiggin@suse.de>
      LKML-Reference: <4AFD5710020000780001F8F0@vpn.id2.novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      350f8f56