1. 07 4月, 2017 1 次提交
    • T
      sparc64: Fix kernel panic due to erroneous #ifdef surrounding pmd_write() · 9ae34dbd
      Tom Hromatka 提交于
      This commit moves sparc64's prototype of pmd_write() outside
      of the CONFIG_TRANSPARENT_HUGEPAGE ifdef.
      
      In 2013, commit a7b9403f ("sparc64: Encode huge PMDs using PTE
      encoding.") exposed a path where pmd_write() could be called without
      CONFIG_TRANSPARENT_HUGEPAGE defined.  This can result in the panic below.
      
      The diff is awkward to read, but the changes are straightforward.
      pmd_write() was moved outside of #ifdef CONFIG_TRANSPARENT_HUGEPAGE.
      Also, __HAVE_ARCH_PMD_WRITE was defined.
      
      kernel BUG at include/asm-generic/pgtable.h:576!
                    \|/ ____ \|/
                    "@'/ .. \`@"
                    /_| \__/ |_\
                       \__U_/
      oracle_8114_cdb(8114): Kernel bad sw trap 5 [#1]
      CPU: 120 PID: 8114 Comm: oracle_8114_cdb Not tainted
      4.1.12-61.7.1.el6uek.rc1.sparc64 #1
      task: fff8400700a24d60 ti: fff8400700bc4000 task.ti: fff8400700bc4000
      TSTATE: 0000004411e01607 TPC: 00000000004609f8 TNPC: 00000000004609fc Y:
      00000005    Not tainted
      TPC: <gup_huge_pmd+0x198/0x1e0>
      g0: 000000000001c000 g1: 0000000000ef3954 g2: 0000000000000000 g3: 0000000000000001
      g4: fff8400700a24d60 g5: fff8001fa5c10000 g6: fff8400700bc4000 g7: 0000000000000720
      o0: 0000000000bc5058 o1: 0000000000000240 o2: 0000000000006000 o3: 0000000000001c00
      o4: 0000000000000000 o5: 0000048000080000 sp: fff8400700bc6ab1 ret_pc: 00000000004609f0
      RPC: <gup_huge_pmd+0x190/0x1e0>
      l0: fff8400700bc74fc l1: 0000000000020000 l2: 0000000000002000 l3: 0000000000000000
      l4: fff8001f93250950 l5: 000000000113f800 l6: 0000000000000004 l7: 0000000000000000
      i0: fff8400700ca46a0 i1: bd0000085e800453 i2: 000000026a0c4000 i3: 000000026a0c6000
      i4: 0000000000000001 i5: fff800070c958de8 i6: fff8400700bc6b61 i7: 0000000000460dd0
      I7: <gup_pud_range+0x170/0x1a0>
      Call Trace:
       [0000000000460dd0] gup_pud_range+0x170/0x1a0
       [0000000000460e84] get_user_pages_fast+0x84/0x120
       [00000000006f5a18] iov_iter_get_pages+0x98/0x240
       [00000000005fa744] do_direct_IO+0xf64/0x1e00
       [00000000005fbbc0] __blockdev_direct_IO+0x360/0x15a0
       [00000000101f74fc] ext4_ind_direct_IO+0xdc/0x400 [ext4]
       [00000000101af690] ext4_ext_direct_IO+0x1d0/0x2c0 [ext4]
       [00000000101af86c] ext4_direct_IO+0xec/0x220 [ext4]
       [0000000000553bd4] generic_file_read_iter+0x114/0x140
       [00000000005bdc2c] __vfs_read+0xac/0x100
       [00000000005bf254] vfs_read+0x54/0x100
       [00000000005bf368] SyS_pread64+0x68/0x80
      Signed-off-by: NTom Hromatka <tom.hromatka@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ae34dbd
  2. 10 3月, 2017 1 次提交
  3. 02 3月, 2017 1 次提交
  4. 24 2月, 2017 1 次提交
    • N
      sparc64: Multi-page size support · c7d9f77d
      Nitin Gupta 提交于
      Add support for using multiple hugepage sizes simultaneously
      on mainline. Currently, support for 256M has been added which
      can be used along with 8M pages.
      
      Page tables are set like this (e.g. for 256M page):
          VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]
      
      and TSB is set similarly:
          VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]
      
      - Testing
      
      Tested on Sonoma (which supports 256M pages) by running stream
      benchmark instances in parallel: one instance uses 8M pages and
      another uses 256M pages, consuming 48G each.
      
      Boot params used:
      
      default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
      hugepages=10000
      Signed-off-by: NNitin Gupta <nitin.m.gupta@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d9f77d
  5. 12 12月, 2016 1 次提交
  6. 30 7月, 2016 1 次提交
  7. 21 5月, 2016 1 次提交
  8. 20 5月, 2016 1 次提交
    • H
      arch: fix has_transparent_hugepage() · fd8cfd30
      Hugh Dickins 提交于
      I've just discovered that the useful-sounding has_transparent_hugepage()
      is actually an architecture-dependent minefield: on some arches it only
      builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
      not, but on some of those (arm and arm64) it then gives the wrong
      answer; and on mips alone it's marked __init, which would crash if
      called later (but so far it has not been called later).
      
      Straighten this out: make it available to all configs, with a sensible
      default in asm-generic/pgtable.h, removing its definitions from those
      arches (arc, arm, arm64, sparc, tile) which are served by the default,
      adding #define has_transparent_hugepage has_transparent_hugepage to
      those (mips, powerpc, s390, x86) which need to override the default at
      runtime, and removing the __init from mips (but maybe that kind of code
      should be avoided after init: set a static variable the first time it's
      called).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>		[arch/arc]
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[arch/s390]
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd8cfd30
  9. 21 3月, 2016 1 次提交
  10. 16 1月, 2016 2 次提交
    • M
      arch/sparc/include/asm/pgtable_64.h: add pmd_[dirty|mkclean] for THP · 79cedb8f
      Minchan Kim 提交于
      MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
      of the contents since MADV_FREE syscall is called for THP page.
      
      This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
      support.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79cedb8f
    • K
      sparc, thp: remove infrastructure for handling splitting PMDs · 99f1bc01
      Kirill A. Shutemov 提交于
      With new refcounting we don't need to mark PMDs splitting.  Let's drop
      code to handle this.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99f1bc01
  11. 25 6月, 2015 1 次提交
  12. 01 6月, 2015 1 次提交
    • K
      sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE · 494e5b6f
      Khalid Aziz 提交于
      sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE
      
      Bit 9 of TTE is CV (Cacheable in V-cache) on sparc v9 processor while
      the same bit 9 is MCDE (Memory Corruption Detection Enable) on M7
      processor. This creates a conflicting usage of the same bit. Kernel
      sets TTE.cv bit on all pages for sun4v architecture which works well
      for sparc v9 but enables memory corruption detection on M7 processor
      which is not the intent. This patch adds code to determine if kernel
      is running on M7 processor and takes steps to not enable memory
      corruption detection in TTE erroneously.
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      494e5b6f
  13. 12 2月, 2015 1 次提交
  14. 11 2月, 2015 1 次提交
  15. 11 12月, 2014 1 次提交
  16. 06 10月, 2014 5 次提交
    • D
      sparc64: Kill unnecessary tables and increase MAX_BANKS. · d195b71b
      David S. Miller 提交于
      swapper_low_pmd_dir and swapper_pud_dir are actually completely
      useless and unnecessary.
      
      We just need swapper_pg_dir[].  Naturally the other page table chunks
      will be allocated on an as-needed basis.  Since the kernel actually
      accesses these tables in the PAGE_OFFSET view, there is not even a TLB
      locality advantage of placing them in the kernel image.
      
      Use the hard coded vmlinux.ld.S slot for swapper_pg_dir which is
      naturally page aligned.
      
      Increase MAX_BANKS to 1024 in order to handle heavily fragmented
      virtual guests.
      
      Even with this MAX_BANKS increase, the kernel is 20K+ smaller.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NBob Picco <bob.picco@oracle.com>
      d195b71b
    • D
      sparc64: Adjust vmalloc region size based upon available virtual address bits. · bb4e6e85
      David S. Miller 提交于
      In order to accomodate embedded per-cpu allocation with large numbers
      of cpus and numa nodes, we have to use as much virtual address space
      as possible for the vmalloc region.  Otherwise we can get things like:
      
      PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000
      
      So, once we select a value for PAGE_OFFSET, derive the size of the
      vmalloc region based upon that.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NBob Picco <bob.picco@oracle.com>
      bb4e6e85
    • D
      sparc64: Increase MAX_PHYS_ADDRESS_BITS to 53. · 7c0fa0f2
      David S. Miller 提交于
      Make sure, at compile time, that the kernel can properly support
      whatever MAX_PHYS_ADDRESS_BITS is defined to.
      
      On M7 chips, use a max_phys_bits value of 49.
      
      Based upon a patch by Bob Picco.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NBob Picco <bob.picco@oracle.com>
      7c0fa0f2
    • D
      sparc64: Fix physical memory management regressions with large max_phys_bits. · 0dd5b7b0
      David S. Miller 提交于
      If max_phys_bits needs to be > 43 (f.e. for T4 chips), things like
      DEBUG_PAGEALLOC stop working because the 3-level page tables only
      can cover up to 43 bits.
      
      Another problem is that when we increased MAX_PHYS_ADDRESS_BITS up to
      47, several statically allocated tables became enormous.
      
      Compounding this is that we will need to support up to 49 bits of
      physical addressing for M7 chips.
      
      The two tables in question are sparc64_valid_addr_bitmap and
      kpte_linear_bitmap.
      
      The first holds a bitmap, with 1 bit for each 4MB chunk of physical
      memory, indicating whether that chunk actually exists in the machine
      and is valid.
      
      The second table is a set of 2-bit values which tell how large of a
      mapping (4MB, 256MB, 2GB, 16GB, respectively) we can use at each 256MB
      chunk of ram in the system.
      
      These tables are huge and take up an enormous amount of the BSS
      section of the sparc64 kernel image.  Specifically, the
      sparc64_valid_addr_bitmap is 4MB, and the kpte_linear_bitmap is 128K.
      
      So let's solve the space wastage and the DEBUG_PAGEALLOC problem
      at the same time, by using the kernel page tables (as designed) to
      manage this information.
      
      We have to keep using large mappings when DEBUG_PAGEALLOC is disabled,
      and we do this by encoding huge PMDs and PUDs.
      
      On a T4-2 with 256GB of ram the kernel page table takes up 16K with
      DEBUG_PAGEALLOC disabled and 256MB with it enabled.  Furthermore, this
      memory is dynamically allocated at run time rather than coded
      statically into the kernel image.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NBob Picco <bob.picco@oracle.com>
      0dd5b7b0
    • D
      sparc64: Switch to 4-level page tables. · ac55c768
      David S. Miller 提交于
      This has become necessary with chips that support more than 43-bits
      of physical addressing.
      
      Based almost entirely upon a patch by Bob Picco.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NBob Picco <bob.picco@oracle.com>
      ac55c768
  17. 19 5月, 2014 1 次提交
  18. 09 5月, 2014 1 次提交
    • D
      sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus. · b18eb2d7
      David S. Miller 提交于
      Access to the TSB hash tables during TLB misses requires that there be
      an atomic 128-bit quad load available so that we fetch a matching TAG
      and DATA field at the same time.
      
      On cpus prior to UltraSPARC-III only virtual address based quad loads
      are available.  UltraSPARC-III and later provide physical address
      based variants which are easier to use.
      
      When we only have virtual address based quad loads available this
      means that we have to lock the TSB into the TLB at a fixed virtual
      address on each cpu when it runs that process.  We can't just access
      the PAGE_OFFSET based aliased mapping of these TSBs because we cannot
      take a recursive TLB miss inside of the TLB miss handler without
      risking running out of hardware trap levels (some trap combinations
      can be deep, such as those generated by register window spill and fill
      traps).
      
      Without huge pages it's working perfectly fine, but when the huge TSB
      got added another chunk of fixed virtual address space was not
      allocated for this second TSB mapping.
      
      So we were mapping both the 8K and 4MB TSBs to the same exact virtual
      address, causing multiple TLB matches which gives undefined behavior.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b18eb2d7
  19. 04 5月, 2014 8 次提交
  20. 19 12月, 2013 1 次提交
    • R
      mm: fix TLB flush race between migration, and change_protection_range · 20841405
      Rik van Riel 提交于
      There are a few subtle races, between change_protection_range (used by
      mprotect and change_prot_numa) on one side, and NUMA page migration and
      compaction on the other side.
      
      The basic race is that there is a time window between when the PTE gets
      made non-present (PROT_NONE or NUMA), and the TLB is flushed.
      
      During that time, a CPU may continue writing to the page.
      
      This is fine most of the time, however compaction or the NUMA migration
      code may come in, and migrate the page away.
      
      When that happens, the CPU may continue writing, through the cached
      translation, to what is no longer the current memory location of the
      process.
      
      This only affects x86, which has a somewhat optimistic pte_accessible.
      All other architectures appear to be safe, and will either always flush,
      or flush whenever there is a valid mapping, even with no permissions
      (SPARC).
      
      The basic race looks like this:
      
      CPU A			CPU B			CPU C
      
      						load TLB entry
      make entry PTE/PMD_NUMA
      			fault on entry
      						read/write old page
      			start migrating page
      			change PTE/PMD to new page
      						read/write old page [*]
      flush TLB
      						reload TLB from new entry
      						read/write new page
      						lose data
      
      [*] the old page may belong to a new user at this point!
      
      The obvious fix is to flush remote TLB entries, by making sure that
      pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
      still be accessible if there is a TLB flush pending for the mm.
      
      This should fix both NUMA migration and compaction.
      
      [mgorman@suse.de: fix build]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20841405
  21. 14 11月, 2013 1 次提交
    • D
      sparc64: Encode huge PMDs using PTE encoding. · a7b9403f
      David S. Miller 提交于
      Now that we have 64-bits for PMDs we can stop using special encodings
      for the huge PMD values, and just put real PTEs in there.
      
      We allocate a _PAGE_PMD_HUGE bit to distinguish between plain PMDs and
      huge ones.  It is the same for both 4U and 4V PTE layouts.
      
      We also use _PAGE_SPECIAL to indicate the splitting state, since a
      huge PMD cannot also be special.
      
      All of the PMD --> PTE translation code disappears, and most of the
      huge PMD bit modifications and tests just degenerate into the PTE
      operations.  In particular USER_PGTABLE_CHECK_PMD_HUGE becomes
      trivial.
      
      As a side effect, normal PMDs don't shift the physical address around.
      This also speeds up the page table walks in the TLB miss paths since
      they don't have to do the shifts any more.
      
      Another non-trivial aspect is that pte_modify() has to be changed
      to preserve the _PAGE_PMD_HUGE bits as well as the page size field
      of the pte.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7b9403f
  22. 13 11月, 2013 2 次提交
    • D
      sparc64: Move to 64-bit PGDs and PMDs. · 2b77933c
      David S. Miller 提交于
      To make the page tables compact, we were using 32-bit PGDs and PMDs.
      We only had to support <= 43 bits of physical addresses so this was
      quite feasible.
      
      In order to support larger physical addresses we have to move to
      64-bit PGDs and PMDs.
      
      Most of the changes are straight-forward:
      
      1) {pgd,pmd}_t --> unsigned long
      
      2) Anything that tries to use plain "unsigned int" types with pgd/pmd
         values needs to be adjusted.  In particular things like "0U" become
         "0UL".
      
      3) {PGDIR,PMD}_BITS decrease by one.
      
      4) In the assembler page table walkers, use "ldxa" instead of "lduwa"
         and adjust the low bit masks to clear out the low 3 bits instead of
         just the low 2 bits during pgd/pmd address formation.
      
      Also, use PTRS_PER_PGD and PTRS_PER_PMD in the sizing of the
      swapper_{pg_dir,low_pmd_dir} arrays.
      
      This patch does not try to take advantage of having 64-bits in the
      PMDs to simplify the hugepage code, that will come in a subsequent
      change.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b77933c
    • D
      sparc64: Move from 4MB to 8MB huge pages. · 37b3a8ff
      David S. Miller 提交于
      The impetus for this is that we would like to move to 64-bit PMDs and
      PGDs, but that would result in only supporting a 42-bit address space
      with the current page table layout.  It'd be nice to support at least
      43-bits.
      
      The reason we'd end up with only 42-bits after making PMDs and PGDs
      64-bit is that we only use half-page sized PTE tables in order to make
      PMDs line up to 4MB, the hardware huge page size we use.
      
      So what we do here is we make huge pages 8MB, and fabricate them using
      4MB hw TLB entries.
      
      Facilitate this by providing a "REAL_HPAGE_SHIFT" which is used in
      places that really need to operate on hardware 4MB pages.
      
      Use full pages (512 entries) for PTE tables, and adjust PMD_SHIFT,
      PGD_SHIFT, and the build time CPP test as needed.  Use a CPP test to
      make sure REAL_HPAGE_SHIFT and the _PAGE_SZHUGE_* we use match up.
      
      This makes the pgtable cache completely unused, so remove the code
      managing it and the state used in mm_context_t.  Now we have less
      spinlocks taken in the page table allocation path.
      
      The technique we use to fabricate the 8MB pages is to transfer bit 22
      from the missing virtual address into the PTEs physical address field.
      That takes care of the transparent huge pages case.
      
      For hugetlb, we fill things in at the PTE level and that code already
      puts the sub huge page physical bits into the PTEs, based upon the
      offset, so there is nothing special we need to do.  It all just works
      out.
      
      So, a small amount of complexity in the THP case, but this code is
      about to get much simpler when we move the 64-bit PMDs as we can move
      away from the fancy 32-bit huge PMD encoding and just put a real PTE
      value in there.
      
      With bug fixes and help from Bob Picco.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37b3a8ff
  23. 29 6月, 2013 1 次提交
  24. 20 6月, 2013 1 次提交
  25. 20 4月, 2013 1 次提交
    • D
      sparc64: Fix race in TLB batch processing. · f36391d2
      David S. Miller 提交于
      As reported by Dave Kleikamp, when we emit cross calls to do batched
      TLB flush processing we have a race because we do not synchronize on
      the sibling cpus completing the cross call.
      
      So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.)
      and either flushes are missed or flushes will flush the wrong
      addresses.
      
      Fix this by using generic infrastructure to synchonize on the
      completion of the cross call.
      
      This first required getting the flush_tlb_pending() call out from
      switch_to() which operates with locks held and interrupts disabled.
      The problem is that smp_call_function_many() cannot be invoked with
      IRQs disabled and this is explicitly checked for with WARN_ON_ONCE().
      
      We get the batch processing outside of locked IRQ disabled sections by
      using some ideas from the powerpc port. Namely, we only batch inside
      of arch_{enter,leave}_lazy_mmu_mode() calls.  If we're not in such a
      region, we flush TLBs synchronously.
      
      1) Get rid of xcall_flush_tlb_pending and per-cpu type
         implementations.
      
      2) Do TLB batch cross calls instead via:
      
      	smp_call_function_many()
      		tlb_pending_func()
      			__flush_tlb_pending()
      
      3) Batch only in lazy mmu sequences:
      
      	a) Add 'active' member to struct tlb_batch
      	b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
      	c) Set 'active' in arch_enter_lazy_mmu_mode()
      	d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode()
      	e) Check 'active' in tlb_batch_add_one() and do a synchronous
                 flush if it's clear.
      
      4) Add infrastructure for synchronous TLB page flushes.
      
      	a) Implement __flush_tlb_page and per-cpu variants, patch
      	   as needed.
      	b) Likewise for xcall_flush_tlb_page.
      	c) Implement smp_flush_tlb_page() to invoke the cross-call.
      	d) Wire up global_flush_tlb_page() to the right routine based
                 upon CONFIG_SMP
      
      5) It turns out that singleton batches are very common, 2 out of every
         3 batch flushes have only a single entry in them.
      
         The batch flush waiting is very expensive, both because of the poll
         on sibling cpu completeion, as well as because passing the tlb batch
         pointer to the sibling cpus invokes a shared memory dereference.
      
         Therefore, in flush_tlb_pending(), if there is only one entry in
         the batch perform a completely asynchronous global_flush_tlb_page()
         instead.
      Reported-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      f36391d2
  26. 14 2月, 2013 1 次提交
  27. 19 12月, 2012 1 次提交