- 07 4月, 2017 1 次提交
-
-
由 Tom Hromatka 提交于
This commit moves sparc64's prototype of pmd_write() outside of the CONFIG_TRANSPARENT_HUGEPAGE ifdef. In 2013, commit a7b9403f ("sparc64: Encode huge PMDs using PTE encoding.") exposed a path where pmd_write() could be called without CONFIG_TRANSPARENT_HUGEPAGE defined. This can result in the panic below. The diff is awkward to read, but the changes are straightforward. pmd_write() was moved outside of #ifdef CONFIG_TRANSPARENT_HUGEPAGE. Also, __HAVE_ARCH_PMD_WRITE was defined. kernel BUG at include/asm-generic/pgtable.h:576! \|/ ____ \|/ "@'/ .. \`@" /_| \__/ |_\ \__U_/ oracle_8114_cdb(8114): Kernel bad sw trap 5 [#1] CPU: 120 PID: 8114 Comm: oracle_8114_cdb Not tainted 4.1.12-61.7.1.el6uek.rc1.sparc64 #1 task: fff8400700a24d60 ti: fff8400700bc4000 task.ti: fff8400700bc4000 TSTATE: 0000004411e01607 TPC: 00000000004609f8 TNPC: 00000000004609fc Y: 00000005 Not tainted TPC: <gup_huge_pmd+0x198/0x1e0> g0: 000000000001c000 g1: 0000000000ef3954 g2: 0000000000000000 g3: 0000000000000001 g4: fff8400700a24d60 g5: fff8001fa5c10000 g6: fff8400700bc4000 g7: 0000000000000720 o0: 0000000000bc5058 o1: 0000000000000240 o2: 0000000000006000 o3: 0000000000001c00 o4: 0000000000000000 o5: 0000048000080000 sp: fff8400700bc6ab1 ret_pc: 00000000004609f0 RPC: <gup_huge_pmd+0x190/0x1e0> l0: fff8400700bc74fc l1: 0000000000020000 l2: 0000000000002000 l3: 0000000000000000 l4: fff8001f93250950 l5: 000000000113f800 l6: 0000000000000004 l7: 0000000000000000 i0: fff8400700ca46a0 i1: bd0000085e800453 i2: 000000026a0c4000 i3: 000000026a0c6000 i4: 0000000000000001 i5: fff800070c958de8 i6: fff8400700bc6b61 i7: 0000000000460dd0 I7: <gup_pud_range+0x170/0x1a0> Call Trace: [0000000000460dd0] gup_pud_range+0x170/0x1a0 [0000000000460e84] get_user_pages_fast+0x84/0x120 [00000000006f5a18] iov_iter_get_pages+0x98/0x240 [00000000005fa744] do_direct_IO+0xf64/0x1e00 [00000000005fbbc0] __blockdev_direct_IO+0x360/0x15a0 [00000000101f74fc] ext4_ind_direct_IO+0xdc/0x400 [ext4] [00000000101af690] ext4_ext_direct_IO+0x1d0/0x2c0 [ext4] [00000000101af86c] ext4_direct_IO+0xec/0x220 [ext4] [0000000000553bd4] generic_file_read_iter+0x114/0x140 [00000000005bdc2c] __vfs_read+0xac/0x100 [00000000005bf254] vfs_read+0x54/0x100 [00000000005bf368] SyS_pread64+0x68/0x80 Signed-off-by: NTom Hromatka <tom.hromatka@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 3月, 2017 1 次提交
-
-
由 Kirill A. Shutemov 提交于
If an architecture uses 4level-fixup.h we don't need to do anything as it includes 5level-fixup.h. If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK before inclusion of the header. It makes asm-generic code to use 5level-fixup.h. If an architecture has 4-level paging or folds levels on its own, include 5level-fixup.h directly. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 3月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
Update code that relied on sched.h including various MM types for them. This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 2月, 2017 1 次提交
-
-
由 Nitin Gupta 提交于
Add support for using multiple hugepage sizes simultaneously on mainline. Currently, support for 256M has been added which can be used along with 8M pages. Page tables are set like this (e.g. for 256M page): VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31] and TSB is set similarly: VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63] - Testing Tested on Sonoma (which supports 256M pages) by running stream benchmark instances in parallel: one instance uses 8M pages and another uses 256M pages, consuming 48G each. Boot params used: default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M hugepages=10000 Signed-off-by: NNitin Gupta <nitin.m.gupta@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 12月, 2016 1 次提交
-
-
由 Kirill A. Shutemov 提交于
It really has to be pgdp, not pgd. It just happend to work since all callers have 'pgd' as an argument. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 7月, 2016 1 次提交
-
-
由 Nitin Gupta 提交于
For PMD aligned (8M) hugepages, we currently allocate all four page table levels which is wasteful. We now allocate till PMD level only which saves memory usage from page tables. Also, when freeing page table for 8M hugepage backed region, make sure we don't try to access non-existent PTE level. Orabug: 22630259 Signed-off-by: NNitin Gupta <nitin.m.gupta@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 5月, 2016 1 次提交
-
-
由 Nitin Gupta 提交于
During hugepage map/unmap, TSB and TLB flushes are currently issued at every PAGE_SIZE'd boundary which is unnecessary. We now issue the flush at REAL_HPAGE_SIZE boundaries only. Without this patch workloads which unmap a large hugepage backed VMA region get CPU lockups due to excessive TLB flush calls. Orabug: 22365539, 22643230, 22995196 Signed-off-by: NNitin Gupta <nitin.m.gupta@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 5月, 2016 1 次提交
-
-
由 Hugh Dickins 提交于
I've just discovered that the useful-sounding has_transparent_hugepage() is actually an architecture-dependent minefield: on some arches it only builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when not, but on some of those (arm and arm64) it then gives the wrong answer; and on mips alone it's marked __init, which would crash if called later (but so far it has not been called later). Straighten this out: make it available to all configs, with a sensible default in asm-generic/pgtable.h, removing its definitions from those arches (arc, arm, arm64, sparc, tile) which are served by the default, adding #define has_transparent_hugepage has_transparent_hugepage to those (mips, powerpc, s390, x86) which need to override the default at runtime, and removing the __init from mips (but maybe that kind of code should be avoided after init: set a static variable the first time it's called). Signed-off-by: NHugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc] Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [arch/s390] Acked-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 3月, 2016 1 次提交
-
-
由 Adam Buchbinder 提交于
Signed-off-by: NAdam Buchbinder <adam.buchbinder@gmail.com> Reviewed-by: NJulian Calaby <julian.calaby@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 1月, 2016 2 次提交
-
-
由 Minchan Kim 提交于
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite of the contents since MADV_FREE syscall is called for THP page. This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE support. Signed-off-by: NMinchan Kim <minchan@kernel.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@kernel.org> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rik van Riel <riel@redhat.com> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Shaohua Li <shli@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
With new refcounting we don't need to mark PMDs splitting. Let's drop code to handle this. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 6月, 2015 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add _huge_ to pmdp_clear functions so that we are clear that they operate on hugepage pte. We don't bother about other functions like pmdp_set_wrprotect, pmdp_clear_flush_young, because they operate on PTE bits and hence indicate they are operating on hugepage ptes Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 6月, 2015 1 次提交
-
-
由 Khalid Aziz 提交于
sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE Bit 9 of TTE is CV (Cacheable in V-cache) on sparc v9 processor while the same bit 9 is MCDE (Memory Corruption Detection Enable) on M7 processor. This creates a conflicting usage of the same bit. Kernel sets TTE.cv bit on all pages for sun4v architecture which works well for sparc v9 but enables memory corruption detection on M7 processor which is not the intent. This patch adds code to determine if kernel is running on M7 processor and takes steps to not enable memory corruption detection in TTE erroneously. Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 2月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
LKP has triggered a compiler warning after my recent patch "mm: account pmd page tables to the process": mm/mmap.c: In function 'exit_mmap': >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default] The code: > 2857 WARN_ON(mm_nr_pmds(mm) > 2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT); In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has the same type -- int. PUD_SHIFT. I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned long. On every arch for consistency. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 2月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore. This patch also increase number of bits availble for swap offset. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2014 1 次提交
-
-
由 Kirill A. Shutemov 提交于
As a small zero page, huge zero page should not be accounted in smaps report as normal page. For small pages we rely on vm_normal_page() to filter out zero page, but vm_normal_page() is not designed to handle pmds. We only get here due hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not necessary compatible on each and every architecture. Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will detect huge zero page for us. We would need pmd_dirty() helper to do this properly. The patch adds it to THP-enabled architectures which don't yet have one. [akpm@linux-foundation.org: use do_div to fix 32-bit build] Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Reported-by: NFengguang Wu <fengguang.wu@intel.com> Tested-by: NFengwei Yin <yfw.kernel@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 10月, 2014 5 次提交
-
-
由 David S. Miller 提交于
swapper_low_pmd_dir and swapper_pud_dir are actually completely useless and unnecessary. We just need swapper_pg_dir[]. Naturally the other page table chunks will be allocated on an as-needed basis. Since the kernel actually accesses these tables in the PAGE_OFFSET view, there is not even a TLB locality advantage of placing them in the kernel image. Use the hard coded vmlinux.ld.S slot for swapper_pg_dir which is naturally page aligned. Increase MAX_BANKS to 1024 in order to handle heavily fragmented virtual guests. Even with this MAX_BANKS increase, the kernel is 20K+ smaller. Signed-off-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NBob Picco <bob.picco@oracle.com>
-
由 David S. Miller 提交于
In order to accomodate embedded per-cpu allocation with large numbers of cpus and numa nodes, we have to use as much virtual address space as possible for the vmalloc region. Otherwise we can get things like: PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000 So, once we select a value for PAGE_OFFSET, derive the size of the vmalloc region based upon that. Signed-off-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NBob Picco <bob.picco@oracle.com>
-
由 David S. Miller 提交于
Make sure, at compile time, that the kernel can properly support whatever MAX_PHYS_ADDRESS_BITS is defined to. On M7 chips, use a max_phys_bits value of 49. Based upon a patch by Bob Picco. Signed-off-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NBob Picco <bob.picco@oracle.com>
-
由 David S. Miller 提交于
If max_phys_bits needs to be > 43 (f.e. for T4 chips), things like DEBUG_PAGEALLOC stop working because the 3-level page tables only can cover up to 43 bits. Another problem is that when we increased MAX_PHYS_ADDRESS_BITS up to 47, several statically allocated tables became enormous. Compounding this is that we will need to support up to 49 bits of physical addressing for M7 chips. The two tables in question are sparc64_valid_addr_bitmap and kpte_linear_bitmap. The first holds a bitmap, with 1 bit for each 4MB chunk of physical memory, indicating whether that chunk actually exists in the machine and is valid. The second table is a set of 2-bit values which tell how large of a mapping (4MB, 256MB, 2GB, 16GB, respectively) we can use at each 256MB chunk of ram in the system. These tables are huge and take up an enormous amount of the BSS section of the sparc64 kernel image. Specifically, the sparc64_valid_addr_bitmap is 4MB, and the kpte_linear_bitmap is 128K. So let's solve the space wastage and the DEBUG_PAGEALLOC problem at the same time, by using the kernel page tables (as designed) to manage this information. We have to keep using large mappings when DEBUG_PAGEALLOC is disabled, and we do this by encoding huge PMDs and PUDs. On a T4-2 with 256GB of ram the kernel page table takes up 16K with DEBUG_PAGEALLOC disabled and 256MB with it enabled. Furthermore, this memory is dynamically allocated at run time rather than coded statically into the kernel image. Signed-off-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NBob Picco <bob.picco@oracle.com>
-
由 David S. Miller 提交于
This has become necessary with chips that support more than 43-bits of physical addressing. Based almost entirely upon a patch by Bob Picco. Signed-off-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NBob Picco <bob.picco@oracle.com>
-
- 19 5月, 2014 1 次提交
-
-
由 Sam Ravnborg 提交于
Drop extern for all prototypes and adjust alignment of parameters as required after the removal. In a few rare cases adjust linelength to conform to maximum 80 chars, and likewise in a few rare cases adjust alignment of parameters to static functions. Signed-off-by: NSam Ravnborg <sam@ravnborg.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 5月, 2014 1 次提交
-
-
由 David S. Miller 提交于
Access to the TSB hash tables during TLB misses requires that there be an atomic 128-bit quad load available so that we fetch a matching TAG and DATA field at the same time. On cpus prior to UltraSPARC-III only virtual address based quad loads are available. UltraSPARC-III and later provide physical address based variants which are easier to use. When we only have virtual address based quad loads available this means that we have to lock the TSB into the TLB at a fixed virtual address on each cpu when it runs that process. We can't just access the PAGE_OFFSET based aliased mapping of these TSBs because we cannot take a recursive TLB miss inside of the TLB miss handler without risking running out of hardware trap levels (some trap combinations can be deep, such as those generated by register window spill and fill traps). Without huge pages it's working perfectly fine, but when the huge TSB got added another chunk of fixed virtual address space was not allocated for this second TSB mapping. So we were mapping both the 8K and 4MB TSBs to the same exact virtual address, causing multiple TLB matches which gives undefined behavior. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 5月, 2014 8 次提交
-
-
由 David S. Miller 提交于
pte_ERROR() is not used anywhere, delete it. For pgd_ERROR() and pmd_ERROR(), output something similar to x86, giving the address of the pgd/pmd as well as it's value. Also provide the caller, since these macros are invoked from pgd_clear_bad() and pmd_clear_bad() which provides little context as to what high level operation was occuring when the BAD state was detected. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Instead of returning false we should at least check the most basic things, otherwise page table corruptions will be very difficult to debug. PMD and PTE tables are of size PAGE_SIZE, so none of the sub-PAGE_SIZE bits should be set. We also complement this with a check that the physical address the pud/pmd points to is valid memory. PowerPC was used as a guide while implementating this. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
In commit b2d43834 ("sparc64: Make PAGE_OFFSET variable."), the MAX_PHYS_ADDRESS_BITS value was increased (to 47). This constant reference to '41UL' was missed. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
When _PAGE_SPECIAL and _PAGE_PMD_HUGE were added to the mask, the comment was not updated. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
The large PMD path needs to check _PAGE_VALID not _PAGE_PRESENT, to decide if it needs to bail and return 0. pmd_large() should therefore just check _PAGE_PMD_HUGE. Calls to gup_huge_pmd() are guarded with a check of pmd_large(), so we just need to add a valid bit check. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
On sparc64 "present" and "valid" are seperate PTE bits, this allows us to naturally distinguish between the user explicitly asking for PROT_NONE with mprotect() and other situations. However we weren't handling this properly in the huge PMD paths. First of all, the page table walker in the TSB miss path only checks for _PAGE_PMD_HUGE. So the generic pmdp_invalidate() would clear _PAGE_PRESENT but the TLB miss paths would still load it into the TLB as a valid huge PMD. Fix this by clearing the valid bit in pmdp_invalidate(), and also checking the valid bit in USER_PGTABLE_CHECK_PMD_HUGE using "brgez" since _PAGE_VALID is bit 63 in both the sun4u and sun4v pte layouts. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 12月, 2013 1 次提交
-
-
由 Rik van Riel 提交于
There are a few subtle races, between change_protection_range (used by mprotect and change_prot_numa) on one side, and NUMA page migration and compaction on the other side. The basic race is that there is a time window between when the PTE gets made non-present (PROT_NONE or NUMA), and the TLB is flushed. During that time, a CPU may continue writing to the page. This is fine most of the time, however compaction or the NUMA migration code may come in, and migrate the page away. When that happens, the CPU may continue writing, through the cached translation, to what is no longer the current memory location of the process. This only affects x86, which has a somewhat optimistic pte_accessible. All other architectures appear to be safe, and will either always flush, or flush whenever there is a valid mapping, even with no permissions (SPARC). The basic race looks like this: CPU A CPU B CPU C load TLB entry make entry PTE/PMD_NUMA fault on entry read/write old page start migrating page change PTE/PMD to new page read/write old page [*] flush TLB reload TLB from new entry read/write new page lose data [*] the old page may belong to a new user at this point! The obvious fix is to flush remote TLB entries, by making sure that pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may still be accessible if there is a TLB flush pending for the mm. This should fix both NUMA migration and compaction. [mgorman@suse.de: fix build] Signed-off-by: NRik van Riel <riel@redhat.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Alex Thorlton <athorlton@sgi.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 11月, 2013 1 次提交
-
-
由 David S. Miller 提交于
Now that we have 64-bits for PMDs we can stop using special encodings for the huge PMD values, and just put real PTEs in there. We allocate a _PAGE_PMD_HUGE bit to distinguish between plain PMDs and huge ones. It is the same for both 4U and 4V PTE layouts. We also use _PAGE_SPECIAL to indicate the splitting state, since a huge PMD cannot also be special. All of the PMD --> PTE translation code disappears, and most of the huge PMD bit modifications and tests just degenerate into the PTE operations. In particular USER_PGTABLE_CHECK_PMD_HUGE becomes trivial. As a side effect, normal PMDs don't shift the physical address around. This also speeds up the page table walks in the TLB miss paths since they don't have to do the shifts any more. Another non-trivial aspect is that pte_modify() has to be changed to preserve the _PAGE_PMD_HUGE bits as well as the page size field of the pte. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 11月, 2013 2 次提交
-
-
由 David S. Miller 提交于
To make the page tables compact, we were using 32-bit PGDs and PMDs. We only had to support <= 43 bits of physical addresses so this was quite feasible. In order to support larger physical addresses we have to move to 64-bit PGDs and PMDs. Most of the changes are straight-forward: 1) {pgd,pmd}_t --> unsigned long 2) Anything that tries to use plain "unsigned int" types with pgd/pmd values needs to be adjusted. In particular things like "0U" become "0UL". 3) {PGDIR,PMD}_BITS decrease by one. 4) In the assembler page table walkers, use "ldxa" instead of "lduwa" and adjust the low bit masks to clear out the low 3 bits instead of just the low 2 bits during pgd/pmd address formation. Also, use PTRS_PER_PGD and PTRS_PER_PMD in the sizing of the swapper_{pg_dir,low_pmd_dir} arrays. This patch does not try to take advantage of having 64-bits in the PMDs to simplify the hugepage code, that will come in a subsequent change. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
The impetus for this is that we would like to move to 64-bit PMDs and PGDs, but that would result in only supporting a 42-bit address space with the current page table layout. It'd be nice to support at least 43-bits. The reason we'd end up with only 42-bits after making PMDs and PGDs 64-bit is that we only use half-page sized PTE tables in order to make PMDs line up to 4MB, the hardware huge page size we use. So what we do here is we make huge pages 8MB, and fabricate them using 4MB hw TLB entries. Facilitate this by providing a "REAL_HPAGE_SHIFT" which is used in places that really need to operate on hardware 4MB pages. Use full pages (512 entries) for PTE tables, and adjust PMD_SHIFT, PGD_SHIFT, and the build time CPP test as needed. Use a CPP test to make sure REAL_HPAGE_SHIFT and the _PAGE_SZHUGE_* we use match up. This makes the pgtable cache completely unused, so remove the code managing it and the state used in mm_context_t. Now we have less spinlocks taken in the page table allocation path. The technique we use to fabricate the 8MB pages is to transfer bit 22 from the missing virtual address into the PTEs physical address field. That takes care of the transparent huge pages case. For hugetlb, we fill things in at the PTE level and that code already puts the sub huge page physical bits into the PTEs, based upon the offset, so there is nothing special we need to do. It all just works out. So, a small amount of complexity in the THP case, but this code is about to get much simpler when we move the 64-bit PMDs as we can move away from the fancy 32-bit huge PMD encoding and just put a real PTE value in there. With bug fixes and help from Bob Picco. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 6月, 2013 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 20 6月, 2013 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
This will be later used by powerpc THP support. In powerpc we want to use pgtable for storing the hash index values. So instead of adding them to mm_context list, we would like to store them in the second half of pmd Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 20 4月, 2013 1 次提交
-
-
由 David S. Miller 提交于
As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: NDave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NDave Kleikamp <dave.kleikamp@oracle.com>
-
- 14 2月, 2013 1 次提交
-
-
由 David S. Miller 提交于
Mostly mirrors the s390 logic, as unlike x86 we don't need the SetPageReferenced() bits. On sparc64 we also lack a user/privileged bit in the huge PMDs. In order to make this work for THP and non-THP builds, some header file adjustments were necessary. Namely, provide the PMD_HUGE_* bit defines and the pmd_large() inline unconditionally rather than protected by TRANSPARENT_HUGEPAGE. Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 12月, 2012 1 次提交
-
-
由 David S. Miller 提交于
We can elide flush_tlb_*() calls when _PAGE_VALID is clear as that is the test used to determine whether or not to queue up a TLB flush in set_pte_at(). Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-