提交 · d08cd5e0eb632eef7819ba911c09d89a767f2d0c · openanolis / cloud-kernel

30 11月, 2017 1 次提交

device-dax: implement ->split() to catch invalid munmap attempts · 9702cffd

由 Dan Williams 提交于 11月 29, 2017

Similar to how device-dax enforces that the 'address', 'offset', and
'len' parameters to mmap() be aligned to the device's fundamental
alignment, the same constraints apply to munmap().  Implement ->split()
to fail munmap calls that violate the alignment constraint.

Otherwise, we later fail VM_BUG_ON checks in the unmap_page_range() path
with crash signatures of the form:

    vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000
    next           (null) prev           (null) mm ffff8800b61150c0
    prot 8000000000000027 anon_vma           (null) vm_ops ffffffffa0091240
    pgoff 0 file ffff8800b638ef80 private_data           (null)
    flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage)
    ------------[ cut here ]------------
    kernel BUG at mm/huge_memory.c:2014!
    [..]
    RIP: 0010:__split_huge_pud+0x12a/0x180
    [..]
    Call Trace:
     unmap_page_range+0x245/0xa40
     ? __vma_adjust+0x301/0x990
     unmap_vmas+0x4c/0xa0
     unmap_region+0xae/0x120
     ? __vma_rb_erase+0x11a/0x230
     do_munmap+0x276/0x410
     vm_munmap+0x6a/0xa0
     SyS_munmap+0x1d/0x30

Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: dee41079 ("/dev/dax, core: file operations and dax-mmap")
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reported-by: NJeff Moyer <jmoyer@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9702cffd

20 10月, 2017 1 次提交

dev/dax: fix uninitialized variable build warning · 0a3ff786

由 Ross Zwisler 提交于 10月 18, 2017

Fix this build warning:

warning: 'phys' may be used uninitialized in this function
[-Wuninitialized]

As reported here:

https://lkml.org/lkml/2017/10/16/152
http://kisskb.ellerman.id.au/kisskb/buildresult/13181373/log/Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0a3ff786

19 7月, 2017 1 次提交

device-dax: fix sysfs duplicate warnings · bbb3be17

由 Dan Williams 提交于 7月 18, 2017

Fix warnings of the form...

     WARNING: CPU: 10 PID: 4983 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
     sysfs: cannot create duplicate filename '/class/dax/dax12.0'
     Call Trace:
      dump_stack+0x63/0x86
      __warn+0xcb/0xf0
      warn_slowpath_fmt+0x5a/0x80
      ? kernfs_path_from_node+0x4f/0x60
      sysfs_warn_dup+0x62/0x80
      sysfs_do_create_link_sd.isra.2+0x97/0xb0
      sysfs_create_link+0x25/0x40
      device_add+0x266/0x630
      devm_create_dax_dev+0x2cf/0x340 [dax]
      dax_pmem_probe+0x1f5/0x26e [dax_pmem]
      nvdimm_bus_probe+0x71/0x120

...by reusing the namespace id for the device-dax instance name.

Now that we have decided that there will never by more than one
device-dax instance per libnvdimm-namespace parent device [1], we can
directly reuse the namepace ids. There are some possible follow-on
cleanups, but those are saved for a later patch to simplify the -stable
backport.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-December/008266.html

Fixes: 98a29c39 ("libnvdimm, namespace: allow creation of multiple pmem...")
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@vger.kernel.org>
Reported-by: NDariusz Dokupil <dariusz.dokupil@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

bbb3be17

18 7月, 2017 1 次提交

device-dax: fix 'passing zero to ERR_PTR()' warning · 43fe51e1

由 Dan Williams 提交于 7月 12, 2017

Dan Carpenter reports:

    The patch 7b6be844: "dax: refactor dax-fs into a generic provider
    of 'struct dax_device' instances" from Apr 11, 2017, leads to the
    following static checker warning:

        drivers/dax/device.c:643 devm_create_dev_dax()
        warn: passing zero to 'ERR_PTR'

Fix the case where we inadvertently leak 0 to ERR_PTR() by setting at
every error case, and make it clear that 'count' is never 0.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

43fe51e1

06 7月, 2017 1 次提交

fs: new infrastructure for writeback error handling and reporting · 5660e13d

由 Jeff Layton 提交于 7月 06, 2017

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>

5660e13d

20 4月, 2017 2 次提交

dax: introduce dax_operations · 6568b08b

由 Dan Williams 提交于 1月 24, 2017

Track a set of dax_operations per dax_device that can be set at
alloc_dax() time. These operations will be used to stop the abuse of
block_device_operations for communicating dax capabilities to
filesystems. It will also be used to replace the "pmem api" and move
pmem-specific cache maintenance, and other dax-driver-specific
filesystem-dax operations, to dax device methods. In particular this
allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(),
with a driver specific replacement.

This is a standalone introduction of the operations. Follow on patches
convert each dax-driver and teach fs/dax.c to use ->direct_access() from
dax_operations instead of block_device_operations.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6568b08b

dax: add a facility to lookup a dax device by 'host' device name · 72058005

由 Dan Williams 提交于 4月 19, 2017

For the current block_device based filesystem-dax path, we need a way
for it to lookup the dax_device associated with a block_device. Add a
'host' property of a dax_device that can be used for this purpose. It is
a free form string, but for a dax_device associated with a block device
it is the bdev name.

This is a stop-gap until filesystems are able to mount on a dax-inode
directly.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

72058005

13 4月, 2017 6 次提交

dax: refactor dax-fs into a generic provider of 'struct dax_device' instances · 7b6be844

由 Dan Williams 提交于 4月 11, 2017

We want dax capable drivers to be able to publish a set of dax
operations [1]. However, we do not want to further abuse block_devices
to advertise these operations. Instead we will attach these operations
to a dax device and add a lookup mechanism to go from block device path
to a dax device. A dax capable driver like pmem or brd is responsible
for registering a dax device, alongside a block device, and then a dax
capable filesystem is responsible for retrieving the dax device by path
name if it wants to call dax_operations.

For now, we refactor the dax pseudo-fs to be a generic facility, rather
than an implementation detail, of the device-dax use case. Where a "dax
device" is just an inode + dax infrastructure, and "Device DAX" is a
mapping service layered on top of that base 'struct dax_device'.
"Filesystem DAX" is then a mapping service that layers a filesystem on
top of that same base device. Filesystem DAX is associated with a
block_device for now, but perhaps directly to a dax device in the
future, or for new pmem-only filesystems.

[1]: https://lkml.org/lkml/2017/1/19/880Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

7b6be844

device-dax: rename 'dax_dev' to 'dev_dax' · 5f0694b3

由 Dan Williams 提交于 1月 30, 2017

In preparation for introducing a struct dax_device type to the kernel
global type namespace, rename dax_dev to dev_dax. A 'dax_device'
instance will be a generic device-driver object for any provider of dax
functionality. A 'dev_dax' object is a device-dax-driver local /
internal instance.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5f0694b3

device-dax: improve fault handler debug output · 76202620

由 Oliver O'Halloran 提交于 4月 12, 2017

A couple of minor improvements to the debug output in the fault handlers:

a) Print the region alignment and fault size when we sent a SIGBUS
because the region alignment is greater than the fault size.
b) Fix the message in the PFN_{DEV|MAP} check.
c) Additionally print the fault size enum value in the huge fault handler.
Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

76202620

device-dax: fix dax_dev_huge_fault() unknown fault size handling · 54eafcc9

由 Pushkar Jambhlekar 提交于 4月 11, 2017

The default case for dax_dev_huge_fault() fault size handling mistakenly
returns when it should unlock. This is not a problem in practice since
the only three possible fault sizes are handled. Going forward, if the
core mm adds a new fault size beyond pte, pmd, or pud device-dax should
abort VM_FAULT_SIGBUS requests not VM_FAULT_FALLBACK since device-dax
guarantees a configured fault granularity for all faults.
Signed-off-by: NPushkar Jambhlekar <pushkar.iit@gmail.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

54eafcc9

device-dax, tools/testing/nvdimm: enable device-dax with mock resources · efebc711

由 Dave Jiang 提交于 4月 07, 2017

Provide a replacement pgoff_to_phys() that translates an nfit_test
resource (allocated by vmalloc()) to a pfn.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

efebc711

device-dax: switch to srcu, fix rcu_read_lock() vs pte allocation · 956a4cd2

由 Dan Williams 提交于 4月 07, 2017

The following warning triggers with a new unit test that stresses the
device-dax interface.

 ===============================
 [ ERR: suspicious RCU usage.  ]
 4.11.0-rc4+ #1049 Tainted: G           O
 -------------------------------
 ./include/linux/rcupdate.h:521 Illegal context switch in RCU read-side critical section!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 0
 2 locks held by fio/9070:
  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8d0739d7>] __do_page_fault+0x167/0x4f0
  #1:  (rcu_read_lock){......}, at: [<ffffffffc03fbd02>] dax_dev_huge_fault+0x32/0x620 [dax]

 Call Trace:
  dump_stack+0x86/0xc3
  lockdep_rcu_suspicious+0xd7/0x110
  ___might_sleep+0xac/0x250
  __might_sleep+0x4a/0x80
  __alloc_pages_nodemask+0x23a/0x360
  alloc_pages_current+0xa1/0x1f0
  pte_alloc_one+0x17/0x80
  __pte_alloc+0x1e/0x120
  __get_locked_pte+0x1bf/0x1d0
  insert_pfn.isra.70+0x3a/0x100
  ? lookup_memtype+0xa6/0xd0
  vm_insert_mixed+0x64/0x90
  dax_dev_huge_fault+0x520/0x620 [dax]
  ? dax_dev_huge_fault+0x32/0x620 [dax]
  dax_dev_fault+0x10/0x20 [dax]
  __do_fault+0x1e/0x140
  __handle_mm_fault+0x9af/0x10d0
  handle_mm_fault+0x16d/0x370
  ? handle_mm_fault+0x47/0x370
  __do_page_fault+0x28c/0x4f0
  trace_do_page_fault+0x58/0x2a0
  do_async_page_fault+0x1a/0xa0
  async_page_fault+0x28/0x30

Inserting a page table entry may trigger an allocation while we are
holding a read lock to keep the device instance alive for the duration
of the fault. Use srcu for this keep-alive protection.

Fixes: dee41079 ("/dev/dax, core: file operations and dax-mmap")
Cc: <stable@vger.kernel.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

956a4cd2

21 3月, 2017 2 次提交

device-dax: utilize new cdev_device_add helper function · 92a3fa07

由 Logan Gunthorpe 提交于 3月 17, 2017

Replace the open coded registration of the cdev and dev with the
new device_add_cdev() helper. The helper replaces a common pattern by
taking the proper reference against the parent device and adding both
the cdev and the device.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

92a3fa07

device-dax: fix cdev leak · ed01e50a

由 Dan Williams 提交于 3月 17, 2017

If device_add() fails, cleanup the cdev. Otherwise, we leak a kobj_map()
with a stale device number.

As Jason points out, there is a small possibility that userspace has
opened and mapped the device in the time between cdev_add() and the
device_add() failure. We need a new kill_dax_dev() helper to invalidate
any established mappings.

Fixes: ba09c01d ("dax: convert to the cdev api")
Cc: <stable@vger.kernel.org>
Reported-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

ed01e50a

11 3月, 2017 3 次提交

device-dax: fix debug output typo · 52084f89

由 Dave Jiang 提交于 3月 09, 2017

The debug output for return the return data of pgoff_to_phys() in the
fault handlers has 'phys' and 'pgoff' incorrectly swapped.
Reported-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

52084f89

device-dax: fix pud fault fallback handling · 70b085b0

由 Dave Jiang 提交于 3月 10, 2017

Jeff Moyer reports:

    With a device dax alignment of 4KB or 2MB, I get sigbus when running
    the attached fio job file for the current kernel (4.11.0-rc1+).  If
    I specify an alignment of 1GB, it works.

    I turned on debug output, and saw that it was failing in the huge
    fault code.

     dax dax1.0: dax_open
     dax dax1.0: dax_mmap
     dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 -
     dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf60)
     dax dax1.0: dax_release

    fio config for reproduce:
    [global]
    ioengine=dev-dax
    direct=0
    filename=/dev/dax0.0
    bs=2m

    [write]
    rw=write

    [read]
    stonewall
    rw=read

The driver fails to fallback when taking a fault that is larger than
the device alignment, or handling a larger fault when a smaller
mapping is already established. While we could support larger
mappings for a device with a smaller alignment, that change is
too large for the immediate fix. The simplest change is to force
fallback until the fault size matches the alignment.
Reported-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

70b085b0

device-dax: fix pmd/pte fault fallback handling · 0134ed4f

由 Dave Jiang 提交于 3月 10, 2017

Jeff Moyer reports:

    With a device dax alignment of 4KB or 2MB, I get sigbus when running
    the attached fio job file for the current kernel (4.11.0-rc1+).  If
    I specify an alignment of 1GB, it works.

    I turned on debug output, and saw that it was failing in the huge
    fault code.

     dax dax1.0: dax_open
     dax dax1.0: dax_mmap
     dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 -
     dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf60
     dax dax1.0: dax_release

    fio config for reproduce:
    [global]
    ioengine=dev-dax
    direct=0
    filename=/dev/dax0.0
    bs=2m

    [write]
    rw=write

    [read]
    stonewall
    rw=read

The driver fails to fallback when taking a fault that is larger than
the device alignment, or handling a larger fault when a smaller
mapping is already established. While we could support larger
mappings for a device with a smaller alignment, that change is
too large for the immediate fix. The simplest change is to force
fallback until the fault size matches the alignment.

Fixes: dee41079 ("/dev/dax, core: file operations and dax-mmap")
Cc: <stable@vger.kernel.org>
Reported-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0134ed4f

02 3月, 2017 1 次提交

sched/headers: Prepare to remove the <linux/magic.h> include from <linux/sched/task_stack.h> · 50d34394

由 Ingo Molnar 提交于 2月 05, 2017

Update files that depend on the magic.h inclusion.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

50d34394

25 2月, 2017 4 次提交

mm: replace FAULT_FLAG_SIZE with parameter to huge_fault · c791ace1

由 Dave Jiang 提交于 2月 24, 2017

Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
been somewhat painful with getting the flags set and removed at the
correct locations.  More than one kernel oops was introduced due to
difficulties of getting the placement correctly.

Remove the flag values and introduce an input parameter to huge_fault
that indicates the size of the page entry.  This makes the code easier
to trace and should avoid the issues we see with the fault flags where
removal of the flag was necessary in the fallback paths.

Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Tested-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c791ace1

dax: support for transparent PUD pages for device DAX · 9557feee

由 Dave Jiang 提交于 2月 24, 2017

Add transparent huge PUD pages support for device DAX by adding a
pud_fault handler.

Link: http://lkml.kernel.org/r/148545060002.17912.6765687780007547551.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9557feee

mm,fs,dax: change ->pmd_fault to ->huge_fault · a2d58167

由 Dave Jiang 提交于 2月 24, 2017

Patch series "1G transparent hugepage support for device dax", v2.

The following series implements support for 1G trasparent hugepage on
x86 for device dax.  The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX.  I have
forward ported the relevant bits to 4.10-rc.  The current submission has
only the necessary code to support device DAX.

Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs.  Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file.  We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].

[1]: https://lkml.org/lkml/2016/1/31/52

Comments from Nilesh @ Oracle:

There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:

processes         : 10,000
memory            :    6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB

As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level.  Memory sizes will keep increasing; so
this number will keep increasing.

An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.

This patch (of 3):

In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault.  The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.

[ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
  Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
[dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
  Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a2d58167

mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf · 11bac800

由 Dave Jiang 提交于 2月 24, 2017

->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
  Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

11bac800

23 2月, 2017 2 次提交

mm, dax: change pmd_fault() to take only vmf parameter · f4200391

由 Dave Jiang 提交于 2月 22, 2017

pmd_fault() and related functions really only need the vmf parameter since
the additional parameters are all included in the vmf struct.  Remove the
additional parameter and simplify pmd_fault() and friends.

Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f4200391

mm, dax: make pmd_fault() and friends be the same as fault() · d8a849e1

由 Dave Jiang 提交于 2月 22, 2017

Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.

[dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
  Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d8a849e1

18 12月, 2016 1 次提交

dax: add region 'id', 'size', and 'align' attributes · d7fe1a67

由 Dan Williams 提交于 12月 17, 2016

While this information is available by looking at the nvdimm parent
device that may not always be the case when/if we add support for other
memory regions. Tooling should not depend on walking a given ancestor
topology that is not guaranteed by the device's class. For example, a
device-dax instance will always have a dax_region parent, but it may not
always have a libnvdimm "dax" device as a grandparent.
Reported-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

d7fe1a67

15 12月, 2016 1 次提交

mm: use vmf->address instead of of vmf->virtual_address · 1a29d85e

由 Jan Kara 提交于 12月 14, 2016

Every single user of vmf->virtual_address typed that entry to unsigned
long before doing anything with it so the type of virtual_address does
not really provide us any additional safety. Just use masked
vmf->address which already has the appropriate type.

Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1a29d85e

07 12月, 2016 1 次提交

device-dax: fix private mapping restriction, permit read-only · 325896ff

由 Dan Williams 提交于 12月 06, 2016

Hugh notes in response to commit 4cb19355 "device-dax: fail all
private mapping attempts":

  "I think that is more restrictive than you intended: haven't tried, but I
  believe it rejects a PROT_READ, MAP_SHARED, O_RDONLY fd mmap, leaving no
  way to mmap /dev/dax without write permission to it."

Indeed it does restrict read-only mappings, switch to checking
VM_MAYSHARE, not VM_SHARED.

Cc: <stable@vger.kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Fixes: 4cb19355 ("device-dax: fail all private mapping attempts")
Reported-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

325896ff

17 11月, 2016 1 次提交

device-dax: fail all private mapping attempts · 4cb19355

由 Dan Williams 提交于 11月 16, 2016

The device-dax implementation originally tried to be tricky and allow
private read-only mappings, but in the process allowed writable
MAP_PRIVATE + MAP_NORESERVE mappings.  For simplicity and predictability
just fail all private mapping attempts since device-dax memory is
statically allocated and will never support overcommit.

Cc: <stable@vger.kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Fixes: dee41079 ("/dev/dax, core: file operations and dax-mmap")
Reported-by: NPawel Lebioda <pawel.lebioda@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4cb19355

08 10月, 2016 2 次提交

dax: use correct dev_t value · bc0a0fe9

由 Arnd Bergmann 提交于 9月 08, 2016

The dev_t variable in devm_create_dax_dev() is used before it's
first set:

drivers/dax/dax.c: In function 'devm_create_dax_dev':
drivers/dax/dax.c:205:39: error: 'dev_t' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  inode = iget5_locked(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/dax/dax.c:688:8: note: 'dev_t' was declared here

This reorders the code to how it looks correct to me.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: 3bc52c45 ("dax: define a unified inode/address_space for device-dax mappings")
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

bc0a0fe9

dax: convert devm_create_dax_dev to PTR_ERR · d76911ee

由 Dan Williams 提交于 7月 19, 2016

For sub-division support we need access to the dax_dev created by
devm_create_dax_dev().
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

d76911ee

04 9月, 2016 1 次提交

dax: fix mapping size check · 4c3cb6e9

由 Dan Williams 提交于 9月 03, 2016

pgoff_to_phys() validates that both the starting address and the length
of the mapping against the resource list. We need to check for a
mapping size of PMD_SIZE not PAGE_SIZE in the pmd fault path.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4c3cb6e9

24 8月, 2016 8 次提交

dax: check resource alignment at dax region/device create · 9d2d01a0

由 Dan Williams 提交于 7月 19, 2016

All the extents of a dax-device must match the alignment of the region.
Otherwise, we are unable to guarantee fault semantics of a given page
size. The region must be self-consistent itself as well.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

9d2d01a0

dax: unmap/truncate on device shutdown · 9dc1e492

由 Dan Williams 提交于 8月 04, 2016

Invalidate all mappings of a device-dax instance when the device is
unregistered.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

9dc1e492

dax: define a unified inode/address_space for device-dax mappings · 3bc52c45

由 Dan Williams 提交于 7月 24, 2016

In support of enabling resize / truncate of device-dax instances, define
a pseudo-fs to provide a unified inode/address space for vm operations.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

3bc52c45

dax: convert to the cdev api · ba09c01d

由 Dan Williams 提交于 7月 24, 2016

A goal of the device-DAX interface is to be able to support many
exclusive allocations (partitions) of performance / feature
differentiated memory.  This count may exceed the default minors limit
of 256.

As a result of switching to an embedded cdev the inode-to-dax_dev
conversion is simplified, as well as reference counting which can switch
to the cdev kobject lifetime.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ba09c01d

dax: embed a struct device in dax_dev · ebd84d72

由 Dan Williams 提交于 8月 11, 2016

The kref in dax_dev can be made redundant if the final put_device() on
the device associated with the dax_dev frees the dax_dev. This can be
accomplished by embedding a struct device in struct dax_dev, open coding
device_create() and specifying a custom release method.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ebd84d72

dax: rename fops from dax_dev_ to dax_ · af69f51e

由 Dan Williams 提交于 8月 11, 2016

Shorten the prefix of the file operations to distinguish them from
operations on the struct device associated with the dax_dev.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

af69f51e

dax: reorder dax_fops function definitions · 043a9255

由 Dan Williams 提交于 8月 07, 2016

In order to convert devm_create_dax_dev() to use cdev, it will need
access to dax_fops. Move dax_fops and related function definitions
before devm_create_dax_dev().
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

043a9255

dax: cleanup needlessly global symbol warnings · ccdb07f6

由 Dan Williams 提交于 8月 06, 2016

drivers/dax/dax.c:75:6: warning: symbol 'dax_region_put' was not declared.
drivers/dax/dax.c:95:19: warning: symbol 'alloc_dax_region' was not declared.
drivers/dax/dax.c:173:5: warning: symbol 'devm_create_dax_dev' was not declared.
drivers/dax/pmem.c:27:17: warning: symbol 'to_dax_pmem' was not declared.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ccdb07f6

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功