1. 09 6月, 2023 1 次提交
  2. 16 11月, 2022 1 次提交
  3. 28 7月, 2022 3 次提交
  4. 28 1月, 2022 1 次提交
  5. 19 1月, 2022 1 次提交
    • L
      mm/dynamic_hugetlb: establish the dynamic hugetlb feature framework · a8a836a3
      Liu Shixin 提交于
      hulk inclusion
      category: feature
      bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
      CVE: NA
      
      --------------------------------
      
      Dynamic hugetlb is a self-developed feature based on the hugetlb and memcontrol.
      It supports to split huge page dynamically in a memory cgroup. There is a new structure
      dhugetlb_pool in every mem_cgroup to manage the pages configured to the mem_cgroup.
      For the mem_cgroup configured with dhugetlb_pool, processes in the mem_cgroup will
      preferentially use the pages in dhugetlb_pool.
      
      Dynamic hugetlb supports three types of pages, including 1G/2M huge pages and 4K pages.
      For the mem_cgroup configured with dhugetlb_pool, processes will be limited to alloc
      1G/2M huge pages only from dhugetlb_pool. But there is no such constraint for 4K pages.
      If there are insufficient 4K pages in the dhugetlb_pool, pages can also be allocated from
      buddy system. So before using dynamic hugetlb, user must know how many huge pages they
      need.
      
      Usage:
      1. Add 'dynamic_hugetlb=on' in cmdline to enable dynamic hugetlb feature.
      2. Prealloc some 1G hugepages through hugetlb.
      3. Create a mem_cgroup and configure dhugetlb_pool to mem_cgroup.
      4. Configure the count of 1G/2M hugepages, and the remaining pages in dhugetlb_pool will
         be used as basic pages.
      5. Bound a process to mem_cgroup. then the memory for it will be allocated from dhugetlb_pool.
      
      This patch add the corresponding structure dhugetlb_pool for dynamic hugetlb feature,
      the interface 'dhugetlb.nr_pages' in mem_cgroup to configure dhugetlb_pool and the cmdline
      'dynamic_hugetlb=on' to enable dynamic hugetlb feature.
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a8a836a3
  6. 02 9月, 2021 1 次提交
  7. 14 7月, 2021 3 次提交
  8. 09 4月, 2021 2 次提交
  9. 09 3月, 2021 2 次提交
  10. 08 8月, 2020 1 次提交
  11. 14 6月, 2020 1 次提交
    • M
      treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624
      Masahiro Yamada 提交于
      Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
      '---help---'"), the number of '---help---' has been gradually
      decreasing, but there are still more than 2400 instances.
      
      This commit finishes the conversion. While I touched the lines,
      I also fixed the indentation.
      
      There are a variety of indentation styles found.
      
        a) 4 spaces + '---help---'
        b) 7 spaces + '---help---'
        c) 8 spaces + '---help---'
        d) 1 space + 1 tab + '---help---'
        e) 1 tab + '---help---'    (correct indentation)
        f) 1 tab + 1 space + '---help---'
        g) 1 tab + 2 spaces + '---help---'
      
      In order to convert all of them to 1 tab + 'help', I ran the
      following commend:
      
        $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      a7f7f624
  12. 21 4月, 2020 1 次提交
  13. 06 3月, 2020 1 次提交
  14. 09 2月, 2020 1 次提交
  15. 07 2月, 2020 1 次提交
    • D
      fs: New zonefs file system · 8dcc1a9d
      Damien Le Moal 提交于
      zonefs is a very simple file system exposing each zone of a zoned block
      device as a file. Unlike a regular file system with zoned block device
      support (e.g. f2fs), zonefs does not hide the sequential write
      constraint of zoned block devices to the user. Files representing
      sequential write zones of the device must be written sequentially
      starting from the end of the file (append only writes).
      
      As such, zonefs is in essence closer to a raw block device access
      interface than to a full featured POSIX file system. The goal of zonefs
      is to simplify the implementation of zoned block device support in
      applications by replacing raw block device file accesses with a richer
      file API, avoiding relying on direct block device file ioctls which may
      be more obscure to developers. One example of this approach is the
      implementation of LSM (log-structured merge) tree structures (such as
      used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
      to be stored in a zone file similarly to a regular file system rather
      than as a range of sectors of a zoned device. The introduction of the
      higher level construct "one file is one zone" can help reducing the
      amount of changes needed in the application as well as introducing
      support for different application programming languages.
      
      Zonefs on-disk metadata is reduced to an immutable super block to
      persistently store a magic number and optional feature flags and
      values. On mount, zonefs uses blkdev_report_zones() to obtain the device
      zone configuration and populates the mount point with a static file tree
      solely based on this information. E.g. file sizes come from the device
      zone type and write pointer offset managed by the device itself.
      
      The zone files created on mount have the following characteristics.
      1) Files representing zones of the same type are grouped together
         under a common sub-directory:
           * For conventional zones, the sub-directory "cnv" is used.
           * For sequential write zones, the sub-directory "seq" is used.
        These two directories are the only directories that exist in zonefs.
        Users cannot create other directories and cannot rename nor delete
        the "cnv" and "seq" sub-directories.
      2) The name of zone files is the number of the file within the zone
         type sub-directory, in order of increasing zone start sector.
      3) The size of conventional zone files is fixed to the device zone size.
         Conventional zone files cannot be truncated.
      4) The size of sequential zone files represent the file's zone write
         pointer position relative to the zone start sector. Truncating these
         files is allowed only down to 0, in which case, the zone is reset to
         rewind the zone write pointer position to the start of the zone, or
         up to the zone size, in which case the file's zone is transitioned
         to the FULL state (finish zone operation).
      5) All read and write operations to files are not allowed beyond the
         file zone size. Any access exceeding the zone size is failed with
         the -EFBIG error.
      6) Creating, deleting, renaming or modifying any attribute of files and
         sub-directories is not allowed.
      7) There are no restrictions on the type of read and write operations
         that can be issued to conventional zone files. Buffered, direct and
         mmap read & write operations are accepted. For sequential zone files,
         there are no restrictions on read operations, but all write
         operations must be direct IO append writes. mmap write of sequential
         files is not allowed.
      
      Several optional features of zonefs can be enabled at format time.
      * Conventional zone aggregation: ranges of contiguous conventional
        zones can be aggregated into a single larger file instead of the
        default one file per zone.
      * File ownership: The owner UID and GID of zone files is by default 0
        (root) but can be changed to any valid UID/GID.
      * File access permissions: the default 640 access permissions can be
        changed.
      
      The mkzonefs tool is used to format zoned block devices for use with
      zonefs. This tool is available on Github at:
      
      git@github.com:damien-lemoal/zonefs-tools.git.
      
      zonefs-tools also includes a test suite which can be run against any
      zoned block device, including null_blk block device created with zoned
      mode.
      
      Example: the following formats a 15TB host-managed SMR HDD with 256 MB
      zones with the conventional zones aggregation feature enabled.
      
      $ sudo mkzonefs -o aggr_cnv /dev/sdX
      $ sudo mount -t zonefs /dev/sdX /mnt
      $ ls -l /mnt/
      total 0
      dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
      dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
      
      The size of the zone files sub-directories indicate the number of files
      existing for each type of zones. In this example, there is only one
      conventional zone file (all conventional zones are aggregated under a
      single file).
      
      $ ls -l /mnt/cnv
      total 137101312
      -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
      
      This aggregated conventional zone file can be used as a regular file.
      
      $ sudo mkfs.ext4 /mnt/cnv/0
      $ sudo mount -o loop /mnt/cnv/0 /data
      
      The "seq" sub-directory grouping files for sequential write zones has
      in this example 55356 zones.
      
      $ ls -lv /mnt/seq
      total 14511243264
      -rw-r----- 1 root root 0 Nov 25 13:23 0
      -rw-r----- 1 root root 0 Nov 25 13:23 1
      -rw-r----- 1 root root 0 Nov 25 13:23 2
      ...
      -rw-r----- 1 root root 0 Nov 25 13:23 55354
      -rw-r----- 1 root root 0 Nov 25 13:23 55355
      
      For sequential write zone files, the file size changes as data is
      appended at the end of the file, similarly to any regular file system.
      
      $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
      1+0 records in
      1+0 records out
      4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s
      
      $ ls -l /mnt/seq/0
      -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
      
      The written file can be truncated to the zone size, preventing any
      further write operation.
      
      $ truncate -s 268435456 /mnt/seq/0
      $ ls -l /mnt/seq/0
      -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
      
      Truncation to 0 size allows freeing the file zone storage space and
      restart append-writes to the file.
      
      $ truncate -s 0 /mnt/seq/0
      $ ls -l /mnt/seq/0
      -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
      
      Since files are statically mapped to zones on the disk, the number of
      blocks of a file as reported by stat() and fstat() indicates the size
      of the file zone.
      
      $ stat /mnt/seq/0
        File: /mnt/seq/0
        Size: 0       Blocks: 524288     IO Block: 4096   regular empty file
      Device: 870h/2160d      Inode: 50431       Links: 1
      Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/  root)
      Access: 2019-11-25 13:23:57.048971997 +0900
      Modify: 2019-11-25 13:52:25.553805765 +0900
      Change: 2019-11-25 13:52:25.553805765 +0900
       Birth: -
      
      The number of blocks of the file ("Blocks") in units of 512B blocks
      gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
      to the device zone size in this example. Of note is that the "IO block"
      field always indicates the minimum IO size for writes and corresponds
      to the device physical sector size.
      
      This code contains contributions from:
      * Johannes Thumshirn <jthumshirn@suse.de>,
      * Darrick J. Wong <darrick.wong@oracle.com>,
      * Christoph Hellwig <hch@lst.de>,
      * Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
      * Ting Yao <tingyao@hust.edu.cn>.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      8dcc1a9d
  16. 30 10月, 2019 1 次提交
    • J
      io-wq: small threadpool implementation for io_uring · 771b53d0
      Jens Axboe 提交于
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      771b53d0
  17. 24 8月, 2019 1 次提交
    • G
      erofs: move erofs out of staging · 47e4937a
      Gao Xiang 提交于
      EROFS filesystem has been merged into linux-staging for a year.
      
      EROFS is designed to be a better solution of saving extra storage
      space with guaranteed end-to-end performance for read-only files
      with the help of reduced metadata, fixed-sized output compression
      and decompression inplace technologies.
      
      In the past year, EROFS was greatly improved by many people as
      a staging driver, self-tested, betaed by a large number of our
      internal users, successfully applied to almost all in-service
      HUAWEI smartphones as the part of EMUI 9.1 and proven to be stable
      enough to be moved out of staging.
      
      EROFS is a self-contained filesystem driver. Although there are
      still some TODOs to be more generic, we have a dedicated team
      actively keeping on working on EROFS in order to make it better
      with the evolution of Linux kernel as the other in-kernel filesystems.
      
      As Pavel suggested, it's better to do as one commit since git
      can do moves and all histories will be saved in this way.
      
      Let's promote it from staging and enhance it more actively as
      a "real" part of kernel for more wider scenarios!
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Pavel Machek <pavel@denx.de>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J . Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Miao Xie <miaoxie@huawei.com>
      Cc: Li Guifu <bluce.liguifu@huawei.com>
      Cc: Fang Wei <fangwei1@huawei.com>
      Signed-off-by: NGao Xiang <gaoxiang25@huawei.com>
      Link: https://lore.kernel.org/r/20190822213659.5501-1-hsiangkao@aol.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47e4937a
  18. 29 7月, 2019 1 次提交
  19. 05 7月, 2019 1 次提交
  20. 21 5月, 2019 1 次提交
  21. 26 4月, 2019 1 次提交
    • G
      unicode: introduce UTF-8 character database · 955405d1
      Gabriel Krisman Bertazi 提交于
      The decomposition and casefolding of UTF-8 characters are described in a
      prefix tree in utf8data.h, which is a generate from the Unicode
      Character Database (UCD), published by the Unicode Consortium, and
      should not be edited by hand.  The structures in utf8data.h are meant to
      be used for lookup operations by the unicode subsystem, when decoding a
      utf-8 string.
      
      mkutf8data.c is the source for a program that generates utf8data.h. It
      was written by Olaf Weber from SGI and originally proposed to be merged
      into Linux in 2014.  The original proposal performed the compatibility
      decomposition, NFKD, but the current version was modified by me to do
      canonical decomposition, NFD, as suggested by the community.  The
      changes from the original submission are:
      
        * Rebase to mainline.
        * Fix out-of-tree-build.
        * Update makefile to build 11.0.0 ucd files.
        * drop references to xfs.
        * Convert NFKD to NFD.
        * Merge back robustness fixes from original patch. Requested by
          Dave Chinner.
      
      The original submission is archived at:
      
      <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs>
      
      The utf8data.h file can be regenerated using the instructions in
      fs/unicode/README.utf8data.
      
      - Notes on the update from 8.0.0 to 11.0:
      
      The structure of the ucd files and special cases have not experienced
      any changes between versions 8.0.0 and 11.0.0.  8.0.0 saw the addition
      of Cherokee LC characters, which is an interesting case for
      case-folding.  The update is accompanied by new tests on the test_ucd
      module to catch specific cases.  No changes to mkutf8data script were
      required for the updates.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      955405d1
  22. 28 2月, 2019 1 次提交
    • D
      vfs: Add configuration parser helpers · 31d921c7
      David Howells 提交于
      Because the new API passes in key,value parameters, match_token() cannot be
      used with it.  Instead, provide three new helpers to aid with parsing:
      
       (1) fs_parse().  This takes a parameter and a simple static description of
           all the parameters and maps the key name to an ID.  It returns 1 on a
           match, 0 on no match if unknowns should be ignored and some other
           negative error code on a parse error.
      
           The parameter description includes a list of key names to IDs, desired
           parameter types and a list of enumeration name -> ID mappings.
      
           [!] Note that for the moment I've required that the key->ID mapping
           array is expected to be sorted and unterminated.  The size of the
           array is noted in the fsconfig_parser struct.  This allows me to use
           bsearch(), but I'm not sure any performance gain is worth the hassle
           of requiring people to keep the array sorted.
      
           The parameter type array is sized according to the number of parameter
           IDs and is indexed directly.  The optional enum mapping array is an
           unterminated, unsorted list and the size goes into the fsconfig_parser
           struct.
      
           The function can do some additional things:
      
      	(a) If it's not ambiguous and no value is given, the prefix "no" on
      	    a key name is permitted to indicate that the parameter should
      	    be considered negatory.
      
      	(b) If the desired type is a single simple integer, it will perform
      	    an appropriate conversion and store the result in a union in
      	    the parse result.
      
      	(c) If the desired type is an enumeration, {key ID, name} will be
      	    looked up in the enumeration list and the matching value will
      	    be stored in the parse result union.
      
      	(d) Optionally generate an error if the key is unrecognised.
      
           This is called something like:
      
      	enum rdt_param {
      		Opt_cdp,
      		Opt_cdpl2,
      		Opt_mba_mpbs,
      		nr__rdt_params
      	};
      
      	const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
      		[Opt_cdp]	= { fs_param_is_bool },
      		[Opt_cdpl2]	= { fs_param_is_bool },
      		[Opt_mba_mpbs]	= { fs_param_is_bool },
      	};
      
      	const const char *const rdt_param_keys[nr__rdt_params] = {
      		[Opt_cdp]	= "cdp",
      		[Opt_cdpl2]	= "cdpl2",
      		[Opt_mba_mpbs]	= "mba_mbps",
      	};
      
      	const struct fs_parameter_description rdt_parser = {
      		.name		= "rdt",
      		.nr_params	= nr__rdt_params,
      		.keys		= rdt_param_keys,
      		.specs		= rdt_param_specs,
      		.no_source	= true,
      	};
      
      	int rdt_parse_param(struct fs_context *fc,
      			    struct fs_parameter *param)
      	{
      		struct fs_parse_result parse;
      		struct rdt_fs_context *ctx = rdt_fc2context(fc);
      		int ret;
      
      		ret = fs_parse(fc, &rdt_parser, param, &parse);
      		if (ret < 0)
      			return ret;
      
      		switch (parse.key) {
      		case Opt_cdp:
      			ctx->enable_cdpl3 = true;
      			return 0;
      		case Opt_cdpl2:
      			ctx->enable_cdpl2 = true;
      			return 0;
      		case Opt_mba_mpbs:
      			ctx->enable_mba_mbps = true;
      			return 0;
      		}
      
      		return -EINVAL;
      	}
      
       (2) fs_lookup_param().  This takes a { dirfd, path, LOOKUP_EMPTY? } or
           string value and performs an appropriate path lookup to convert it
           into a path object, which it will then return.
      
           If the desired type was a blockdev, the type of the looked up inode
           will be checked to make sure it is one.
      
           This can be used like:
      
      	enum foo_param {
      		Opt_source,
      		nr__foo_params
      	};
      
      	const struct fs_parameter_spec foo_param_specs[nr__foo_params] = {
      		[Opt_source]	= { fs_param_is_blockdev },
      	};
      
      	const char *char foo_param_keys[nr__foo_params] = {
      		[Opt_source]	= "source",
      	};
      
      	const struct constant_table foo_param_alt_keys[] = {
      		{ "device",	Opt_source },
      	};
      
      	const struct fs_parameter_description foo_parser = {
      		.name		= "foo",
      		.nr_params	= nr__foo_params,
      		.nr_alt_keys	= ARRAY_SIZE(foo_param_alt_keys),
      		.keys		= foo_param_keys,
      		.alt_keys	= foo_param_alt_keys,
      		.specs		= foo_param_specs,
      	};
      
      	int foo_parse_param(struct fs_context *fc,
      			    struct fs_parameter *param)
      	{
      		struct fs_parse_result parse;
      		struct foo_fs_context *ctx = foo_fc2context(fc);
      		int ret;
      
      		ret = fs_parse(fc, &foo_parser, param, &parse);
      		if (ret < 0)
      			return ret;
      
      		switch (parse.key) {
      		case Opt_source:
      			return fs_lookup_param(fc, &foo_parser, param,
      					       &parse, &ctx->source);
      		default:
      			return -EINVAL;
      		}
      	}
      
       (3) lookup_constant().  This takes a table of named constants and looks up
           the given name within it.  The table is expected to be sorted such
           that bsearch() be used upon it.
      
           Possibly I should require the table be terminated and just use a
           for-loop to scan it instead of using bsearch() to reduce hassle.
      
           Tables look something like:
      
      	static const struct constant_table bool_names[] = {
      		{ "0",		false },
      		{ "1",		true },
      		{ "false",	false },
      		{ "no",		false },
      		{ "true",	true },
      		{ "yes",	true },
      	};
      
           and a lookup is done with something like:
      
      	b = lookup_constant(bool_names, param->string, -1);
      
      Additionally, optional validation routines for the parameter description
      are provided that can be enabled at compile time.  A later patch will
      invoke these when a filesystem is registered.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      31d921c7
  23. 06 2月, 2019 1 次提交
  24. 11 6月, 2018 1 次提交
    • L
      autofs: remove left-over autofs4 stubs · a2225d93
      Linus Torvalds 提交于
      There's no need to retain the fs/autofs4 directory for backward
      compatibility.
      
      Adding an AUTOFS4_FS fragment to the autofs Kconfig and a module alias
      for autofs4 is sufficient for almost all cases. Not keeping fs/autofs4
      remnants will prevent "insmod <path>/autofs4/autofs4.ko" from working
      but this shouldn't be used in automation scripts rather than
      modprobe(8).
      
      There were some comments about things to look out for with the module
      rename in the fs/autofs4/Kconfig that is removed by this patch, see the
      commit patch if you are interested.
      
      One potential problem with this change is that when the
      fs/autofs/Kconfig fragment for AUTOFS4_FS is removed any AUTOFS4_FS
      entries will be removed from the kernel config, resulting in no autofs
      file system being built if there is no AUTOFS_FS entry also.
      
      This would have also happened if the fs/autofs4 remnants had remained
      and is most likely to be a problem with automated builds.
      
      Please check your build configurations before the removal which will
      occur after the next couple of kernel releases.
      Acked-by: NIan Kent <raven@themaw.net>
      [ With edits and commit message from Ian Kent ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2225d93
  25. 08 6月, 2018 2 次提交
  26. 22 5月, 2018 1 次提交
    • D
      mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS · e7638488
      Dan Williams 提交于
      In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
      be able to rely on the fact that they will get wakeups on dev_pagemap
      page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
      generic_dax_page_free() as common indicator / infrastructure for dax
      filesytems to require. With this change there are no users of the
      MEMORY_DEVICE_HOST designation, so remove it.
      
      The HMM sub-system extended dev_pagemap to arrange a callback when a
      dev_pagemap managed page is freed. Since a dev_pagemap page is free /
      idle when its reference count is 1 it requires an additional branch to
      check the page-type at put_page() time. Given put_page() is a hot-path
      we do not want to incur that check if HMM is not in use, so a static
      branch is used to avoid that overhead when not necessary.
      
      Now, the FS_DAX implementation wants to reuse this mechanism for
      receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
      static-key into a generic mechanism that either HMM or FS_DAX code paths
      can enable.
      
      For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
      care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
      However, we still need to support FS_DAX in the FS_DAX_LIMITED case
      implemented by the s390/dcssblk driver.
      
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Reported-by: NThomas Meyer <thomas@m3y3r.de>
      Reported-by: NDave Jiang <dave.jiang@intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e7638488
  27. 28 4月, 2018 1 次提交
  28. 17 4月, 2018 1 次提交
  29. 20 1月, 2018 1 次提交
    • D
      dax: require 'struct page' by default for filesystem dax · 569d0365
      Dan Williams 提交于
      If a dax buffer from a device that does not map pages is passed to
      read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
      gdb attempts to examine the contents of a dax buffer from a device that
      does not map pages it triggers SIGBUS. If fork(2) is called on a process
      with a dax mapping from a device that does not map pages it triggers
      SIGBUS. 'struct page' is required otherwise several kernel code paths
      break in surprising ways. Disable filesystem-dax on devices that do not
      map pages.
      
      In addition to needing pfn_to_page() to be valid we also require devmap
      pages.  We need this to detect dax pages in the get_user_pages_fast()
      path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
      drivers that have not supported get_user_pages() to date we allow them
      to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
      option which requires ->direct_access() to return pfn_t_special() pfns.
      This leaves DAX support in brd disabled and scheduled for removal.
      
      Note that when the initial dax support was being merged a few years back
      there was concern that struct page was unsuitable for use with next
      generation persistent memory devices. The theoretical concern was that
      struct page access, being such a hotly used data structure in the
      kernel, would lead to media wear out. While that was a reasonable
      conservative starting position it has not held true in practice. We have
      long since committed to using devm_memremap_pages() to support higher
      order kernel functionality that needs get_user_pages() and
      pfn_to_page().
      
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      569d0365
  30. 02 1月, 2018 1 次提交
  31. 28 11月, 2017 1 次提交
  32. 13 7月, 2017 1 次提交
  33. 09 5月, 2017 1 次提交
    • D
      block, dax: move "select DAX" from BLOCK to FS_DAX · ef510424
      Dan Williams 提交于
      For configurations that do not enable DAX filesystems or drivers, do not
      require the DAX core to be built.
      
      Given that the 'direct_access' method has been removed from
      'block_device_operations', we can also go ahead and remove the
      block-related dax helper functions from fs/block_dev.c to
      drivers/dax/super.c. This keeps dax details out of the block layer and
      lets the DAX core be built as a module in the FS_DAX=n case.
      
      Filesystems need to include dax.h to call bdev_dax_supported().
      
      Cc: linux-xfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef510424