提交 · f9fe48bece3af2d60e1bad65db4825f5a025dd36 · openanolis / cloud-kernel

23 1月, 2016 1 次提交

dax: support dirty DAX entries in radix tree · f9fe48be

由 Ross Zwisler 提交于 1月 22, 2016

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new
exceptional entries into the radix tree that represent dirty DAX PTE or
PMD pages.  These exceptional entries will also contain the writeback
addresses for the PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely
on the fact that only one type of exceptional entry can be found in a
given radix tree based on its usage.  This happens for free with DAX vs
shmem but we explicitly prevent shadow entries from being added to radix
trees for DAX mappings.

The only shadow entries that would be generated for DAX radix trees
would be to track zero page mappings that were created for holes.  These
pages would receive minimal benefit from having shadow entries, and the
choice to have only one type of exceptional entry in a given radix tree
makes the logic simpler both in clear_exceptional_entry() and in the
rest of DAX.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9fe48be

16 1月, 2016 2 次提交

dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() · b2e0d162

由 Dan Williams 提交于 1月 15, 2016

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled.  Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

[willy@linux.intel.com: fix read() of a hole]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b2e0d162

dax: guarantee page aligned results from bdev_direct_access() · fe683ada

由 Dan Williams 提交于 1月 15, 2016

If a ->direct_access() implementation ever returns a map count less than
PAGE_SIZE, catch the error in bdev_direct_access().  This simplifies
error checking in upper layers.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fe683ada

15 1月, 2016 2 次提交

fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO) · b832861c

由 Andrew Morton 提交于 1月 14, 2016

bdev_write_page() is used by swapout and by writepage where we cannot
use __GFP_FS or __GFP_IO.  So it is misleading to mention GFP_KERNEL
here.

blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
bugs were harmed in the making of this patch.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b832861c

kmemcg: account certain kmem allocations to memcg · 5d097056

由 Vladimir Davydov 提交于 1月 14, 2016

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5d097056

14 1月, 2016 1 次提交

block: use bd{grab,put}() instead of open-coding · ed8a9d2c

由 Ilya Dryomov 提交于 11月 20, 2015

- bd_acquire() and bd_forget() open-code bdgrab() and bdput()
- raw driver uses igrab() but never checks its return value and always
  holds another ref from bind_set() while calling it, so it's
  equivalent to bdgrab()
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ed8a9d2c

09 1月, 2016 2 次提交

block: enable dax for raw block devices · 5a023cdb

由 Dan Williams 提交于 11月 30, 2015

If an application wants exclusive access to all of the persistent memory
provided by an NVDIMM namespace it can use this raw-block-dax facility
to forgo establishing a filesystem.  This capability is targeted
primarily to hypervisors wanting to provision persistent memory for
guests.  It can be disabled / enabled dynamically via the new BLKDAXSET
ioctl.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Reviewed-by: NJan Kara <jack@suse.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5a023cdb

block: introduce bdev_file_inode() · 4ebb16ca

由 Dan Williams 提交于 10月 28, 2015

Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4ebb16ca

07 1月, 2016 1 次提交

fs: use gendisk->disk_name where possible · 424081f3

由 Dmitry Monakhov 提交于 4月 13, 2015

gendisk with part==0 is obviously gendisk->disk_name.
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

424081f3

05 12月, 2015 1 次提交

block: detach bdev inode from its wb in __blkdev_put() · 43d1c0eb

由 Ilya Dryomov 提交于 11月 20, 2015

Since 52ebea74 ("writeback: make backing_dev_info host
cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
gets attached to a wb (struct bdi_writeback).  Detaching happens on
evict, in inode_detach_wb() called from __destroy_inode(), and involves
updating wb.

However, detaching an internal bdev inode from its wb in
__destroy_inode() is too late.  Its bdi and by extension root wb are
embedded into struct request_queue, which has different lifetime rules
and can be freed long before the final bdput() is called (can be from
__fput() of a corresponding /dev inode, through dput() - evict() -
bd_forget().  bdevs hold onto the underlying disk/queue pair only while
opened; as soon as bdev is closed all bets are off.  In fact,
disk/queue can be gone before __blkdev_put() even returns:

1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
1500 {
...
1518         if (bdev->bd_contains == bdev) {
1519                 if (disk->fops->release)
1520                         disk->fops->release(disk, mode);

[ Driver puts its references to disk/queue ]

1521         }
1522         if (!bdev->bd_openers) {
1523                 struct module *owner = disk->fops->owner;
1524
1525                 disk_put_part(bdev->bd_part);
1526                 bdev->bd_part = NULL;
1527                 bdev->bd_disk = NULL;
1528                 if (bdev != bdev->bd_contains)
1529                         victim = bdev->bd_contains;
1530                 bdev->bd_contains = NULL;
1531
1532                 put_disk(disk);

[ We put ours, the queue is gone
  The last bdput() would result in a write to invalid memory ]

1533                 module_put(owner);
...
1539 }

Since bdev inodes are special anyway, detach them in __blkdev_put()
after clearing inode's dirty bits, turning the problematic
inode_detach_wb() in __destroy_inode() into a noop.

add_disk() grabs its disk->queue since 523e1d39 ("block: make
gendisk hold a reference to its queue"), so the old ->release comment
is removed in favor of the new inode_detach_wb() comment.

Cc: stable@vger.kernel.org # 4.2+, needs backporting
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Tested-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

43d1c0eb

02 12月, 2015 1 次提交

blk-mq: add a flags parameter to blk_mq_alloc_request · 6f3b0e8b

由 Christoph Hellwig 提交于 11月 26, 2015

We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t.  Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

6f3b0e8b

20 11月, 2015 1 次提交

block: protect rw_page against device teardown · 2e6edc95

由 Dan Williams 提交于 11月 19, 2015

Fix use after free crashes like the following:

 general protection fault: 0000 [#1] SMP
 Call Trace:
  [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
  [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
  [<ffffffff8128fd90>] bdev_read_page+0x50/0x60
  [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
  [<ffffffff81297657>] mpage_readpages+0x107/0x170
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
  [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
  [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
  [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
  [<ffffffff811c76f6>] filemap_fault+0x396/0x530
  [<ffffffff811f816e>] __do_fault+0x4e/0xf0
  [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50

Cc: <stable@vger.kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NMatthew Wilcox <willy@linux.intel.com>
[willy: symmetry fixups]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

2e6edc95

12 11月, 2015 1 次提交

fs/block_dev.c: Remove WARN_ON() when inode writeback fails · dbd3ca50

由 Vivek Goyal 提交于 11月 09, 2015

If a block device is hot removed and later last reference to device
is put, we try to writeback the dirty inode. But device is gone and
that writeback fails.

Currently we do a WARN_ON() which does not seem to be the right thing.
Convert it to a ratelimited kernel warning.
Reported-by: NAndi Kleen <andi@firstfloor.org>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
[jmoyer@redhat.com: get rid of unnecessary name initialization, 80 cols]
Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

dbd3ca50

22 10月, 2015 1 次提交

block: Inline blk_integrity in struct gendisk · 25520d55

由 Martin K. Petersen 提交于 10月 21, 2015

Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

 - Moving the blk_integrity definition to genhd.h.

 - Inlining blk_integrity in struct gendisk.

 - Removing the dynamic allocation code.

 - Adding helper functions which allow gendisk to set up and tear down
   the integrity sysfs dir when a disk is added/deleted.

 - Adding a blk_integrity_revalidate() callback for updating the stable
   pages bdi setting.

 - The calls that depend on whether a device has an integrity profile or
   not now key off of the bi->profile pointer.

 - Simplifying the integrity support routines in DM (Mike Snitzer).
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reported-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

25520d55

16 9月, 2015 1 次提交

blockdev: don't set S_DAX for misaligned partitions · f0b2e563

由 Jeff Moyer 提交于 8月 14, 2015

The dax code doesn't currently support misaligned partitions,
so disable O_DIRECT via dax until such time as that support
materializes.

Cc: <stable@vger.kernel.org>
Suggested-by: NBoaz Harrosh <boaz@plexistor.com>
Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f0b2e563

09 9月, 2015 1 次提交

dax: move DAX-related functions to a new header · c94c2acf

由 Matthew Wilcox 提交于 9月 08, 2015

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h.  Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>.  We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c94c2acf

28 8月, 2015 1 次提交

dax: drop size parameter to ->direct_access() · cb389b9c

由 Dan Williams 提交于 8月 07, 2015

None of the implementations currently use it.  The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

cb389b9c

21 8月, 2015 1 次提交

pmem, dax: have direct_access use __pmem annotation · e2e05394

由 Ross Zwisler 提交于 8月 18, 2015

Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer. This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e2e05394

18 8月, 2015 1 次提交

inode: convert inode_sb_list_lock to per-sb · 74278da9

由 Dave Chinner 提交于 3月 04, 2015

The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NDave Chinner <dchinner@redhat.com>

74278da9

05 7月, 2015 2 次提交

dax: bdev_direct_access() may sleep · 43c3dd08

由 Matthew Wilcox 提交于 7月 03, 2015

The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method.  To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

43c3dd08

block: Add support for DAX reads/writes to block devices · bbab37dd

由 Matthew Wilcox 提交于 7月 03, 2015

If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.

Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bbab37dd

28 6月, 2015 1 次提交

bdi: Remove "inline" keyword from exported I_BDEV() implementation · ff5053f6

由 Geert Uytterhoeven 提交于 6月 26, 2015

With gcc 3.4.6/4.1.2/4.2.4 (not with 4.4.7/4.6.4/4.8.4):

      CC      fs/block_dev.o
    include/linux/fs.h:804: warning: ‘I_BDEV’ declared inline after being called
    include/linux/fs.h:804: warning: previous declaration of ‘I_BDEV’ was here

Commit a212b105 ("bdi: make inode_to_bdi() inline") added a
caller of I_BDEV() in a header file, exposing the bogus "inline" on the
exported implementation.

Drop the "inline" keyword to fix this.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

ff5053f6

26 6月, 2015 1 次提交

fs/block_dev.c: skip rw_page if bdev has integrity · f68eb1e7

由 Vishal Verma 提交于 5月 12, 2015

If a block device has bio integrity enabled, rw_page will bypass the
integrity payload, which is undesirable. Skip rw_page if this is the
case.

Currently brd and zram provide rw_page, and the proposed 'nd' drivers
will too.

Cc: Jens Axboe <axboe@fb.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Suggested-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f68eb1e7

02 6月, 2015 2 次提交

bdi: make inode_to_bdi() inline · a212b105

由 Tejun Heo 提交于 5月 22, 2015

Now that bdi definitions are moved to backing-dev-defs.h,
backing-dev.h can include blkdev.h and inline inode_to_bdi() without
worrying about introducing circular include dependency.  The function
gets called from hot paths and fairly trivial.

This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
function calls inline.  blockdev_superblock and noop_backing_dev_info
are EXPORT_GPL'd to allow the inline functions to be used from
modules.

While at it, make sb_is_blkdev_sb() return bool instead of int.

v2: Fixed typo in description as suggested by Jan.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

a212b105

writeback: separate out include/linux/backing-dev-defs.h · 66114cad

由 Tejun Heo 提交于 5月 22, 2015

With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@fb.com>

66114cad

25 4月, 2015 1 次提交

direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0

由 Jens Axboe 提交于 4月 15, 2015

do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fe0f07d0

16 4月, 2015 1 次提交

VFS: assorted d_backing_inode() annotations · bb668734

由 David Howells 提交于 3月 17, 2015

Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bb668734

12 4月, 2015 5 次提交

A
blkdev_write_iter: expand generic_file_checks() call in there · 7ec7b94a
由 Al Viro 提交于 4月 07, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
7ec7b94a
A
lift generic_write_checks() into callers of __generic_file_write_iter() · 5f380c7f
由 Al Viro 提交于 4月 07, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
5f380c7f

direct_IO: remove rw from a_ops->direct_IO() · 22c6186e

由 Omar Sandoval 提交于 3月 16, 2015

Now that no one is using rw, remove it completely.
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

22c6186e

Remove rw from {,__,do_}blockdev_direct_IO() · 17f8c842

由 Omar Sandoval 提交于 3月 16, 2015

Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

17f8c842

make new_sync_{read,write}() static · 5d5d5689

由 Al Viro 提交于 4月 03, 2015

All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5d5d5689

26 3月, 2015 1 次提交

fs: move struct kiocb to fs.h · e2e40f2c

由 Christoph Hellwig 提交于 2月 22, 2015

struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e2e40f2c

21 1月, 2015 2 次提交

fs: remove mapping->backing_dev_info · b83ae6d4

由 Christoph Hellwig 提交于 1月 14, 2015

Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

b83ae6d4

block_dev: only write bdev inode on close · 564f00f6

由 Christoph Hellwig 提交于 1月 14, 2015

Since 018a17bd ("bdi: reimplement bdev_inode_switch_bdi()") the
block device code writes out all dirty data whenever switching the
backing_dev_info for a block device inode.  But a block device inode can
only be dirtied when it is in use, which means we only have to write it
out on the final blkdev_put, but not when doing a blkdev_get.

Factoring out the write out from the bdi list switch prepares from
removing the list switch later in the series.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

564f00f6

14 1月, 2015 1 次提交

block: Change direct_access calling convention · dd22f551

由 Matthew Wilcox 提交于 1月 07, 2015

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NBoaz Harrosh <boaz@plexistor.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

dd22f551

17 11月, 2014 1 次提交

fs: add freeze_super/thaw_super fs hooks · 48b6bca6

由 Benjamin Marzinski 提交于 11月 13, 2014

Currently, freezing a filesystem involves calling freeze_super, which locks
sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
hard for gfs2 (and potentially other cluster filesystems) to use the vfs
freezing code to do freezes on all the cluster nodes.

In order to communicate that a freeze has been requested, and to make sure
that only one node is trying to freeze at a time, gfs2 uses a glock
(sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
this lock before calling freeze_super. This means that two nodes can
attempt to freeze the filesystem by both calling freeze_super, acquiring
the sb->s_umount lock, and then attempting to grab the cluster glock
sd_freeze_gl. Only one will succeed, and the other will be stuck in
freeze_super, making it impossible to finish freezing the node.

To solve this problem, this patch adds the freeze_super and thaw_super
hooks. If a filesystem implements these hooks, they are called instead of
the vfs freeze_super and thaw_super functions. This means that every
filesystem that implements these hooks must call the vfs freeze_super and
thaw_super functions itself within the hook function to make use of the vfs
freezing code.
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

48b6bca6

31 10月, 2014 1 次提交

Return short read or 0 at end of a raw device, not EIO · b2de525f

由 David Jeffery 提交于 9月 29, 2014

Author: David Jeffery <djeffery@redhat.com>
Changes to the basic direct I/O code have broken the raw driver when reading
to the end of a raw device.  Instead of returning a short read for a read that
extends partially beyond the device's end or 0 when at the end of the device,
these reads now return EIO.

The raw driver needs the same end of device handling as was added for normal
block devices.  Using blkdev_read_iter, which has the needed size checks,
prevents the EIO conditions at the end of the device.
Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b2de525f

10 10月, 2014 1 次提交

block_dev: implement readpages() to optimize sequential read · 447f05bb

由 Akinobu Mita 提交于 10月 09, 2014

Sequential read from a block device is expected to be equal or faster than
from the file on a filesystem.  But it is not correct due to the lack of
effective readpages() in the address space operations for block device.

This implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.

Install 1GB of RAM disk storage:

	# modprobe scsi_debug dev_size_mb=1024 delay=0

Sequential read from file on a filesystem:

	# mkfs.ext4 /dev/$DEV
	# mount /dev/$DEV /mnt
	# fio --name=t --size=512m --rw=read --filename=/mnt/file
	...
	  read : io=524288KB, bw=2133.4MB/s, iops=546133, runt=   240msec

Sequential read from a block device:
	# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
	...
(Without this commit)
	  read : io=524288KB, bw=1700.2MB/s, iops=435455, runt=   301msec

(With this commit)
	  read : io=524288KB, bw=2160.4MB/s, iops=553046, runt=   237msec
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

447f05bb

09 9月, 2014 1 次提交

bdi: reimplement bdev_inode_switch_bdi() · 018a17bd

由 Tejun Heo 提交于 9月 08, 2014

A block_device may be attached to different gendisks and thus
different bdis over time.  bdev_inode_switch_bdi() is used to switch
the associated bdi.  The function assumes that the inode could be
dirty and transfers it between bdis if so.  This is a bit nasty in
that it reaches into bdi internals.

This patch reimplements the function so that it writes out the inode
if dirty.  This is a lot simpler and can be implemented without
exposing bdi internals.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@fb.com>

018a17bd

openanolis / cloud-kernel 12 个月 前同步成功

openanolis / cloud-kernel
12 个月前同步成功