提交 · fd1ae514528bfa7136640301523f646f396134e2 · openeuler / Kernel

25 8月, 2016 1 次提交

fs/block_dev: fix potential NULL ptr deref in freeze_bdev() · 5bb53c0f

由 Andrey Ryabinin 提交于 8月 23, 2016

Calling freeze_bdev() twice on the same block device without mounted
filesystem get_super() will return NULL, which will lead to NULL-ptr
dereference later in drop_super().

Check get_super() result to fix that.

Note, that this is a purely theoretical issue. We have only 3
freeze_bdev() callers. 2 of them are in filesystem code and used on a
device with mounted fs. The third one in lock_fs() has protection in
upper-layer code against freezing block device the second time without
thawing it first.
Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

5bb53c0f

22 8月, 2016 1 次提交

bdev: fix NULL pointer dereference · e9e5e3fa

由 Vegard Nossum 提交于 8月 22, 2016

I got this:

    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    Dumping ftrace buffer:
       (ftrace buffer empty)
    CPU: 0 PID: 5505 Comm: syz-executor Not tainted 4.8.0-rc2+ #161
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    task: ffff880113415940 task.stack: ffff880118350000
    RIP: 0010:[<ffffffff8172cb32>]  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
    RSP: 0018:ffff880118357ca0  EFLAGS: 00010207
    RAX: dffffc0000000000 RBX: ffffffffffffffff RCX: ffffc90000bb6000
    RDX: 0000000000000018 RSI: ffffffff846d6b20 RDI: 00000000000000c7
    RBP: ffff880118357cb0 R08: ffff880115967c68 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801188211e8
    R13: ffffffff847baa20 R14: ffff8801139cb000 R15: 0000000000000080
    FS:  00007fa3ff6c0700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc1d8cc7e78 CR3: 0000000109f20000 CR4: 00000000000006f0
    DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Stack:
     ffff880112cfd6c0 ffff8801188211e8 ffff880118357cf0 ffffffff8167f207
     ffffffff816d7a1e ffff880112a413c0 ffffffff847baa20 ffff8801188211e8
     0000000000000080 ffff880112cfd6c0 ffff880118357d38 ffffffff816dce0a
    Call Trace:
     [<ffffffff8167f207>] mount_fs+0x97/0x2e0
     [<ffffffff816d7a1e>] ? alloc_vfsmnt+0x55e/0x760
     [<ffffffff816dce0a>] vfs_kern_mount+0x7a/0x300
     [<ffffffff83c3247c>] ? _raw_read_unlock+0x2c/0x50
     [<ffffffff816dfc87>] do_mount+0x3d7/0x2730
     [<ffffffff81235fd4>] ? trace_do_page_fault+0x1f4/0x3a0
     [<ffffffff816df8b0>] ? copy_mount_string+0x40/0x40
     [<ffffffff8161ea81>] ? memset+0x31/0x40
     [<ffffffff816df73e>] ? copy_mount_options+0x1ee/0x320
     [<ffffffff816e2a02>] SyS_mount+0xb2/0x120
     [<ffffffff816e2950>] ? copy_mnt_ns+0x970/0x970
     [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
     [<ffffffff83c3282a>] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 83 e8 63 1b fc ff 48 85 c0 48 89 c3 74 4c e8 56 35 d1 ff 48 8d bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
    RIP  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
     RSP <ffff880118357ca0>
    ---[ end trace 13690ad962168b98 ]---

mount_pseudo() returns ERR_PTR(), not NULL, on error.

Fixes: 3684aa70 ("block-dev: enable writeback cgroup support")
Cc: Shaohua Li <shli@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e9e5e3fa

08 8月, 2016 1 次提交

block/mm: make bdev_ops->rw_page() take a bool for read/write · c11f0c0b

由 Jens Axboe 提交于 8月 05, 2016

Commit abf54548 changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.
Reviewed-by: NMike Christie <mchristi@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c11f0c0b

05 8月, 2016 1 次提交

mm/block: convert rw_page users to bio op use · abf54548

由 Mike Christie 提交于 8月 04, 2016

The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52 ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK
Signed-off-by: NJens Axboe <axboe@fb.com>

abf54548

04 8月, 2016 1 次提交

block: remove BLK_DEV_DAX config option · 99a01cdf

由 Ross Zwisler 提交于 8月 03, 2016

The functionality for block device DAX was already removed with commit
acc93d30 ("Revert "block: enable dax for raw block devices"")

However, we still had a config option hanging around that was always
disabled because it depended on CONFIG_BROKEN. This config option was
introduced in commit 03cdadb0 ("block: disable block device DAX by
default")

This change reverts that commit, removing the dead config option.

Link: http://lkml.kernel.org/r/20160729182314.6368-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

99a01cdf

21 7月, 2016 1 次提交

block: add QUEUE_FLAG_DAX for devices to advertise their DAX support · 163d4baa

由 Toshi Kani 提交于 6月 23, 2016

Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device.  Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.

In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support.  This will allow to set the DAX capability based on how
mapped device is composed.
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

163d4baa

20 7月, 2016 1 次提交

bdev: get rid of ->bd_inodes · a4a4f943

由 Al Viro 提交于 7月 19, 2016

Since 2006 we have ->i_bdev pinning bdev in question, so there's no
way to get to bdev ->evict_inode() while there's an aliasing inode
anywhere.  In other words, the only place walking the list of aliases
is guaranteed to do it only when the list is empty...

Remove the detritus; it should've been done in "[PATCH] Fix a race
condition between ->i_mapping and iput()", but nobody had noticed it
back then.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a4a4f943

24 6月, 2016 1 次提交

vfs: Generalize filesystem nodev handling. · a2982cc9

由 Eric W. Biederman 提交于 6月 09, 2016

Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV.  Use this new function in all of the
places where MNT_NODEV was previously tested.

Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong.  With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.
Acked-by: NSeth Forshee <seth.forshee@canonical.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

a2982cc9

21 5月, 2016 1 次提交

Revert "block: enable dax for raw block devices" · acc93d30

由 Dan Williams 提交于 5月 07, 2016

This reverts commit 5a023cdb.

The functionality is superseded by the new "Device DAX" facility.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

acc93d30

19 5月, 2016 1 次提交

dax: enable dax in the presence of known media errors (badblocks) · 0a70bd43

由 Dan Williams 提交于 2月 24, 2016

1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
   requested when errors present.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>

0a70bd43

17 5月, 2016 4 次提交

block: Update blkdev_dax_capable() for consistency · a8078b1f

由 Toshi Kani 提交于 5月 10, 2016

blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
to remain as a separate interface for checking dax capability of
a raw block device.

Rename and relocate blkdev_dax_capable() to keep them maintained
consistently, and call bdev_direct_access() for the dax capability
check.

There is no change in the behavior.

Link: https://lkml.org/lkml/2016/5/9/950Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>

a8078b1f

block: Add bdev_dax_supported() for dax mount checks · 2d96afc8

由 Toshi Kani 提交于 5月 10, 2016

DAX imposes additional requirements to a device.  Add
bdev_dax_supported() which performs all the precondition checks
necessary for filesystem to mount the device with dax option.

Also add a new check to verify if a partition is aligned by 4KB.
When a partition is unaligned, any dax read/write access fails,
except for metadata update.
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>

2d96afc8

block: Add vfs_msg() interface · 2af3a815

由 Toshi Kani 提交于 5月 10, 2016

In preparation of moving DAX capability checks to the block layer
from filesystem code, add a VFS message interface that aligns with
filesystem's message format.

For instance, a vfs_msg() message followed by XFS messages in case
of a dax mount error may look like:

  VFS (pmem0p1): error: unaligned partition for dax
  XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
  XFS (pmem0p1): Mounting V5 Filesystem
   :

vfs_msg() is largely based on ext4_msg().
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>

2af3a815

dax: Remove complete_unwritten argument · 02fbd139

由 Jan Kara 提交于 5月 11, 2016

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>

02fbd139

02 5月, 2016 3 次提交

fs: simplify the generic_write_sync prototype · e2592217

由 Christoph Hellwig 提交于 4月 07, 2016

The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e2592217

fs: add IOCB_SYNC and IOCB_DSYNC · dde0c2e7

由 Christoph Hellwig 提交于 4月 07, 2016

This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

dde0c2e7

direct-io: eliminate the offset argument to ->direct_IO · c8b8e32d

由 Christoph Hellwig 提交于 4月 07, 2016

Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c8b8e32d

05 4月, 2016 1 次提交

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf

由 Kirill A. Shutemov 提交于 4月 01, 2016

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

09cbfeaf

04 3月, 2016 1 次提交

block-dev: enable writeback cgroup support · 3684aa70

由 Shaohua Li 提交于 2月 22, 2016

block_dev's .writepages/.writepage already handles
wbc_init_bio/wbc_account_io. We only set the SB_I_CGROUPWB bit to
suppport writeback cgroup support.
Signed-off-by: NShaohua Li <shli@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

3684aa70

28 2月, 2016 2 次提交

dax: move writeback calls into the filesystems · 7f6d5b52

由 Ross Zwisler 提交于 2月 26, 2016

Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device.  This also fixes DAX code to properly flush caches in response
to sync(2).
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7f6d5b52

block: disable block device DAX by default · 03cdadb0

由 Dan Williams 提交于 2月 26, 2016

The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings.  This can lead to data corruption.

We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.

   dump_stack+0x85/0xc2
   dax_writeback_mapping_range+0x60/0xe0
   blkdev_writepages+0x3f/0x50
   do_writepages+0x21/0x30
   __filemap_fdatawrite_range+0xc6/0x100
   filemap_write_and_wait+0x4a/0xa0
   set_blocksize+0x70/0xd0
   sb_set_blocksize+0x1d/0x50
   ext4_fill_super+0x75b/0x3360
   mount_bdev+0x180/0x1b0
   ext4_mount+0x15/0x20
   mount_fs+0x38/0x170

Mark the support broken so its disabled by default, but otherwise still
available for testing.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: NDave Chinner <david@fromorbit.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

03cdadb0

06 2月, 2016 1 次提交

block: fix pfn_mkwrite() DAX fault handler · 9c5a05bc

由 Ross Zwisler 提交于 2月 05, 2016

Previously the pfn_mkwrite() fault handler for raw block devices called
bldev_dax_fault() -> __dax_fault() to do a full DAX page fault.

Really what the pfn_mkwrite() fault handler needs to do is call
dax_pfn_mkwrite() to make sure that the radix tree entry for the given
PTE is marked as dirty so that a follow-up fsync or msync call will
flush it durably to media.

Fixes: 5a023cdb ("block: enable dax for raw block devices")
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9c5a05bc

31 1月, 2016 1 次提交

block: revert runtime dax control of the raw block device · 9f4736fe

由 Dan Williams 提交于 1月 28, 2016

Dynamically enabling DAX requires that the page cache first be flushed
and invalidated.  This must occur atomically with the change of DAX mode
otherwise we confuse the fsync/msync tracking and violate data
durability guarantees.  Eliminate the possibilty of DAX-disabled to
DAX-enabled transitions for now and revisit this for the next cycle.

Cc: Jan Kara <jack@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

9f4736fe

23 1月, 2016 2 次提交

dax: support dirty DAX entries in radix tree · f9fe48be

由 Ross Zwisler 提交于 1月 22, 2016

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new
exceptional entries into the radix tree that represent dirty DAX PTE or
PMD pages.  These exceptional entries will also contain the writeback
addresses for the PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely
on the fact that only one type of exceptional entry can be found in a
given radix tree based on its usage.  This happens for free with DAX vs
shmem but we explicitly prevent shadow entries from being added to radix
trees for DAX mappings.

The only shadow entries that would be generated for DAX radix trees
would be to track zero page mappings that were created for holes.  These
pages would receive minimal benefit from having shadow entries, and the
choice to have only one type of exceptional entry in a given radix tree
makes the logic simpler both in clear_exceptional_entry() and in the
rest of DAX.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9fe48be

wrappers for ->i_mutex access · 5955102c

由 Al Viro 提交于 1月 22, 2016

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5955102c

16 1月, 2016 2 次提交

dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() · b2e0d162

由 Dan Williams 提交于 1月 15, 2016

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled.  Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

[willy@linux.intel.com: fix read() of a hole]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b2e0d162

dax: guarantee page aligned results from bdev_direct_access() · fe683ada

由 Dan Williams 提交于 1月 15, 2016

If a ->direct_access() implementation ever returns a map count less than
PAGE_SIZE, catch the error in bdev_direct_access().  This simplifies
error checking in upper layers.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fe683ada

15 1月, 2016 2 次提交

fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO) · b832861c

由 Andrew Morton 提交于 1月 14, 2016

bdev_write_page() is used by swapout and by writepage where we cannot
use __GFP_FS or __GFP_IO.  So it is misleading to mention GFP_KERNEL
here.

blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
bugs were harmed in the making of this patch.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b832861c

kmemcg: account certain kmem allocations to memcg · 5d097056

由 Vladimir Davydov 提交于 1月 14, 2016

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5d097056

14 1月, 2016 1 次提交

block: use bd{grab,put}() instead of open-coding · ed8a9d2c

由 Ilya Dryomov 提交于 11月 20, 2015

- bd_acquire() and bd_forget() open-code bdgrab() and bdput()
- raw driver uses igrab() but never checks its return value and always
  holds another ref from bind_set() while calling it, so it's
  equivalent to bdgrab()
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ed8a9d2c

09 1月, 2016 2 次提交

block: enable dax for raw block devices · 5a023cdb

由 Dan Williams 提交于 11月 30, 2015

If an application wants exclusive access to all of the persistent memory
provided by an NVDIMM namespace it can use this raw-block-dax facility
to forgo establishing a filesystem.  This capability is targeted
primarily to hypervisors wanting to provision persistent memory for
guests.  It can be disabled / enabled dynamically via the new BLKDAXSET
ioctl.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Reviewed-by: NJan Kara <jack@suse.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5a023cdb

block: introduce bdev_file_inode() · 4ebb16ca

由 Dan Williams 提交于 10月 28, 2015

Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4ebb16ca

07 1月, 2016 1 次提交

fs: use gendisk->disk_name where possible · 424081f3

由 Dmitry Monakhov 提交于 4月 13, 2015

gendisk with part==0 is obviously gendisk->disk_name.
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

424081f3

05 12月, 2015 1 次提交

block: detach bdev inode from its wb in __blkdev_put() · 43d1c0eb

由 Ilya Dryomov 提交于 11月 20, 2015

Since 52ebea74 ("writeback: make backing_dev_info host
cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
gets attached to a wb (struct bdi_writeback).  Detaching happens on
evict, in inode_detach_wb() called from __destroy_inode(), and involves
updating wb.

However, detaching an internal bdev inode from its wb in
__destroy_inode() is too late.  Its bdi and by extension root wb are
embedded into struct request_queue, which has different lifetime rules
and can be freed long before the final bdput() is called (can be from
__fput() of a corresponding /dev inode, through dput() - evict() -
bd_forget().  bdevs hold onto the underlying disk/queue pair only while
opened; as soon as bdev is closed all bets are off.  In fact,
disk/queue can be gone before __blkdev_put() even returns:

1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
1500 {
...
1518         if (bdev->bd_contains == bdev) {
1519                 if (disk->fops->release)
1520                         disk->fops->release(disk, mode);

[ Driver puts its references to disk/queue ]

1521         }
1522         if (!bdev->bd_openers) {
1523                 struct module *owner = disk->fops->owner;
1524
1525                 disk_put_part(bdev->bd_part);
1526                 bdev->bd_part = NULL;
1527                 bdev->bd_disk = NULL;
1528                 if (bdev != bdev->bd_contains)
1529                         victim = bdev->bd_contains;
1530                 bdev->bd_contains = NULL;
1531
1532                 put_disk(disk);

[ We put ours, the queue is gone
  The last bdput() would result in a write to invalid memory ]

1533                 module_put(owner);
...
1539 }

Since bdev inodes are special anyway, detach them in __blkdev_put()
after clearing inode's dirty bits, turning the problematic
inode_detach_wb() in __destroy_inode() into a noop.

add_disk() grabs its disk->queue since 523e1d39 ("block: make
gendisk hold a reference to its queue"), so the old ->release comment
is removed in favor of the new inode_detach_wb() comment.

Cc: stable@vger.kernel.org # 4.2+, needs backporting
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Tested-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

43d1c0eb

02 12月, 2015 1 次提交

blk-mq: add a flags parameter to blk_mq_alloc_request · 6f3b0e8b

由 Christoph Hellwig 提交于 11月 26, 2015

We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t.  Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

6f3b0e8b

20 11月, 2015 1 次提交

block: protect rw_page against device teardown · 2e6edc95

由 Dan Williams 提交于 11月 19, 2015

Fix use after free crashes like the following:

 general protection fault: 0000 [#1] SMP
 Call Trace:
  [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
  [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
  [<ffffffff8128fd90>] bdev_read_page+0x50/0x60
  [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
  [<ffffffff81297657>] mpage_readpages+0x107/0x170
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
  [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
  [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
  [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
  [<ffffffff811c76f6>] filemap_fault+0x396/0x530
  [<ffffffff811f816e>] __do_fault+0x4e/0xf0
  [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50

Cc: <stable@vger.kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NMatthew Wilcox <willy@linux.intel.com>
[willy: symmetry fixups]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

2e6edc95

12 11月, 2015 1 次提交

fs/block_dev.c: Remove WARN_ON() when inode writeback fails · dbd3ca50

由 Vivek Goyal 提交于 11月 09, 2015

If a block device is hot removed and later last reference to device
is put, we try to writeback the dirty inode. But device is gone and
that writeback fails.

Currently we do a WARN_ON() which does not seem to be the right thing.
Convert it to a ratelimited kernel warning.
Reported-by: NAndi Kleen <andi@firstfloor.org>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
[jmoyer@redhat.com: get rid of unnecessary name initialization, 80 cols]
Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

dbd3ca50

22 10月, 2015 1 次提交

block: Inline blk_integrity in struct gendisk · 25520d55

由 Martin K. Petersen 提交于 10月 21, 2015

Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

 - Moving the blk_integrity definition to genhd.h.

 - Inlining blk_integrity in struct gendisk.

 - Removing the dynamic allocation code.

 - Adding helper functions which allow gendisk to set up and tear down
   the integrity sysfs dir when a disk is added/deleted.

 - Adding a blk_integrity_revalidate() callback for updating the stable
   pages bdi setting.

 - The calls that depend on whether a device has an integrity profile or
   not now key off of the bi->profile pointer.

 - Simplifying the integrity support routines in DM (Mike Snitzer).
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reported-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

25520d55

16 9月, 2015 1 次提交

blockdev: don't set S_DAX for misaligned partitions · f0b2e563

由 Jeff Moyer 提交于 8月 14, 2015

The dax code doesn't currently support misaligned partitions,
so disable O_DIRECT via dax until such time as that support
materializes.

Cc: <stable@vger.kernel.org>
Suggested-by: NBoaz Harrosh <boaz@plexistor.com>
Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f0b2e563

09 9月, 2015 1 次提交

dax: move DAX-related functions to a new header · c94c2acf

由 Matthew Wilcox 提交于 9月 08, 2015

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h.  Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>.  We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c94c2acf

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功