提交 · 7e6358d244e4706fe612a77b9c36519a33600ac0 · openanolis / cloud-kernel

04 12月, 2017 2 次提交

dm: fix various targets to dm_register_target after module __init resources created · 7e6358d2

由 monty_pavel@sina.com 提交于 11月 25, 2017

A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
processes race to load the dm-thin-pool module:

 PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
  #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
  0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
  0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
  0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
  0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
  0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
  0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
  0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
  0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
     [exception RIP: kmem_cache_alloc+108]
     RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
     RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
     RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
     RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
     R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
     R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
     ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
  0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
 0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
 0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
 0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
 0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
 0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
 0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]

The race results in a NULL pointer because:

Process A (vgchange -ay -K):
 	a. send DM_LIST_VERSIONS_CMD ioctl;
 	b. pool_target not registered;
 	c. modprobe dm_thin_pool and wait until end.

Process B (vgchange -ay -K):
 	a. send DM_LIST_VERSIONS_CMD ioctl;
 	b. pool_target registered;
 	c. table_load->dm_table_add_target->pool_ctr;
 	d. _new_mapping_cache is NULL and panic.
Note:
 	1. process A and process B are two concurrent processes.
 	2. pool_target can be detected by process B but
 	_new_mapping_cache initialization has not ended.

To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
with the same problem, simply dm_register_target() after all resources
created during module init (as labelled with __init) are finished.

Cc: stable@vger.kernel.org
Signed-off-by: Nmonty <monty_pavel@sina.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7e6358d2

dm table: fix regression from improper dm_dev_internal.count refcount_t conversion · afc567a4

由 Mike Snitzer 提交于 11月 25, 2017

Multiple refcounts are needed if the device was already added. The
micro-optimization of setting the refcount to 1 on first added (rather
than fall thru to a common refcount_inc) lost sight of the fact that the
refcount_inc is also needed for the case when the device already exists
and the mode need not be upgraded.

Fixes: 2a0b4682 ("dm: convert dm_dev_internal.count from atomic_t to refcount_t")
Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

afc567a4

25 11月, 2017 4 次提交

bcache: check return value of register_shrinker · 6c4ca1e3

由 Michael Lyle 提交于 11月 24, 2017

register_shrinker is now __must_check, so check it to kill a warning.
Caller of bch_btree_cache_alloc in super.c appropriately checks return
value so this is fully plumbed through.

This V2 fixes checkpatch warnings and improves the commit description,
as I was too hasty getting the previous version out.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NVojtech Pavlik <vojtech@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6c4ca1e3

bcache: recover data from backing when data is clean · e393aa24

由 Rui Hua 提交于 11月 24, 2017

When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)

It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in  /sys/fs/bcache/XXX/internal/cache_read_races.

Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.

In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.

[edited by mlyle to fix up whitespace, commit log title, comment
spelling]

Fixes: d59b2379 ("bcache: only permit to recovery read error when cache device is clean")
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: NHua Rui <huarui.dev@gmail.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e393aa24

bcache: Fix building error on MIPS · cf33c1ee

由 Huacai Chen 提交于 11月 24, 2017

This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the PTR macro, which conflicts with the PTR macro
in include/uapi/linux/bcache.h.

[fixed by mlyle: corrected a line-length issue]

Cc: stable@vger.kernel.org
Signed-off-by: NHuacai Chen <chenhc@lemote.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cf33c1ee

bcache: add a comment in journal bucket reading · bb22cafd

由 Tang Junhui 提交于 11月 24, 2017

Journal bucket is a circular buffer, the bucket
can be like YYYNNNYY, which means the first valid journal in
the 7th bucket, and the latest valid journal in third bucket, in
this case, if we do not try we the zero index first, We
may get a valid journal in the 7th bucket, then we call
find_next_bit(bitmap,ca->sb.njournal_buckets, l + 1) to get the
first invalid bucket after the 7th bucket, because all these
buckets is valid, so no bit 1 in bitmap, thus find_next_bit()
function would return with ca->sb.njournal_buckets (8). So, after
that, bcache only read journal in 7th and 8the bucket,
the first to the third buckets are lost.

So, it is important to let developer know that, we need to try
the zero index at first in the hash-search, and avoid any breaks
in future's code modification.

[ML: Fixed whitespace & formatting & file permissions]
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bb22cafd

17 11月, 2017 5 次提交

dm bufio: fix integer overflow when limiting maximum cache size · 74d4108d

由 Eric Biggers 提交于 11月 15, 2017

The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression

    (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100

overflows, causing the wrong result to be computed.  For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem).  This causes
severe performance problems for dm-verity users on affected systems.

Fix this by using mult_frac() to correctly multiply by a percentage.  Do
this for all places in dm-bufio that multiply by a percentage.  Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.

Depends-on: 9993bc63 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f0 ("dm: add bufio")
Cc: <stable@vger.kernel.org> # v3.2+
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

74d4108d

dm: clear all discard attributes in queue_limits when discards are disabled · 5d47c89f

由 Mike Snitzer 提交于 11月 16, 2017

Otherwise, it can happen that the QUEUE_FLAG_DISCARD isn't set but the
various discard attributes (which get exposed via sysfs) may be set.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5d47c89f

dm: do not set 'discards_supported' in targets that do not need it · 7dea378b

由 Mike Snitzer 提交于 11月 16, 2017

The DM target's 'discards_supported' flag is intended to act as an
override.  Meaning, even if the underlying storage doesn't support
discards the DM target will.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7dea378b

dm: discard support requires all targets in a table support discards · 8a74d29d

由 Mike Snitzer 提交于 11月 14, 2017

A DM device with a mix of discard capabilities (due to some underlying
devices not having discard support) _should_ just return -EOPNOTSUPP for
the region of the device that doesn't support discards (even if only by
way of the underlying driver formally not supporting discards).  BUT,
that does ask the underlying driver to handle something that it never
advertised support for.  In doing so we're exposing users to the
potential for a underlying disk driver hanging if/when a discard is
issued a the device that is incapable and never claimed to support
discards.

Fix this by requiring that each DM target in a DM table provide discard
support as a prereq for a DM device to advertise support for discards.

This may cause some configurations that were happily supporting discards
(even in the face of a mix of discard support) to stop supporting
discards -- but the risk of users hitting driver hangs, and forced
reboots, outweighs supporting those fringe mixed discard
configurations.

Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

8a74d29d

dm mpath: remove annoying message of 'blk_get_request() returned -11' · 9dc112e2

由 Ming Lei 提交于 9月 30, 2017

It is very normal to see allocation failure, especially with blk-mq
request_queues, so it's unnecessary to report this error and annoy
people.

In practice this 'blk_get_request() returned -11' error gets logged
quite frequently when a blk-mq DM multipath device sees heavy IO.

This change is marked for stable@ because the annoying message in
question was included in stable@ commit 7083abbb.

Fixes: 7083abbb ("dm mpath: avoid that path removal can trigger an infinite loop")
Cc: stable@vger.kernel.org
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

9dc112e2

15 11月, 2017 1 次提交

md: Convert timers to use timer_setup() · 8376d3c1

由 Kees Cook 提交于 10月 16, 2017

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: linux-bcache@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8376d3c1

11 11月, 2017 24 次提交

M
dm cache: lift common migration preparation code to alloc_migration() · ef7afb36
由 Mike Snitzer 提交于 11月 09, 2017
```
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
ef7afb36
J
dm cache: remove usused deferred_cells member from struct cache · ede6507d
由 Joe Thornber 提交于 11月 09, 2017
```
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
ede6507d

dm cache policy smq: allocate cache blocks in order · 9768a10d

由 Joe Thornber 提交于 11月 09, 2017

Previously, cache blocks were being allocated in reverse order.  Fix
this by pulling the block off the head of the free list.

Shouldn't have any impact on performance or latency but it is more
correct to have the cache blocks allocated/mapped in ascending order.
This fix will slightly increase the chances of two adjacent oblocks
being in adjacent cblocks.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

9768a10d

dm cache policy smq: change max background work from 10240 to 4096 blocks · 8ee18ede

由 Joe Thornber 提交于 11月 09, 2017

10240 blocks was too much, lowering this reduces the latency of copying
and consumes less memory.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

8ee18ede

dm cache background tracker: limit amount of background work that may be issued at once · 64748b16

由 Joe Thornber 提交于 11月 08, 2017

On large systems the cache policy can be over enthusiastic and queue far
too much dirty data to be written back.  This consumes memory.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

64748b16

dm cache policy smq: take origin idle status into account when queuing writebacks · deb71918

由 Joe Thornber 提交于 11月 08, 2017

If the origin device is idle try and writeback more data.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

deb71918

dm cache policy smq: handle races with queuing background_work · 1e72a8e8

由 Joe Thornber 提交于 11月 08, 2017

The background_tracker holds a set of promotions/demotions that the
cache policy wishes the core target to implement.

When adding a new operation to the tracker it's possible that an
operation on the same block is already present (but in practise this
doesn't appear to be happening).  Catch these situations and do the
appropriate cleanup.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

1e72a8e8

dm raid: fix panic when attempting to force a raid to sync · 23397844

由 Heinz Mauelshagen 提交于 11月 02, 2017

Requesting a sync on an active raid device via a table reload
(see 'sync' parameter in Documentation/device-mapper/dm-raid.txt)
skips the super_load() call that defines the superblock size
(rdev->sb_size) -- resulting in an oops if/when super_sync()->memset()
is called.

Fix by moving the initialization of the superblock start and size
out of super_load() to the caller (analyse_superblocks).
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

23397844

dm integrity: allow unaligned bv_offset · 95b1369a

由 Mikulas Patocka 提交于 11月 07, 2017

When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
this unaligned memory for its buffers (if an unaligned buffer crosses a
page, XFS frees it and allocates a full page instead - see the function
xfs_buf_allocate_memory).

dm-integrity checks if bv_offset is aligned on page size and this check
fail with slub_debug and XFS.

Fix this bug by removing the bv_offset check, leaving only the check for
bv_len.

Fixes: 7eada909 ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: NBruno Prémont <bonbons@sysophe.eu>
Reviewed-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

95b1369a

dm crypt: allow unaligned bv_offset · 0440d5c0

由 Mikulas Patocka 提交于 11月 07, 2017

When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
this unaligned memory for its buffers (if an unaligned buffer crosses a
page, XFS frees it and allocates a full page instead - see the function
xfs_buf_allocate_memory).

dm-crypt checks if bv_offset is aligned on page size and these checks
fail with slub_debug and XFS.

Fix this bug by removing the bv_offset checks. Switch to checking if
bv_len is aligned instead of bv_offset (this check should be sufficient
to prevent overruns if a bio with too small bv_len is received).

Fixes: 8f0009a2 ("dm crypt: optionally support larger encryption sector size")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: NBruno Prémont <bonbons@sysophe.eu>
Tested-by: NBruno Prémont <bonbons@sysophe.eu>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reviewed-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

0440d5c0

dm: small cleanup in dm_get_md() · 49de5769

由 Mike Snitzer 提交于 11月 06, 2017

Makes dm_get_md() and dm_get_from_kobject() have similar code.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

49de5769

dm: fix race between dm_get_from_kobject() and __dm_destroy() · b9a41d21

由 Hou Tao 提交于 11月 01, 2017

The following BUG_ON was hit when testing repeat creation and removal of
DM devices:

    kernel BUG at drivers/md/dm.c:2919!
    CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
    Call Trace:
     [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
     [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
     [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
     [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
     [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
     [<ffffffff81199118>] seq_read+0x16f/0x325
     [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
     [<ffffffff8117b625>] __vfs_read+0x26/0x9d
     [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
     [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
     [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
     [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
     [<ffffffff8117c686>] SyS_read+0x4b/0x76
     [<ffffffff817b606e>] system_call_fastpath+0x12/0x71

The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().

To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.

The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.

Cc: stable@vger.kernel.org
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b9a41d21

dm: allocate struct mapped_device with kvzalloc · 856eb091

由 Mikulas Patocka 提交于 10月 31, 2017

The structure srcu_struct can be very big, its size is proportional to the
value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
io_barrier in the struct mapped_device has 84kB in the debugging kernel
and 50kB in the non-debugging kernel. The large size may result in failure
of the function kzalloc_node.

In order to avoid the allocation failure, we use the function
kvzalloc_node, this function falls back to vmalloc if a large contiguous
chunk of memory is not available. This patch also moves the field
io_barrier to the last position of struct mapped_device - the reason is
that on many processor architectures, short memory offsets result in
smaller code than long memory offsets - on x86-64 it reduces code size by
320 bytes.

Note to stable kernel maintainers - the kernels 4.11 and older don't have
the function kvzalloc_node, you can use the function vzalloc_node instead.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

856eb091

dm zoned: ignore last smaller runt zone · 114e0259

由 Damien Le Moal 提交于 10月 28, 2017

The SCSI layer allows ZBC drives to have a smaller last runt zone. For
such a device, specifying the entire capacity for a dm-zoned target
table entry fails because the specified capacity is not aligned on a
device zone size indicated in the request queue structure of the
device.

Fix this problem by ignoring the last runt zone in the entry length
when seting up the dm-zoned target (ctr method) and when iterating table
entries of the target (iterate_devices method). This allows dm-zoned
users to still easily setup a target using the entire device capacity
(as mandated by dm-zoned) or the aligned capacity excluding the last
runt zone.

While at it, replace direct references to the device queue chunk_sectors
limit with calls to the accessor blk_queue_zone_sectors().
Reported-by: NPeter Desnoyers <pjd@ccs.neu.edu>
Cc: stable@vger.kernel.org
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

114e0259

dm space map metadata: use ARRAY_SIZE · fbc61291

由 Jérémy Lefaure 提交于 10月 01, 2017

Using the ARRAY_SIZE macro improves the readability of the code.

Found with Coccinelle with the following semantic patch:
@r depends on (org || report)@
type T;
T[] E;
position p;
@@
(
 (sizeof(E)@p /sizeof(*E))
|
 (sizeof(E)@p /sizeof(E[...]))
|
 (sizeof(E)@p /sizeof(T))
)
Signed-off-by: NJérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

fbc61291

dm log writes: add support for DAX · 98d82f48

由 Ross Zwisler 提交于 10月 19, 2017

Now that we have the ability log filesystem writes using a flat buffer, add
support for DAX.

The motivation for this support is the need for an xfstest that can test
the new MAP_SYNC DAX flag.  By logging the filesystem activity with
dm-log-writes we can show that the MAP_SYNC page faults are writing out
their metadata as they happen, instead of requiring an explicit
msync/fsync.

Unfortunately we can't easily track data that has been written via
mmap() now that the dax_flush() abstraction was removed by commit
c3ca015f ("dax: remove the pmem_dax_ops->flush abstraction").
Otherwise we could just treat each flush as a big write, and store the
data that is being synced to media.  It may be worthwhile to add the
dax_flush() entry point back, just as a notifier so we can do this
logging.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

98d82f48

dm log writes: add support for inline data buffers · e5a20660

由 Ross Zwisler 提交于 10月 19, 2017

Currently dm-log-writes supports writing filesystem data via BIOs, and
writing internal metadata from a flat buffer via write_metadata().

For DAX writes, though, we won't have a BIO, but will instead have an
iterator that we'll want to use to fill a flat data buffer.

So, create write_inline_data() which allows us to write filesystem data
using a flat buffer as a source, and wire it up in log_one_block().
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

e5a20660

dm cache: simplify get_per_bio_data() by removing data_size argument · 693b960e

由 Mike Snitzer 提交于 10月 19, 2017

There is only one per_bio_data size now that writethrough-specific data
was removed from the per_bio_data structure.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

693b960e

dm cache: remove all obsolete writethrough-specific code · 9958f1d9

由 Mike Snitzer 提交于 10月 19, 2017

Now that the writethrough code is much simpler there is no need to track
so much state or cascade bio submission (as was done, via
writethrough_endio(), to issue origin then cache IO in series).

As such the obsolete writethrough list and workqueue is also removed.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

9958f1d9

dm cache: submit writethrough writes in parallel to origin and cache · 2df3bae9

由 Mike Snitzer 提交于 10月 19, 2017

Discontinue issuing writethrough write IO in series to the origin and
then cache.

Use bio_clone_fast() to create a new origin clone bio that will be
mapped to the origin device and then bio_chain() it to the bio that gets
remapped to the cache device.  The origin clone bio does _not_ have a
copy of the per_bio_data -- as such check_if_tick_bio_needed() will not
be called.

The cache bio (parent bio) will not complete until the origin bio has
completed -- this fulfills bio_clone_fast()'s requirements as well as
the requirement to not complete the original IO until the write IO has
completed to both the origin and cache device.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2df3bae9

dm cache: pass cache structure to mode functions · 8e3c3827

由 Mike Snitzer 提交于 10月 19, 2017

No functional changes, just a bit cleaner than passing cache_features
structure.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

8e3c3827

dm cache: fix race condition in the writeback mode overwrite_bio optimisation · d1260e2a

由 Joe Thornber 提交于 11月 10, 2017

When a DM cache in writeback mode moves data between the slow and fast
device it can often avoid a copy if the triggering bio either:

i) covers the whole block (no point copying if we're about to overwrite it)
ii) the migration is a promotion and the origin block is currently discarded

Prior to this fix there was a race with case (ii).  The discard status
was checked with a shared lock held (rather than exclusive).  This meant
another bio could run in parallel and write data to the origin, removing
the discard state.  After the promotion the parallel write would have
been lost.

With this fix the discard status is re-checked once the exclusive lock
has been aquired.  If the block is no longer discarded it falls back to
the slower full copy path.

Fixes: b29d4986 ("dm cache: significant rework to leverage dm-bio-prison-v2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d1260e2a

md: free unused memory after bitmap resize · 0868b99c

由 Zdenek Kabelac 提交于 11月 08, 2017

When bitmap is resized, the old kalloced chunks just are not released
once the resized bitmap starts to use new space.

This fixes in particular kmemleak reports like this one:

unreferenced object 0xffff8f4311e9c000 (size 4096):
  comm "lvm", pid 19333, jiffies 4295263268 (age 528.265s)
  hex dump (first 32 bytes):
    02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80  ................
    02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80  ................
  backtrace:
    [<ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0
    [<ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0
    [<ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110
    [<ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0
    [<ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0
    [<ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530
    [<ffffffffa676f1ae>] bitmap_load+0xbe/0x160
    [<ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid]
    [<ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0
    [<ffffffffa6774b52>] dm_resume+0x122/0x140
    [<ffffffffa6779b9f>] dev_suspend+0x18f/0x290
    [<ffffffffa677a3a7>] ctl_ioctl+0x287/0x560
    [<ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20
    [<ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750
    [<ffffffffa62d7269>] SyS_ioctl+0x79/0x90
    [<ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
Signed-off-by: NZdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0868b99c

md: release allocated bitset sync_set · 0202ce8a

由 Zdenek Kabelac 提交于 11月 08, 2017

Patch fixes kmemleak on md_stop() path used likely only by dm-raid wrapper.
Code of md is using  mddev_put() where both bitsets are released however this
freeing is not shared.

Also set NULL to bio_set and sync_set pointers just like mddev_put is
doing.
Signed-off-by: NZdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0202ce8a

09 11月, 2017 2 次提交

md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb · 97f0eb9f

由 Hou Tao 提交于 11月 06, 2017

For a RAID1 device using a file-based bitmap, if a bitmap write error
occurs but the later writes succeed, it's possible both BITMAP_STALE
and BITMAP_WRITE_ERROR bits will be written to the bitmap super block,
the BITMAP_STALE bit will be handled properly and be cleared, but the
BITMAP_WRITE_ERROR bit in sb->flags will make bitmap_create() to fail.

So clear it to protect against the write failure-and-then-recovery case.
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NShaohua Li <shli@fb.com>

97f0eb9f

md: be cautious about using ->curr_resync_completed for ->recovery_offset · db0505d3

由 NeilBrown 提交于 10月 17, 2017

The ->recovery_offset shows how much of a non-InSync device is actually
in sync - how much has been recoveryed.

When performing a recovery, ->curr_resync and ->curr_resync_completed
follow the device address being recovered and so can be used to update
->recovery_offset.

When performing a reshape, ->curr_resync* might follow the device
addresses (raid5) or might follow array addresses (raid10), so cannot
in general be used to set ->recovery_offset.  When reshaping backwards,
->curre_resync* measures from the *end* of the array-or-device, so is
particularly unhelpful.

So change the common code in md.c to only use ->curr_resync_complete
for the simple recovery case, and add code to raid5.c to update
->recovery_offset during a forwards reshape.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

db0505d3

03 11月, 2017 1 次提交

dm: move dm-verity to generic async completion · 12f1ffc4

由 Gilad Ben-Yossef 提交于 10月 18, 2017

dm-verity is starting async. crypto ops and waiting for them to complete.
Move it over to generic code doing the same.

This also avoids a future potential data coruption bug created
by the use of wait_for_completion_interruptible() without dealing
correctly with an interrupt aborting the wait prior to the
async op finishing, should this code ever move to a context
where signals are not masked.
Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
CC: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

12f1ffc4

02 11月, 2017 1 次提交

License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318

由 Greg Kroah-Hartman 提交于 11月 01, 2017

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b2441318

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功