提交 · 3b0e6aacbfe04fa144c4732f269b09ce91177566 · openanolis / cloud-kernel

04 3月, 2015 2 次提交

md/bitmap: use sector_div for sector_t divisions · 3b0e6aac

由 Stephen Rothwell 提交于 3月 03, 2015

neilb: modified to not corrupt ->resync_max_sectors.

sector_div usage fixed by Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NNeilBrown <neilb@suse.de>

3b0e6aac

md/bitmap: fix incorrect DIV_ROUND_UP usage. · 935f3d4f

由 NeilBrown 提交于 3月 02, 2015

DIV_ROUTND_UP doesn't work on "long long", - and it should be
sector_t anyway.
Signed-off-by: NNeilBrown <neilb@suse.de>

935f3d4f

25 2月, 2015 1 次提交

md: fix error paths from bitmap_create. · ba599aca

由 NeilBrown 提交于 2月 25, 2015

Recent change to bitmap_create mishandles errors.
In particular a failure doesn't alway cause 'err' to be set.
Signed-off-by: NNeilBrown <neilb@suse.de>

ba599aca

23 2月, 2015 23 次提交

Add new disk to clustered array · 1aee41f6

由 Goldwyn Rodrigues 提交于 10月 29, 2014

Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
   ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
   using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
   was found:
   ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
	 disc.number set to slot number)
   ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
   as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

1aee41f6

Read from the first device when an area is resyncing · 7d49ffcf

由 Goldwyn Rodrigues 提交于 8月 12, 2014

set choose_first true for cluster read in read balance when the area
is resyncing.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

7d49ffcf

Suspend writes in RAID1 if within range · 589a1c49

由 Goldwyn Rodrigues 提交于 6月 07, 2014

If there is a resync going on, all nodes must suspend writes to the
range. This is recorded in the suspend_info/suspend_list.

If there is an I/O within the ranges of any of the suspend_info,
should_suspend will return 1.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

589a1c49

Resync start/Finish actions · e59721cc

由 Goldwyn Rodrigues 提交于 6月 07, 2014

When a RESYNC_START message arrives, the node removes the entry
with the current slot number and adds the range to the
suspend_list.

Simlarly, when a RESYNC_FINISHED message is received, node clears
entry with respect to the bitmap number.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

e59721cc

Send RESYNCING while performing resync start/stop · 965400eb

由 Goldwyn Rodrigues 提交于 6月 07, 2014

When a resync is initiated, RESYNCING message is sent to all active
nodes with the range (lo,hi). When the resync is over, a RESYNCING
message is sent with (0,0). A high sector value of zero indicates
that the resync is over.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

965400eb

Reload superblock if METADATA_UPDATED is received · 1d7e3e96

由 Goldwyn Rodrigues 提交于 6月 07, 2014

Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

1d7e3e96

metadata_update sends message to other nodes · 293467aa

由 Goldwyn Rodrigues 提交于 6月 07, 2014

   - request to send a message
   - make changes to superblock
   - send messages telling everyone that the superblock has changed
   - other nodes all read the superblock
   - other nodes all ack the messages
   - updating node release the "I'm sending a message" resource.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

293467aa

Communication Framework: Sending functions · 601b515c

由 Goldwyn Rodrigues 提交于 6月 07, 2014

The sending part is split in two functions to make sure
atomicity of the operations, such as the MD superblock update.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

601b515c

Communication Framework: Receiving · 4664680c

由 Goldwyn Rodrigues 提交于 6月 07, 2014

1. receive status

   sender                         receiver                   receiver
   ACK:CR                          ACK:CR                     ACK:CR

2. sender get EX of TOKEN
   sender get EX of MESSAGE
   sender                          receiver                   receiver
   TOKEN:EX                         ACK:CR                     ACK:CR
   MESSAGE:EX
   ACK:CR

3. sender write LVB.
   sender down-convert MESSAGE from EX to CR
   sender try to get EX of ACK
   [ wait until all receiver has *processed* the MESSAGE ]

                                     [ triggered by bast of ACK ]
                                     receiver get CR of MESSAGE
                                     receiver read LVB
                                     receiver processes the message
				     [ wait finish ]
                                     receiver release ACK

   sender                         receiver                   receiver
   TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
   MESSAGE:CR
   ACK:EX

4. sender down-convert ACK from EX to CR
   sender release MESSAGE
   sender release TOKEN
				  receiver upconvert to EX of MESSAGE
                                  receiver get CR of ACK
				  receiver release MESSAGE

   sender                        receiver                   receiver
   ACK:CR                         ACK:CR                     ACK:CR
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

4664680c

Perform resync for cluster node failure · 4b26a08a

由 Goldwyn Rodrigues 提交于 6月 07, 2014

If bitmap_copy_slot returns hi>0, we need to perform resync.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

4b26a08a

Initiate recovery on node failure · e94987db

由 Goldwyn Rodrigues 提交于 6月 07, 2014

The DLM informs us in case of node failure with the DLM slot number.
cluster_info->recovery_map sets the bit corresponding to the slot number
and wakes up the recovery thread.

The recovery thread:
1. Derives the slot number from the recovery_map
2. Locks the bitmap corresponding to the slot
3. Copies the set bits to the node-local bitmap
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

e94987db

Copy set bits from another slot · 11dd35da

由 Goldwyn Rodrigues 提交于 6月 07, 2014

bitmap_copy_from_slot reads the bitmap from the slot mentioned.
It then copies the set bits to the node local bitmap.

This is helper function for the resync operation on node failure.

bitmap_set_memory_bits() currently assumes it is only run at startup and that
they bitmap is currently empty. So if it finds that a region is already
marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits()
to always set the NEEDED_MASK bit if 'needed' is set.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

11dd35da

bitmap_create returns bitmap pointer · f9209a32

由 Goldwyn Rodrigues 提交于 6月 06, 2014

This is done to have multiple bitmaps open at the same time.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

f9209a32

Gather on-going resync information of other nodes · 96ae923a

由 Goldwyn Rodrigues 提交于 6月 06, 2014

When a node joins, it does not know of other nodes performing resync.
So, each node keeps the resync information in it's LVB. When a new
node joins, it reads the LVB of each "online" bitmap.

[TODO] The new node attempts to get the PW lock on other bitmap, if
it is successful, it reads the bitmap and performs the resync (if
required) on it's behalf.

If the node does not get the PW, it requests CR and reads the LVB
for the resync information.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

96ae923a

G
Lock bitmap while joining the cluster · 54519c5f
由 Goldwyn Rodrigues 提交于 6月 06, 2014
```
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
54519c5f

Use separate bitmaps for each nodes in the cluster · b97e9257

由 Goldwyn Rodrigues 提交于 6月 06, 2014

On-disk format:

0                    4k                     8k                    12k
-------------------------------------------------------------------
| idle                | md super            | bm super [0] + bits |
| bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
| bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
| bm bits [3, contd]  |                     |                     |

Bitmap super has a field nodes, which defines the maximum number
of nodes the device can use. While reading the bitmap super, if
the cluster finds out that the number of nodes is > 0:
1. Requests the md-cluster module.
2. Calls md_cluster_ops->join(), which sets up clustering such as
   joining DLM lockspace.

Since the first time, the first bitmap is read. After the call
to the cluster_setup, the bitmap offset is adjusted and the
superblock is re-read. This also ensures the bitmap is read
the bitmap lock (when bitmap lock is introduced in later patches)

Questions:
1. cluster name is repeated in all bitmap supers. Is that okay?
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

b97e9257

Add node recovery callbacks · cf921cc1

由 Goldwyn Rodrigues 提交于 3月 30, 2014

DLM offers callbacks when a node fails and the lock remastery
is performed:

1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
		can start
3. recover_done: called when all nodes have completed recover_slot

recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

cf921cc1

G
Return MD_SB_CLUSTERED if mddev is clustered · ca8895d9
由 Goldwyn Rodrigues 提交于 11月 26, 2014
```
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
ca8895d9

Introduce md_cluster_info · c4ce867f

由 Goldwyn Rodrigues 提交于 3月 29, 2014

md_cluster_info stores the cluster information in the MD device.

The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
	1. Setup a DLM lockspace
	2. Setup all initial locks such as super block locks and bitmap lock (will come later)

The leave() clears up the lockspace and all the locks held.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

c4ce867f

G
Introduce md_cluster_operations to handle cluster functions · edb39c9d
由 Goldwyn Rodrigues 提交于 3月 29, 2014
```
This allows dynamic registering of cluster hooks.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
edb39c9d

DLM lock and unlock functions · 47741b7c

由 Goldwyn Rodrigues 提交于 3月 07, 2014

A dlm_lock_resource is a structure which contains all information
required for locking using DLM. The init function allocates the
lock and acquires the lock in NL mode. The unlock function
converts the lock resource to NL mode. This is done to preserve
LVB and for faster processing of locks. The lock resource is
DLM unlocked only in the lockres_free function, which is the end
of life of the lock resource.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

47741b7c

G
Create a separate module for clustering support · 8e854e9c
由 Goldwyn Rodrigues 提交于 3月 07, 2014
```
Tagged as EXPERIMENTAL for now.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
8e854e9c
G
Add number of nodes to bitmap structure for clustering · 183bdf51
由 Goldwyn Rodrigues 提交于 3月 07, 2014
```
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
183bdf51

18 2月, 2015 3 次提交

dm snapshot: fix a possible invalid memory access on unload · 22aa66a3

由 Mikulas Patocka 提交于 2月 17, 2015

When the snapshot target is unloaded, snapshot_dtr() waits until
pending_exceptions_count drops to zero.  Then, it destroys the snapshot.
Therefore, the function that decrements pending_exceptions_count
should not touch the snapshot structure after the decrement.

pending_complete() calls free_pending_exception(), which decrements
pending_exceptions_count, and then it performs up_write(&s->lock) and it
calls retry_origin_bios() which dereferences  s->origin.  These two
memory accesses to the fields of the snapshot may touch the dm_snapshot
struture after it is freed.

This patch moves the call to free_pending_exception() to the end of
pending_complete(), so that the snapshot will not be destroyed while
pending_complete() is in progress.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

22aa66a3

dm: fix a race condition in dm_get_md · 2bec1f4a

由 Mikulas Patocka 提交于 2月 17, 2015

The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.

dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.

There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.

To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

2bec1f4a

md/raid5: Fix livelock when array is both resyncing and degraded. · 26ac1073

由 NeilBrown 提交于 2月 18, 2015

Commit a7854487:
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.

Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.

Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.

So change the condition to only force RCW on non-degraded arrays.
Reported-by: NManibalan P <pmanibalan@amiindia.co.in>
Bisected-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Fixes: a7854487
Cc: stable@vger.kernel.org (v3.7+)

26ac1073

17 2月, 2015 7 次提交

dm crypt: sort writes · b3c5fd30

由 Mikulas Patocka 提交于 2月 13, 2015

Write requests are sorted in a red-black tree structure and are
submitted in the sorted order.

In theory the sorting should be performed by the underlying disk
scheduler, however, in practice the disk scheduler only accepts and
sorts a finite number of requests.  To allow the sorting of all
requests, dm-crypt needs to implement its own sorting.

The overhead associated with rbtree-based sorting is considered
negligible so it is not used conditionally.  Even on SSD sorting can be
beneficial since in-order request dispatch promotes lower latency IO
completion to the upper layers.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b3c5fd30

dm crypt: add 'submit_from_crypt_cpus' option · 0f5d8e6e

由 Mikulas Patocka 提交于 2月 13, 2015

Make it possible to disable offloading writes by setting the optional
'submit_from_crypt_cpus' table argument.

There are some situations where offloading write bios from the
encryption threads to a single thread degrades performance
significantly.

The default is to offload write bios to the same thread because it
benefits CFQ to have writes submitted using the same IO context.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

0f5d8e6e

dm crypt: offload writes to thread · dc267621

由 Mikulas Patocka 提交于 2月 13, 2015

Submitting write bios directly in the encryption thread caused serious
performance degradation.  On a multiprocessor machine, encryption requests
finish in a different order than they were submitted.  Consequently, write
requests would be submitted in a different order and it could cause severe
performance degradation.

Move the submission of write requests to a separate thread so that the
requests can be sorted before submitting.  But this commit improves
dm-crypt performance even without having dm-crypt perform request
sorting (in particular it enables IO schedulers like CFQ to sort more
effectively).

Note: it is required that a previous commit ("dm crypt: don't allocate
pages for a partial request") be applied before applying this patch.
Otherwise, this commit could introduce a crash.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

dc267621

dm crypt: remove unused io_pool and _crypt_io_pool · 94f5e024

由 Mikulas Patocka 提交于 2月 13, 2015

The previous commit ("dm crypt: don't allocate pages for a partial
request") stopped using the io_pool slab mempool and backing
_crypt_io_pool kmem cache.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

94f5e024

dm crypt: avoid deadlock in mempools · 7145c241

由 Mikulas Patocka 提交于 2月 13, 2015

Fix a theoretical deadlock introduced in the previous commit ("dm crypt:
don't allocate pages for a partial request").

The function crypt_alloc_buffer may be called concurrently.  If we allocate
from the mempool concurrently, there is a possibility of deadlock.  For
example, if we have mempool of 256 pages, two processes, each wanting
256, pages allocate from the mempool concurrently, it may deadlock in a
situation where both processes have allocated 128 pages and the mempool
is exhausted.

To avoid such a scenario we allocate the pages under a mutex.  In order
to not degrade performance with excessive locking, we try non-blocking
allocations without a mutex first and if that fails, we fallback to a
blocking allocations with a mutex.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7145c241

dm crypt: don't allocate pages for a partial request · cf2f1abf

由 Mikulas Patocka 提交于 2月 13, 2015

Change crypt_alloc_buffer so that it only ever allocates pages for a
full request.  This is a prerequisite for the commit "dm crypt: offload
writes to thread".

This change simplifies the dm-crypt code at the expense of reduced
throughput in low memory conditions (where allocation for a partial
request is most useful).

Note: the next commit ("dm crypt: avoid deadlock in mempools") is needed
to fix a theoretical deadlock.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

cf2f1abf

dm crypt: use unbound workqueue for request processing · f3396c58

由 Mikulas Patocka 提交于 2月 13, 2015

Use unbound workqueue by default so that work is automatically balanced
between available CPUs.  The original behavior of encrypting using the
same cpu that IO was submitted on can still be enabled by setting the
optional 'same_cpu_crypt' table argument.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f3396c58

16 2月, 2015 2 次提交

md/raid10: round up to bdev_logical_block_size in narrow_write_error. · f04ebb0b

由 NeilBrown 提交于 2月 16, 2015

RAID10 version of earlier fix for RAID1.  We must never initiate
IO with sizes less that logical_block_size.
Signed-off-by: NNeilBrown <neilb@suse.de>

f04ebb0b

md/raid1: round up to bdev_logical_block_size in narrow_write_error · ab713cdc

由 Nate Dailey 提交于 2月 12, 2015

This modifies raid1's narrow_write_error to round up block_sectors to the
device's logical block size.

This prevents sd complaining about "Bad block number requested" for non-512-byte
sector disks.
Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

ab713cdc

14 2月, 2015 2 次提交

dm io: reject unsupported DISCARD requests with EOPNOTSUPP · 37527b86

由 Darrick J. Wong 提交于 2月 13, 2015

I created a dm-raid1 device backed by a device that supports DISCARD
and another device that does NOT support DISCARD with the following
dm configuration:

 #  echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
 # lsblk -D
 NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
 sda                 0        4K       1G         0
 `-moo (dm-0)        0        4K       1G         0
 sdb                 0        0B       0B         0
 `-moo (dm-0)        0        4K       1G         0

Notice that the mirror device /dev/mapper/moo advertises DISCARD
support even though one of the mirror halves doesn't.

If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
loop in do_region() when it tries to issue a DISCARD request to sdb.
The problem is that when we call do_region() against sdb, num_sectors
is set to zero because q->limits.max_discard_sectors is zero.
Therefore, "remaining" never decreases and the loop never terminates.

To fix this: before entering the loop, check for the combination of
REQ_DISCARD and no discard and return -EOPNOTSUPP to avoid hanging up
the mirror device.

This bug was found by the unfortunate coincidence of pvmove and a
discard operation in the RHEL 6.5 kernel; upstream is also affected.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Acked-by: N"Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

37527b86

dm mirror: do not degrade the mirror on discard error · f2ed51ac

由 Mikulas Patocka 提交于 2月 12, 2015

It may be possible that a device claims discard support but it rejects
discards with -EOPNOTSUPP.  It happens when using loopback on ext2/ext3
filesystem driven by the ext4 driver.  It may also happen if the
underlying devices are moved from one disk on another.

If discard error happens, we reject the bio with -EOPNOTSUPP, but we do
not degrade the array.

This patch fixes failed test shell/lvconvert-repair-transient.sh in the
lvm2 testsuite if the testsuite is extracted on an ext2 or ext3
filesystem and it is being driven by the ext4 driver.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

f2ed51ac

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功