1. 17 3月, 2017 18 次提交
    • G
      md: move bitmap_destroy to the beginning of __md_stop · 48df498d
      Guoqing Jiang 提交于
      Since we have switched to sync way to handle METADATA_UPDATED
      msg for md-cluster, then process_metadata_update is depended
      on mddev->thread->wqueue.
      
      With the new change, clustered raid could possible hang if
      array received a METADATA_UPDATED msg after array unregistered
      mddev->thread, so we need to stop clustered raid (bitmap_destroy
      -> bitmap_free -> md_cluster_stop) earlier than unregister
      thread (mddev_detach -> md_unregister_thread).
      
      And this change should be safe for non-clustered raid since
      all writes are stopped before the destroy. Also in md_run,
      we activate the personality (pers->run()) before activating
      the bitmap (bitmap_create()). So it is pleasingly symmetric
      to stop the bitmap (bitmap_destroy()) before stopping the
      personality (__md_stop() calls pers->free()), we achieve this
      by move bitmap_destroy to the beginning of __md_stop.
      
      But we don't want to break the codes for waiting behind IO as
      Shaohua mentioned, so introduce bitmap_wait_behind_writes to
      call the codes, and call the new fun in both mddev_detach and
      bitmap_destroy, then we will not break original behind IO code
      and also fit the new condition well.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      48df498d
    • S
      md/r5cache: generate R5LOG_PAYLOAD_FLUSH · ea17481f
      Song Liu 提交于
      In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to
      log->current_io.
      
      Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to
      journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in
      quiesce.
      
      Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload.
      However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH,
      which is simpler.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ea17481f
    • S
      md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recovery · 2d4f4687
      Song Liu 提交于
      This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery.
      Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush
      finish.
      
      When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity
      will be dropped from recovery. This will reduce the number of stripes
      to replay, and thus accelerate the recovery process.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2d4f4687
    • A
      raid5-ppl: runtime PPL enabling or disabling · ba903a3e
      Artur Paszkiewicz 提交于
      Allow writing to 'consistency_policy' attribute when the array is
      active. Add a new function 'change_consistency_policy' to the
      md_personality operations structure to handle the change in the
      personality code. Values "ppl" and "resync" are accepted and
      turn PPL on and off respectively.
      
      When enabling PPL its location and size should first be set using
      'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
      written at this location on each member device.
      
      Enabling or disabling PPL is performed under a suspended array.  The
      raid5_reset_stripe_cache function frees the stripe cache and allocates
      it again in order to allocate or free the ppl_pages for the stripes in
      the stripe cache.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ba903a3e
    • A
      raid5-ppl: support disk hot add/remove with PPL · 6358c239
      Artur Paszkiewicz 提交于
      Add a function to modify the log by removing an rdev when a drive fails
      or adding when a spare/replacement is activated as a raid member.
      
      Removing a disk just clears the child log rdev pointer. No new stripes
      will be accepted for this child log in ppl_write_stripe() and running io
      units will be processed without writing PPL to the device.
      
      Adding a disk sets the child log rdev pointer and writes an empty PPL
      header.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6358c239
    • A
      raid5-ppl: load and recover the log · 4536bf9b
      Artur Paszkiewicz 提交于
      Load the log from each disk when starting the array and recover if the
      array is dirty.
      
      The initial empty PPL is written by mdadm. When loading the log we
      verify the header checksum and signature. For external metadata arrays
      the signature is verified in userspace, so here we read it from the
      header, verifying only if it matches on all disks, and use it later when
      writing PPL.
      
      In addition to the header checksum, each header entry also contains a
      checksum of its partial parity data. If the header is valid, recovery is
      performed for each entry until an invalid entry is found. If the array
      is not degraded and recovery using PPL fully succeeds, there is no need
      to resync the array because data and parity will be consistent, so in
      this case resync will be disabled.
      
      Due to compatibility with IMSM implementations on other systems, we
      can't assume that the recovery data block size is always 4K. Writes
      generated by MD raid5 don't have this issue, but when recovering PPL
      written in other environments it is possible to have entries with
      512-byte sector granularity. The recovery code takes this into account
      and also the logical sector size of the underlying drives.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4536bf9b
    • A
      md: add sysfs entries for PPL · 664aed04
      Artur Paszkiewicz 提交于
      Add 'consistency_policy' attribute for array. It indicates how the array
      maintains consistency in case of unexpected shutdown.
      
      Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
      and size of the PPL space on the device. They can't be changed for
      active members if the array is started and PPL is enabled, so in the
      setter functions only basic checks are performed. More checks are done
      in ppl_validate_rdev() when starting the log.
      
      These attributes are writable to allow enabling PPL for external
      metadata arrays and (later) to enable/disable PPL for a running array.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      664aed04
    • A
      raid5-ppl: Partial Parity Log write logging implementation · 3418d036
      Artur Paszkiewicz 提交于
      Implement the calculation of partial parity for a stripe and PPL write
      logging functionality. The description of PPL is added to the
      documentation. More details can be found in the comments in raid5-ppl.c.
      
      Attach a page for holding the partial parity data to stripe_head.
      Allocate it only if mddev has the MD_HAS_PPL flag set.
      
      Partial parity is the xor of not modified data chunks of a stripe and is
      calculated as follows:
      
      - reconstruct-write case:
        xor data from all not updated disks in a stripe
      
      - read-modify-write case:
        xor old data and parity from all updated disks in a stripe
      
      Implement it using the async_tx API and integrate into raid_run_ops().
      It must be called when we still have access to old data, so do it when
      STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
      stored into sh->ppl_page.
      
      Partial parity is not meaningful for full stripe write and is not stored
      in the log or used for recovery, so don't attempt to calculate it when
      stripe has STRIPE_FULL_WRITE.
      
      Put the PPL metadata structures to md_p.h because userspace tools
      (mdadm) will also need to read/write PPL.
      
      Warn about using PPL with enabled disk volatile write-back cache for
      now. It can be removed once disk cache flushing before writing PPL is
      implemented.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3418d036
    • A
      raid5: separate header for log functions · ff875738
      Artur Paszkiewicz 提交于
      Move raid5-cache declarations from raid5.h to raid5-log.h, add inline
      wrappers for functions which will be shared with ppl and use them in
      raid5 core instead of direct calls to raid5-cache.
      
      Remove unused parameter from r5c_cache_data(), move two duplicated
      pr_debug() calls to r5l_init_log().
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ff875738
    • A
      md: superblock changes for PPL · ea0213e0
      Artur Paszkiewicz 提交于
      Include information about PPL location and size into mdp_superblock_1
      and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
      put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
      'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
      to mddev->flags to indicate that PPL is enabled on an array.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ea0213e0
    • S
      md/r5cache: improve recovery with read ahead page pool · effe6ee7
      Song Liu 提交于
      In r5cache recovery, the journal device is scanned page by page.
      Currently, we use sync_page_io() to read journal device. This is
      not efficient when we have to recovery many stripes from the journal.
      
      To improve the speed of recovery, this patch introduces a read ahead
      page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
      pages are read in one IO. Then the recovery code read the journal from
      ra_pool.
      
      With ra_pool, r5l_recovery_ctx has become much bigger. Therefore,
      r5l_recovery_log() is refactored so r5l_recovery_ctx is not using
      stack space.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      effe6ee7
    • S
      md/raid5: sort bios · aaf9f12e
      Shaohua Li 提交于
      Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
      defers IO dispatching. The goal is to create better IO pattern. At that
      time, we don't sort the deffered IO and hope the block layer can do IO
      merge and sort. Now the raid5-cache writeback could create large amount
      of bios. And if we enable muti-thread for stripe handling, we can't
      control when to dispatch IO to raid disks. In a lot of time, we are
      dispatching IO which block layer can't do merge effectively.
      
      This patch moves further for the IO dispatching defer. We accumulate
      bios, but we don't dispatch all the bios after a threshold is met. This
      'dispatch partial portion of bios' stragety allows bios coming in a
      large time window are sent to disks together. At the dispatching time,
      there is large chance the block layer can merge the bios. To make this
      more effective, we dispatch IO in ascending order. This increases
      request merge chance and reduces disk seek.
      Signed-off-by: NShaohua Li <shli@fb.com>
      aaf9f12e
    • S
      md/raid5-cache: bump flush stripe batch size · 84890c03
      Shaohua Li 提交于
      Bump the flush stripe batch size to 2048. For my 12 disks raid
      array, the stripes takes:
      12 * 4k * 2048 = 96MB
      
      This is still quite small. A hardware raid card generally has 1GB size,
      which we suggest the raid5-cache has similar cache size.
      
      The advantage of a big batch size is we can dispatch a lot of IO in the
      same time, then we can do some scheduling to make better IO pattern.
      
      Last patch prioritizes stripes, so we don't worry about a big flush
      stripe batch will starve normal stripes.
      Signed-off-by: NShaohua Li <shli@fb.com>
      84890c03
    • S
      md/raid5: prioritize stripes for writeback · 535ae4eb
      Shaohua Li 提交于
      In raid5-cache writeback mode, we have two types of stripes to handle.
      - stripes which aren't cached yet
      - stripes which are cached and flushing out to raid disks
      
      Upperlayer is more sensistive to latency of the first type of stripes
      generally. But we only one handle list for all these stripes, where the
      two types of stripes are mixed together. When reclaim flushes a lot of
      stripes, the first type of stripes could be noticeably delayed. On the
      other hand, if the log space is tight, we'd like to handle the second
      type of stripes faster and free log space.
      
      This patch destinguishes the two types stripes. They are added into
      different handle list. When we try to get a stripe to handl, we prefer
      the first type of stripes unless log space is tight.
      
      This should have no impact for !writeback case.
      Signed-off-by: NShaohua Li <shli@fb.com>
      535ae4eb
    • G
      md-cluster: add the support for resize · 818da59f
      Guoqing Jiang 提交于
      To update size for cluster raid, we need to make
      sure all nodes can perform the change successfully.
      However, it is possible that some of them can't do
      it due to failure (bitmap_resize could fail). So
      we need to consider the issue before we set the
      capacity unconditionally, and we use below steps
      to perform sanity check.
      
      1. A change the size, then broadcast METADATA_UPDATED
         msg.
      2. B and C receive METADATA_UPDATED change the size
         excepts call set_capacity, sync_size is not update
         if the change failed. Also call bitmap_update_sb
         to sync sb to disk.
      3. A checks other node's sync_size, if sync_size has
         been updated in all nodes, then send CHANGE_CAPACITY
         msg otherwise send msg to revert previous change.
      4. B and C call set_capacity if receive CHANGE_CAPACITY
         msg, otherwise pers->resize will be called to restore
         the old value.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      818da59f
    • G
      md-cluster: introduce cluster_check_sync_size · b98938d1
      Guoqing Jiang 提交于
      Support resize is a little complex for clustered
      raid, since we need to ensure all the nodes share
      the same knowledge about the size of raid.
      
      We achieve the goal by check the sync_size which
      is in each node's bitmap, we can only change the
      capacity after cluster_check_sync_size returns 0.
      
      Also, get_bitmap_from_slot is added to get a slot's
      bitmap. And we exported some funcs since they are
      used in cluster_check_sync_size().
      
      We can also reuse get_bitmap_from_slot to remove
      redundant code existed in bitmap_copy_from_slot.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b98938d1
    • G
      md-cluster: add CHANGE_CAPACITY message type · 7da3d203
      Guoqing Jiang 提交于
      The msg type CHANGE_CAPACITY is introduced to support
      resize clustered raid in later patch, and it is sent
      after all the nodes have the same sync_size, receiver
      node just need to set new capacity once received this
      msg.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7da3d203
    • G
      md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977
      Guoqing Jiang 提交于
      Previously, when node received METADATA_UPDATED msg, it just
      need to wakeup mddev->thread, then md_reload_sb will be called
      eventually.
      
      We taken the asynchronous way to avoid a deadlock issue, the
      deadlock issue could happen when one node is receiving the
      METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
      the path:
      
      md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                        -> md_update_sb-metadata_update_start
      		     (want EX on token however token is
      		      got by the sending node)
      
      Since we will support resizing for clustered raid, and we
      need the metadata update handling to be synchronous so that
      the initiating node can detect failure, so we need to change
      the way for handling METADATA_UPDATED msg.
      
      But, we obviously need to avoid above deadlock with the
      sync way. To make this happen, we considered to not hold
      reconfig_mutex to call md_reload_sb, if some other thread
      has already taken reconfig_mutex and waiting for the 'token',
      then process_recvd_msg() can safely call md_reload_sb()
      without taking the mutex. This is because we can be certain
      that no other thread will take the mutex, and we also certain
      that the actions performed by md_reload_sb() won't interfere
      with anything that the other thread is in the middle of.
      
      To make this more concrete, we added a new cinfo->state bit
              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      
      Which is set in lock_token() just before dlm_lock_sync() is
      called, and cleared just after. As lock_token() is always
      called with reconfig_mutex() held (the specific case is the
      resync_info_update which is distinguished well in previous
      patch), if process_recvd_msg() finds that the new bit is set,
      then the mutex must be held by some other thread, and it will
      keep waiting.
      
      So process_metadata_update() can call md_reload_sb() if either
      mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      is set. The tricky bit is what to do if neither of these apply.
      We need to wait. Fortunately mddev_unlock() always calls wake_up()
      on mddev->thread->wqueue. So we can get lock_token() to call
      wake_up() on that when it sets the bit.
      
      There are also some related changes inside this commit:
      1. remove RELOAD_SB related codes since there are not valid anymore.
      2. mddev is added into md_cluster_info then we can get mddev inside
         lock_token.
      3. add new parameter for lock_token to distinguish reconfig_mutex
         is held or not.
      
      And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
      1. set it before unregister thread, otherwise a deadlock could
         appear if stop a resyncing array.
         This is because md_unregister_thread(&cinfo->recv_thread) is
         blocked by recv_daemon -> process_recvd_msg
      			  -> process_metadata_update.
         To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
         also need to be set before unregister thread.
      2. set it in metadata_update_start to fix another deadlock.
      	a. Node A sends METADATA_UPDATED msg (held Token lock).
      	b. Node B wants to do resync, and is blocked since it can't
      	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
      	   not set since the callchain
      	   (md_do_sync -> sync_request
              	       -> resync_info_update
      		       -> sendmsg
      		       -> lock_comm -> lock_token)
      	   doesn't hold reconfig_mutex.
      	c. Node B trys to update sb (held reconfig_mutex), but stopped
      	   at wait_event() in metadata_update_start since we have set
      	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
      	d. Then Node B receives METADATA_UPDATED msg from A, of course
      	   recv_daemon is blocked forever.
         Since metadata_update_start always calls lock_token with reconfig_mutex,
         we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
         lock_token don't need to set it twice unless lock_token is invoked from
         lock_comm.
      
      Finally, thanks to Neil for his great idea and help!
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0ba95977
  2. 15 3月, 2017 2 次提交
  3. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  4. 11 3月, 2017 2 次提交
  5. 10 3月, 2017 8 次提交
  6. 02 3月, 2017 8 次提交
    • I
      sched/headers: Prepare to move the get_task_struct()/put_task_struct() and... · 0881e7bd
      Ingo Molnar 提交于
      sched/headers: Prepare to move the get_task_struct()/put_task_struct() and related APIs from <linux/sched.h> to <linux/sched/task.h>
      
      But first update usage sites with the new header dependency.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0881e7bd
    • I
      sched/headers: Prepare to use <linux/rcuupdate.h> instead of <linux/rculist.h> in <linux/sched.h> · b2d09103
      Ingo Molnar 提交于
      We don't actually need the full rculist.h header in sched.h anymore,
      we will be able to include the smaller rcupdate.h header instead.
      
      But first update code that relied on the implicit header inclusion.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b2d09103
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h> · 68db0cf1
      Ingo Molnar 提交于
      We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/task_stack.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      68db0cf1
    • I
      sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h> · 5b3cc15a
      Ingo Molnar 提交于
      Update the .c files that depend on these APIs.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5b3cc15a
    • I
      sched/headers: Prepare to move signal wakeup & sigpending methods from... · 174cd4b1
      Ingo Molnar 提交于
      sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
      
      Fix up affected files that include this signal functionality via sched.h.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      174cd4b1
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> · 3f07c014
      Ingo Molnar 提交于
      We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/signal.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3f07c014
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> · e6017571
      Ingo Molnar 提交于
      We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and .c files.
      
      Create a trivial placeholder <linux/sched/clock.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e6017571
    • D
      KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload() · 0837e49a
      David Howells 提交于
      rcu_dereference_key() and user_key_payload() are currently being used in
      two different, incompatible ways:
      
       (1) As a wrapper to rcu_dereference() - when only the RCU read lock used
           to protect the key.
      
       (2) As a wrapper to rcu_dereference_protected() - when the key semaphor is
           used to protect the key and the may be being modified.
      
      Fix this by splitting both of the key wrappers to produce:
      
       (1) RCU accessors for keys when caller has the key semaphore locked:
      
      	dereference_key_locked()
      	user_key_payload_locked()
      
       (2) RCU accessors for keys when caller holds the RCU read lock:
      
      	dereference_key_rcu()
      	user_key_payload_rcu()
      
      This should fix following warning in the NFS idmapper
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.10.0 #1 Tainted: G        W
        -------------------------------
        ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
        other info that might help us debug this:
        rcu_scheduler_active = 2, debug_locks = 0
        1 lock held by mount.nfs/5987:
          #0:  (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4]
        stack backtrace:
        CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G        W       4.10.0 #1
        Call Trace:
          dump_stack+0xe8/0x154 (unreliable)
          lockdep_rcu_suspicious+0x140/0x190
          nfs_idmap_get_key+0x380/0x420 [nfsv4]
          nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4]
          decode_getfattr_attrs+0xfac/0x16b0 [nfsv4]
          decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4]
          nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4]
          rpcauth_unwrap_resp+0xe8/0x140 [sunrpc]
          call_decode+0x29c/0x910 [sunrpc]
          __rpc_execute+0x140/0x8f0 [sunrpc]
          rpc_run_task+0x170/0x200 [sunrpc]
          nfs4_call_sync_sequence+0x68/0xa0 [nfsv4]
          _nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4]
          nfs4_lookup_root+0xe0/0x350 [nfsv4]
          nfs4_lookup_root_sec+0x70/0xa0 [nfsv4]
          nfs4_find_root_sec+0xc4/0x100 [nfsv4]
          nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4]
          nfs4_get_rootfh+0x6c/0x190 [nfsv4]
          nfs4_server_common_setup+0xc4/0x260 [nfsv4]
          nfs4_create_server+0x278/0x3c0 [nfsv4]
          nfs4_remote_mount+0x50/0xb0 [nfsv4]
          mount_fs+0x74/0x210
          vfs_kern_mount+0x78/0x220
          nfs_do_root_mount+0xb0/0x140 [nfsv4]
          nfs4_try_mount+0x60/0x100 [nfsv4]
          nfs_fs_mount+0x5ec/0xda0 [nfs]
          mount_fs+0x74/0x210
          vfs_kern_mount+0x78/0x220
          do_mount+0x254/0xf70
          SyS_mount+0x94/0x100
          system_call+0x38/0xe0
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      0837e49a
  7. 01 3月, 2017 1 次提交
    • M
      dm raid: bump the target version · 2664f3c9
      Mike Snitzer 提交于
      This version bump reflects that the reshape corruption fix (commit
      92a39f6cc "dm raid: fix data corruption on reshape request") is
      present.
      
      Done as a separate fix because the above referenced commit is marked for
      stable and target version bumps in a stable@ fix are a recipe for the
      fix to never get backported to stable@ kernels (because of target
      version number conflicts).
      
      Also, move RESUME_STAY_FROZEN_FLAGS up with the reset the the _FLAGS
      definitions now that we don't need to worry about stable@ conflicts as a
      result of missing context.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2664f3c9