提交 · aa373cf550994623efb5d49a4d8775bafd10bbc1 · gsplhtlxg / clone-Linux

14 1月, 2011 40 次提交

writeback: stop background/kupdate works from livelocking other works · aa373cf5

由 Jan Kara 提交于 1月 13, 2011

Background writeback is easily livelockable in a loop in wb_writeback() by
a process continuously re-dirtying pages (or continuously appending to a
file).  This is in fact intended as the target of background writeback is
to write dirty pages it can find as long as we are over
dirty_background_threshold.

But the above behavior gets inconvenient at times because no other work
queued in the flusher thread's queue gets processed.  In particular, since
e.g.  sync(1) relies on flusher thread to do all the IO for it, sync(1)
can hang forever waiting for flusher thread to do the work.

Generally, when a flusher thread has some work queued, someone submitted
the work to achieve a goal more specific than what background writeback
does.  Moreover by working on the specific work, we also reduce amount of
dirty pages which is exactly the target of background writeout.  So it
makes sense to give specific work a priority over a generic page cleaning.

Thus we interrupt background writeback if there is some other work to do.
We return to the background writeback after completing all the queued
work.

This may delay the writeback of expired inodes for a while, however the
expired inodes will eventually be flushed to disk as long as the other
works won't livelock.

[fengguang.wu@intel.com: update comment]
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa373cf5

writeback: trace wakeup event for background writeback · 71927e84

由 Wu Fengguang 提交于 1月 13, 2011

This tracks when balance_dirty_pages() tries to wakeup the flusher thread
for background writeback (if it was not started already).
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

71927e84

writeback: integrated background writeback work · 6585027a

由 Jan Kara 提交于 1月 13, 2011

Check whether background writeback is needed after finishing each work.

When bdi flusher thread finishes doing some work check whether any kind of
background writeback needs to be done (either because
dirty_background_ratio is exceeded or because we need to start flushing
old inodes).  If so, just do background write back.

This way, bdi_start_background_writeback() just needs to wake up the
flusher thread.  It will do background writeback as soon as there is no
other work.

This is a preparatory patch for the next patch which stops background
writeback as soon as there is other work to do.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6585027a

mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3

由 Mel Gorman 提交于 1月 13, 2011

reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
errors due to counter drift.  The functions duplicate some code so this
patch replaces them with a single set_pgdat_percpu_threshold() that takes
a callback function to calculate the desired threshold as a parameter.

[akpm@linux-foundation.org: readability tweak]
[kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b44129b3

mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8

由 Mel Gorman 提交于 1月 13, 2011

Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold.  On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high.  The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues.  The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing.  This patch takes a different approach.  For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level.  This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node.  There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat().  We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark
functions is reduced.  The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().

vanilla                      11.6615%
disable-threshold            0.2584%

David said:

: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity.  I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Reported-by: NShaohua Li <shaohua.li@intel.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Tested-by: NNicolas Bareil <nico@chdir.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

88f5acf8

sched: remove long deprecated CLONE_STOPPED flag · 43bb40c9

由 Dave Jones 提交于 1月 13, 2011

This warning was added in commit bdff746a ("clone: prepare to recycle
CLONE_STOPPED") three years ago.  2.6.26 came and went.  As far as I know,
no-one is actually using CLONE_STOPPED.
Signed-off-by: NDave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

43bb40c9

atmel_serial: fix RTS high after initialization in RS485 mode · 5dfbd1d7

由 Claudio Scordino 提交于 1月 13, 2011

When working in RS485 mode, the atmel_serial driver keeps RTS high after
the initialization of the serial port.  It goes low only after the first
character has been sent.

[akpm@linux-foundation.org: simplify code]
Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com>
Signed-off-by: NArkadiusz Bubala <arkadiusz.bubala@gmail.com>
Tested-by: NArkadiusz Bubala <arkadiusz.bubala@gmail.com>
Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5dfbd1d7

irq: use per_cpu kstat_irqs · 6c9ae009

由 Eric Dumazet 提交于 1月 13, 2011

Use modern per_cpu API to increment {soft|hard}irq counters, and use
per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array.

This gives better SMP/NUMA locality and saves few instructions per irq.

With small nr_cpuids values (8 for example), kstats_irq was a small array
(less than L1_CACHE_BYTES), potentially source of false sharing.

In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly
kstat_irqs_all[NR_IRQS][NR_CPUS] array.

Note: we still populate kstats_irq for all possible irqs in
early_irq_init().  We probably could use on-demand allocations.  (Code
included in alloc_descs()).  Problem is not all IRQS are used with a prior
alloc_descs() call.

kstat_irqs_this_cpu() is not used anymore, remove it.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6c9ae009

MAINTAINERS: update entries affecting VIA Technologies · 558bbb2f

由 Bruce Chang 提交于 1月 13, 2011

Since the original maintainer-Joseph Chan (josephchan@via.com.tw) doesn't
handle the Linux driver for VIA now, I would like to request to update the
maintainer for the SD/MMC CARD CONTROLLER DRIVER and VIA
UNICHROME(PRO)/CHROME9 FRAMEBUFFER DRIVER before we find a better one.
Signed-off-by: NBruce Chang <brucechang@via.com.tw>
Signed-off-by: NFlorian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: Joseph Chan <JosephChan@via.com.tw>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Harald Welte <HaraldWelte@viatech.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

558bbb2f

Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm · f6bcfd94

由 Linus Torvalds 提交于 1月 13, 2011

* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits)
  dm: raid456 basic support
  dm: per target unplug callback support
  dm: introduce target callbacks and congestion callback
  dm mpath: delay activate_path retry on SCSI_DH_RETRY
  dm: remove superfluous irq disablement in dm_request_fn
  dm log: use PTR_ERR value instead of ENOMEM
  dm snapshot: avoid storing private suspended state
  dm snapshot: persistent make metadata_wq multithreaded
  dm: use non reentrant workqueues if equivalent
  dm: convert workqueues to alloc_ordered
  dm stripe: switch from local workqueue to system_wq
  dm: dont use flush_scheduled_work
  dm snapshot: remove unused dm_snapshot queued_bios_work
  dm ioctl: suppress needless warning messages
  dm crypt: add loop aes iv generator
  dm crypt: add multi key capability
  dm crypt: add post iv call to iv generator
  dm crypt: use io thread for reads only if mempool exhausted
  dm crypt: scale to multiple cpus
  dm crypt: simplify compatible table output
  ...

f6bcfd94

Merge branch 'for-linus' of git://neil.brown.name/md · 509e4aef

由 Linus Torvalds 提交于 1月 13, 2011

* 'for-linus' of git://neil.brown.name/md:
  md: Fix removal of extra drives when converting RAID6 to RAID5
  md: range check slot number when manually adding a spare.
  md/raid5: handle manually-added spares in start_reshape.
  md: fix sync_completed reporting for very large drives (>2TB)
  md: allow suspend_lo and suspend_hi to decrease as well as increase.
  md: Don't let implementation detail of curr_resync leak out through sysfs.
  md: separate meta and data devs
  md-new-param-to_sync_page_io
  md-new-param-to-calc_dev_sboffset
  md: Be more careful about clearing flags bit in ->recovery
  md: md_stop_writes requires mddev_lock.
  md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
  md: Ensure no IO request to get md device before it is properly initialised.
  md: Fix single printks with multiple KERN_<level>s
  md: fix regression resulting in delays in clearing bits in a bitmap
  md: fix regression with re-adding devices to arrays with no metadata

509e4aef

L
Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 · 375b6f5a
由 Linus Torvalds 提交于 1月 13, 2011
```
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] fix build error - arch/ia64/kernel/perfmon.c
```
375b6f5a

Revert "gpiolib: annotate gpio-intialization with __must_check" · d8a3515e

由 Linus Torvalds 提交于 1月 13, 2011

This reverts commit 0fdae42d, which
wasn't really supposed to go in, and causes lots of annoying warnings.

Quoth Andrew:
  "Complete brainfart - I meant to drop that patch ages ago."

Quoth Greg:
  "Ick, yeah, that patch isn't ok to go in as-is, all of the callers
   need to be fixed up first, which is what I thought we had agreed on..."
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NGreg KH <greg@kroah.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d8a3515e

ecryptfs: fix broken build · 6254b32b

由 Linus Torvalds 提交于 1月 13, 2011

Stephen Rothwell reports that the vfs merge broke the build of ecryptfs.
The breakage comes from commit 66cb7666 ("sanitize ecryptfs
->mount()") which was obviously not even build tested. Tssk, tssk, Al.

This is the minimal build fixup for the situation, although I don't have
a filesystem to actually test it with.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6254b32b

[IA64] fix build error - arch/ia64/kernel/perfmon.c · 09579770

由 Tony Luck 提交于 1月 13, 2011

arch/ia64/kernel/perfmon.c:621: error: duplicate 'static'

Introduced by commit c74a1cbb

    pass default dentry_operations to mount_pseudo()
Signed-off-by: NTony Luck <tony.luck@intel.com>

09579770

md: Fix removal of extra drives when converting RAID6 to RAID5 · bf2cb0da

由 NeilBrown 提交于 1月 14, 2011

When a RAID6 is converted to a RAID5, the extra drive should
be discarded.  However it isn't due to a typo in a comparison.

This bug was introduced in commit e93f68a1 in 2.6.35-rc4
and is suitable for any -stable since than.

As the extra drive is not removed, the 'degraded' counter is wrong and
so the RAID5 will not respond correctly to a subsequent failure.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

bf2cb0da

md: range check slot number when manually adding a spare. · ba1b41b6

由 NeilBrown 提交于 1月 14, 2011

When adding a spare to an active array, we should check the slot
number, but allow it to be larger than raid_disks if a reshape
is being prepared.

Apply the same test when adding a device to an
array-under-construction.  It already had most of the test in place,
but not quite all.
Signed-off-by: NNeilBrown <neilb@suse.de>

ba1b41b6

md/raid5: handle manually-added spares in start_reshape. · 1a940fce

由 NeilBrown 提交于 1月 14, 2011

It is possible to manually add spares to specific slots before
starting a reshape.
raid5_start_reshape should recognised this possibility and include
it in the accounting.
Signed-off-by: NNeilBrown <neilb@suse.de>

1a940fce

md: fix sync_completed reporting for very large drives (>2TB) · 13ae864b

由 Rémi Rérolle 提交于 1月 14, 2011

The values exported in the sync_completed file are unsigned long, which
overflows with very large drives, resulting in wrong values reported.

Since sync_completed uses sectors as unit, we'll start getting wrong
values with components larger than 2TB.

This patch simply replaces the use of unsigned long by unsigned long long.
Signed-off-by: NRémi Rérolle <rrerolle@lacie.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

13ae864b

md: allow suspend_lo and suspend_hi to decrease as well as increase. · 23ddff37

由 NeilBrown 提交于 1月 14, 2011

The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
to which read/writes are suspended so that the under lying data can be
manipulated without user-space noticing.
Currently the window they describe can only move forwards along the
device.  However this is an unnecessary restriction which will cause
problems with planned developments.
So relax this restriction and allow these endpoints to move
arbitrarily.
Signed-off-by: NNeilBrown <neilb@suse.de>

23ddff37

md: Don't let implementation detail of curr_resync leak out through sysfs. · 75d3da43

由 NeilBrown 提交于 1月 14, 2011

mddev->curr_resync has artificial values of '1' and '2' which are used
by the code which ensures only one resync is happening at a time on
any given device.

These values are internal and should never be exposed to user-space
(except when translated appropriately as in the 'pending' status in
/proc/mdstat).

Unfortunately they are as ->curr_resync is assigned to
->curr_resync_completed and that value is directly visible through
sysfs.

So change the assignments to ->curr_resync_completed to get the same
valued from elsewhere in a form that doesn't have the magic '1' or '2'
values.
Signed-off-by: NNeilBrown <neilb@suse.de>

75d3da43

md: separate meta and data devs · a6ff7e08

由 Jonathan Brassow 提交于 1月 14, 2011

Allow the metadata to be on a separate device from the
data.

This doesn't mean the data and metadata will by on separate
physical devices - it simply gives device-mapper and userspace
tools more flexibility.
Signed-off-by: NNeilBrown <neilb@suse.de>

a6ff7e08

md-new-param-to_sync_page_io · ccebd4c4

由 Jonathan Brassow 提交于 1月 14, 2011

Add new parameter to 'sync_page_io'.

The new parameter allows us to distinguish between metadata and data
operations.  This becomes important later when we add the ability to
use separate devices for data and metadata.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>

ccebd4c4

md-new-param-to-calc_dev_sboffset · 57b2caa3

由 Jonathan Brassow 提交于 1月 14, 2011

When we allow for separate devices for data and metadata
in a later patch, we will need to be able to calculate
the superblock offset based on more than the bdev.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>

57b2caa3

md: Be more careful about clearing flags bit in ->recovery · 7ebc0be7

由 NeilBrown 提交于 1月 14, 2011

Setting ->recovery to 0 is generally not a good idea as it could clear
bits that shouldn't be cleared.  In particular, MD_RECOVERY_FROZEN
should only be cleared on explicit request from user-space.

So when we need to clear things, just clear the bits that need
clearing.

As there are a few different places which reap a resync process - and
some do an incomplte job - factor out the code for doing the from
md_check_recovery and call that function instead of open coding part
of it.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reported-by: NJonathan Brassow <jbrassow@redhat.com>

7ebc0be7

md: md_stop_writes requires mddev_lock. · defad61a

由 NeilBrown 提交于 1月 14, 2011

As md_stop_writes manipulates the sync_thread and calls md_update_sb,
it need to be called with mddev_lock held.

In all internal cases it is, but the symbol is exported for dm-raid to
call and in that case the lock won't be help.
Do make an exported version which takes the lock, and an internal
version which does not.
Signed-off-by: NNeilBrown <neilb@suse.de>

defad61a

md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer · 43c73ca4

由 Jonathan Brassow 提交于 1月 14, 2011

With the module parameter 'start_dirty_degraded' set,
raid5_spare_active() previously called sysfs_notify_dirent() with a NULL
argument (rdev->sysfs_state) when a rebuild finished.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

43c73ca4

md: Ensure no IO request to get md device before it is properly initialised. · 0ca69886

由 NeilBrown 提交于 1月 14, 2011

When an md device is in the process of coming on line it is possible
for an IO request (typically a partition table probe) to get through
before the array is fully initialised, which can cause unexpected
behaviour (e.g. a crash).

So explicitly record when the array is ready for IO and don't allow IO
through until then.

There is no possibility for a similar problem when the array is going
off-line as there must only be one 'open' at that time, and it is busy
off-lining the array and so cannot send IO requests.  So no memory
barrier is needed in md_stop()

This has been a bug since commit 409c57f3 in 2.6.30 which
introduced md_make_request.  Before then, each personality would
register its own make_request_fn when it was ready.
This is suitable for any stable kernel from 2.6.30.y onwards.

Cc: <stable@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.de>
Reported-by: N"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>

0ca69886

md: Fix single printks with multiple KERN_<level>s · 067032bc

由 Joe Perches 提交于 1月 14, 2011

Noticed-by: NRussell King <linux@arm.linux.org.uk>
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

067032bc

md: fix regression resulting in delays in clearing bits in a bitmap · 6c987910

由 NeilBrown 提交于 1月 14, 2011

commit 589a594b (2.6.37-rc4) fixed a problem were md_thread would
sometimes call the ->run function at a bad time.

If an error is detected during array start up after the md_thread has
been started, the md_thread is killed.  This resulted in the ->run
function being called once.  However the array may not be in a state
that it is safe to call ->run.

However the fix imposed meant that  ->run was not called on a timeout.
This means that when an array goes idle, bitmap bits do not get
cleared promptly.  While the array is busy the bits will still be
cleared when appropriate so this is not very serious.  There is no
risk to data.

Change the test so that we only avoid calling ->run when the thread
is being stopped.  This more explicitly addresses the problem situation.

This is suitable for 2.6.37-stable and any -stable kernel to which
589a594b was applied.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

6c987910

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/avr32-2.6 · 2a86cb7c

由 Linus Torvalds 提交于 1月 13, 2011

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/avr32-2.6:
  avr32: update default configuration files for Atmel boards
  avr32: Convert to clocksource_register_hz
  avr32: make architecture sys_clone prototype match asm-generic prototype
  avr32: use syscall prototypes from asm-generic instead of arch
  avr32: disable kprobes for all default configurations
  avr32: boards: setup: use IS_ERR() instead of NULL check

2a86cb7c

NFS: Fix NFSv3 exclusive open semantics · 8a0eebf6

由 Trond Myklebust 提交于 1月 13, 2011

Commit c0204fd2 (NFS: Clean up
nfs4_proc_create()) broke NFSv3 exclusive open by removing the code
that passes the O_EXCL flag down to nfs3_proc_create(). This patch
reverts that offending hunk from the original commit.
Reported-by: NNick Bowler <nbowler@elliptictech.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org    [2.6.37]
Tested-by: NNick Bowler <nbowler@elliptictech.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8a0eebf6

dm: raid456 basic support · 9d09e663

由 NeilBrown 提交于 1月 13, 2011

This patch is the skeleton for the DM target that will be
the bridge from DM to MD (initially RAID456 and later RAID1).  It
provides a way to use device-mapper interfaces to the MD RAID456
drivers.

As with all device-mapper targets, the nominal public interfaces are the
constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO
and STATUSTYPE_TABLE).  The CTR table looks like the following:

1: <s> <l> raid \
2:	<raid_type> <#raid_params> <raid_params> \
3:	<#raid_devs> <meta_dev1> <dev1> .. <meta_devN> <devN>

Line 1 contains the standard first three arguments to any device-mapper
target - the start, length, and target type fields.  The target type in
this case is "raid".

Line 2 contains the arguments that define the particular raid
type/personality/level, the required arguments for that raid type, and
any optional arguments.  Possible raid types include: raid4, raid5_la,
raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc.  (again, raid1 is
planned for the future.)  The list of required and optional parameters
is the same for all the current raid types.  The required parameters are
positional, while the optional parameters are given as key/value pairs.
The possible parameters are as follows:
 <chunk_size>		Chunk size in sectors.
 [[no]sync]		Force/Prevent RAID initialization
 [rebuild <idx>]	Rebuild the drive indicated by the index
 [daemon_sleep <ms>]	Time between bitmap daemon work to clear bits
 [min_recovery_rate <kB/sec/disk>]	Throttle RAID initialization
 [max_recovery_rate <kB/sec/disk>]	Throttle RAID initialization
 [max_write_behind <value>]		See '-write-behind=' (man mdadm)
 [stripe_cache <sectors>]		Stripe cache size for higher RAIDs

Line 3 contains the list of devices that compose the array in
metadata/data device pairs.  If the metadata is stored separately, a '-'
is given for the metadata device position.  If a drive has failed or is
missing at creation time, a '-' can be given for both the metadata and
data drives for a given position.

Examples:
# RAID4 - 4 data drives, 1 parity
# No metadata devices specified to hold superblock/bitmap info
# Chunk size of 1MiB
# (Lines separated for easy reading)
0 1960893648 raid \
	raid4 1 2048 \
	5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

# RAID4 - 4 data drives, 1 parity (no metadata devices)
# Chunk size of 1MiB, force RAID initialization,
#	min recovery rate at 20 kiB/sec/disk
0 1960893648 raid \
        raid4 4 2048 min_recovery_rate 20 sync\
        5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

Performing a 'dmsetup table' should display the CTR table used to
construct the mapping (with possible reordering of optional
parameters).

Performing a 'dmsetup status' will yield information on the state and
health of the array.  The output is as follows:
1: <s> <l> raid \
2:	<raid_type> <#devices> <1 health char for each dev> <resync_ratio>

Line 1 is standard DM output.  Line 2 is best shown by example:
	0 1960893648 raid raid4 5 AAAAA 2/490221568
Here we can see the RAID type is raid4, there are 5 devices - all of
which are 'A'live, and the array is 2/490221568 complete with recovery.

Cc: linux-raid@vger.kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

9d09e663

dm: per target unplug callback support · 99d03c14

由 NeilBrown 提交于 1月 13, 2011

Add per-target unplug callback support.

Cc: linux-raid@vger.kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

99d03c14

dm: introduce target callbacks and congestion callback · 9d357b07

由 NeilBrown 提交于 1月 13, 2011

DM currently implements congestion checking by checking on congestion
in each component device.  For raid456 we need to also check if the
stripe cache is congested.

Add per-target congestion checker callback support.

Extending the target_callbacks structure with additional callback
functions allows for establishing multiple callbacks per-target (a
callback is also needed for unplug).

Cc: linux-raid@vger.kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

9d357b07

dm mpath: delay activate_path retry on SCSI_DH_RETRY · 4e2d19e4

由 Chandra Seetharaman 提交于 1月 13, 2011

This patch adds a user-configurable 'pg_init_delay_msecs' feature.  Use
this feature to specify the number of milliseconds to delay before
retrying scsi_dh_activate, when SCSI_DH_RETRY is returned.

SCSI Device Handlers return SCSI_DH_IMM_RETRY if we could retry
activation immediately and SCSI_DH_RETRY in cases where it is better to
retry after some delay.

Currently we immediately retry scsi_dh_activate irrespective of
SCSI_DH_IMM_RETRY and SCSI_DH_RETRY.

The 'pg_init_delay_msecs' feature may be provided during table create or
load, e.g.:
    dmsetup create --table "0 20971520 multipath 3 queue_if_no_path \
	pg_init_delay_msecs 2500 ..." mpatha

The default for 'pg_init_delay_msecs' is 2000 milliseconds.
Maximum configurable delay is 60000 milliseconds.  Specifying a
'pg_init_delay_msecs' of 0 will cause immediate retry.
Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
Acked-by: NMike Christie <michaelc@cs.wisc.edu>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

4e2d19e4

dm: remove superfluous irq disablement in dm_request_fn · 052189a2

由 Kiyoshi Ueda 提交于 1月 13, 2011

This patch changes spin_lock_irq() to spin_lock() in dm_request_fn().
This patch is just a clean-up and no functional change.

The spin_lock_irq() was leftover from the early request-based dm code,
where map_request() used to enable interrupts.
Since current map_request() never enables interrupts, we can change it
to spin_lock() to match the prior spin_unlock().

Auditing through the dm and block-layer code called from
map_request(), I confirmed all functions save/restore interrupt
status, so no function returning with interrupts enabled.
Also I haven't observed any problem on my test environment which
uses scsi and lpfc driver after heavy I/O testing with occasional
path down/up.

Added BUG_ON() to detect breakage in future.
Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

052189a2

dm log: use PTR_ERR value instead of ENOMEM · dbc883f1

由 Dan Carpenter 提交于 1月 13, 2011

It's nicer to return the PTR_ERR() value instead of just returning
-ENOMEM.  In the current code the PTR_ERR() value is always equal to
-ENOMEM so this doesn't actually affect anything, but still...

In addition, dm_dirty_log_create() doesn't check for a specific -ENOMEM
return.  So this change is safe relative to potential for a non -ENOMEM
return in the future.
Signed-off-by: NDan Carpenter <error27@gmail.com>
Acked-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

dbc883f1

dm snapshot: avoid storing private suspended state · b83b2f29

由 Mike Snitzer 提交于 1月 13, 2011

Use dm_suspended() rather than having each snapshot target maintain a
private 'suspended' flag in struct dm_snapshot.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b83b2f29

dm snapshot: persistent make metadata_wq multithreaded · 239c8dd5

由 Tejun Heo 提交于 1月 13, 2011

metadata_wq serves on-stack work items from chunk_io().  Even if
multiple chunk_io() are simultaneously in progress, each is
independent and queued only once, so multithreaded workqueue can be
safely used.

Switch metadata_wq to multithread and flush the work item instead of
the workqueue in chunk_io().
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

239c8dd5