提交 · 83cd4fe27ad8446619b2e030b171b858501de87d · gsplhtlxg / clone-Linux

09 6月, 2010 10 次提交

sched: Change nohz idle load balancing logic to push model · 83cd4fe2

由 Venkatesh Pallipadi 提交于 5月 21, 2010

In the new push model, all idle CPUs indeed go into nohz mode. There is
still the concept of idle load balancer (performing the load balancing
on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
balancer when any of the nohz CPUs need idle load balancing.
The kickee CPU does the idle load balancing on behalf of all idle CPUs
instead of the normal idle balance.

This addresses the below two problems with the current nohz ilb logic:
* the idle load balancer continued to have periodic ticks during idle and
  wokeup frequently, even though it did not have any rebalancing to do on
  behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
  periodic wakeup can result in a periodic additional interrupt on a CPU
  doing the timer broadcast.

Also currently we are migrating the unpinned timers from an idle to the cpu
doing idle load balancing (when all the cpus in the system are idle,
there is no idle load balancing cpu and timers get added to the same idle cpu
where the request was made. So the existing optimization works only on semi idle
system).

And In semi idle system, we no longer have periodic ticks on the idle load
balancer CPU. Using that cpu will add more delays to the timers than intended
(as that cpu's timer base may not be uptodate wrt jiffies etc). This was
causing mysterious slowdowns during boot etc.

For now, in the semi idle case, use the nearest busy cpu for migrating timers
from an idle cpu.  This is good for power-savings anyway.
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

83cd4fe2

sched: Avoid side-effect of tickless idle on update_cpu_load · fdf3e95d

由 Venkatesh Pallipadi 提交于 5月 17, 2010

tickless idle has a negative side effect on update_cpu_load(), which
in turn can affect load balancing behavior.

update_cpu_load() is supposed to be called every tick, to keep track
of various load indicies. With tickless idle, there are no scheduler
ticks called on the idle CPUs. Idle CPUs may still do load balancing
(with idle_load_balance CPU) using the stale cpu_load. It will also
cause problems when all CPUs go idle for a while and become active
again. In this case loads would not degrade as expected.

This is how rq->nr_load_updates change looks like under different
conditions:

<cpu_num> <nr_load_updates change>
All CPUS idle for 10 seconds (HZ=1000)
0 1621
10 496
11 139
12 875
13 1672
14 12
15 21
1 1472
2 2426
3 1161
4 2108
5 1525
6 701
7 249
8 766
9 1967

One CPU busy rest idle for 10 seconds
0 10003
10 601
11 95
12 966
13 1597
14 114
15 98
1 3457
2 93
3 6679
4 1425
5 1479
6 595
7 193
8 633
9 1687

All CPUs busy for 10 seconds
0 10026
10 10026
11 10026
12 10026
13 10025
14 10025
15 10025
1 10026
2 10026
3 10026
4 10026
5 10026
6 10026
7 10026
8 10026
9 10026

That is update_cpu_load works properly only when all CPUs are busy.
If all are idle, all the CPUs get way lower updates.  And when few
CPUs are busy and rest are idle, only busy and ilb CPU does proper
updates and rest of the idle CPUs will do lower updates.

The patch keeps track of when a last update was done and fixes up
the load avg based on current time.

On one of my test system SPECjbb with warehouse 1..numcpus, patch
improves throughput numbers by ~1% (average of 6 runs).  On another
test system (with different domain hierarchy) there is no noticable
change in perf.
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fdf3e95d

sched: Simplify the reacquire_kernel_lock() logic · 246d86b5

由 Oleg Nesterov 提交于 5月 19, 2010

- Contrary to what 6d558c3a says, there is no need to reload
  prev = rq->curr after the context switch. You always schedule
  back to where you came from, prev must be equal to current
  even if cpu/rq was changed.

- This also means reacquire_kernel_lock() can use prev instead
  of current.

- No need to reassign switch_count if reacquire_kernel_lock()
  reports need_resched(), we can just move the initial assignment
  down, under the "need_resched_nonpreemptible:" label.

- Try to update the comment after context_switch().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100519125711.GA30199@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

246d86b5

sched_clock: Add local_clock() API and improve documentation · c676329a

由 Peter Zijlstra 提交于 5月 25, 2010

For people who otherwise get to write: cpu_clock(smp_processor_id()),
there is now: local_clock().

Also, as per suggestion from Andrew, provide some documentation on
the various clock interfaces, and minimize the unsigned long long vs
u64 mess.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
LKML-Reference: <1275052414.1645.52.camel@laptop>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c676329a

I

Merge branch 'sched-wq' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into sched/core · 95ae3c59
由 Ingo Molnar 提交于 6月 08, 2010

95ae3c59

sched: add hooks for workqueue · 21aa9af0

由 Tejun Heo 提交于 6月 08, 2010

Concurrency managed workqueue needs to know when workers are going to
sleep and waking up.  Using these two hooks, cmwq keeps track of the
current concurrency level and throttles execution of new works if it's
too high and wakes up another worker from the sleep hook if it becomes
too low.

This patch introduces PF_WQ_WORKER to identify workqueue workers and
adds the following two hooks.

* wq_worker_waking_up(): called when a worker is woken up.

* wq_worker_sleeping(): called when a worker is going to sleep and may
  return a pointer to a local task which should be woken up.  The
  returned task is woken up using try_to_wake_up_local() which is
  simplified ttwu which is called under rq lock and can only wake up
  local tasks.

Both hooks are currently defined as noop in kernel/workqueue_sched.h.
Later cmwq implementation will replace them with proper
implementation.

These hooks are hard coded as they'll always be enabled.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>

21aa9af0

sched: refactor try_to_wake_up() · 9ed3811a

由 Tejun Heo 提交于 12月 03, 2009

Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise.  Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.

This is to help future implementation of try_to_wake_up_local().

Mike Galbraith suggested rename to ttwu_post_activation() from
ttwu_woken_up() and comment update in try_to_wake_up().
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>

9ed3811a

sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining · 3a101d05

由 Tejun Heo 提交于 6月 08, 2010

Currently, when a cpu goes down, cpu_active is cleared before
CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
default priority cpu notifier.  When a cpu is coming up, it's set
before CPU_ONLINE but cpuset configuration again is updated from the
same cpu notifier.

For cpu notifiers, this presents an inconsistent state.  Threads which
a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
migrated to other cpus because the cpu is no more inactive.

Fix it by updating cpu_active in the highest priority cpu notifier and
cpuset configuration in the second highest when a cpu is coming up.
Down path is updated similarly.  This guarantees that all other cpu
notifiers see consistent cpu_active and cpuset configuration.

cpuset_track_online_cpus() notifier is converted to
cpuset_update_active_cpus() which just updates the configuration and
now called from cpuset_cpu_[in]active() notifiers registered from
sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
degenerates into partition_sched_domains() making separate notifier
for !CONFIG_CPUSETS unnecessary.

This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
callback creates a kthread and kthread_bind()s it to the target cpu,
and the thread is expected to run on that cpu.

* Ingo's test discovered __cpuinit/exit markups were incorrect.
  Fixed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Menage <menage@google.com>

3a101d05

sched: define and use CPU_PRI_* enums for cpu notifier priorities · 50a323b7

由 Tejun Heo 提交于 6月 08, 2010

Instead of hardcoding priority 10 and 20 in sched and perf, collect
them into CPU_PRI_* enums.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>

50a323b7

sched: Fix PROVE_RCU vs cpu_cgroup · dc61b1d6

由 Peter Zijlstra 提交于 6月 08, 2010

PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
typically holds rq->lock around the css rcu derefs but the generic
cgroup code doesn't (and can't) know about that lock.

Provide means to add extra checks to the css dereference and use that
in the scheduler to annotate its users.

The addition of rq->lock to these checks is correct because the
cgroup_subsys::attach() method takes the rq->lock for each task it
moves, therefore by holding that lock, we ensure the task is pinned to
the current cgroup and the RCU derefence is valid.

That leaves one genuine race in __sched_setscheduler() where we used
task_group() without holding any of the required locks and thus raced
with the cgroup code. Solve this by moving the check under the
appropriate lock.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

dc61b1d6

08 6月, 2010 6 次提交

Merge git://git.infradead.org/~dwmw2/mtd-2.6.35 · 3975d167

由 Linus Torvalds 提交于 6月 07, 2010

* git://git.infradead.org/~dwmw2/mtd-2.6.35:
  jffs2: update ctime when changing the file's permission by setfacl
  jffs2: Fix NFS race by using insert_inode_locked()
  jffs2: Fix in-core inode leaks on error paths
  mtd: Fix NAND submenu
  mtd/r852: update card detect early.
  mtd/r852: Fixes in case of DMA timeout
  mtd/r852: register IRQ as last step
  drivers/mtd: Use memdup_user
  docbook: make mtd nand module init static

3975d167

Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev · 4d3d769c

由 Linus Torvalds 提交于 6月 07, 2010

* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  ahci: redo stopping DMA engines on empty ports
  sata_sil24: fix kernel panic on ARM caused by unaligned access in sata_sil24
  ahci: add pci quirk for JMB362
  sata_via: explain the magic fix

4d3d769c

ahci: redo stopping DMA engines on empty ports · 0ee71952

由 Tejun Heo 提交于 6月 07, 2010

Commit 96d60303 (ahci: Turn off DMA engines when there's no device)
implemented stopping DMA engines on empty ports but it used single
sampling of status registers to determine device presence which led to
disabling of DMA engines on occupied ports.  Do it after all EH
actions are complete using device presence state determined by EH.
This avoids spurious disabling of DMA engines and simplifies the code.
Signed-off-by: NTejun Heo <tj@kernel.org>
Tested-by: NMarc Dionne <marc.c.dionne@gmail.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Robert Hancock <hancockrwd@gmail.com>
Signed-off-by: NJeff Garzik <jgarzik@redhat.com>

0ee71952

sata_sil24: fix kernel panic on ARM caused by unaligned access in sata_sil24 · 7a4f876b

由 Colin Tuckley 提交于 6月 04, 2010

The sata_sil24 driver has six 16-bit registers that are initialised with
32-bit writes. This cause a kernel panic on ARM due to the unaligned
accesses which result.

This patch changes the accesses to the correct 16-bit ones.
Signed-off-by: NColin Tuckley <colin.tuckley@arm.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJeff Garzik <jgarzik@redhat.com>

7a4f876b

ahci: add pci quirk for JMB362 · 4daedcfe

由 Tejun Heo 提交于 6月 03, 2010

JMB362 is a new variant of jmicron controller which is similar to
JMB360 but has two SATA ports instead of one.  As there is no PATA
port, single function AHCI mode can be used as in JMB360.  Add pci
quirk for JMB362.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NAries Lee <arieslee@jmicron.com>
Cc: stable@kernel.org
Signed-off-by: NJeff Garzik <jgarzik@redhat.com>

4daedcfe

sata_via: explain the magic fix · b475a3b8

由 Tejun Heo 提交于 6月 03, 2010

Add Joseph Chan's explanation of the problem and workaround to the
VT6421 magic fix.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Joseph Chan <JosephChan@via.com.tw>
Signed-off-by: NJeff Garzik <jgarzik@redhat.com>

b475a3b8

07 6月, 2010 2 次提交

[PATCH 2/11] drivers/watchdog: Eliminate a NULL pointer dereference · cfca31ce

由 Julia Lawall 提交于 5月 27, 2010

At the point of the call to dev_err, wm8350 is NULL.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
expression E,E1;
identifier f;
statement S1,S2,S3;
@@

if ((E == NULL && ...) || ...)
{
  ... when != if (...) S1 else S2
      when != E = E1
* E->f
  ... when any
  return ...;
}
else S3
// </smpl>
Signed-off-by: NJulia Lawall <julia@diku.dk>
Acked-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NWim Van Sebroeck <wim@iguana.be>

cfca31ce

Revert "tty: fix a little bug in scrup, vt.c" · 386f40c8

由 Linus Torvalds 提交于 6月 06, 2010

This reverts commit 962400e8, which was
entirely bogus.

The code used to multiply the character offset by "vc->vc_cols", and
that's actually correct, because 'd' itself is an 'unsigned short'.  So
the pointer arithmetic already takes the size of a VGA character into
account.  Changing it to use vc_size_row (which is just "vc_cols"
shifted up to take the size of the character into account) ends up
multiplying with the VGA character size twice.

This got reported as bugs for various other subsystems, because what it
actually results in is writing the 16-bit vc_video_erase_char pattern
(usually 0x0720: 0x07 is the default attribute, 0x20 is ASCII space)
into some random other allocation.

So Markus ended up reporting this as a ext4 bug, while to Torsten Kaiser
it looked like a problem with KMS or libata.  Jeff Chua saw it in
different places.

And finally - Justin Mattock had slab poisoning enabled, and saw it as a
slab poison overwritten.  And bisected and reverted this to verify the
buggy commit.
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
Reported-by: NJeff Chua <jeff.chua.linux@gmail.com>
Reported-by: NJustin P. Mattock <justinmattock@gmail.com>
Reported-bisected-and-tested-by: NJustin P. Mattock <justinmattock@gmail.com>
Acked-by: NDave Airlie <airlied@redhat.com>
Cc: Frank Pan <frankpzh@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

386f40c8

06 6月, 2010 4 次提交

jffs2: update ctime when changing the file's permission by setfacl · 1c24d06f

由 Jan Kara 提交于 6月 04, 2010

jffs2 didn't update the ctime of the file when its permission was changed.

Steps to reproduce:
 # touch aaa
 # stat -c %Z aaa
 1275289822
 # setfacl -m  'u::x,g::x,o::x' aaa
 # stat -c %Z aaa
 1275289822                         <- unchanged

But, according to the spec of the ctime, jffs2 must update it.

Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>

1c24d06f

L

Linux 2.6.35-rc2 · e44a21b7
由 Linus Torvalds 提交于 6月 05, 2010

e44a21b7

drm/i915: Move non-phys cursors into the GTT · e7b526bb

由 Chris Wilson 提交于 6月 02, 2010

Cursors need to be in the GTT domain when being accessed by the GPU.
Previously this was a fortuitous byproduct of userspace using pwrite()
to upload the image data into the cursor. The redundant clflush was
removed in commit 9b8c4a and so the image was no longer being flushed
out of the caches into main memory. One could also devise a scenario
where the cursor was rendered by the GPU, prior to being attached as the
cursor, resulting in similar corruption due to the missing MI_FLUSH.

Fixes:

Bug 28335 - Cursor corruption caused by commit 9b8c4a0b
https://bugs.freedesktop.org/show_bug.cgi?id=28335Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
Reported-and-tested-by: NJeff Chua <jeff.chua.linux@gmail.com>
Tested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reported-by: NAndy Isaacson <adi@hexapodia.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e7b526bb

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 78b36558

由 Linus Torvalds 提交于 6月 05, 2010

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags
  ext4: Make sure the MOVE_EXT ioctl can't overwrite append-only files

78b36558

05 6月, 2010 18 次提交

ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags · 84a8dce2

由 Dmitry Monakhov 提交于 6月 05, 2010

A few functions were still modifying i_flags in a racy manner.
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

84a8dce2

Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs · 6c5de280

由 Linus Torvalds 提交于 6月 05, 2010

* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: improve xfs_isilocked
  xfs: skip writeback from reclaim context
  xfs: remove done roadmap item from xfs-delayed-logging-design.txt
  xfs: fix race in inode cluster freeing failing to stale inodes
  xfs: fix access to upper inodes without inode64
  xfs: fix might_sleep() warning when initialising per-ag tree
  fs/xfs/quota: Add missing mutex_unlock
  xfs: remove duplicated #include
  xfs: convert more trace events to DEFINE_EVENT
  xfs: xfs_trace.c: remove duplicated #include
  xfs: Check new inode size is OK before preallocating
  xfs: clean up xlog_align
  xfs: cleanup log reservation calculactions
  xfs: be more explicit if RT mount fails due to config
  xfs: replace E2BIG with EFBIG where appropriate

6c5de280

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · ed7dc1df

由 Linus Torvalds 提交于 6月 05, 2010

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
  X25: remove duplicated #include
  tcp: use correct net ns in cookie_v4_check()
  rps: tcp: fix rps_sock_flow_table table updates
  ppp_generic: fix multilink fragment sizes
  syncookies: remove Kconfig text line about disabled-by-default
  ixgbe: only check pfc bits in hang logic if pfc is enabled
  net: check for refcount if pop a stacked dst_entry
  ixgbe: return IXGBE_ERR_RAR_INDEX when out of range
  act_pedit: access skb->data safely
  sfc: Store port number in net_device::dev_id
  epic100: Test __BIG_ENDIAN instead of (non-existent) CONFIG_BIG_ENDIAN
  tehuti: return -EFAULT on copy_to_user errors
  isdn/kcapi: return -EFAULT on copy_from_user errors
  e1000e: change logical negate to bitwise
  sfc: Get port number from CS_PORT_NUM, not PCI function number
  cls_u32: use skb_header_pointer() to dereference data safely
  TCP: tcp_hybla: Fix integer overflow in slow start increment
  act_nat: fix the wrong checksum when addr isn't in old_addr/mask
  net/fec: fix pm to survive to suspend/resume
  korina: count RX DMA OVR as rx_fifo_error
  ...

ed7dc1df

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 · 7926e0bf

由 Linus Torvalds 提交于 6月 05, 2010

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: remove obsolete declarations of cache constructor and destructor
  nilfs2: fix style issue in nilfs_destroy_cachep

7926e0bf

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 · 7f0d384c

由 Linus Torvalds 提交于 6月 04, 2010

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  Minix: Clean up left over label
  fix truncate inode time modification breakage
  fix setattr error handling in sysfs, configfs
  fcntl: return -EFAULT if copy_to_user fails
  wrong type for 'magic' argument in simple_fill_super()
  fix the deadlock in qib_fs
  mqueue doesn't need make_bad_inode()

7f0d384c

Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus · 90ec7819

由 Linus Torvalds 提交于 6月 04, 2010

* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  module: fix bne2 "gave up waiting for init of module libcrc32c"
  module: verify_export_symbols under the lock
  module: move find_module check to end
  module: make locking more fine-grained.
  module: Make module sysfs functions private.
  module: move sysfs exposure to end of load_module
  module: fix kdb's illicit use of struct module_use.
  module: Make the 'usage' lists be two-way

90ec7819

module: fix bne2 "gave up waiting for init of module libcrc32c" · 9bea7f23

由 Rusty Russell 提交于 6月 05, 2010

Problem: it's hard to avoid an init routine stumbling over a
request_module these days.  And it's not clear it's always a bad idea:
for example, a module like kvm with dynamic dependencies on kvm-intel
or kvm-amd would be neater if it could simply request_module the right
one.

In this particular case, it's libcrc32c:

	libcrc32c_mod_init
	 crypto_alloc_shash
	  crypto_alloc_tfm
	   crypto_find_alg
	    crypto_alg_mod_lookup
	     crypto_larval_lookup
	      request_module

If another module is waiting inside resolve_symbol() for libcrc32c to
finish initializing (ie. bne2 depends on libcrc32c) then it does so
holding the module lock, and our request_module() can't make progress
until that is released.

Waiting inside resolve_symbol() without the lock isn't all that hard:
we just need to pass the -EBUSY up the call chain so we can sleep
where we don't hold the lock.  Error reporting is a bit trickier: we
need to copy the name of the unfinished module before releasing the
lock.

Other notes:
1) This also fixes a theoretical issue where a weak dependency would allow
   symbol version mismatches to be ignored.
2) We rename use_module to ref_module to make life easier for the only
   external user (the out-of-tree ksplice patches).
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tim Abbot <tabbott@ksplice.com>
Tested-by: NBrandon Philips <bphilips@suse.de>

9bea7f23

module: verify_export_symbols under the lock · be593f4c

由 Rusty Russell 提交于 6月 05, 2010

It disabled preempt so it was "safe", but nothing stops another module
slipping in before this module is added to the global list now we don't
hold the lock the whole time.

So we check this just after we check for duplicate modules, and just
before we put the module in the global list.

(find_symbol finds symbols in coming and going modules, too).
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

be593f4c

module: move find_module check to end · 3bafeb62

由 Linus Torvalds 提交于 6月 05, 2010

I think Rusty may have made the lock a bit _too_ finegrained there, and
didn't add it to some places that needed it. It looks, for example, like
PATCH 1/2 actually drops the lock in places where it's needed
("find_module()" is documented to need it, but now load_module() didn't
hold it at all when it did the find_module()).

Rather than adding a new "module_loading" list, I think we should be able
to just use the existing "modules" list, and just fix up the locking a
bit.

In fact, maybe we could just move the "look up existing module" a bit
later - optimistically assuming that the module doesn't exist, and then
just undoing the work if it turns out that we were wrong, just before
adding ourselves to the list.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

3bafeb62

module: make locking more fine-grained. · 75676500

由 Rusty Russell 提交于 6月 05, 2010

Kay Sievers <kay.sievers@vrfy.org> reports that we still have some
contention over module loading which is slowing boot.

Linus also disliked a previous "drop lock and regrab" patch to fix the
bne2 "gave up waiting for init of module libcrc32c" message.

This is more ambitious: we only grab the lock where we need it.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Brandon Philips <brandon@ifup.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>

75676500

module: Make module sysfs functions private. · 6407ebb2

由 Rusty Russell 提交于 6月 05, 2010

These were placed in the header in ef665c1a to get the various
SYSFS/MODULE config combintations to compile.

That may have been necessary then, but it's not now.  These functions
are all local to module.c.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Randy Dunlap <randy.dunlap@oracle.com>

6407ebb2

module: move sysfs exposure to end of load_module · 80a3d1bb

由 Rusty Russell 提交于 6月 05, 2010

This means a little extra work, but is more logical: we don't put
anything in sysfs until we're about to put the module into the
global list an parse its parameters.

This also gives us a logical place to put duplicate module detection
in the next patch.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

80a3d1bb

module: fix kdb's illicit use of struct module_use. · c8e21ced

由 Rusty Russell 提交于 6月 05, 2010

Linus changed the structure, and luckily this didn't compile any more.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Martin Hicks <mort@sgi.com>

c8e21ced

module: Make the 'usage' lists be two-way · 2c02dfe7

由 Linus Torvalds 提交于 5月 31, 2010

When adding a module that depends on another one, we used to create a
one-way list of "modules_which_use_me", so that module unloading could
see who needs a module.

It's actually quite simple to make that list go both ways: so that we
not only can see "who uses me", but also see a list of modules that are
"used by me".

In fact, we always wanted that list in "module_unload_free()": when we
unload a module, we want to also release all the other modules that are
used by that module.  But because we didn't have that list, we used to
first iterate over all modules, and then iterate over each "used by me"
list of that module.

By making the list two-way, we simplify module_unload_free(), and it
allows for some trivial fixes later too.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned & rebased)

2c02dfe7

X25: remove duplicated #include · ca733594

由 Huang Weiyi 提交于 6月 04, 2010

Remove duplicated #include('s) in drivers/net/wan/x25_asy.c
Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca733594

tcp: use correct net ns in cookie_v4_check() · c4464921

由 Eric Dumazet 提交于 6月 03, 2010

Its better to make a route lookup in appropriate namespace.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4464921

rps: tcp: fix rps_sock_flow_table table updates · ca55158c

由 Eric Dumazet 提交于 6月 03, 2010

I believe a moderate SYN flood attack can corrupt RFS flow table
(rps_sock_flow_table), making RPS/RFS much less effective.

Even in a normal situation, server handling short lived sessions suffer
from bad steering for the first data packet of a session, if another SYN
packet is received for another session.

We do following action in tcp_v4_rcv() :

	sock_rps_save_rxhash(sk, skb->rxhash);

We should _not_ do this if sk is a LISTEN socket, as about each
packet received on a LISTEN socket has a different rxhash than
previous one.
 -> RPS_NO_CPU markers are spread all over rps_sock_flow_table.

Also, it makes sense to protect sk->rxhash field changes with socket
lock (We currently can change it even if user thread owns the lock
and might use rxhash)

This patch moves sock_rps_save_rxhash() to a sock locked section,
and only for non LISTEN sockets.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca55158c

ppp_generic: fix multilink fragment sizes · 536e00e5

由 Ben McKeegan 提交于 6月 02, 2010

Fix bug in multilink fragment size calculation introduced by
commit 9c705260
"ppp: ppp_mp_explode() redesign"
Signed-off-by: NBen McKeegan <ben@netservers.co.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

536e00e5