提交 · 1be0bd77c5dd7c903f46abf52f9a3650face3c1d · openanolis / cloud-kernel

09 10月, 2013 12 次提交

stop_machine: Introduce stop_two_cpus() · 1be0bd77

由 Peter Zijlstra 提交于 10月 07, 2013

Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.

The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
[ Implemented deadlock avoidance. ]
Signed-off-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1381141781-10992-38-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

1be0bd77

sched/numa: Retry migration of tasks to CPU on a preferred node · 6b9a7460

由 Mel Gorman 提交于 10月 07, 2013

When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-34-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

6b9a7460

mm: numa: Limit NUMA scanning to migrate-on-fault VMAs · fc314724

由 Mel Gorman 提交于 10月 07, 2013

There is a 90% regression observed with a large Oracle performance test
on a 4 node system. Profiles indicated that the overhead was due to
contention on sp_lock when looking up shared memory policies. These
policies do not have the appropriate flags to allow them to be
automatically balanced so trapping faults on them is pointless. This
patch skips VMAs that do not have MPOL_F_MOF set.

[riel@redhat.com: Initial patch]
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reported-and-tested-by: NJoe Mario <jmario@redhat.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

fc314724

sched/numa: Set preferred NUMA node based on number of private faults · b795854b

由 Mel Gorman 提交于 10月 07, 2013

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.
Signed-off-by: NMel Gorman <mgorman@suse.de>
[ Fix complication error when !NUMA_BALANCING. ]
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

b795854b

mm: numa: Scan pages with elevated page_mapcount · 1bc115d8

由 Mel Gorman 提交于 10月 07, 2013

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

1bc115d8

sched/numa: Add infrastructure for split shared/private accounting of NUMA hinting faults · ac8e895b

由 Mel Gorman 提交于 10月 07, 2013

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-26-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

ac8e895b

sched/numa: Favour moving tasks towards the preferred node · 3a7053b3

由 Mel Gorman 提交于 10月 07, 2013

This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing.  Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Tunable and use higher faults instead of preferred. ]
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

3a7053b3

sched/numa: Update NUMA hinting faults once per scan · 745d6147

由 Mel Gorman 提交于 10月 07, 2013

NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

745d6147

sched/numa: Select a preferred node with the most numa hinting faults · 688b7585

由 Mel Gorman 提交于 10月 07, 2013

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

688b7585

sched/numa: Track NUMA hinting faults on per-node basis · f809ca9a

由 Mel Gorman 提交于 10月 07, 2013

This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

f809ca9a

sched/numa: Set the scan rate proportional to the memory usage of the task being scanned · 598f0ec0

由 Mel Gorman 提交于 10月 07, 2013

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

598f0ec0

Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" · b726b7df

由 Mel Gorman 提交于 10月 07, 2013

PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

b726b7df

04 10月, 2013 17 次提交

sched/wait: Clean up wait.h details a bit · fb869b6e

由 Ingo Molnar 提交于 10月 04, 2013

Since we are changing wait.h profoundly, use the opportunity to:

 - add a sentence to explain what this file is about
 - remove whitespace noise
 - prettify weird looking line break fixup attempts
 - standardize type definition and initialization sequences
 - use consistent style details

No code is changed.
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-O8dIie5swnctqpupakatvqyq@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

fb869b6e

sched/wait: Make the __wait_event*() interface more friendly · 35a2af94

由 Peter Zijlstra 提交于 10月 02, 2013

Change all __wait_event*() implementations to match the corresponding
wait_event*() signature for convenience.

In particular this does away with the weird 'ret' logic. Since there
are __wait_event*() users this requires we update them too.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092529.042563462@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

35a2af94

sched/wait: Collapse __wait_event_hrtimeout() · ebdc195f

由 Peter Zijlstra 提交于 10月 02, 2013

While not a whole-sale replacement like the others we can still reduce
the size of __wait_event_hrtimeout() considerably by noting that the
actual core of __wait_event_hrtimeout() is identical to what
___wait_event() generates.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.972793648@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

ebdc195f

sched/wait: Collapse __wait_event_killable() · cf7361fd

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.898691966@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

cf7361fd

sched/wait: Collapse __wait_event_interruptible_tty() · 0d1e1c8a

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.831085521@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

0d1e1c8a

sched/wait: Collapse __wait_event_interruptible_lock_irq_timeout() · a1dc6852

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.759956109@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

a1dc6852

sched/wait: Collapse __wait_event_interruptible_lock_irq() · 8fbd88fa

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.686006009@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

8fbd88fa

sched/wait: Collapse __wait_event_lock_irq() · 13cb5042

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.612813379@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

13cb5042

sched/wait: Collapse __wait_event_interruptible_exclusive() · 48c25217

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.541716442@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

48c25217

sched/wait: Collapse __wait_event_interruptible_timeout() · c2ebb1fb

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.469616907@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

c2ebb1fb

sched/wait: Collapse __wait_event_interruptible() · f13f4c41

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.396949919@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

f13f4c41

sched/wait: Collapse __wait_event_timeout() · ddc1994b

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.325264677@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

ddc1994b

sched/wait: Collapse __wait_event() · 854267f4

由 Peter Zijlstra 提交于 10月 02, 2013

Reduce macro complexity by using the new ___wait_event() helper.
No change in behaviour, identical generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.254863348@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

854267f4

sched/wait: Introduce ___wait_event() · 41a1431b

由 Peter Zijlstra 提交于 10月 02, 2013

There's far too much duplication in the __wait_event macros; in order
to fix this introduce ___wait_event() a macro with the capability to
replace most other macros.

With the previous patches changing the various __wait_event*()
implementations to be more uniform; we can now collapse the lot
without also changing generated code.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.181897111@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

41a1431b

sched/wait: Change the wait_exclusive control flow · bb632bc4

由 Peter Zijlstra 提交于 10月 02, 2013

Purely a preparatory patch; it changes the control flow to match what
will soon be generated by generic code so that that patch can be a
unity transform.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.107994763@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

bb632bc4

sched/wait: Change timeout logic · 2953ef24

由 Peter Zijlstra 提交于 10月 02, 2013

Commit 4c663cfc ("wait: fix false timeouts when using
wait_event_timeout()") introduced an additional condition check after
a timeout but there's a few issues;

 - it forgot one site
 - it put the check after the main loop; not at the actual timeout
   check.

Cure both; by wrapping the condition (as suggested by Oleg), this
avoids double evaluation of 'condition' which could be quite big.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.028892896@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

2953ef24

sched/wait: Make the signal_pending() checks consistent · 2f2a2b60

由 Peter Zijlstra 提交于 10月 02, 2013

There's two patterns to check signals in the __wait_event*() macros:

  if (!signal_pending(current)) {
	schedule();
	continue;
  }
  ret = -ERESTARTSYS;
  break;

And the more natural:

  if (signal_pending(current)) {
	ret = -ERESTARTSYS;
	break;
  }
  schedule();

Change them all into the latter form.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092527.956416254@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

2f2a2b60

01 10月, 2013 2 次提交

skbuff: size of hole is wrong in a comment · 45906723

由 Nicolas Dichtel 提交于 9月 30, 2013

Since commit c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC
reserves"), hole size is one bit less than what is written in the comment.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45906723

mm: avoid reinserting isolated balloon pages into LRU lists · 117aad1e

由 Rafael Aquini 提交于 9月 30, 2013

Isolated balloon pages can wrongly end up in LRU lists when
migrate_pages() finishes its round without draining all the isolated
page list.

The same issue can happen when reclaim_clean_pages_from_list() tries to
reclaim pages from an isolated page list, before migration, in the CMA
path.  Such balloon page leak opens a race window against LRU lists
shrinkers that leads us to the following kernel panic:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
  IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
  PGD 3cda2067 PUD 3d713067 PMD 0
  Oops: 0000 [#1] SMP
  CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  RIP: shrink_page_list+0x24e/0x897
  RSP: 0000:ffff88003da499b8  EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
  RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
  RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
  R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
  R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
  FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
  Call Trace:
    shrink_inactive_list+0x240/0x3de
    shrink_lruvec+0x3e0/0x566
    __shrink_zone+0x94/0x178
    shrink_zone+0x3a/0x82
    balance_pgdat+0x32a/0x4c2
    kswapd+0x2f0/0x372
    kthread+0xa2/0xaa
    ret_from_fork+0x7c/0xb0
  Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
  RIP  [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
   RSP <ffff88003da499b8>
  CR2: 0000000000000028
  ---[ end trace 703d2451af6ffbfd ]---
  Kernel panic - not syncing: Fatal exception

This patch fixes the issue, by assuring the proper tests are made at
putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
isolated balloon pages being wrongly reinserted in LRU lists.

[akpm@linux-foundation.org: clarify awkward comment text]
Signed-off-by: NRafael Aquini <aquini@redhat.com>
Reported-by: NLuiz Capitulino <lcapitulino@redhat.com>
Tested-by: NLuiz Capitulino <lcapitulino@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

117aad1e

29 9月, 2013 1 次提交

USBNET: fix handling padding packet · 60e453a9

由 Ming Lei 提交于 9月 23, 2013

Commit 638c5115(USBNET: support DMA SG) introduces DMA SG
if the usb host controller is capable of building packet from
discontinuous buffers, but missed handling padding packet when
building DMA SG.

This patch attachs the pre-allocated padding packet at the
end of the sg list, so padding packet can be sent to device
if drivers require that.
Reported-by: NDavid Laight <David.Laight@aculab.com>
Acked-by: NOliver Neukum <oliver@neukum.org>
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

60e453a9

28 9月, 2013 2 次提交

mutex: replace CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX with simple ifdef · 083986e8

由 Heiko Carstens 提交于 9月 28, 2013

Linus suggested to replace

 #ifndef CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX
 #define arch_mutex_cpu_relax() cpu_relax()
 #endif

with just a simple

  #ifndef arch_mutex_cpu_relax
  # define arch_mutex_cpu_relax() cpu_relax()
  #endif

to get rid of CONFIG_HAVE_CPU_RELAX_SIMPLE. So architectures can
simply define arch_mutex_cpu_relax if they want an architecture
specific function instead of having to add a select statement in
their Kconfig in addition.
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

083986e8

sched: Revert need_resched() to look at TIF_NEED_RESCHED · 75f93fed

由 Peter Zijlstra 提交于 9月 27, 2013

Yuanhan reported a serious throughput regression in his pigz
benchmark. Using the ftrace patch I found that several idle
paths need more TLC before we can switch the generic
need_resched() over to preempt_need_resched.

The preemption paths benefit most from preempt_need_resched and
do indeed use it; all other need_resched() users don't really
care that much so reverting need_resched() back to
tif_need_resched() is the simple and safe solution.
Reported-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: lkp@linux.intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20130927153003.GF15690@laptop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

75f93fed

27 9月, 2013 2 次提交

Drivers: hv: util: Correctly support ws2008R2 and earlier · 3a491605

由 K. Y. Srinivasan 提交于 9月 06, 2013

The current code does not correctly negotiate the version numbers for the util
driver when hosted on earlier hosts. The version numbers presented by this
driver were not compatible with the version numbers supported by Windows Server
2008. Fix this problem.

I would like to thank Olaf Hering (ohering@suse.com) for identifying the problem.
Reported-by: NOlaf Hering <ohering@suse.com>
Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

3a491605

bcma: make bcma_core_pci_{up,down}() callable from atomic context · 2bedea8f

由 Arend van Spriel 提交于 9月 25, 2013

This patch removes the bcma_core_pci_power_save() call from
the bcma_core_pci_{up,down}() functions as it tries to schedule
thus requiring to call them from non-atomic context. The function
bcma_core_pci_power_save() is now exported so the calling module
can explicitly use it in non-atomic context. This fixes the
'scheduling while atomic' issue reported by Tod Jackson and
Joe Perches.

[   13.210710] BUG: scheduling while atomic: dhcpcd/1800/0x00000202
[   13.210718] Modules linked in: brcmsmac nouveau coretemp kvm_intel kvm cordic brcmutil bcma dell_wmi atl1c ttm mxm_wmi wmi
[   13.210756] CPU: 2 PID: 1800 Comm: dhcpcd Not tainted 3.11.0-wl #1
[   13.210762] Hardware name: Alienware M11x R2/M11x R2, BIOS A04 11/23/2010
[   13.210767]  ffff880177c92c40 ffff880170fd1948 ffffffff8169af5b 0000000000000007
[   13.210777]  ffff880170fd1ab0 ffff880170fd1958 ffffffff81697ee2 ffff880170fd19d8
[   13.210785]  ffffffff816a19f5 00000000000f4240 000000000000d080 ffff880170fd1fd8
[   13.210794] Call Trace:
[   13.210813]  [<ffffffff8169af5b>] dump_stack+0x4f/0x84
[   13.210826]  [<ffffffff81697ee2>] __schedule_bug+0x43/0x51
[   13.210837]  [<ffffffff816a19f5>] __schedule+0x6e5/0x810
[   13.210845]  [<ffffffff816a1c34>] schedule+0x24/0x70
[   13.210855]  [<ffffffff816a04fc>] schedule_hrtimeout_range_clock+0x10c/0x150
[   13.210867]  [<ffffffff810684e0>] ? update_rmtp+0x60/0x60
[   13.210877]  [<ffffffff8106915f>] ? hrtimer_start_range_ns+0xf/0x20
[   13.210887]  [<ffffffff816a054e>] schedule_hrtimeout_range+0xe/0x10
[   13.210897]  [<ffffffff8104f6fb>] usleep_range+0x3b/0x40
[   13.210910]  [<ffffffffa00371af>] bcma_pcie_mdio_set_phy.isra.3+0x4f/0x80 [bcma]
[   13.210921]  [<ffffffffa003729f>] bcma_pcie_mdio_write.isra.4+0xbf/0xd0 [bcma]
[   13.210932]  [<ffffffffa0037498>] bcma_pcie_mdio_writeread.isra.6.constprop.13+0x18/0x30 [bcma]
[   13.210942]  [<ffffffffa00374ee>] bcma_core_pci_power_save+0x3e/0x80 [bcma]
[   13.210953]  [<ffffffffa003765d>] bcma_core_pci_up+0x2d/0x60 [bcma]
[   13.210975]  [<ffffffffa03dc17c>] brcms_c_up+0xfc/0x430 [brcmsmac]
[   13.210989]  [<ffffffffa03d1a7d>] brcms_up+0x1d/0x20 [brcmsmac]
[   13.211003]  [<ffffffffa03d2498>] brcms_ops_start+0x298/0x340 [brcmsmac]
[   13.211020]  [<ffffffff81600a12>] ? cfg80211_netdev_notifier_call+0xd2/0x5f0
[   13.211030]  [<ffffffff815fa53d>] ? packet_notifier+0xad/0x1d0
[   13.211064]  [<ffffffff81656e75>] ieee80211_do_open+0x325/0xf80
[   13.211076]  [<ffffffff8106ac09>] ? __raw_notifier_call_chain+0x9/0x10
[   13.211086]  [<ffffffff81657b41>] ieee80211_open+0x71/0x80
[   13.211101]  [<ffffffff81526267>] __dev_open+0x87/0xe0
[   13.211109]  [<ffffffff8152650c>] __dev_change_flags+0x9c/0x180
[   13.211117]  [<ffffffff815266a3>] dev_change_flags+0x23/0x70
[   13.211127]  [<ffffffff8158cd68>] devinet_ioctl+0x5b8/0x6a0
[   13.211136]  [<ffffffff8158d5c5>] inet_ioctl+0x75/0x90
[   13.211147]  [<ffffffff8150b38b>] sock_do_ioctl+0x2b/0x70
[   13.211155]  [<ffffffff8150b681>] sock_ioctl+0x71/0x2a0
[   13.211169]  [<ffffffff8114ed47>] do_vfs_ioctl+0x87/0x520
[   13.211180]  [<ffffffff8113f159>] ? ____fput+0x9/0x10
[   13.211198]  [<ffffffff8106228c>] ? task_work_run+0x9c/0xd0
[   13.211202]  [<ffffffff8114f271>] SyS_ioctl+0x91/0xb0
[   13.211208]  [<ffffffff816aa252>] system_call_fastpath+0x16/0x1b
[   13.211217] NOHZ: local_softirq_pending 202

The issue was introduced in v3.11 kernel by following commit:

commit aa51e598
Author: Hauke Mehrtens <hauke@hauke-m.de>
Date:   Sat Aug 24 00:32:31 2013 +0200

    brcmsmac: use bcma PCIe up and down functions

    replace the calls to bcma_core_pci_extend_L1timer() by calls to the
    newly introduced bcma_core_pci_ip() and bcma_core_pci_down()
Signed-off-by: NHauke Mehrtens <hauke@hauke-m.de>
    Cc: Arend van Spriel <arend@broadcom.com>
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>

This fix has been discussed with Hauke Mehrtens [1] selection
option 3) and is intended for v3.12.

Ref:
[1] http://mid.gmane.org/5239B12D.3040206@hauke-m.de

Cc: <stable@vger.kernel.org> # 3.11.x
Cc: Tod Jackson <tod.jackson@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Rafal Milecki <zajec5@gmail.com>
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Reviewed-by: NHante Meuleman <meuleman@broadcom.com>
Signed-off-by: NArend van Spriel <arend@broadcom.com>
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>

2bedea8f

26 9月, 2013 1 次提交

NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method · 5bc2afc2

由 Trond Myklebust 提交于 9月 23, 2013

Determine if we've created a new file by examining the directory change
attribute and/or the O_EXCL flag.

This fixes a regression when doing a non-exclusive create of a new file.
If the FILE_CREATED flag is not set, the atomic_open() command will
perform full file access permissions checks instead of just checking
for MAY_OPEN.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

5bc2afc2

25 9月, 2013 3 次提交

sched, x86: Optimize the preempt_schedule() call · 1a338ac3

由 Peter Zijlstra 提交于 8月 14, 2013

Remove the bloat of the C calling convention out of the
preempt_enable() sites by creating an ASM wrapper which allows us to
do an asm("call ___preempt_schedule") instead.

calling.h bits by Andi Kleen
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-tk7xdi1cvvxewixzke8t8le1@git.kernel.org
[ Fixed build error. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

1a338ac3

sched: Prepare for per-cpu preempt_count · a233f112

由 Peter Zijlstra 提交于 9月 23, 2013

When using per-cpu preempt_count variables we need to save/restore the
preempt_count on context switch (into per task storage; for instance
the old thread_info::preempt_count variable) because of
PREEMPT_ACTIVE.

However, this means that on fork() the preempt_count value of the last
context switch gets copied and if we had a PREEMPT_ACTIVE switch right
before cloning a child task the child task will now too have
PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE
count.

Therefore we need to make init_task_preempt_count() unconditional;
this resets whatever preempt_count we inherited from our parent
process.

Doing so for !per-cpu implementations is harmless.

For !PREEMPT_COUNT kernels we need to be careful not to start life
with an increased preempt_count.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

a233f112

sched: Extract the basic add/sub preempt_count modifiers · bdb43806

由 Peter Zijlstra 提交于 9月 10, 2013

Rewrite the preempt_count macros in order to extract the 3 basic
preempt_count value modifiers:

  __preempt_count_add()
  __preempt_count_sub()

and the new:

  __preempt_count_dec_and_test()

And since we're at it anyway, replace the unconventional
$op_preempt_count names with the more conventional preempt_count_$op.

Since these basic operators are equivalent to the previous _notrace()
variants, do away with the _notrace() versions.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

bdb43806

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功