提交 · f2e91042a807cbf9b0b0d9776bf37d1ef0bd7ebe · openeuler / Kernel

30 5月, 2014 1 次提交

KVM: PPC: Add CAP to indicate hcall fixes · f2e91042

由 Alexander Graf 提交于 5月 22, 2014

We worked around some nasty KVM magic page hcall breakages:

  1) NX bit not honored, so ignore NX when we detect it
  2) LE guests swizzle hypercall instruction

Without these fixes in place, there's no way it would make sense to expose kvm
hypercalls to a guest. Chances are immensely high it would trip over and break.

So add a new CAP that gives user space a hint that we have workarounds for the
bugs above in place. It can use those as hint to disable PV hypercalls when
the guest CPU is anything POWER7 or higher and the host does not have fixes
in place.
Signed-off-by: NAlexander Graf <agraf@suse.de>

f2e91042

06 5月, 2014 1 次提交

KVM: s390: Add clock comparator and CPU timer IRQ injection · e029ae5b

由 Thomas Huth 提交于 3月 26, 2014

Add an interface to inject clock comparator and CPU timer interrupts
into the guest. This is needed for handling the external interrupt
interception.
Signed-off-by: NThomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

e029ae5b

05 5月, 2014 1 次提交

kvm/irqchip: Speed up KVM_SET_GSI_ROUTING · 719d93cd

由 Christian Borntraeger 提交于 1月 16, 2014

When starting lots of dataplane devices the bootup takes very long on
Christian's s390 with irqfd patches. With larger setups he is even
able to trigger some timeouts in some components. Turns out that the
KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec)
when having multiple CPUs. This is caused by the synchronize_rcu and
the HZ=100 of s390. By changing the code to use a private srcu we can
speed things up. This patch reduces the boot time till mounting root
from 8 to 2 seconds on my s390 guest with 100 disks.

Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
are fine because they do not have lockdep checks (hlist_for_each_entry_rcu
uses rcu_dereference_raw rather than rcu_dereference, and write-sides
do not do rcu lockdep at all).

Note that we're hardly relying on the "sleepable" part of srcu. We just
want SRCU's faster detection of grace periods.

Testing was done by Andrew Theurer using netperf tests STREAM, MAERTS
and RR. The difference between results "before" and "after" the patch
has mean -0.2% and standard deviation 0.6%. Using a paired t-test on the
data points says that there is a 2.5% probability that the patch is the
cause of the performance difference (rather than a random fluctuation).

(Restricting the t-test to RR, which is the most likely to be affected,
changes the numbers to respectively -0.3% mean, 0.7% stdev, and 8%
probability that the numbers actually say something about the patch.
The probability increases mostly because there are fewer data points).

Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390
Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

719d93cd

29 4月, 2014 1 次提交

KVM: s390: enable IBS for single running VCPUs · 8ad35755

由 David Hildenbrand 提交于 3月 14, 2014

This patch enables the IBS facility when a single VCPU is running.
The facility is dynamically turned on/off as soon as other VCPUs
enter/leave the stopped state.

When this facility is operating, some instructions can be executed
faster for single-cpu guests.
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

8ad35755

24 4月, 2014 1 次提交

Revert "KVM: Simplify kvm->tlbs_dirty handling" · a086f6a1

由 Xiao Guangrong 提交于 4月 17, 2014

This reverts commit 5befdc38.

Since we will allow flush tlb out of mmu-lock in the later
patch
Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

a086f6a1

22 4月, 2014 2 次提交

KVM: add kvm_is_error_gpa() helper · dfeec843

由 Heiko Carstens 提交于 1月 01, 2014

It's quite common (in the s390 guest access code) to test if a guest
physical address points to a valid guest memory area or not.
So add a simple helper function in common code, since this might be
of interest for other architectures as well.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: NThomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

dfeec843

KVM: s390: Per-vm kvm device controls · f2061656

由 Dominik Dingel 提交于 4月 09, 2014

We sometimes need to get/set attributes specific to a virtual machine
and so need something else than ONE_REG.

Let's copy the KVM_DEVICE approach, and define the respective ioctls
for the vm file descriptor.
Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

f2061656

18 4月, 2014 2 次提交

KVM: VMX: speed up wildcard MMIO EVENTFD · 68c3b4d1

由 Michael S. Tsirkin 提交于 3月 31, 2014

With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Unfortunately, this only works if userspace does not need to match on
access length and data.  The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
    - minimize overhead for old userspace that does not use eventfd with lengtth = 0
    - minimize disruption in other code (since we don't know the length,
      devices on the MMIO bus only get a valid address in write, this
      way we don't need to touch all devices to teach them to handle
      an invalid length)

At the moment, this optimization only has effect for EPT on x86.

It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.

With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.

Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.

The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
pre-review and suggestions.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

68c3b4d1

KVM: support any-length wildcard ioeventfd · f848a5a8

由 Michael S. Tsirkin 提交于 3月 31, 2014

It is sometimes benefitial to ignore IO size, and only match on address.
In hindsight this would have been a better default than matching length
when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
of access can be optimized on VMX: there no need to do page lookups.
This can currently be done with many ioeventfds but in a suboptimal way.

However we can't change kernel/userspace ABI without risk of breaking
some applications.
Use len = 0 to mean "ignore length for matching" in a more optimal way.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

f848a5a8

16 4月, 2014 2 次提交

ipv4: add a sock pointer to dst->output() path. · aad88724

由 Eric Dumazet 提交于 4月 15, 2014

In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.

The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.

Fixes: 8f646c92 ("vxlan: keep original skb ownership")
Reported-by: Nlucien xin <lucien.xin@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aad88724

ipv4: add a sock pointer to ip_queue_xmit() · b0270e91

由 Eric Dumazet 提交于 4月 15, 2014

ip_queue_xmit() assumes the skb it has to transmit is attached to an
inet socket. Commit 31c70d59 ("l2tp: keep original skb ownership")
changed l2tp to not change skb ownership and thus broke this assumption.

One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
so that we do not assume skb->sk points to the socket used by l2tp
tunnel.

Fixes: 31c70d59 ("l2tp: keep original skb ownership")
Reported-by: NZhan Jianyu <nasa4836@gmail.com>
Tested-by: NZhan Jianyu <nasa4836@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0270e91

15 4月, 2014 3 次提交

Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" · 362d5204

由 Daniel Borkmann 提交于 4月 14, 2014

This reverts commit ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management
to reflect real state of the receiver's buffer") as it introduced a
serious performance regression on SCTP over IPv4 and IPv6, though a not
as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Current state:

[root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 17:56:21 GMT
Connecting to host 192.168.241.3, port 5201
      Cookie: Lab200slot2.1397238981.812898.548918
[  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
[  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
[  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
[  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
[  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
[  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
[  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
[  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
[  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
[  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
[  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
[  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
[  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
(etc)

[root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 19:08:41 GMT
Connecting to host 2001:db8:0:f101::1, port 5201
      Cookie: Lab200slot2.1397243321.714295.2b3f7c
[  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
[  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
[  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
[  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
[  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
[  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
[  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
[  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
[  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
[  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
[  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
[  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
(etc)

After patch:

[root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
Time: Mon, 14 Apr 2014 16:40:48 GMT
Connecting to host 192.168.240.3, port 5201
      Cookie: Lab200slot2.1397493648.413274.65e131
[  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
[  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
[  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
[  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec

With the reverted patch applied, the SCTP/IPv4 performance is back
to normal on latest upstream for IPv4 and IPv6 and has same throughput
as 3.4.2 test kernel, steady and interval reports are smooth again.

Fixes: ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
Reported-by: NPeter Butler <pbutler@sonusnet.com>
Reported-by: NDongsheng Song <dongsheng.song@gmail.com>
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Tested-by: NPeter Butler <pbutler@sonusnet.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: NVlad Yasevich <vyasevich@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

362d5204

net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W · 8c482cdc

由 Daniel Borkmann 提交于 4月 14, 2014

While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to
get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.

In practice, this should not have much side-effect though, as such
conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
operation PR_GET_SECCOMP will only return the current seccomp mode, but
not the filter itself. Since the transition to the new BPF infrastructure,
it's also not used anymore, so we can simply remove this as it's
unreachable.

Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)")
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8c482cdc

ipv6: Limit mtu to 65575 bytes · 30f78d8e

由 Eric Dumazet 提交于 4月 10, 2014

Francois reported that setting big mtu on loopback device could prevent
tcp sessions making progress.

We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets.

We must limit the IPv6 MTU to (65535 + 40) bytes in theory.

Tested:

ifconfig lo mtu 70000
netperf -H ::1

Before patch : Throughput :   0.05 Mbits

After patch : Throughput : 35484 Mbits
Reported-by: NFrancois WELLENREITER <f.wellenreiter@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30f78d8e

14 4月, 2014 1 次提交

netfilter: nf_tables: fix nft_cmp_fast failure on big endian for size < 4 · b855d416

由 Patrick McHardy 提交于 4月 12, 2014

nft_cmp_fast is used for equality comparisions of size <= 4. For
comparisions of size < 4 byte a mask is calculated that is applied to
both the data from userspace (during initialization) and the register
value (during runtime). Both values are stored using (in effect) memcpy
to a memory area that is then interpreted as u32 by nft_cmp_fast.

This works fine on little endian since smaller types have the same base
address, however on big endian this is not true and the smaller types
are interpreted as a big number with trailing zero bytes.

The mask therefore must not include the lower bytes, but the higher bytes
on big endian. Add a helper function that does a cpu_to_le32 to switch
the bytes on big endian. Since we're dealing with a mask of just consequitive
bits, this works out fine.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

b855d416

12 4月, 2014 1 次提交

net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369

由 David S. Miller 提交于 4月 11, 2014

Several spots in the kernel perform a sequence like:

	skb_queue_tail(&sk->s_receive_queue, skb);
	sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up.  So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument.  And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

676d2369

11 4月, 2014 6 次提交

mm: slab/slub: use page->list consistently instead of page->lru · 34bf6ef9

由 Dave Hansen 提交于 4月 08, 2014

'struct page' has two list_head fields: 'lru' and 'list'.  Conveniently,
they are unioned together.  This means that code can use them
interchangably, which gets horribly confusing like with this nugget from
slab.c:

>	list_del(&page->lru);
>	if (page->active == cachep->num)
>		list_add(&page->list, &n->slabs_full);

This patch makes the slab and slub code use page->lru universally instead
of mixing ->list and ->lru.

So, the new rule is: page->lru is what the you use if you want to keep
your page on a list.  Don't like the fact that it's not called ->list?
Too bad.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

34bf6ef9

AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC · 312103d6

由 Chris Metcalf 提交于 3月 25, 2014

On systems with CONFIG_COMPAT we introduced the new requirement that
audit_classify_compat_syscall() exists.  This wasn't true for everything
(apparently not for "tilegx", which I know less that nothing about.)

Instead of wrapping the preprocessor optomization with CONFIG_COMPAT we
should have used the new CONFIG_AUDIT_COMPAT_GENERIC.  This patch uses
that config option to make sure only arches which intend to implement
this have the requirement.

This works fine for tilegx according to Chris Metcalf
Signed-off-by: NEric Paris <eparis@redhat.com>

312103d6

NVMe: Retry failed commands with non-fatal errors · edd10d33

由 Keith Busch 提交于 4月 03, 2014

For commands returned with failed status, queue these for resubmission
and continue retrying them until success or for a limited amount of
time. The final timeout was arbitrarily chosen so requests can't be
retried indefinitely.

Since these are requeued on the nvmeq that submitted the command, the
callbacks have to take an nvmeq instead of an nvme_dev as a parameter
so that we can use the locked queue to append the iod to retry later.

The nvme_iod conviently can be used to track how long we've been trying
to successfully complete an iod request. The nvme_iod also provides the
nvme prp dma mappings, so I had to move a few things around so we can
keep those mappings.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[fixed checkpatch issue with long line]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

edd10d33

NVMe: Make I/O timeout a module parameter · b355084a

由 Keith Busch 提交于 4月 04, 2014

Increase the default timeout to 30 seconds to match SCSI.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[use byte instead of ushort]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

b355084a

NVMe: CPU hot plug notification · 33b1e95c

由 Keith Busch 提交于 3月 24, 2014

Registers with hot cpu notification to rebalance, and potentially allocate
additional, io queues.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

33b1e95c

NVMe: per-cpu io queues · 42f61420

由 Keith Busch 提交于 3月 24, 2014

The device's IO queues are associated with CPUs, so we can use a per-cpu
variable to map the a qid to a cpu. This provides a convienient way
to optimally assign queues to multiple cpus when the device supports
fewer queues than the host has cpus. The previous implementation may
have assigned these poorly in these situations. This patch addresses
this by sharing queues among cpus that are "close" together and should
have a lower lock contention penalty.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

42f61420

10 4月, 2014 6 次提交

block: fix regression with block enabled tagging · 360f92c2

由 Jens Axboe 提交于 4月 09, 2014

Martin reported that his test system would not boot with
current git, it oopsed with this:

BUG: unable to handle kernel paging request at ffff88046c6c9e80
IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
Workqueue: events_unbound async_run_entry_fn
task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
blk_queue_start_tag+0x90/0x150
RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
Stack:
 ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
 ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
 ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
Call Trace:
 [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
 [<ffffffff81293167>] __blk_run_queue+0x37/0x50
 [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
 [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
 [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
 [<ffffffff813bba35>] scsi_execute+0xe5/0x180
 [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
 [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
 [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
 [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
 [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
 [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
 [<ffffffff8106a1c5>] process_one_work+0x175/0x410
 [<ffffffff8106b373>] worker_thread+0x123/0x400
 [<ffffffff8106b250>] ? manage_workers+0x160/0x160
 [<ffffffff8107104e>] kthread+0xce/0xf0
 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
 RSP <ffff880273d03a58>
CR2: ffff88046c6c9e80

Martin bisected and found this to be the problem patch;

	commit 6d113398
	Author: Jan Kara <jack@suse.cz>
	Date:   Mon Feb 24 16:39:54 2014 +0100

	    block: Stop abusing rq->csd.list in blk-softirq

and the problem was immediately apparent. The patch states that
it is safe to reuse queuelist at completion time, since it is
no longer used. However, that is not true if a device is using
block enabled tagging. If that is the case, then the queuelist
is reused to keep track of busy tags. If a device also ended
up using softirq completions, we'd reuse ->queuelist for the
IPI handling while block tagging was still using it. Boom.

Fix this by adding a new ipi_list list head, and share the
memory used with the request hash table. The hash table is
never used after the request is moved to the dispatch list,
which happens long before any potential completion of the
request. Add a new request bit for this, so we don't have
cases that check rq->hash while it could potentially have
been reused for the IPI completion.
Reported-by: NMartin K. Petersen <martin.petersen@oracle.com>
Tested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

360f92c2

scsi: Make sure cmd_flags are 64-bit · 2bfad21e

由 Martin K. Petersen 提交于 4月 09, 2014

cmd_flags in struct request is now 64 bits wide but the scsi_execute
functions truncated arguments passed to int leading to errors. Make sure
the flags parameters are u64.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Cc: Jens Axboe <axboe@fb.com>
CC: Jan Kara <jack@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2bfad21e

tracing: Fix anonymous unions in struct ftrace_event_call · abb43f69

由 Mathieu Desnoyers 提交于 4月 09, 2014

gcc <= 4.5.x has significant limitations with respect to initialization
of anonymous unions within structures. They need to be surrounded by
brackets, _and_ they need to be initialized in the same order in which
they appear in the structure declaration.

Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
Link: http://lkml.kernel.org/r/1397077568-3156-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

abb43f69

x86: LLVMLinux: Fix "incomplete type const struct x86cpu_device_id" · c4586256

由 Behan Webster 提交于 2月 13, 2014

Similar to the fix in 40413dcb

MODULE_DEVICE_TABLE(x86cpu, ...) expects the struct to be called struct
x86cpu_device_id, and not struct x86_cpu_id which is what is used in the rest
of the kernel code.  Although gcc seems to ignore this error, clang fails
without this define to fix the name.

Code from drivers/thermal/x86_pkg_temp_thermal.c
static const struct x86_cpu_id __initconst pkg_temp_thermal_ids[] = { ... };
MODULE_DEVICE_TABLE(x86cpu, pkg_temp_thermal_ids);

Error from clang:
drivers/thermal/x86_pkg_temp_thermal.c:577:1: error: variable has
      incomplete type 'const struct x86cpu_device_id'
MODULE_DEVICE_TABLE(x86cpu, pkg_temp_thermal_ids);
^
include/linux/module.h:145:3: note: expanded from macro
      'MODULE_DEVICE_TABLE'
  MODULE_GENERIC_TABLE(type##_device, name)
  ^
include/linux/module.h:87:32: note: expanded from macro
      'MODULE_GENERIC_TABLE'
extern const struct gtype##_id __mod_##gtype##_table            \
                               ^
<scratch space>:143:1: note: expanded from here
__mod_x86cpu_device_table
^
drivers/thermal/x86_pkg_temp_thermal.c:577:1: note: forward declaration of
      'struct x86cpu_device_id'
include/linux/module.h:145:3: note: expanded from macro
      'MODULE_DEVICE_TABLE'
  MODULE_GENERIC_TABLE(type##_device, name)
  ^
include/linux/module.h:87:21: note: expanded from macro
      'MODULE_GENERIC_TABLE'
extern const struct gtype##_id __mod_##gtype##_table            \
                    ^
<scratch space>:141:1: note: expanded from here
x86cpu_device_id
^
1 error generated.
Signed-off-by: NBehan Webster <behanw@converseincode.com>
Signed-off-by: NJan-Simon Möller <dl9pf@gmx.de>
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

c4586256

LLVMLinux: Add support for clang to compiler.h and new compiler-clang.h · 565cbdc2

由 Mark Charlebois 提交于 11月 20, 2012

Add a compiler-clang.h file to add specific macros needed for compiling the
kernel with clang.

Initially the only override required is the macro for silencing the
compiler for a purposefully uninintialized variable.

Author: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: NMark Charlebois <charlebm@gmail.com>
Signed-off-by: NBehan Webster <behanw@converseincode.com>

565cbdc2

LLVMLinux: Remove warning about returning an uninitialized variable · aa93685a

由 Behan Webster 提交于 11月 20, 2012

Fix uninitialized return code in default case in cmpxchg-local.h

This patch fixes the code to prevent an uninitialized return value that is detected
when compiling with clang. The bug produces numerous warnings when compiling the
Linux kernel with clang.
Signed-off-by: NBehan Webster <behanw@converseincode.com>
Signed-off-by: NMark Charlebois <charlebm@gmail.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>

aa93685a

09 4月, 2014 5 次提交

tracepoint: Fix sparse warnings in tracepoint.c · b725dfea

由 Mathieu Desnoyers 提交于 4月 09, 2014

Fix the following sparse warnings:

CHECK kernel/tracepoint.c
kernel/tracepoint.c:184:18: warning: incorrect type in assignment (different address spaces)
kernel/tracepoint.c:184:18: expected struct tracepoint_func *tp_funcs
kernel/tracepoint.c:184:18: got struct tracepoint_func [noderef] <asn:4>*funcs
kernel/tracepoint.c:216:18: warning: incorrect type in assignment (different address spaces)
kernel/tracepoint.c:216:18: expected struct tracepoint_func *tp_funcs
kernel/tracepoint.c:216:18: got struct tracepoint_func [noderef] <asn:4>*funcs
kernel/tracepoint.c:392:24: error: return expression in void function
CC kernel/tracepoint.o
kernel/tracepoint.c: In function tracepoint_module_going:
kernel/tracepoint.c:491:6: warning: symbol 'syscall_regfunc' was not declared. Should it be static?
kernel/tracepoint.c:508:6: warning: symbol 'syscall_unregfunc' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/1397049883-28692-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

b725dfea

block: Fix for_each_bvec() · b7aa84d9

由 Martin K. Petersen 提交于 4月 08, 2014

Commit 4550dd6c introduced for_each_bvec() which iterates over each
bvec attached to a bio or bip. However, the macro fails to check bi_size
before dereferencing which can lead to crashes while counting/mapping
integrity scatterlist segments.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: NJens Axboe <axboe@fb.com>

b7aa84d9

tracepoint: Simplify tracepoint module search · eb7d035c

由 Steven Rostedt (Red Hat) 提交于 4月 08, 2014

Instead of copying the num_tracepoints and tracepoints_ptrs from
the module structure to the tp_mod structure, which only uses it to
find the module associated to tracepoints of modules that are coming
and going, simply copy the pointer to the module struct to the tracepoint
tp_module structure.

Also removed un-needed brackets around an if statement.

Link: http://lkml.kernel.org/r/20140408201705.4dad2c4a@gandalf.local.homeAcked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

eb7d035c

tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints · de7b2973

由 Mathieu Desnoyers 提交于 4月 08, 2014

Register/unregister tracepoint probes with struct tracepoint pointer
rather than tracepoint name.

This change, which vastly simplifies tracepoint.c, has been proposed by
Steven Rostedt. It also removes 8.8kB (mostly of text) to the vmlinux
size.

From this point on, the tracers need to pass a struct tracepoint pointer
to probe register/unregister. A probe can now only be connected to a
tracepoint that exists. Moreover, tracers are responsible for
unregistering the probe before the module containing its associated
tracepoint is unloaded.

   text    data     bss     dec     hex filename
10443444        4282528 10391552        25117524        17f4354 vmlinux.orig
10434930        4282848 10391552        25109330        17f2352 vmlinux

Link: http://lkml.kernel.org/r/1396992381-23785-2-git-send-email-mathieu.desnoyers@efficios.com

CC: Ingo Molnar <mingo@kernel.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Frank Ch. Eigler <fche@redhat.com>
CC: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ SDR - fixed return val in void func in tracepoint_module_going() ]
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

de7b2973

include/linux/syscalls.h: add sys_renameat2() prototype · 5fb6b953

由 Heiko Carstens 提交于 4月 08, 2014

Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5fb6b953

08 4月, 2014 7 次提交

netfilter: nf_conntrack: flush net_gre->keymap_list only from gre helper · 8142b227

由 Andrey Vagin 提交于 3月 31, 2014

nf_ct_gre_keymap_flush() removes a nf_ct_gre_keymap object from
net_gre->keymap_list and frees the object. But it doesn't clean
a reference on this object from ct_pptp_info->keymap[dir].
Then nf_ct_gre_keymap_destroy() may release the same object again.

So nf_ct_gre_keymap_flush() can be called only when we are sure that
when nf_ct_gre_keymap_destroy will not be called.

nf_ct_gre_keymap is created by nf_ct_gre_keymap_add() and the right way
to destroy it is to call nf_ct_gre_keymap_destroy().

This patch marks nf_ct_gre_keymap_flush() as static, so this patch can
break compilation of third party modules, which use
nf_ct_gre_keymap_flush. I'm not sure this is the right way to deprecate
this function.

[  226.540793] general protection fault: 0000 [#1] SMP
[  226.541750] Modules linked in: nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre ip_gre ip_tunnel gre
ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc xt_nat
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
nf_conntrack veth tun bridge stp llc ppdev microcode joydev pcspkr
serio_raw virtio_console virtio_balloon floppy parport_pc parport
pvpanic i2c_piix4 virtio_net drm_kms_helper ttm ata_generic virtio_pci
virtio_ring virtio drm i2c_core pata_acpi [last unloaded: ip_tunnel]
[  226.541776] CPU: 0 PID: 49 Comm: kworker/u4:2 Not tainted 3.14.0-rc8+ #101
[  226.541776] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  226.541776] Workqueue: netns cleanup_net
[  226.541776] task: ffff8800371e0000 ti: ffff88003730c000 task.ti: ffff88003730c000
[  226.541776] RIP: 0010:[<ffffffff81389ba9>]  [<ffffffff81389ba9>] __list_del_entry+0x29/0xd0
[  226.541776] RSP: 0018:ffff88003730dbd0  EFLAGS: 00010a83
[  226.541776] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800374e6c40 RCX: dead000000200200
[  226.541776] RDX: 6b6b6b6b6b6b6b6b RSI: ffff8800371e07d0 RDI: ffff8800374e6c40
[  226.541776] RBP: ffff88003730dbd0 R08: 0000000000000000 R09: 0000000000000000
[  226.541776] R10: 0000000000000001 R11: ffff88003730d92e R12: 0000000000000002
[  226.541776] R13: ffff88007a4c42d0 R14: ffff88007aef0000 R15: ffff880036cf0018
[  226.541776] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  226.541776] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  226.541776] CR2: 00007f07f643f7d0 CR3: 0000000036fd2000 CR4: 00000000000006f0
[  226.541776] Stack:
[  226.541776]  ffff88003730dbe8 ffffffff81389c5d ffff8800374ffbe4 ffff88003730dc28
[  226.541776]  ffffffffa0162a43 ffffffffa01627c5 ffff88007a4c42d0 ffff88007aef0000
[  226.541776]  ffffffffa01651c0 ffff88007a4c45e0 ffff88007aef0000 ffff88003730dc40
[  226.541776] Call Trace:
[  226.541776]  [<ffffffff81389c5d>] list_del+0xd/0x30
[  226.541776]  [<ffffffffa0162a43>] nf_ct_gre_keymap_destroy+0x283/0x2d0 [nf_conntrack_proto_gre]
[  226.541776]  [<ffffffffa01627c5>] ? nf_ct_gre_keymap_destroy+0x5/0x2d0 [nf_conntrack_proto_gre]
[  226.541776]  [<ffffffffa0162ab7>] gre_destroy+0x27/0x70 [nf_conntrack_proto_gre]
[  226.541776]  [<ffffffffa0117de3>] destroy_conntrack+0x83/0x200 [nf_conntrack]
[  226.541776]  [<ffffffffa0117d87>] ? destroy_conntrack+0x27/0x200 [nf_conntrack]
[  226.541776]  [<ffffffffa0117d60>] ? nf_conntrack_hash_check_insert+0x2e0/0x2e0 [nf_conntrack]
[  226.541776]  [<ffffffff81630142>] nf_conntrack_destroy+0x72/0x180
[  226.541776]  [<ffffffff816300d5>] ? nf_conntrack_destroy+0x5/0x180
[  226.541776]  [<ffffffffa011ef80>] ? kill_l3proto+0x20/0x20 [nf_conntrack]
[  226.541776]  [<ffffffffa011847e>] nf_ct_iterate_cleanup+0x14e/0x170 [nf_conntrack]
[  226.541776]  [<ffffffffa011f74b>] nf_ct_l4proto_pernet_unregister+0x5b/0x90 [nf_conntrack]
[  226.541776]  [<ffffffffa0162409>] proto_gre_net_exit+0x19/0x30 [nf_conntrack_proto_gre]
[  226.541776]  [<ffffffff815edf89>] ops_exit_list.isra.1+0x39/0x60
[  226.541776]  [<ffffffff815eecc0>] cleanup_net+0x100/0x1d0
[  226.541776]  [<ffffffff810a608a>] process_one_work+0x1ea/0x4f0
[  226.541776]  [<ffffffff810a6028>] ? process_one_work+0x188/0x4f0
[  226.541776]  [<ffffffff810a64ab>] worker_thread+0x11b/0x3a0
[  226.541776]  [<ffffffff810a6390>] ? process_one_work+0x4f0/0x4f0
[  226.541776]  [<ffffffff810af42d>] kthread+0xed/0x110
[  226.541776]  [<ffffffff8173d4dc>] ? _raw_spin_unlock_irq+0x2c/0x40
[  226.541776]  [<ffffffff810af340>] ? kthread_create_on_node+0x200/0x200
[  226.541776]  [<ffffffff8174747c>] ret_from_fork+0x7c/0xb0
[  226.541776]  [<ffffffff810af340>] ? kthread_create_on_node+0x200/0x200
[  226.541776] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de
48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48
39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89
42 08
[  226.541776] RIP  [<ffffffff81389ba9>] __list_del_entry+0x29/0xd0
[  226.541776]  RSP <ffff88003730dbd0>
[  226.612193] ---[ end trace 985ae23ddfcc357c ]---

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

8142b227

mm: create generic early_ioremap() support · 9e5c33d7

由 Mark Salter 提交于 4月 07, 2014

This patch creates a generic implementation of early_ioremap() support
based on the existing x86 implementation.  early_ioremp() is useful for
early boot code which needs to temporarily map I/O or memory regions
before normal mapping functions such as ioremap() are available.

Some architectures have optional MMU.  In the no-MMU case, the remap
functions simply return the passed in physical address and the unmap
functions do nothing.
Signed-off-by: NMark Salter <msalter@redhat.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NH. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9e5c33d7

lglock: map to spinlock when !CONFIG_SMP · 64b47e8f

由 Josh Triplett 提交于 4月 07, 2014

When the system has only one CPU, lglock is effectively a spinlock; map
it directly to spinlock to eliminate the indirection and duplicate code.

In addition to removing overhead, this drops 1.6k of code with a
defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config.
Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

64b47e8f

percpu: add preemption checks to __this_cpu ops · 188a8140

由 Christoph Lameter 提交于 4月 07, 2014

We define a check function in order to avoid trouble with the include
files.  Then the higher level __this_cpu macros are modified to invoke
the preemption check.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NChristoph Lameter <cl@linux.com>
Acked-by: NIngo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Tested-by: NGrygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

188a8140

vmstat: use raw_cpu_ops to avoid false positives on preemption checks · 293b6a4c

由 Christoph Lameter 提交于 4月 07, 2014

vm counters are allowed to be racy.  Use raw_cpu_ops to avoid the
local_irq_disable overhead and to avoid preemption checks which will be
added to the __this_cpu operations.

[akpm@linux-foundation.org: Add comment.  Again.]
Signed-off-by: NChristoph Lameter <cl@linux.com>
Reported-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

293b6a4c

mm: use raw_cpu ops for determining current NUMA node · dc322a99

由 Christoph Lameter 提交于 4月 07, 2014

With the preempt checking logic for __this_cpu_ops we will get false
positives from locations in the code that use numa_node_id.

Before the __this_cpu ops where introduced there were no checks for
preemption present either.  smp_raw_processor_id() was used.  See

  http://www.spinics.net/lists/linux-numa/msg00641.html

Therefore we need to use raw_cpu_read here to avoid false postives.

Note that this issue has been discussed in prior years.  If the process
changes nodes after retrieving the current numa node then that is
acceptable since most uses of numa_node etc are for optimization and not
for correctness.

There were suggestions to implement a raw_numa_node_id in order to do
preempt checks for numa_node_id as well.  But I think we better defer
that to another patch since that would mean investigating how
numa_node_id() is used throughout the kernel which would increase the
scope of this patchset significantly.  After all preemption was never
checked before when numa_node_id() was used.

Some sample traces:

__this_cpu_read operation in preemptible [00000000] code: login/1456
caller is __this_cpu_preempt_check+0x2b/0x2d
CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3b-dirty #185
Call Trace:
  dump_stack+0x4e/0x82
  check_preemption_disabled+0xc5/0xe0
  __this_cpu_preempt_check+0x2b/0x2d
  get_task_policy+0x1d/0x49
  get_vma_policy+0x14/0x76
  alloc_pages_vma+0x35/0xff
  handle_mm_fault+0x290/0x73b
  __do_page_fault+0x3fe/0x44d
  do_page_fault+0x9/0xc
  page_fault+0x22/0x30
  generic_file_aio_read+0x38e/0x624
  do_sync_read+0x54/0x73
  vfs_read+0x9d/0x12a
  SyS_read+0x47/0x7e
  cstar_dispatch+0x7/0x23

caller is __this_cpu_preempt_check+0x2b/0x2d
CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3b-dirty #185
Call Trace:
  dump_stack+0x4e/0x82
  check_preemption_disabled+0xc5/0xe0
  __this_cpu_preempt_check+0x2b/0x2d
  alloc_pages_current+0x8f/0xbc
  __page_cache_alloc+0xb/0xd
  __do_page_cache_readahead+0xf4/0x219
  ra_submit+0x1c/0x20
  ondemand_readahead+0x28c/0x2b4
  page_cache_sync_readahead+0x38/0x3a
  generic_file_aio_read+0x261/0x624
  do_sync_read+0x54/0x73
  vfs_read+0x9d/0x12a
  SyS_read+0x47/0x7e
  cstar_dispatch+0x7/0x23
Signed-off-by: NChristoph Lameter <cl@linux.com>
Acked-by: NIngo Molnar <mingo@kernel.org>
Cc: Alex Shi <alex.shi@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dc322a99

percpu: add raw_cpu_ops · b3ca1c10

由 Christoph Lameter 提交于 4月 07, 2014

The kernel has never been audited to ensure that this_cpu operations are
consistently used throughout the kernel.  The code generated in many
places can be improved through the use of this_cpu operations (which
uses a segment register for relocation of per cpu offsets instead of
performing address calculations).

The patch set also addresses various consistency issues in general with
the per cpu macros.

A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
   because checks are skipped. This is typically shown through a raw_
   prefix. So this patch set changes the places where __this_cpu_ptr()
   is used to raw_cpu_ptr().

B. There has been the long term wish by some that __this_cpu operations
   would check for preemption. However, there are cases where preemption
   checks need to be skipped. This patch set adds raw_cpu operations that
   do not check for preemption and then adds preemption checks to the
   __this_cpu operations.

C. The use of __get_cpu_var is always a reference to a percpu variable
   that can also be handled via a this_cpu operation. This patch set
   replaces all uses of __get_cpu_var with this_cpu operations.

D. We can then use this_cpu RMW operations in various places replacing
   sequences of instructions by a single one.

E. The use of this_cpu operations throughout will allow other arches than
   x86 to implement optimized references and RMV operations to work with
   per cpu local data.

F. The use of this_cpu operations opens up the possibility to
   further optimize code that relies on synchronization through
   per cpu data.

The patch set works in a couple of stages:

I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
    Also converts the existing __this_cpu_xx_# primitive in the x86
    code to raw_cpu_xx_#.

II. Patch 2-4 use the raw_cpu operations in places that would give
     us false positives once they are enabled.

III. Patch 5 adds preemption checks to __this_cpu operations to allow
    checking if preemption is properly disabled when these functions
    are used.

IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
   with this_cpu_ptr. They do not depend on any changes to the percpu
   code. No preemption tests are skipped if they are applied.

V. Patches 21-46 are conversion patches that use this_cpu operations
   in various kernel subsystems/drivers or arch code.

VI.  Patches 47/48 (not included in this series) remove no longer used
    functions (__this_cpu_ptr and __get_cpu_var).  These should only be
    applied after all the conversion patches have made it and after we
    have done additional passes through the kernel to ensure that none of
    the uses of these functions remain.

This patch (of 46):

The patches following this one will add preemption checks to __this_cpu
ops so we need to have an alternative way to use this_cpu operations
without preemption checks.

raw_cpu_ops will be the basis for all other ops since these will be the
operations that do not implement any checks.

Primitive operations are renamed by this patch from __this_cpu_xxx to
raw_cpu_xxxx.

Also change the uses of the x86 percpu primitives in preempt.h.
These depend directly on asm/percpu.h (header #include nesting issue).
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Acked-by: NIngo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Alex Shi <alex.shi@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bryan Wu <cooloney@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: David Daney <david.daney@cavium.com>
Cc: David Miller <davem@davemloft.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Robert Richter <rric@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b3ca1c10

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功