提交 · 4a633a602c26497b8285a202830829d3be007c7b · openeuler / Kernel

22 1月, 2013 6 次提交

virtio-net: introduce a new control to set macaddr · 7e58d5ae

由 Amos Kong 提交于 1月 21, 2013

Currently we write MAC address to pci config space byte by byte,
this means that we have an intermediate step where mac is wrong.
This patch introduced a new control command to set MAC address,
it's atomic.

VIRTIO_NET_F_CTRL_MAC_ADDR is a new feature bit for compatibility.
Signed-off-by: NAmos Kong <akong@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7e58d5ae

net: split eth_mac_addr for better error handling · fa0879e3

由 Stefan Hajnoczi 提交于 1月 21, 2013

When we set mac address, software mac address in system and hardware mac
address all need to be updated. Current eth_mac_addr() doesn't allow
callers to implement error handling nicely.

This patch split eth_mac_addr() to prepare part and real commit part,
then we can prepare first, and try to change hardware address, then do
the real commit if hardware address is set successfully.
Signed-off-by: NStefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: NAmos Kong <akong@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa0879e3

mcast: add multicast proxy support (IPv4 and IPv6) · 660b26dc

由 Nicolas Dichtel 提交于 1月 21, 2013

This patch add the support of proxy multicast, ie being able to build a static
multicast tree. It adds the support of (*,*) and (*,G) entries.

The user should define an (*,*) entry which is not used for real forwarding.
This entry defines the upstream in iif and contains all interfaces from the
static tree in its oifs. It will be used to forward packet upstream when they
come from an interface belonging to the static tree.
Hence, the user should define (*,G) entries to build its static tree. Note that
upstream interface must be part of oifs: packets are sent to all oifs
interfaces except the input interface. This ensures to always join the whole
static tree, even if the packet is not coming from the upstream interface.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NDavid L Stevens <dlstevens@us.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

660b26dc

mcast: define and use MRT[6]_MAX in ip[6]_mroute_opt() · bbb923a4

由 Nicolas Dichtel 提交于 1月 21, 2013

This will ease further addition of new MRT[6]_* values and avoid to update
in6.h each time.
Note that we reduce the maximum value from 210 to 209, but 210 does not match
any known value in ip[6]_mroute_setsockopt().
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NDavid L Stevens <dlstevens@us.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bbb923a4

ipv6: Unshare ip6_nd_hdr() and change return type to void. · 2576f17d

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 21, 2013

- move ip6_nd_hdr() to its users' source files.
  In net/ipv6/mcast.c, it will be called ip6_mc_hdr().
- make return type to void since this function never fails.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2576f17d

ndisc: Move ndisc_opt_addr_space() to include/net/ndisc.h. · c558e9fc

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 21, 2013

This also makes ndisc_opt_addr_data() and ndisc_fill_addr_option()
use ndisc_opt_addr_space().
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c558e9fc

21 1月, 2013 4 次提交

Y
ipv6: Optimize ipv6_addr_is_ll_all_{nodes,routers}(). · d1641565
由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 20, 2013
```
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
d1641565
Y
ipv6: Optimize ipv6_addr_is_solict_mult(). · 9d100774
由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 20, 2013
```
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
9d100774
Y
ipv6: Introduce ipv6_addr_is_solict_mult() to check Solicited Node Multicast Addresses. · ca97a644
由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 20, 2013
```
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
ca97a644

ipv6: Make ipv6_addr_is_XXX() return boolean. · b27b28cb

由 YOSHIFUJI Hideaki 提交于 1月 21, 2013

ipv6_addr_is_{multicast,ll_all_nodes,ll_all_routers,isatap}()
return boolean.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b27b28cb

19 1月, 2013 1 次提交

ipv6: Remove unused neigh argument for icmp6_dst_alloc() and its callers. · 12fd84f4

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 18, 2013

Because of rt->n removal, we do not need neigh argument any more.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12fd84f4

18 1月, 2013 6 次提交

ipv6: Complete neighbour entry removal from dst_entry. · 887c95cc

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 17, 2013

CC: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

887c95cc

ipv6: Introduce rt6_nexthop() to select nexthop address. · 9bb5a148

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 17, 2013

For RTF_GATEWAY route, return rt->rt6i_gateway.
Otherwise, return 2nd argument (destination address).

This will be used by following patches which remove rt->n
dependency patches in ip6_dst_lookup_tail() and ip6_finish_output2().
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9bb5a148

ndisc: Introduce __ipv6_neigh_lookup_noref(). · ac3175fe

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 17, 2013

This function, which looks up neighbour entry for an IPv6 address
without touching refcnt, will be used for patches to remove
dependency on rt->n (neighbour entry in rt6_info).
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac3175fe

ndisc: Remove tbl argument for __ipv6_neigh_lookup(). · 8e022ee6

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 17, 2013

We can refer to nd_tbl directly.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e022ee6

ipv6: fix ipv6_prefix_equal64_half mask conversion · 512613d7

由 Fabio Baltieri 提交于 1月 16, 2013

Fix the 64bit optimized version of ipv6_prefix_equal to convert the
bitmask to network byte order only after the bit-shift.

The bug was introduced in:

38675170 ipv6: 64bit version of ipv6_prefix_equal().
Signed-off-by: NFabio Baltieri <fabio.baltieri@linaro.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

512613d7

net: increase fragment memory usage limits · c2a93660

由 Jesper Dangaard Brouer 提交于 1月 15, 2013

Increase the amount of memory usage limits for incomplete
IP fragments.

Arguing for new thresh high/low values:

 High threshold = 4 MBytes
 Low  threshold = 3 MBytes

The fragmentation memory accounting code, tries to account for the
real memory usage, by measuring both the size of frag queue struct
(inet_frag_queue (ipv4:ipq/ipv6:frag_queue)) and the SKB's truesize.

We want to be able to handle/hold-on-to enough fragments, to ensure
good performance, without causing incomplete fragments to hurt
scalability, by causing the number of inet_frag_queue to grow too much
(resulting longer searches for frag queues).

For IPv4, how much memory does the largest frag consume.

Maximum size fragment is 64K, which is approx 44 fragments with
MTU(1500) sized packets. Sizeof(struct ipq) is 200.  A 1500 byte
packet results in a truesize of 2944 (not 2048 as I first assumed)

  (44*2944)+200 = 129736 bytes

The current default high thresh of 262144 bytes, is obviously
problematic, as only two 64K fragments can fit in the queue at the
same time.

How many 64K fragment can we fit into 4 MBytes:

  4*2^20/((44*2944)+200) = 32.34 fragment in queues

An attacker could send a separate/distinct fake fragment packets per
queue, causing us to allocate one inet_frag_queue per packet, and thus
attacking the hash table and its lists.

How many frag queue do we need to store, and given a current hash size
of 64, what is the average list length.

Using one MTU sized fragment per inet_frag_queue, each consuming
(2944+200) 3144 bytes.

  4*2^20/(2944+200) = 1334 frag queues -> 21 avg list length

An attack could send small fragments, the smallest packet I could send
resulted in a truesize of 896 bytes (I'm a little surprised by this).

  4*2^20/(896+200)  = 3827 frag queues -> 59 avg list length

When increasing these number, we also need to followup with
improvements, that is going to help scalability.  Simply increasing
the hash size, is not enough as the current implementation does not
have a per hash bucket locking.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c2a93660

17 1月, 2013 2 次提交

sk-filter: Add ability to lock a socket filter program · d59577b6

由 Vincent Bernat 提交于 1月 16, 2013

While a privileged program can open a raw socket, attach some
restrictive filter and drop its privileges (or send the socket to an
unprivileged program through some Unix socket), the filter can still
be removed or modified by the unprivileged program. This commit adds a
socket option to lock the filter (SO_LOCK_FILTER) preventing any
modification of a socket filter program.

This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even
root is not allowed change/drop the filter.

The state of the lock can be read with getsockopt(). No error is
triggered if the state is not changed. -EPERM is returned when a user
tries to remove the lock or to change/remove the filter while the lock
is active. The check is done directly in sk_attach_filter() and
sk_detach_filter() and does not affect only setsockopt() syscall.
Signed-off-by: NVincent Bernat <bernat@luffy.cx>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d59577b6

ipv6: Fix endianess warning in ip6_flow_hdr(). · 07f623d3

由 YOSHIFUJI Hideaki 提交于 1月 17, 2013

Commit 3e4e4c1f ("ipv6: Introduce ip6_flow_hdr() to fill version,
tclass and flowlabel.) uses ntohl(), which should be htonl().

Found by Fengguang Wu <fengguang.wu@intel.com>.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07f623d3

15 1月, 2013 10 次提交

tun: fix LSM/SELinux labeling of tun/tap devices · 5dbbaf2d

由 Paul Moore 提交于 1月 14, 2013

This patch corrects some problems with LSM/SELinux that were introduced
with the multiqueue patchset. The problem stems from the fact that the
multiqueue work changed the relationship between the tun device and its
associated socket; before the socket persisted for the life of the
device, however after the multiqueue changes the socket only persisted
for the life of the userspace connection (fd open). For non-persistent
devices this is not an issue, but for persistent devices this can cause
the tun device to lose its SELinux label.

We correct this problem by adding an opaque LSM security blob to the
tun device struct which allows us to have the LSM security state, e.g.
SELinux labeling information, persist for the lifetime of the tun
device. In the process we tweak the LSM hooks to work with this new
approach to TUN device/socket labeling and introduce a new LSM hook,
security_tun_dev_attach_queue(), to approve requests to attach to a
TUN queue via TUNSETQUEUE.

The SELinux code has been adjusted to match the new LSM hooks, the
other LSMs do not make use of the LSM TUN controls. This patch makes
use of the recently added "tun_socket:attach_queue" permission to
restrict access to the TUNSETQUEUE operation. On older SELinux
policies which do not define the "tun_socket:attach_queue" permission
the access control decision for TUNSETQUEUE will be handled according
to the SELinux policy's unknown permission setting.
Signed-off-by: NPaul Moore <pmoore@redhat.com>
Acked-by: NEric Paris <eparis@parisplace.org>
Tested-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5dbbaf2d

net: phy: remove flags argument from phy_{attach, connect, connect_direct} · f9a8f83b

由 Florian Fainelli 提交于 1月 14, 2013

The flags argument of the phy_{attach,connect,connect_direct} functions
is then used to assign a struct phy_device dev_flags with its value.
All callers but the tg3 driver pass the flag 0, which results in the
underlying PHY drivers in drivers/net/phy/ not being able to actually
use any of the flags they would set in dev_flags. This patch gets rid of
the flags argument, and passes phydev->dev_flags to the internal PHY
library call phy_attach_direct() such that drivers which actually modify
a phy device dev_flags get the value preserved for use by the underlying
phy driver.
Acked-by: NKosta Zertsekel <konszert@marvell.com>
Signed-off-by: NFlorian Fainelli <florian@openwrt.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9a8f83b

pkt_sched: namespace aware act_mirred · c1b52739

由 Benjamin LaHaise 提交于 1月 14, 2013

Eric Dumazet pointed out that act_mirred needs to find the current net_ns,
and struct net pointer is not provided in the call chain. His original
patch made use of current->nsproxy->net_ns to find the network namespace,
but this fails to work correctly for userspace code that makes use of
netlink sockets in different network namespaces. Instead, pass the
"struct net *" down along the call chain to where it is needed.

This version removes the ifb changes as Eric has submitted that patch
separately, but is otherwise identical to the previous version.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1b52739

ipv6 netevent: Remove old_neigh from netevent_redirect. · 60592833

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 14, 2013

The only user is cxgb3 driver.

old_neigh is used to check device change, but it must not happen
on redirect.  In this sense, we can remove old_neigh argument.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

60592833

Y
ipv6: 64bit version of ipv6_prefix_equal(). · 38675170
由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 14, 2013
```
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
38675170

ipv6: Remove __ipv6_prefix_equal(). · 2ef97332

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 14, 2013

ipv6_prefix_equal() just casts its arguments and it is the only
user of __ipv6_prefix_equal().
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ef97332

ipv6: 64bit version of ipv6_addr_set(). · 5206c579

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 14, 2013

Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5206c579

Y
ipv6: 64bit version of ipv6_addr_v4mapped(). · a04d40b8
由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 14, 2013
```
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
a04d40b8
Y
ipv6: 64bit version of ipv6_addr_loopback(). · e287656b
由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 14, 2013
```
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
e287656b

ipv6: 64bit version of ipv6_addr_diff(). · 9f2e7334

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 14, 2013

Introduce __ipv6_addr_diff64() to to find the first different
bit between two addresses on 64bit architectures.

32bit version is still available as __ipv6_addr_diff32(),
and __ipv6_addr_diff() automatically selects appropriate
version.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f2e7334

14 1月, 2013 6 次提交

ipv6: Move comment to right place. · 25d46f43

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 13, 2013

IN6ADDR_* and in6addr_* are not exported to userspace, and are defined
in include/linux/in6.h.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25d46f43

ipv6: Store Router Alert option in IP6CB directly. · dd3332bf

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 13, 2013

Router Alert option is very small and we can store the value
itself in the skb.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd3332bf

ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input(). · daad1512

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 13, 2013

Move generalized version of ipv6_is_mld() to header,
and use it from ip6_mc_input().
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

daad1512

ipv6: Use ipv6_get_dsfield() instead of ipv6_tclass(). · e7219858

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 13, 2013

Commit 7a3198a8 ("ipv6: helper function to get tclass") introduced
ipv6_tclass(), but similar function is already available as
ipv6_get_dsfield().

We might be able to call ipv6_tclass() from ipv6_get_dsfield(),
but it is confusing to have two versions.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7219858

Y
ipv6: Introduce ip6_flowinfo() to extract flowinfo (tclass + flowlabel). · 6502ca52
由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 13, 2013
```
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
6502ca52

ipv6: Introduce ip6_flow_hdr() to fill version, tclass and flowlabel. · 3e4e4c1f

由 YOSHIFUJI Hideaki / 吉藤英明提交于 1月 13, 2013

This is not only for readability but also for optimization.
What we do here is to build the 32bit word at the beginning of the ipv6
header (the "ip6_flow" virtual member of struct ip6_hdr in RFC3542) and
we do not need to read the tclass portion of the target buffer.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e4e4c1f

12 1月, 2013 5 次提交

netfilter: nf_conntrack: fix BUG_ON while removing nf_conntrack with netns · 1e47ee83

由 Pablo Neira Ayuso 提交于 1月 10, 2013

canqun zhang reported that we're hitting BUG_ON in the
nf_conntrack_destroy path when calling kfree_skb while
rmmod'ing the nf_conntrack module.

Currently, the nf_ct_destroy hook is being set to NULL in the
destroy path of conntrack.init_net. However, this is a problem
since init_net may be destroyed before any other existing netns
(we cannot assume any specific ordering while releasing existing
netns according to what I read in recent emails).

Thanks to Gao feng for initial patch to address this issue.
Reported-by: Ncanqun zhang <canqunzhang@gmail.com>
Acked-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

1e47ee83

net, wireless: overwrite default_ethtool_ops · d07d7507

由 Stanislaw Gruszka 提交于 1月 10, 2013

Since:

commit 2c60db03
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Sep 16 09:17:26 2012 +0000

    net: provide a default dev->ethtool_ops

wireless core does not correctly assign ethtool_ops.

After alloc_netdev*() call, some cfg80211 drivers provide they own
ethtool_ops, but some do not. For them, wireless core provide generic
cfg80211_ethtool_ops, which is assigned in NETDEV_REGISTER notify call:

        if (!dev->ethtool_ops)
                dev->ethtool_ops = &cfg80211_ethtool_ops;

But after Eric's commit, dev->ethtool_ops is no longer NULL (on cfg80211
drivers without custom ethtool_ops), but points to &default_ethtool_ops.

In order to fix the problem, provide function which will overwrite
default_ethtool_ops and use it by wireless core.
Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
Acked-by: NJohannes Berg <johannes@sipsolutions.net>
Acked-by: NBen Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d07d7507

lib/rbtree.c: avoid the use of non-static __always_inline · 3cb7a563

由 Michel Lespinasse 提交于 1月 11, 2013

lib/rbtree.c declared __rb_erase_color() as __always_inline void, and
then exported it with EXPORT_SYMBOL.

This was because __rb_erase_color() must be exported for augmented
rbtree users, but it must also be inlined into rb_erase() so that the
dummy callback can get optimized out of that call site.

(Actually with a modern compiler, none of the dummy callback functions
should even be generated as separate text functions).

The above usage is legal C, but it was unusual enough for some compilers
to warn about it.  This change makes things more explicit, with a static
__always_inline ____rb_erase_color function for use in rb_erase(), and a
separate non-inline __rb_erase_color function for use in
rb_erase_augmented call sites.
Signed-off-by: NMichel Lespinasse <walken@google.com>
Reported-by: NWu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3cb7a563

mm: compaction: partially revert capture of suitable high-order page · 8fb74b9f

由 Mel Gorman 提交于 1月 11, 2013

Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket. It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed. For Eric, the problem was
that page->pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped. However, I identified a few more problems with the patch
including the fact that it can increase contention on zone->lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach. What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way. While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested. This patch
partially reverts commit 1fb3f8ca ("mm: compaction: capture a
suitable high-order page immediately when it is made available").
Reported-and-tested-by: NEric Wong <normalperson@yhbt.net>
Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8fb74b9f

linux/audit.h: move ptrace.h include to kernel header · c0a3a20b

由 Mike Frysinger 提交于 1月 11, 2013

While the kernel internals want pt_regs (and so it includes
linux/ptrace.h), the user version of audit.h does not need it.  So move
the include out of the uapi version.

This avoids issues where people want the audit defines and userland
ptrace api.  Including both the kernel ptrace and the userland ptrace
headers can easily lead to failure.
Signed-off-by: NMike Frysinger <vapier@gentoo.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c0a3a20b

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功