提交 · 24025c465f77c3585f73450bab19501b2edd6fba · openanolis / cloud-kernel

05 4月, 2016 5 次提交

ipv4: process socket-level control messages in IPv4 · 24025c46

由 Soheil Hassas Yeganeh 提交于 4月 02, 2016

Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

24025c46

sock: accept SO_TIMESTAMPING flags in socket cmsg · 3dd17e63

由 Soheil Hassas Yeganeh 提交于 4月 02, 2016

Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
as a basis to accept timestamping requests per write.

This implementation only accepts TX recording flags (i.e.,
SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
control messages. Users need to set reporting flags (e.g.,
SOF_TIMESTAMPING_OPT_ID) per socket via socket options.

This commit adds a tsflags field in sockcm_cookie which is
set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
bits in sockcm_cookie.tsflags allowing the control message
to override the recording behavior per write, yet maintaining
the value of other flags.

This patch implements validating the control message and setting
tsflags in struct sockcm_cookie. Next commits in this series will
actually implement timestamping per write for different protocols.
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3dd17e63

tcp: use one bit in TCP_SKB_CB to mark ACK timestamps · 6b084928

由 Soheil Hassas Yeganeh 提交于 4月 02, 2016

Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.

To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.

To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6b084928

tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO · 6db8b963

由 Soheil Hassas Yeganeh 提交于 4月 02, 2016

SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
to associate timestamps with send calls. For TCP connections,
tp->snd_una is used as the starting point to calculate
relative IDs.

This socket option will fail if set before the handshake on a
passive TCP fast open connection with data in SYN or SYN/ACK,
since setsockopt requires the connection to be in the
ESTABLISHED state.

To address these, instead of limiting the option to the
ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
long as the connection is not in LISTEN or CLOSE states.
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6db8b963

sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop · 39771b12

由 Willem de Bruijn 提交于 4月 02, 2016

To process cmsg's of the SOL_SOCKET level in addition to
cmsgs of another level, protocols can call sock_cmsg_send().
This causes a double walk on the cmsghdr list, one for SOL_SOCKET
and one for the other level.

Extract the inner demultiplex logic from the loop that walks the list,
to allow having this called directly from a walker in the protocol
specific code.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39771b12

03 4月, 2016 2 次提交

netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address · 7822ce73

由 Haishuang Yan 提交于 3月 31, 2016

Since nla_get_in_addr and nla_put_in_addr were implemented,
so use them appropriately.
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7822ce73

tcp: remove cwnd moderation after recovery · 23492623

由 Yuchung Cheng 提交于 3月 30, 2016

For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.

This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.

On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.

By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.

v2 changes: revised commit message on bogus ACKs and burst, and
            missing signature
Signed-off-by: NMatt Mathis <mattmathis@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

23492623

02 4月, 2016 1 次提交

tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter · 5a5abb1f

由 Daniel Borkmann 提交于 3月 31, 2016

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

  [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
  [   52.765688] other info that might help us debug this:
  [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
  [   52.765701] 1 lock held by a.out/1525:
  [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
  [   52.765721] stack backtrace:
  [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
  [...]
  [   52.765768] Call Trace:
  [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
  [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
  [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
  [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
  [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
  [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
  [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
  [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
  [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
  [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
  [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
  [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in f91ff5b9 ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in 99405162 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.

Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a5abb1f

01 4月, 2016 1 次提交

rtnl: fix msg size calculation in if_nlmsg_size() · c57c7a95

由 Nicolas Dichtel 提交于 3月 31, 2016

Size of the attribute IFLA_PHYS_PORT_NAME was missing.

Fixes: db24a904 ("net: add support for phys_port_name")
CC: David Ahern <dsahern@gmail.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c57c7a95

31 3月, 2016 5 次提交

bpf: make padding in bpf_tunnel_key explicit · c0e760c9

由 Daniel Borkmann 提交于 3月 30, 2016

Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
and tunnel_label members explicit. No issue has been observed, and
gcc/llvm does padding for the old struct already, where tunnel_label
was not yet present, so the current code works, but since it's part
of uapi, make sure we don't introduce holes in structs.

Therefore, add tunnel_ext that we can use generically in future
(f.e. to flag OAM messages for backends, etc). Also add the offset
to the compat tests to be sure should some compilers not padd the
tail of the old version of bpf_tunnel_key.

Fixes: 4018ab18 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0e760c9

ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates · 2d421226

由 Eric Dumazet 提交于 3月 29, 2016

IPv6 counters updates use a different macro than IPv4.

Fixes: 36cbb245 ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d421226

gro: Allow tunnel stacking in the case of FOU/GUE · c3483384

由 Alexander Duyck 提交于 3月 29, 2016

This patch should fix the issues seen with a recent fix to prevent
tunnel-in-tunnel frames from being generated with GRO. The fix itself is
correct for now as long as we do not add any devices that support
NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the
potential to mess things up due to the fact that the outer transport header
points to the outer UDP header and not the GRE header as would be expected.

Fixes: fac8e0f5 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3483384

sctp: really allow using GFP_KERNEL on sctp_packet_transmit · 28fd3498

由 Marcelo Ricardo Leitner 提交于 3月 29, 2016

Somehow my patch for commit cea8768f ("sctp: allow
sctp_transmit_packet and others to use gfp") missed two important
chunks, which are now added.

Fixes: cea8768f ("sctp: allow sctp_transmit_packet and others to use gfp")
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-By: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

28fd3498

bridge: Allow set bridge ageing time when switchdev disabled · 5e263f71

由 Haishuang Yan 提交于 3月 29, 2016

When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP,
we should ignore this error code and continue to set the ageing time.

Fixes: c62987bb ("bridge: push bridge setting ageing_time down to switchdev")
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e263f71

28 3月, 2016 11 次提交

netfilter: ipv4: fix NULL dereference · 29421198

由 Liping Zhang 提交于 3月 26, 2016

Commit fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.
Signed-off-by: NLiping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: NNikolay Borisov <kernel@kyup.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

29421198

netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES · b301f253

由 Pablo Neira Ayuso 提交于 3月 24, 2016

Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.
Reported-by: NBaozeng Ding <sploving1@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

b301f253

netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails · 93140113

由 Pablo Neira Ayuso 提交于 3月 23, 2016

When netlink unicast fails to deliver the message to userspace, we
should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject
the packet back to the stack.

I think the user expects no packet drops when this flag is set due to
queueing to userspace errors, no matter if related to the internal queue
or when sending the netlink message to userspace.

The userspace application will still get the ENOBUFS error via recvmsg()
so the user still knows that, with the current configuration that is in
place, the userspace application is not consuming the messages at the
pace that the kernel needs.
Reported-by: N"Yigal Reiss (yreiss)" <yreiss@cisco.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Tested-by: N"Yigal Reiss (yreiss)" <yreiss@cisco.com>

93140113

netfilter: x_tables: fix unconditional helper · 54d83fc7

由 Florian Westphal 提交于 3月 22, 2016

Ben Hawkes says:

 In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
 is possible for a user-supplied ipt_entry structure to have a large
 next_offset field. This field is not bounds checked prior to writing a
 counter value at the supplied offset.

Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.

However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.

However, an unconditional rule must also not be using any matches
(no -m args).

The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.

Unify this so that all the callers have same idea of 'unconditional rule'.
Reported-by: NBen Hawkes <hawkes@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

54d83fc7

netfilter: x_tables: make sure e->next_offset covers remaining blob size · 6e94e0cf

由 Florian Westphal 提交于 3月 22, 2016

Otherwise this function may read data beyond the ruleset blob.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

6e94e0cf

netfilter: x_tables: validate e->target_offset early · bdf533de

由 Florian Westphal 提交于 3月 22, 2016

We should check that e->target_offset is sane before
mark_source_chains gets called since it will fetch the target entry
for loop detection.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

bdf533de

openvswitch: call only into reachable nf-nat code · 99b7248e

由 Arnd Bergmann 提交于 3月 18, 2016

The openvswitch code has gained support for calling into the
nf-nat-ipv4/ipv6 modules, however those can be loadable modules
in a configuration in which openvswitch is built-in, leading
to link errors:

net/built-in.o: In function `__ovs_ct_lookup':
:(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation'
:(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation'

The dependency on (!NF_NAT || NF_NAT) prevents similar issues,
but NF_NAT is set to 'y' if any of the symbols selecting
it are built-in, but the link error happens when any of them
are modular.

A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in,
CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely
to be useful in practice, but the driver currently only handles
IPv6 being optional.

This patch improves the Kconfig dependency so that openvswitch
cannot be built-in if either of the two other symbols are set
to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute()
with two "if (IS_ENABLED())" checks that should catch all corner
cases also make the code more readable.

The same #ifdef exists ovs_ct_nat_to_attr(), where it does not
cause a link error, but for consistency I'm changing it the same
way.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: 05752523 ("openvswitch: Interface with NAT.")
Acked-by: NJoe Stringer <joe@ovn.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

99b7248e

openvswitch: Fix checking for new expected connections. · 5745b0be

由 Jarno Rajahalme 提交于 3月 21, 2016

OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action. The test for this condition is doubly wrong, as the CT
status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather
than the mask (IPS_EXPECTED), and due to the wrong assumption that the
expected bit would apply only for the first (i.e., 'new') packet of a
connection, while in fact the expected bit remains on for the lifetime of
an expected connection. The 'ctinfo' value IP_CT_RELATED derived from
the ct status can be used instead, as it is only ever applicable to
the 'new' packets of the expected connection.

Fixes: 05752523 ('openvswitch: Interface with NAT.')
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

5745b0be

netfilter: ipset: fix race condition in ipset save, swap and delete · 596cf3fe

由 Vishwanath Pai 提交于 3月 16, 2016

This fix adds a new reference counter (ref_netlink) for the struct ip_set.
The other reference counter (ref) can be swapped out by ip_set_swap and we
need a separate counter to keep track of references for netlink events
like dump. Using the same ref counter for dump causes a race condition
which can be demonstrated by the following script:

ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
counters
ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
counters
ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
counters

ipset save &

ipset swap hash_ip3 hash_ip2
ipset destroy hash_ip3 /* will crash the machine */

Swap will exchange the values of ref so destroy will see ref = 0 instead of
ref = 1. With this fix in place swap will not succeed because ipset save
still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).

Both delete and swap will error out if ref_netlink != 0 on the set.

Note: The changes to *_head functions is because previously we would
increment ref whenever we called these functions, we don't do that
anymore.
Reviewed-by: NJoshua Hunt <johunt@akamai.com>
Signed-off-by: NVishwanath Pai <vpai@akamai.com>
Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

596cf3fe

openvswitch: Use proper buffer size in nla_memcpy · ac71b46e

由 Haishuang Yan 提交于 3月 28, 2016

For the input parameter count, it's better to use the size
of destination buffer size, as nla_memcpy would take into
account the length of the source netlink attribute when
a data is copied from an attribute.
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac71b46e

Fix returned tc and hoplimit values for route with IPv6 encapsulation · 995096a0

由 Quentin Armitage 提交于 3月 27, 2016

For a route with IPv6 encapsulation, the traffic class and hop limit
values are interchanged when returned to userspace by the kernel.
For example, see below.

># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000
># ip route show table 1000
192.168.0.1  encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2  scope link
Signed-off-by: NQuentin Armitage <quentin@armitage.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

995096a0

26 3月, 2016 15 次提交

libceph: use KMEM_CACHE macro · 5ee61e95

由 Geliang Tang 提交于 3月 13, 2016

Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.
Signed-off-by: NGeliang Tang <geliangtang@163.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

5ee61e95

libceph: use sizeof_footer() more · 89f08173

由 Ilya Dryomov 提交于 2月 20, 2016

Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke().  Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

89f08173

libceph: add helper that duplicates last extent operation · 2c63f49a

由 Yan, Zheng 提交于 1月 07, 2016

This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.
Signed-off-by: NYan, Zheng <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

2c63f49a

libceph: enable large, variable-sized OSD requests · 3f1af42a

由 Ilya Dryomov 提交于 2月 09, 2016

Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests.  The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.

r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once.  ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc().  req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3f1af42a

libceph: osdc->req_mempool should be backed by a slab pool · 9e767adb

由 Ilya Dryomov 提交于 2月 09, 2016

ceph_osd_request_cache was introduced a long time ago.  Also, osd_req
is about to get a flexible array member, which ceph_osd_request_cache
is going to be aware of.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

9e767adb

libceph: make r_request msg_size calculation clearer · ae458f5a

由 Ilya Dryomov 提交于 2月 11, 2016

Although msg_size is calculated correctly, the terms are grouped in
a misleading way - snaps appears to not have room for a u32 length.
Move calculation closer to its use and regroup terms.

No functional change.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

ae458f5a

libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op · 7665d85b

由 Yan, Zheng 提交于 1月 07, 2016

This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.
Signed-off-by: NYan, Zheng <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

7665d85b

libceph: rename ceph_osd_req_op::payload_len to indata_len · de2aa102

由 Ilya Dryomov 提交于 2月 08, 2016

Follow userspace nomenclature on this - the next commit adds
outdata_len.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

de2aa102

libceph: behave in mon_fault() if cur_mon < 0 · b5d91704

由 Ilya Dryomov 提交于 1月 23, 2016

This can happen if __close_session() in ceph_monc_stop() races with
a connection reset.  We need to ignore such faults, otherwise it's
likely we would take !hunting, call __schedule_delayed() and end up
with delayed_work() executing on invalid memory, among other things.

The (two!) con->private tests are useless, as nothing ever clears
con->private.  Nuke them.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

b5d91704

libceph: reschedule tick in mon_fault() · bee3a37c

由 Ilya Dryomov 提交于 1月 22, 2016

Doing __schedule_delayed() in the hunting branch is pointless, as the
tick will have already been scheduled by then.

What we need to do instead is *reschedule* it in the !hunting branch,
after reopen_session() changes hunt_mult, which affects the delay.
This helps with spacing out connection attempts and avoiding things
like two back-to-back attempts followed by a longer period of waiting
around.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

bee3a37c

libceph: introduce and switch to reopen_session() · 1752b50c

由 Ilya Dryomov 提交于 1月 21, 2016

hunting is now set in __open_session() and cleared in finish_hunting(),
instead of all around.  The "session lost" message is printed not only
on connection resets, but also on keepalive timeouts.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1752b50c

libceph: monc hunt rate is 3s with backoff up to 30s · 168b9090

由 Ilya Dryomov 提交于 1月 21, 2016

Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.

Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

168b9090

libceph: monc ping rate is 10s · 58d81b12

由 Ilya Dryomov 提交于 1月 21, 2016

Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.

Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

58d81b12

libceph: pick a different monitor when reconnecting · 0e04dc26

由 Ilya Dryomov 提交于 1月 20, 2016

Don't try to reconnect to the same monitor when we fail to establish
a session within a timeout or it's lost.

For that, pick_new_mon() needs to see the old value of cur_mon, so
don't clear it in __close_session() - all calls to __close_session()
but one are followed by __open_session() anyway. __open_session() is
only called when a new session needs to be established, so the "already
open?" branch, which is now in the way, is simply dropped.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

0e04dc26

libceph: revamp subs code, switch to SUBSCRIBE2 protocol · 82dcabad

由 Ilya Dryomov 提交于 1月 19, 2016

It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime". To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs. Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.

Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).

Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

82dcabad

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功