提交 · a72c9512bf2bef12c5e66a4d910c4b348fe31d61 · openeuler / raspberrypi-kernel

23 10月, 2015 5 次提交

mpls: flow-based multipath selection · 1c78efa8

由 Robert Shearman 提交于 10月 23, 2015

Change the selection of a multipath route to use a flow-based
hash. This more suitable for traffic sensitive to reordering within a
flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution
of traffic given enough flows.

Selection of the path for a multipath route is done using a hash of:
1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and
   including entropy label, whichever is first.
2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS
   payload, if present.

Naturally, a 5-tuple hash using L4 information in addition would be
possible and be better in some scenarios, but there is a tradeoff
between looking deeper into the packet to achieve good distribution,
and packet forwarding performance, and I have erred on the side of the
latter as the default.
Signed-off-by: NRobert Shearman <rshearma@brocade.com>
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c78efa8

mpls: multipath route support · f8efb73c

由 Roopa Prabhu 提交于 10月 23, 2015

This patch adds support for MPLS multipath routes.

Includes following changes to support multipath:
- splits struct mpls_route into 'struct mpls_route + struct mpls_nh'

- 'struct mpls_nh' represents a mpls nexthop label forwarding entry

- moves mpls route and nexthop structures into internal.h

- A mpls_route can point to multiple mpls_nh structs

- the nexthops are maintained as a array (similar to ipv4 fib)

- In the process of restructuring, this patch also consistently changes
  all labels to u8

- Adds support to parse/fill RTA_MULTIPATH netlink attribute for
multipath routes similar to ipv4/v6 fib

- In this patch, the multipath route nexthop selection algorithm
simply returns the first nexthop. It is replaced by a
hash based algorithm from Robert Shearman in the next patch

- mpls_route_update cleanup: remove 'dev' handling in mpls_route_update.
mpls_route_update though implemented to update based on dev, it was
never used that way. And the dev handling gets tricky with multiple
nexthops. Cannot match against any single nexthops dev. So, this patch
removes the unused 'dev' handling in mpls_route_update.

- dead route/path handling will be implemented in a subsequent patch

Example:

$ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \
                nexthop as 700 via inet 10.1.1.6 dev swp2 \
                nexthop as 800 via inet 40.1.1.2 dev swp3

$ip  -f mpls route show
100
        nexthop as to 200 via inet 10.1.1.2  dev swp1
        nexthop as to 700 via inet 10.1.1.6  dev swp2
        nexthop as to 800 via inet 40.1.1.2  dev swp3
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: NRobert Shearman <rshearma@brocade.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8efb73c

if_link: Add control trust VF · dd461d6a

由 Hiroshi Shimamoto 提交于 8月 28, 2015

Add netlink directives and ndo entry to trust VF user.

This controls the special permission of VF user.
The administrator will dedicatedly trust VF user to use some features
which impacts security and/or performance.

The administrator never turn it on unless VF user is fully trusted.

CC: Sy Jong Choi <sy.jong.choi@intel.com>
Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: NGreg Rose <gregory.v.rose@intel.com>
Tested-by: NKrishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

dd461d6a

tcp/dccp: fix hashdance race for passive sessions · 5e0724d0

由 Eric Dumazet 提交于 10月 22, 2015

Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e0724d0

ipv4: implement support for NOPREFIXROUTE ifa flag for ipv4 address · 7b131180

由 Paolo Abeni 提交于 10月 20, 2015

Currently adding a new ipv4 address always cause the creation of the
related network route, with default metric. When a host has multiple
interfaces on the same network, multiple routes with the same metric
are created.

If the userspace wants to set specific metric on each routes, i.e.
giving better metric to ethernet links in respect to Wi-Fi ones,
the network routes must be deleted and recreated, which is error-prone.

This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
address. When an address is added with such flag set, no associated
network route is created, no network route is deleted when
said IP is gone and it's up to the user space manage such route.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b131180

22 10月, 2015 20 次提交

net: dsa: remove port_fdb_getnext · 1a49a2fb

由 Vivien Didelot 提交于 10月 22, 2015

No driver implements port_fdb_getnext anymore, and port_fdb_dump is
preferred anyway, so remove this function from DSA.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a49a2fb

net: dsa: add port_fdb_dump function · ea70ba98

由 Vivien Didelot 提交于 10月 22, 2015

Not all switch chips support a Get Next operation to iterate on its FDB.
So add a more simple port_fdb_dump function for them.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea70ba98

openvswitch: Use dev_queue_xmit for vport send. · aec15924

由 Pravin B Shelar 提交于 10月 20, 2015

With use of lwtunnel, we can directly call dev_queue_xmit()
rather than calling netdev vport send operation.
Following change make tunnel vport code bit cleaner.
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aec15924

openvswitch: Fix incorrect type use. · 99e28f18

由 Pravin B Shelar 提交于 10月 20, 2015

Patch fixes following sparse warning.
net/openvswitch/flow_netlink.c:583:30: warning: incorrect type in assignment (different base types)
net/openvswitch/flow_netlink.c:583:30:    expected restricted __be16 [usertype] ipv4
net/openvswitch/flow_netlink.c:583:30:    got int

Fixes: 6b26ba3a ("openvswitch: netlink attributes for IPv6 tunneling")
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

99e28f18

Bluetooth: Increase minor version of core module · 13972adc

由 Marcel Holtmann 提交于 10月 22, 2015

With the addition of support for diagnostic feature, it makes sense to
increase the minor version of the Bluetooth core module.

The module version is not used anywhere, but it gives a nice extra
hint for debugging purposes.
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>

13972adc

ieee802154: 6lowpan: fix memory leak · aeedebff

由 Alexander Aring 提交于 10月 22, 2015

Looking at current situation of memory management in 6lowpan receive
function I detected some invalid handling. After calling
lowpan_invoke_rx_handlers we will do a kfree_skb and then NET_RX_DROP on
error handling. We don't do this before, also on
skb_share_check/skb_unshare which might manipulate the reference
counters.

After running some 'grep -r "dev_add_pack" net/' to look how others
packet-layer receive callbacks works I detected that every subsystem do
a kfree_skb, then NET_RX_DROP without calling skb functions which
might manipulate the skb reference counters. This is the reason why we
should do the same here like all others subsystems. I didn't find any
documentation how the packet-layer receive callbacks handle NET_RX_DROP
return values either.

This patch will add a kfree_skb, then NET_RX_DROP handling for the
"trivial checks", in case of skb_share_check/skb_unshare the kfree_skb
call will be done inside these functions.
Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

aeedebff

Bluetooth: Make hci_disconnect() behave correctly for all states · 88d07feb

由 Johan Hedberg 提交于 10月 22, 2015

There are a few places that don't explicitly check the connection
state before calling hci_disconnect(). To make this API do the right
thing take advantage of the new hci_abort_conn() API and also make
sure to only read the clock offset if we're really connected.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

88d07feb

Bluetooth: Take advantage of connection abort helpers · 89e0ccc8

由 Johan Hedberg 提交于 10月 22, 2015

Convert the various places mapping connection state to
disconnect/cancel HCI command to use the new hci_abort_conn helper
API.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

89e0ccc8

Bluetooth: Introduce hci_req helper to abort a connection · dcc0f0d9

由 Johan Hedberg 提交于 10月 22, 2015

There are several different places needing to make sure that a
connection gets disconnected or canceled. The exact action needed
depends on the connection state, so centralizing this logic can save
quite a lot of code duplication.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

dcc0f0d9

Bluetooth: Fix crash in SMP when unpairing · c81d555a

由 Johan Hedberg 提交于 10月 22, 2015

When unpairing the keys stored in hci_dev are removed. If SMP is
ongoing the SMP context will also have references to these keys, so
removing them from the hci_dev lists will make the pointers invalid.
This can result in the following type of crashes:

 BUG: unable to handle kernel paging request at 6b6b6b6b
 IP: [<c11f26be>] __list_del_entry+0x44/0x71
 *pde = 00000000
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in: hci_uart btqca btusb btintel btbcm btrtl hci_vhci rfcomm bluetooth_6lowpan bluetooth
 CPU: 0 PID: 723 Comm: kworker/u5:0 Not tainted 4.3.0-rc3+ #1379
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
 Workqueue: hci0 hci_rx_work [bluetooth]
 task: f19da940 ti: f1a94000 task.ti: f1a94000
 EIP: 0060:[<c11f26be>] EFLAGS: 00010202 CPU: 0
 EIP is at __list_del_entry+0x44/0x71
 EAX: c0088d20 EBX: f30fcac0 ECX: 6b6b6b6b EDX: 6b6b6b6b
 ESI: f4b60000 EDI: c0088d20 EBP: f1a95d90 ESP: f1a95d8c
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
 CR0: 8005003b CR2: 6b6b6b6b CR3: 319e5000 CR4: 00000690
 Stack:
  f30fcac0 f1a95db0 f82dc3e1 f1bfc000 00000000 c106524f f1bfc000 f30fd020
  f1a95dc0 f1a95dd0 f82dcbdb f1a95de0 f82dcbdb 00000067 f1bfc000 f30fd020
  f1a95de0 f1a95df0 f82d1126 00000067 f82d1126 00000006 f30fd020 f1bfc000
 Call Trace:
  [<f82dc3e1>] smp_chan_destroy+0x192/0x240 [bluetooth]
  [<c106524f>] ? trace_hardirqs_on_caller+0x14e/0x169
  [<f82dcbdb>] smp_teardown_cb+0x47/0x64 [bluetooth]
  [<f82dcbdb>] ? smp_teardown_cb+0x47/0x64 [bluetooth]
  [<f82d1126>] l2cap_chan_del+0x5d/0x14d [bluetooth]
  [<f82d1126>] ? l2cap_chan_del+0x5d/0x14d [bluetooth]
  [<f82d40ef>] l2cap_conn_del+0x109/0x17b [bluetooth]
  [<f82d40ef>] ? l2cap_conn_del+0x109/0x17b [bluetooth]
  [<f82c0205>] ? hci_event_packet+0x5b1/0x2092 [bluetooth]
  [<f82d41aa>] l2cap_disconn_cfm+0x49/0x50 [bluetooth]
  [<f82d41aa>] ? l2cap_disconn_cfm+0x49/0x50 [bluetooth]
  [<f82c0228>] hci_event_packet+0x5d4/0x2092 [bluetooth]
  [<c1332c16>] ? skb_release_data+0x6a/0x95
  [<f82ce5d4>] ? hci_send_to_monitor+0xe7/0xf4 [bluetooth]
  [<c1409708>] ? _raw_spin_unlock_irqrestore+0x44/0x57
  [<f82b3bb0>] hci_rx_work+0xf1/0x28b [bluetooth]
  [<f82b3bb0>] ? hci_rx_work+0xf1/0x28b [bluetooth]
  [<c10635a0>] ? __lock_is_held+0x2e/0x44
  [<c104772e>] process_one_work+0x232/0x432
  [<c1071ddc>] ? rcu_read_lock_sched_held+0x50/0x5a
  [<c104772e>] ? process_one_work+0x232/0x432
  [<c1047d48>] worker_thread+0x1b8/0x255
  [<c1047b90>] ? rescuer_thread+0x23c/0x23c
  [<c104bb71>] kthread+0x91/0x96
  [<c14096a7>] ? _raw_spin_unlock_irq+0x27/0x44
  [<c1409d61>] ret_from_kernel_thread+0x21/0x30
  [<c104bae0>] ? kthread_parkme+0x1e/0x1e

To solve the issue, introduce a new smp_cancel_pairing() API that can
be used to clean up the SMP state before touching the hci_dev lists.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

c81d555a

Bluetooth: Disable auto-connection parameters when unpairing · fc64361a

由 Johan Hedberg 提交于 10月 22, 2015

For connection parameters that are left around until a disconnection
we should at least clear any auto-connection properties. This way a
new Add Device call is required to re-set them after calling Unpair
Device.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

fc64361a

ipv6: gro: support sit protocol · feec0cb3

由 Eric Dumazet 提交于 10月 19, 2015

Tom Herbert added SIT support to GRO with commit
19424e05 ("sit: Add gro callbacks to sit_offload"),
later reverted by Herbert Xu.

The problem came because Tom patch was building GRO
packets without proper meta data : If packets were locally
delivered, we would not care.

But if packets needed to be forwarded, GSO engine was not
able to segment individual segments.

With the following patch, we correctly set skb->encapsulation
and inner network header. We also update gso_type.

Tested:

Server :
netserver
modprobe dummy
ifconfig dummy0 8.0.0.1 netmask 255.255.255.0 up
arp -s 8.0.0.100 4e:32:51:04:47:e5
iptables -I INPUT -s 10.246.7.151 -j TEE --gateway 8.0.0.100
ifconfig sixtofour0
sixtofour0 Link encap:IPv6-in-IPv4
          inet6 addr: 2002:af6:798::1/128 Scope:Global
          inet6 addr: 2002:af6:798::/128 Scope:Global
          UP RUNNING NOARP  MTU:1480  Metric:1
          RX packets:411169 errors:0 dropped:0 overruns:0 frame:0
          TX packets:409414 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:20319631739 (20.3 GB)  TX bytes:29529556 (29.5 MB)

Client :
netperf -H 2002:af6:798::1 -l 1000 &

Checked on server traffic copied on dummy0 and verify segments were
properly rebuilt, with proper IP headers, TCP checksums...

tcpdump on eth0 shows proper GRO aggregation takes place.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

feec0cb3

netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0

由 Arad, Ronen 提交于 10月 19, 2015

if_nlmsg_size() overestimates the minimum allocation size of netlink
dump request (when called from rtnl_calcit()) or the size of the
message (when called from rtnl_getlink()). This is because
ext_filter_mask is not supported by rtnl_link_get_af_size() and
rtnl_link_get_size().

The over-estimation is significant when at least one netdev has many
VLANs configured (8 bytes for each configured VLAN).

This patch-set "rightsizes" the protocol specific attribute size
calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
and adding this a argument to get_link_af_size op in rtnl_af_ops.

Bridge module already used filtering aware sizing for notifications.
br_get_link_af_size_filtered() is consistent with the modified
get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
br_get_link_af_size() becomes unused and thus removed.
Signed-off-by: NRonen Arad <ronen.arad@intel.com>
Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1974ed0

Bluetooth: Remove unnecessary hci_explicit_connect_lookup function · 17bc08f0

由 Johan Hedberg 提交于 10月 21, 2015

There's only one user of this helper which can be replaces with a call
to hci_pend_le_action_lookup() and a check for params->explicit_connect.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

17bc08f0

Bluetooth: Remove redundant (and possibly wrong) flag clearing · 1ede9868

由 Johan Hedberg 提交于 10月 21, 2015

There's no need to clear the HCI_CONN_ENCRYPT_PEND flag in
smp_failure. In fact, this may cause the encryption tracking to get
out of sync as this has nothing to do with HCI activity.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

1ede9868

Bluetooth: Add hdev helper variable to hci_le_create_connection_cancel · b5c2b621

由 Johan Hedberg 提交于 10月 21, 2015

The hci_le_create_connection_cancel() function needs to use the hdev
pointer in many places so add a variable for it to avoid the need to
dereference the hci_conn every time.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

b5c2b621

Bluetooth: Remove unnecessary indentation in unpair_device() · ec182f03

由 Johan Hedberg 提交于 10月 21, 2015

Instead of doing all of the LE-specific handling in an else-branch in
unpair_device() create a 'done' label for the BR/EDR branch to jump to
and then remove the else-branch completely.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

ec182f03

Bluetooth: 6lowpan: Use hci_conn_hash_lookup_le() when possible · f5ad4ffc

由 Johan Hedberg 提交于 10月 21, 2015

Use the new hci_conn_hash_lookup_le() API to look up LE connections.
This way we're guaranteed exact matches that also take into account
the address type.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

f5ad4ffc

Bluetooth: Use hci_conn_hash_lookup_le() when possible · 9d4c1cc1

由 Johan Hedberg 提交于 10月 21, 2015

Use the new hci_conn_hash_lookup_le() API to look up LE connections.
This way we're guaranteed exact matches that also take into account
the address type.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

9d4c1cc1

Bluetooth: Add le_addr_type() helper function · 85813a7e

由 Johan Hedberg 提交于 10月 21, 2015

The mgmt code needs to convert from mgmt/L2CAP address types to HCI in
many places. Having a dedicated helper function for this simplifies
code by shortening it and removing unnecessary 'addr_type' variables.
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

85813a7e

21 10月, 2015 15 次提交

Adding switchdev ageing notification on port bridged · 6ac311ae

由 Elad Raz 提交于 10月 19, 2015

Configure ageing time to the HW for newly bridged device

CC: Scott Feldman <sfeldma@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: NElad Raz <eladr@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NScott Feldman <sfeldma@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ac311ae

tcp: use RACK to detect losses · 4f41b1c5

由 Yuchung Cheng 提交于 10月 16, 2015

This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f41b1c5

tcp: track the packet timings in RACK · 659a8ad5

由 Yuchung Cheng 提交于 10月 16, 2015

This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

659a8ad5

tcp: add tcp_tsopt_ecr_before helper · 77c63127

由 Yuchung Cheng 提交于 10月 16, 2015

a helper to prepare the main RACK patch
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77c63127

tcp: remove tcp_mark_lost_retrans() · af82f4e8

由 Yuchung Cheng 提交于 10月 16, 2015

Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

af82f4e8

tcp: track min RTT using windowed min-filter · f6722583

由 Yuchung Cheng 提交于 10月 16, 2015

Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f6722583

tcp: apply Kern's check on RTTs used for congestion control · 9e45a3e3

由 Yuchung Cheng 提交于 10月 16, 2015

Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.

Fixes: 5b08e47c ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9e45a3e3

Bluetooth: Fix missing hdev locking for LE scan cleanup · 8ce783dc

由 Johan Hedberg 提交于 10月 21, 2015

The hci_conn objects don't have a dedicated lock themselves but rely
on the caller to hold the hci_dev lock for most types of access. The
hci_conn_timeout() function has so far sent certain HCI commands based
on the hci_conn state which has been possible without holding the
hci_dev lock.

The recent changes to do LE scanning before connect attempts added
even more operations to hci_conn and hci_dev from hci_conn_timeout,
thereby exposing potential race conditions with the hci_dev and
hci_conn states.

As an example of such a race, here there's a timeout but an
l2cap_sock_connect() call manages to race with the cleanup routine:

[Oct21 08:14] l2cap_chan_timeout: chan ee4b12c0 state BT_CONNECT
[  +0.000004] l2cap_chan_close: chan ee4b12c0 state BT_CONNECT
[  +0.000002] l2cap_chan_del: chan ee4b12c0, conn f3141580, err 111, state BT_CONNECT
[  +0.000002] l2cap_sock_teardown_cb: chan ee4b12c0 state BT_CONNECT
[  +0.000005] l2cap_chan_put: chan ee4b12c0 orig refcnt 4
[  +0.000010] hci_conn_drop: hcon f53d56e0 orig refcnt 1
[  +0.000013] l2cap_chan_put: chan ee4b12c0 orig refcnt 3
[  +0.000063] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
[  +0.000049] hci_conn_params_del: addr ee:0d:30:09:53:1f (type 1)
[  +0.000002] hci_chan_list_flush: hcon f53d56e0
[  +0.000001] hci_chan_del: hci0 hcon f53d56e0 chan f4e7ccc0
[  +0.004528] l2cap_sock_create: sock e708fc00
[  +0.000023] l2cap_chan_create: chan ee4b1770
[  +0.000001] l2cap_chan_hold: chan ee4b1770 orig refcnt 1
[  +0.000002] l2cap_sock_init: sk ee4b3390
[  +0.000029] l2cap_sock_bind: sk ee4b3390
[  +0.000010] l2cap_sock_setsockopt: sk ee4b3390
[  +0.000037] l2cap_sock_connect: sk ee4b3390
[  +0.000002] l2cap_chan_connect: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f (type 2) psm 0x00
[  +0.000002] hci_get_route: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f
[  +0.000001] hci_dev_hold: hci0 orig refcnt 8
[  +0.000003] hci_conn_hold: hcon f53d56e0 orig refcnt 0

Above the l2cap_chan_connect() shouldn't have been able to reach the
hci_conn f53d56e0 anymore but since hci_conn_timeout didn't do proper
locking that's not the case. The end result is a reference to hci_conn
that's not in the conn_hash list, resulting in list corruption when
trying to remove it later:

[Oct21 08:15] l2cap_chan_timeout: chan ee4b1770 state BT_CONNECT
[  +0.000004] l2cap_chan_close: chan ee4b1770 state BT_CONNECT
[  +0.000003] l2cap_chan_del: chan ee4b1770, conn f3141580, err 111, state BT_CONNECT
[  +0.000001] l2cap_sock_teardown_cb: chan ee4b1770 state BT_CONNECT
[  +0.000005] l2cap_chan_put: chan ee4b1770 orig refcnt 4
[  +0.000002] hci_conn_drop: hcon f53d56e0 orig refcnt 1
[  +0.000015] l2cap_chan_put: chan ee4b1770 orig refcnt 3
[  +0.000038] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
[  +0.000003] hci_chan_list_flush: hcon f53d56e0
[  +0.000002] hci_conn_hash_del: hci0 hcon f53d56e0
[  +0.000001] ------------[ cut here ]------------
[  +0.000461] WARNING: CPU: 0 PID: 1782 at lib/list_debug.c:56 __list_del_entry+0x3f/0x71()
[  +0.000839] list_del corruption, f53d56e0->prev is LIST_POISON2 (00000200)

The necessary fix is unfortunately more complicated than just adding
hci_dev_lock/unlock calls to the hci_conn_timeout() call path.
Particularly, the hci_conn_del() API, which expects the hci_dev lock to
be held, performs a cancel_delayed_work_sync(&hcon->disc_work) which
would lead to a deadlock if the hci_conn_timeout() call path tries to
acquire the same lock.

This patch solves the problem by deferring the cleanup work to a
separate work callback. To protect against the hci_dev or hci_conn
going away meanwhile temporary references are taken with the help of
hci_dev_hold() and hci_conn_get().
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.3

8ce783dc

mac80211: move station statistics into sub-structs · e5a9f8d0

由 Johannes Berg 提交于 10月 16, 2015

Group station statistics by where they're (mostly) updated
(TX, RX and TX-status) and group them into sub-structs of
the struct sta_info.

Also rename the variables since the grouping now makes it
obvious where they belong.

This makes it easier to identify where the statistics are
updated in the code, and thus easier to think about them.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

e5a9f8d0

mac80211: move beacon_loss_count into ifmgd · 976bd9ef

由 Johannes Berg 提交于 10月 16, 2015

There's little point in keeping (and even sending to userspace)
the beacon_loss_count value per station, since it can only apply
to the AP on a managed-mode connection. Move the value to ifmgd,
advertise it only in managed mode, and remove it from ethtool as
it's available through better interfaces.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

976bd9ef

mac80211: remove sta->last_ack_signal · 763aa27a

由 Johannes Berg 提交于 10月 16, 2015

This file only feeds a debugfs file that isn't very useful, so remove
it. If necessary, we can add other ways to get this information, for
example in the NL80211_CMD_PROBE_CLIENT response.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

763aa27a

Bluetooth: Introduce driver specific post init callback · 98a63aaf

由 Marcel Holtmann 提交于 10月 20, 2015

Some drivers might have to restore certain settings after the init
procedure has been completed. This driver callback allows them to hook
into that stage. This callback is run just before the controller is
declared as powered up.
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>

98a63aaf

Bluetooth: l2cap_disconnection_req priority over shutdown · 9f7378a9

由 Dean Jenkins 提交于 10月 14, 2015

There is a L2CAP protocol race between the local peer and
the remote peer demanding disconnection of the L2CAP link.

When L2CAP ERTM is used, l2cap_sock_shutdown() can be called
from userland to disconnect L2CAP. However, there can be a
delay introduced by waiting for ACKs. During this waiting
period, the remote peer may have sent a Disconnection Request.
Therefore, recheck the shutdown status of the socket
after waiting for ACKs because there is no need to do
further processing if the connection has gone.
Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: NHarish Jenny K N <harish_kandiga@mentor.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

9f7378a9

Bluetooth: Reorganize mutex lock in l2cap_sock_shutdown() · 04ba72e6

由 Dean Jenkins 提交于 10月 14, 2015

This commit reorganizes the mutex lock and is now
only protecting l2cap_chan_close(). This is now consistent
with other places where l2cap_chan_close() is called.

If a conn connection exists, call
mutex_lock(&conn->chan_lock) before calling l2cap_chan_close()
to ensure other L2CAP protocol operations do not interfere.

Note that the conn structure has to be protected from being
freed as it is possible for the connection to be disconnected
whilst the locks are not held. This solution allows the mutex
lock to be used even when the connection has just been
disconnected.

This commit also reduces the scope of chan locking.

The only place where chan locking is needed is the call to
l2cap_chan_close(chan, 0) which if necessary closes the channel.
Therefore, move the l2cap_chan_lock(chan) and
l2cap_chan_lock(chan) locking calls to around
l2cap_chan_close(chan, 0).

This allows __l2cap_wait_ack(sk, chan) to be called with no
chan locks being held so L2CAP messaging over the ACL link
can be done unimpaired.
Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: NHarish Jenny K N <harish_kandiga@mentor.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

04ba72e6

Bluetooth: Unwind l2cap_sock_shutdown() · e7456437

由 Dean Jenkins 提交于 10月 14, 2015

l2cap_sock_shutdown() is designed to only action shutdown
of the channel when shutdown is not already in progress.
Therefore, reorganise the code flow by adding a goto
to jump to the end of function handling when shutdown is
already being actioned. This removes one level of code
indentation and make the code more readable.
Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: NHarish Jenny K N <harish_kandiga@mentor.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

e7456437