提交 · e86abc689c5cb963f713c1bab9c37775421a6a96 · openanolis / cloud-kernel

03 11月, 2015 11 次提交

cfg80211/mac80211: clarify RSSI CQM reporting requirements · e86abc68

由 Johannes Berg 提交于 10月 22, 2015

The previous patch changed mac80211 to always report an event
after a CQM RSSI reconfiguration. Document that as expected
behaviour in both the cfg80211 and mac80211 API.

Currently, iwlmvm already implements that behaviour; the other
drivers implementing CQM RSSI events may have to be changed.

This behaviour lets userspace know what the current state is
without relying on querying the data which is racy.
Reviewed-by: NSharon, Sara <sara.sharon@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

e86abc68

mac80211: fix crash on mesh local link ID generation with VIFs · 520c75dc

由 Matthias Schiffer 提交于 10月 24, 2015

llid_in_use needs to be limited to stations of the same VIF, otherwise it
will cause a NULL deref as the sta_info of non-mesh-VIFs don't have
sta->mesh set.

Steps to reproduce:

   modprobe mac80211_hwsim channels=2
   iw phy phy0 interface add ibss0 type ibss
   iw phy phy0 interface add mesh0 type mp
   iw phy phy1 interface add ibss1 type ibss
   iw phy phy1 interface add mesh1 type mp
   ip link set ibss0 up
   ip link set mesh0 up
   ip link set ibss1 up
   ip link set mesh1 up
   iw dev ibss0 ibss join foo 2412
   iw dev ibss1 ibss join foo 2412
   # Ensure that ibss0 and ibss1 are actually associated; I often need to
   # leave and join the cell on ibss1 a second time.
   iw dev mesh0 mesh join bar
   iw dev mesh1 mesh join bar # crash
Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

520c75dc

mac80211: TDLS: add proper HT-oper IE · 57f255f5

由 Arik Nemtsov 提交于 10月 25, 2015

When 11n peers performs a TDLS connection on a legacy BSS, the HT
operation IE must be specified according to IEEE802.11-2012 section
9.23.3.2. Otherwise HT-protection is compromised and the medium becomes
noisy for both the TDLS and the BSS links.
Signed-off-by: NArik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

57f255f5

mac80211: don't reconfigure sched scan in case of wowlan · 0d440ea2

由 Eliad Peller 提交于 10月 25, 2015

Scheduled scan has to be reconfigured only if wowlan wasn't
configured, since otherwise it should continue to run (with
the 'any' trigger) or be aborted.

The current code will end up asking the driver to start a new
scheduled scan without stopping the previous one, and leaking
some memory (from the previous request.)

Fix this by doing the abort/restart under the proper conditions.
Signed-off-by: NEliad Peller <eliadx.peller@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

0d440ea2

mac80211: call drv_stop only if driver is started · 968a76ce

由 Eliad Peller 提交于 10月 25, 2015

If drv_start() fails during hw_restart, all the running
interfaces are being closed/stopped, which results in
drv_stop() being called, although the driver was never
started successfully.

This might cause drivers to perform operations on uninitialized
memory (as they assume it was initialized on drv_start)

Consider the local->started flag, and call the driver's stop()
op only if drv_start() succeeded before.

Move drv_start() and drv_stop() to driver-ops.c, as they are no
longer simple wrappers.
Signed-off-by: NEliad Peller <eliadx.peller@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

968a76ce

mac80211: Remove WARN_ON_ONCE in ieee80211_recalc_smps · c189a685

由 Andrei Otcheretianski 提交于 10月 25, 2015

The recalc_smps work can run after the station disassociates.
At this stage we already released the channel, but the work
will be cancelled only when the interface stops.
In this scenario we can hit the warning in ieee80211_recalc_smps, so
just remove it.
Signed-off-by: NAndrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

c189a685

mac80211: use freezable workqueue for restart work · 43d6df00

由 Eliad Peller 提交于 10月 25, 2015

Requesting hw restart during suspend might result
in the restart work being executed after mac80211
and the hw are suspended.

Solve the race by simply scheduling the restart
work on a freezable workqueue.

Note that there can be some cases of reconfiguration
on resume (besides the hardware restart):

* wowlan is not configured -
    All the interfaces removed were removed on suspend,
    and drv_stop() was called. At this point the driver
    shouldn't expect for hw_restart anyway, so we can
    simply cancel it (on resume).

* wowlan is configured, drv_resume() == 1
    There is no definitive expected behavior in this case,
    as each driver might have different expectations (e.g.
    setting some flags on suspend/restart vs. not handling
    spurious recovery).
    For now, simply let the hw_restart work run again after
    resume, and hope the driver will handle it well (or at
    least initiate another hw restart).
Signed-off-by: NEliad Peller <eliadx.peller@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

43d6df00

mac80211: Fix local deauth while associating · a64cba3c

由 Andrei Otcheretianski 提交于 10月 25, 2015

Local request to deauthenticate wasn't handled while associating, thus
the association could continue even when the user space required to
disconnect.

Cc: stable@vger.kernel.org
Signed-off-by: NAndrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

a64cba3c

mac80211: allow null chandef in tracing · 254d3dfe

由 Arik Nemtsov 提交于 10月 25, 2015

In TDLS channel-switch operations the chandef can sometimes be NULL.
Avoid an oops in the trace code for these cases and just print a
chandef full of zeros.

Cc: stable@vger.kernel.org
Fixes: a7a6bdd0 ("mac80211: introduce TDLS channel switch ops")
Signed-off-by: NArik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

254d3dfe

nl80211: Fix potential memory leak from parse_acl_data · 4baf6bea

由 Ola Olsson 提交于 10月 29, 2015

If parse_acl_data succeeds but the subsequent parsing of smps
attributes fails, there will be a memory leak due to early returns.
Fix that by moving the ACL parsing later.

Cc: stable@vger.kernel.org
Fixes: 18998c38 ("cfg80211: allow requesting SMPS mode on ap start")
Signed-off-by: NOla Olsson <ola.olsson@sonymobile.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

4baf6bea

mac80211: fix divide by zero when NOA update · 519ee691

由 Janusz.Dziedzic@tieto.com 提交于 10月 27, 2015

In case of one shot NOA the interval can be 0, catch that
instead of potentially (depending on the driver) crashing
like this:

divide error: 0000 [#1] SMP
[...]
Call Trace:
<IRQ>
[<ffffffffc08e891c>] ieee80211_extend_absent_time+0x6c/0xb0 [mac80211]
[<ffffffffc08e8a17>] ieee80211_update_p2p_noa+0xb7/0xe0 [mac80211]
[<ffffffffc069cc30>] ath9k_p2p_ps_timer+0x170/0x190 [ath9k]
[<ffffffffc070adf8>] ath_gen_timer_isr+0xc8/0xf0 [ath9k_hw]
[<ffffffffc0691156>] ath9k_tasklet+0x296/0x2f0 [ath9k]
[<ffffffff8107ad65>] tasklet_action+0xe5/0xf0
[...]

Cc: stable@vger.kernel.org [3.16+, due to d463af4a using it]
Signed-off-by: NJanusz Dziedzic <janusz.dziedzic@tieto.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

519ee691

22 10月, 2015 25 次提交

Merge tag 'mac80211-next-for-davem-2015-10-21' of... · e9829b97

由 David S. Miller 提交于 10月 22, 2015

Merge tag 'mac80211-next-for-davem-2015-10-21' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Here's another set of patches for the current cycle:
 * I merged net-next back to avoid a conflict with the
 * cfg80211 scheduled scan API extensions
 * preparations for better scan result timestamping
 * regulatory cleanups
 * mac80211 statistics cleanups
 * a few other small cleanups and fixes
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e9829b97

net: hisilicon: deals with the sub ctrl by syscon · c7fc9eb7

由 yankejian 提交于 10月 21, 2015

the global Soc configuration is treated by syscon, and sub ctrl bus is
Soc bus. it has to be treated by syscon.
Signed-off-by: Nyankejian <yankejian@huawei.com>
Signed-off-by: Nlisheng <lisheng011@huawei.com>
Signed-off-by: Nlipeng <lipeng321@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7fc9eb7

Merge branch 'cxgb4-trivial-fixes' · e9e6d79c

由 David S. Miller 提交于 10月 22, 2015

Hariprasad Shenai says:

====================
Trivial fixes for cxgb4 driver

This patch series updates driver description for next gen. adapters, updates
firmware info., returns error for setup_rss error case, restores L1
configuration in case of FW rejects new config, updates and aligns ethtool
get stats settings, etc

 This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e9e6d79c

cxgb4: Update ethtool get_drvinfo to get regdump len · b08f2b35

由 Hariprasad Shenai 提交于 10月 21, 2015

Update ethtool get_drvinfo to display regdump len and also update
firmware string version print to display N/A in case FW isn't present
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b08f2b35

cxgb4: Use vmalloc, if kmalloc fails · 9c673d15

由 Hariprasad Shenai 提交于 10月 21, 2015

Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9c673d15

H
cxgb4: Return error if setup_rss is called before probe · 6ac5fe75
由 Hariprasad Shenai 提交于 10月 21, 2015
```
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
6ac5fe75
H
cxgb4/cxgb4vf: Update driver desc. to include Chelsio T6 adapter · 52a5f846
由 Hariprasad Shenai 提交于 10月 21, 2015
```
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
52a5f846
H
cxgb4: Add info print to display number of MSI-X vectors allocated · 43eb4e82
由 Hariprasad Shenai 提交于 10月 21, 2015
```
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
43eb4e82

cxgb4: Restore L1 cfg, if FW rejects new L1 cfg settings · 41165428

由 Hariprasad Shenai 提交于 10月 21, 2015

In the ethtool set_settings() routine we need to remember our old L1
Configuration in case the firmware rejects the request and then restore
that.
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

41165428

cxgb4: Don't disallow turning off auto-negotiation · 9bfdad5e

由 Hariprasad Shenai 提交于 10月 21, 2015

For {1, 10, 40} Gb/s. Prohibiting turning off autonegotiation isn't anywhere
in the standard.
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9bfdad5e

cxgb4: Align ethtool get stat settings · eed7342d

由 Hariprasad Shenai 提交于 10月 21, 2015

Align the ethtool get stats settings with the rest so it looks uniform
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eed7342d

openvswitch: Use dev_queue_xmit for vport send. · aec15924

由 Pravin B Shelar 提交于 10月 20, 2015

With use of lwtunnel, we can directly call dev_queue_xmit()
rather than calling netdev vport send operation.
Following change make tunnel vport code bit cleaner.
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aec15924

openvswitch: Fix incorrect type use. · 99e28f18

由 Pravin B Shelar 提交于 10月 20, 2015

Patch fixes following sparse warning.
net/openvswitch/flow_netlink.c:583:30: warning: incorrect type in assignment (different base types)
net/openvswitch/flow_netlink.c:583:30:    expected restricted __be16 [usertype] ipv4
net/openvswitch/flow_netlink.c:583:30:    got int

Fixes: 6b26ba3a ("openvswitch: netlink attributes for IPv6 tunneling")
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

99e28f18

Merge branch 'bpf-perf' · 721daebb

由 David S. Miller 提交于 10月 22, 2015

Alexei Starovoitov says:

====================
bpf_perf_event_output helper

Over the last year there were multiple attempts to let eBPF programs
output data into perf events by He Kuang and Wangnan.
The last one was:
https://lkml.org/lkml/2015/7/20/736
It was almost perfect with exception that all bpf programs would sent
data into one global perf_event.
This patch set takes different approach by letting user space
open independent PERF_COUNT_SW_BPF_OUTPUT events, so that program
output won't collide.

Wangnan is working on corresponding perf patches.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

721daebb

samples: bpf: add bpf_perf_event_output example · 39111695

由 Alexei Starovoitov 提交于 10月 20, 2015

Performance test and example of bpf_perf_event_output().
kprobe is attached to sys_write() and trivial bpf program streams
pid+cookie into userspace via PERF_COUNT_SW_BPF_OUTPUT event.

Usage:
$ sudo ./bld_x64/samples/bpf/trace_output
recv 2968913 events per sec
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39111695

bpf: introduce bpf_perf_event_output() helper · a43eec30

由 Alexei Starovoitov 提交于 10月 20, 2015

This helper is used to send raw data from eBPF program into
special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
User space needs to perf_event_open() it (either for one or all cpus) and
store FD into perf_event_array (similar to bpf_perf_event_read() helper)
before eBPF program can send data into it.

Today the programs triggered by kprobe collect the data and either store
it into the maps or print it via bpf_trace_printk() where latter is the debug
facility and not suitable to stream the data. This new helper replaces
such bpf_trace_printk() usage and allows programs to have dedicated
channel into user space for post-processing of the raw data collected.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a43eec30

perf: pad raw data samples automatically · fa128e6a

由 Alexei Starovoitov 提交于 10月 20, 2015

Instead of WARN_ON in perf_event_output() on unpaded raw samples,
pad them automatically.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa128e6a

ipvlan: read direct ifindex instead of iflink · 63b11e75

由 Brenden Blanco 提交于 10月 20, 2015

In the ipv4 outbound path of an ipvlan device in l3 mode, the ifindex is
being grabbed from dev_get_iflink. This works for the physical device
case, since as the documentation of that function notes: "Physical
interfaces have the same 'ifindex' and 'iflink' values.".  However, if
the master device is a veth, and the pairs are in separate net
namespaces, the route lookup will fail with -ENODEV due to outer veth
pair being in a separate namespace from the ipvlan master/routing
namespace.

  ns0    |   ns1    |   ns2
 veth0a--|--veth0b--|--ipvl0

In ipvlan_process_v4_outbound(), a packet sent from ipvl0 in the above
configuration will pass fl.flowi4_oif == veth0a to
ip_route_output_flow(), but *net == ns1.

Notice also that ipv6 processing is not using iflink. Since there is a
discrepancy in usage, fixup both v4 and v6 case to use local dev
variable.

Tested this with l3 ipvlan on top of veth, as well as with single
physical interface in the top namespace.
Signed-off-by: NBrenden Blanco <bblanco@plumgrid.com>
Reviewed-by: NJiri Benc <jbenc@redhat.com>
Acked-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

63b11e75

tcp: fastopen: limit max_qlen · dbf650b6

由 Eric Dumazet 提交于 10月 20, 2015

Allowing an application to set whatever limit for
the list of recently RST fastopen sessions [1] is not wise,
as it open ways to deplete kernel memory.

Cap the user provided limit by somaxconn sysctl,
like listen() backlog.

[1] https://tools.ietf.org/html/rfc7413#section-5.1Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbf650b6

net: mdio-gpio: move platform data header · e2aacd96

由 Vivien Didelot 提交于 10月 20, 2015

This header file only contains the platform data structure definition,
so move it to the include/linux/platform_data/ directory.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e2aacd96

ARM: gemini: remove unnecessary mdio-gpio includes · 844338e5

由 Vivien Didelot 提交于 10月 20, 2015

Remove the inclusion of linux/mdio-gpio.h in nas4220b, wbd111 and wbd222
boards since mdio-gpio is not used.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

844338e5

net: hisilicon: fix ptr_ret.cocci warnings · c6aa74d5

由 Wu Fengguang 提交于 10月 20, 2015

drivers/net/ethernet/hisilicon/hns/hnae.c:442:1-3: WARNING: PTR_ERR_OR_ZERO can be used

 Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

CC: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6aa74d5

ipv6: gro: support sit protocol · feec0cb3

由 Eric Dumazet 提交于 10月 19, 2015

Tom Herbert added SIT support to GRO with commit
19424e05 ("sit: Add gro callbacks to sit_offload"),
later reverted by Herbert Xu.

The problem came because Tom patch was building GRO
packets without proper meta data : If packets were locally
delivered, we would not care.

But if packets needed to be forwarded, GSO engine was not
able to segment individual segments.

With the following patch, we correctly set skb->encapsulation
and inner network header. We also update gso_type.

Tested:

Server :
netserver
modprobe dummy
ifconfig dummy0 8.0.0.1 netmask 255.255.255.0 up
arp -s 8.0.0.100 4e:32:51:04:47:e5
iptables -I INPUT -s 10.246.7.151 -j TEE --gateway 8.0.0.100
ifconfig sixtofour0
sixtofour0 Link encap:IPv6-in-IPv4
          inet6 addr: 2002:af6:798::1/128 Scope:Global
          inet6 addr: 2002:af6:798::/128 Scope:Global
          UP RUNNING NOARP  MTU:1480  Metric:1
          RX packets:411169 errors:0 dropped:0 overruns:0 frame:0
          TX packets:409414 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:20319631739 (20.3 GB)  TX bytes:29529556 (29.5 MB)

Client :
netperf -H 2002:af6:798::1 -l 1000 &

Checked on server traffic copied on dummy0 and verify segments were
properly rebuilt, with proper IP headers, TCP checksums...

tcpdump on eth0 shows proper GRO aggregation takes place.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

feec0cb3

net: dummy: add more features · 8f3af277

由 Eric Dumazet 提交于 10月 19, 2015

While testing my SIT/GRO patch using netfilter TEE module and a dummy
device, I found some features were missing :

TSO IPv6, UFO, and encapsulated traffic.

ethtool -k dummy0 now gives :
...
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: on
	tx-tcp6-segmentation: on
udp-fragmentation-offload: on
...
tx-gre-segmentation: on
tx-ipip-segmentation: on
tx-sit-segmentation: on
tx-udp_tnl-segmentation: on
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8f3af277

netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0

由 Arad, Ronen 提交于 10月 19, 2015

if_nlmsg_size() overestimates the minimum allocation size of netlink
dump request (when called from rtnl_calcit()) or the size of the
message (when called from rtnl_getlink()). This is because
ext_filter_mask is not supported by rtnl_link_get_af_size() and
rtnl_link_get_size().

The over-estimation is significant when at least one netdev has many
VLANs configured (8 bytes for each configured VLAN).

This patch-set "rightsizes" the protocol specific attribute size
calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
and adding this a argument to get_link_af_size op in rtnl_af_ops.

Bridge module already used filtering aware sizing for notifications.
br_get_link_af_size_filtered() is consistent with the modified
get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
br_get_link_af_size() becomes unused and thus removed.
Signed-off-by: NRonen Arad <ronen.arad@intel.com>
Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1974ed0

21 10月, 2015 4 次提交

Adding switchdev ageing notification on port bridged · 6ac311ae

由 Elad Raz 提交于 10月 19, 2015

Configure ageing time to the HW for newly bridged device

CC: Scott Feldman <sfeldma@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: NElad Raz <eladr@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NScott Feldman <sfeldma@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ac311ae

Merge branch 'tcp-rack' · eb9fae32

由 David S. Miller 提交于 10月 21, 2015

Yuchung Cheng says:

====================
RACK loss detection

RACK (Recent ACK) loss recovery uses the notion of time instead of
packet sequence (FACK) or counts (dupthresh).

It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
limited transmit (new data packet) is sacked in recovery, then any
retransmission sent before that newly sacked packet was sent must have
been lost, since at least one round trip time has elapsed.

But that existing heuristic from tcp_mark_lost_retrans()
has several limitations:
  1) it can't detect tail drops since it depends on limited transmit
  2) it's disabled upon reordering (assumes no reordering)
  3) it's only enabled in fast recovery but not timeout recovery

RACK addresses these limitations with a core idea: an unacknowledged
packet P1 is deemed lost if a packet P2 that was sent later is is
s/acked, since at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when a later retransmission is
s/acked, while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering, similar to the delayed
Early Retransmit.

In the current patch set RACK is only a supplemental loss detection
and does not trigger fast recovery. However we are developing RACK
to replace or consolidate FACK/dupthresh, early retransmit, and
thin-dupack. These heuristics all implicitly bear the time notion.
For example, the delayed Early Retransmit is simply applying RACK
to trigger the fast recovery with small inflight.

RACK requires measuring the minimum RTT. Tracking a global min is less
robust due to traffic engineering pathing changes. Therefore it uses a
windowed filter by Kathleen Nichols. The min RTT can also be useful
for various other purposes like congestion control or stat monitoring.

This patch has been used on Google servers for well over 1 year. RACK
has also been implemented in the QUIC protocol. We are submitting an
IETF draft as well.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb9fae32

tcp: use RACK to detect losses · 4f41b1c5

由 Yuchung Cheng 提交于 10月 16, 2015

This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f41b1c5

tcp: track the packet timings in RACK · 659a8ad5

由 Yuchung Cheng 提交于 10月 16, 2015

This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

659a8ad5

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功