提交 · 8ed1dc44d3e9e8387a104b1ae8f92e9a3fbf1b1e · openeuler / Kernel

14 1月, 2014 2 次提交

ipv4: introduce hardened ip_no_pmtu_disc mode · 8ed1dc44

由 Hannes Frederic Sowa 提交于 1月 09, 2014

This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
to be honored by protocols which do more stringent validation on the
ICMP's packet payload. This knob is useful for people who e.g. want to
run an unmodified DNS server in a namespace where they need to use pmtu
for TCP connections (as they are used for zone transfers or fallback
for requests) but don't want to use possibly spoofed UDP pmtu information.

Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
if the returned packet is in the window or if the association is valid.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Suggested-by: NFlorian Weimer <fweimer@redhat.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ed1dc44

ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8

由 Hannes Frederic Sowa 提交于 1月 09, 2014

While forwarding we should not use the protocol path mtu to calculate
the mtu for a forwarded packet but instead use the interface mtu.

We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
introduced for multicast forwarding. But as it does not conflict with
our usage in unicast code path it is perfect for reuse.

I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
dependencies because of IPSKB_FORWARDED.

Because someone might have written a software which does probe
destinations manually and expects the kernel to honour those path mtus
I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
can disable this new behaviour. We also still use mtus which are locked on a
route for forwarding.

The reason for this change is, that path mtus information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv4 forwarding path to
wrongfully emit fragmentation needed notifications or start to fragment
packets along a path.

Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
won't be set and further fragmentation logic will use the path mtu to
determine the fragmentation size. They also recheck packet size with
help of path mtu discovery and report appropriate errors.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f87c10a8

09 1月, 2014 1 次提交

batman-adv: add isolation_mark sysfs attribute · c42edfe3

由 Antonio Quartulli 提交于 11月 16, 2013

This attribute can be used to set and read the value and the
mask of the skb mark which will be used to classify the
source non-mesh client as ISOLATED. In this way a client can
be advertised as such and the mark can potentially be
restored at the receiving node before delivering the skb.

This can be helpful for creating network wide netfilter
policies.

This sysfs file expects a string of the shape "$mark/$mask".
Where $mark has to be a 32-bit number in any base, while
$mask must be a 32bit mask expressed in hex base. Only bits
in $mark covered by the bitmask are really stored.
Signed-off-by: NAntonio Quartulli <antonio@open-mesh.com>
Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>

c42edfe3

08 1月, 2014 1 次提交

IPv6: add the option to use anycast addresses as source addresses in echo reply · 509aba3b

由 FX Le Bail 提交于 1月 07, 2014

This change allows to follow a recommandation of RFC4942.

- Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
  as source addresses for ICMPv6 echo reply. This sysctl is false by default
  to preserve existing behavior.
- Add inline check ipv6_anycast_destination().
- Use them in icmpv6_echo_reply().

Reference:
RFC4942 - IPv6 Transition/Coexistence Security Considerations
   (http://tools.ietf.org/html/rfc4942#section-2.1.6)

2.1.6. Anycast Traffic Identification and Security

   [...]
   To avoid exposing knowledge about the internal structure of the
   network, it is recommended that anycast servers now take advantage of
   the ability to return responses with the anycast address as the
   source address if possible.
Signed-off-by: NFrancois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

509aba3b

04 1月, 2014 1 次提交

netfilter: x_tables: lightweight process control group matching · 82a37132

由 Daniel Borkmann 提交于 12月 29, 2013

It would be useful e.g. in a server or desktop environment to have
a facility in the notion of fine-grained "per application" or "per
application group" firewall policies. Probably, users in the mobile,
embedded area (e.g. Android based) with different security policy
requirements for application groups could have great benefit from
that as well. For example, with a little bit of configuration effort,
an admin could whitelist well-known applications, and thus block
otherwise unwanted "hard-to-track" applications like [1] from a
user's machine. Blocking is just one example, but it is not limited
to that, meaning we can have much different scenarios/policies that
netfilter allows us than just blocking, e.g. fine grained settings
where applications are allowed to connect/send traffic to, application
traffic marking/conntracking, application-specific packet mangling,
and so on.

Implementation of PID-based matching would not be appropriate
as they frequently change, and child tracking would make that
even more complex and ugly. Cgroups would be a perfect candidate
for accomplishing that as they associate a set of tasks with a
set of parameters for one or more subsystems, in our case the
netfilter subsystem, which, of course, can be combined with other
cgroup subsystems into something more complex if needed.

As mentioned, to overcome this constraint, such processes could
be placed into one or multiple cgroups where different fine-grained
rules can be defined depending on the application scenario, while
e.g. everything else that is not part of that could be dropped (or
vice versa), thus making life harder for unwanted processes to
communicate to the outside world. So, we make use of cgroups here
to track jobs and limit their resources in terms of iptables
policies; in other words, limiting, tracking, etc what they are
allowed to communicate.

In our case we're working on outgoing traffic based on which local
socket that originated from. Also, one doesn't even need to have
an a-prio knowledge of the application internals regarding their
particular use of ports or protocols. Matching is *extremly*
lightweight as we just test for the sk_classid marker of sockets,
originating from net_cls. net_cls and netfilter do not contradict
each other; in fact, each construct can live as standalone or they
can be used in combination with each other, which is perfectly fine,
plus it serves Tejun's requirement to not introduce a new cgroups
subsystem. Through this, we result in a very minimal and efficient
module, and don't add anything except netfilter code.

One possible, minimal usage example (many other iptables options
can be applied obviously):

 1) Configuring cgroups if not already done, e.g.:

  mkdir /sys/fs/cgroup/net_cls
  mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls
  mkdir /sys/fs/cgroup/net_cls/0
  echo 1 > /sys/fs/cgroup/net_cls/0/net_cls.classid
  (resp. a real flow handle id for tc)

 2) Configuring netfilter (iptables-nftables), e.g.:

  iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP

 3) Running applications, e.g.:

  ping 208.67.222.222  <pid:1799>
  echo 1799 > /sys/fs/cgroup/net_cls/0/tasks
  64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms
  [...]
  ping 208.67.220.220  <pid:1804>
  ping: sendmsg: Operation not permitted
  [...]
  echo 1804 > /sys/fs/cgroup/net_cls/0/tasks
  64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms
  [...]

Of course, real-world deployments would make use of cgroups user
space toolsuite, or own custom policy daemons dynamically moving
applications from/to various cgroups.

  [1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdfSigned-off-by: NDaniel Borkmann <dborkman@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Acked-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

82a37132

01 1月, 2014 1 次提交

i40evf: add driver to kernel build system · 105bf2fe

由 Greg Rose 提交于 12月 21, 2013

Modify the existing Kconfig, Makefile, and MAINTAINERS to add the driver
to the kernel. Add a Makefile and a documentation
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Signed-off-by: NGreg Rose <gregory.v.rose@intel.com>
Tested-by: NSibai Li <sibai.li@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

105bf2fe

31 12月, 2013 1 次提交
- D
  bonding: update Documentation/networking/bonding.txt for option lp_interval · 84a6a0ac
  由 dingtianhong 提交于 12月 21, 2013
```
Signed-off-by: NDing Tianhong <dingtianhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  84a6a0ac
22 12月, 2013 2 次提交

null_blk: set use_per_node_hctx param to false · 20005244

由 Matias Bjørling 提交于 12月 21, 2013

The defaults for the module is to instantiate itself with blk-mq and a
submit queue for each CPU node in the system.

To save resources, initialize instead with a single submit queue.
Signed-off-by: NMatias Bjorling <m@bjorling.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

20005244

null_blk: corrections to documentation · 89ed05ee

由 Matias Bjørling 提交于 12月 21, 2013

Randy Dunlap reported a couple of grammar errors and unfortunate usages of
socket/node/core.
Signed-off-by: NMatias Bjorling <m@bjorling.me>
Acked-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

89ed05ee

21 12月, 2013 2 次提交

can: mcp251x: Add device tree support · 66606aaf

由 Alexander Shiyan 提交于 12月 21, 2013

This patch adds Device Tree support to the Microchip MCP251X driver.
Signed-off-by: NAlexander Shiyan <shc_work@mail.ru>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

66606aaf

packet: doc: add documentation for VLAN TPID delivery · ac7686b9

由 Atzm Watanabe 提交于 12月 20, 2013

Introduce TP_STATUS_VLAN_TPID_VALID bit into the documentation.
Signed-off-by: NAtzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac7686b9

19 12月, 2013 2 次提交

null_blk: documentation · 12f8f4fc

由 Matias Bjorling 提交于 12月 18, 2013

Add description of module and its parameters.
Signed-off-by: NMatias Bjorling <m@bjorling.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

12f8f4fc

ipv4: new ip_no_pmtu_disc mode to always discard incoming frag needed msgs · cd174e67

由 Hannes Frederic Sowa 提交于 12月 14, 2013

This new mode discards all incoming fragmentation-needed notifications
as I guess was originally intended with this knob. To not break backward
compatibility too much, I only added a special case for mode 2 in the
receiving path.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd174e67

18 12月, 2013 1 次提交

ipv4: improve documentation of ip_no_pmtu_disc · 188b04d5

由 Hannes Frederic Sowa 提交于 12月 14, 2013

Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

188b04d5

17 12月, 2013 2 次提交

can: update MAINTAINERS and Documentation · f35f6c8f

由 John Whitmore 提交于 12月 06, 2013

Changed MAINTAINERS file to add Documentation/networking/can.txt to the list of
maintained files.

can.txt:
- Globally changed Socket CAN to SocketCAN
- Removed section 3.3 from the document
- Updated Section 7
- Corrected a few simple typos
Acked-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NJohn Whitmore <johnfwhitmore@gmail.com>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

f35f6c8f

libata: disable a disk via libata.force params · b8bd6dc3

由 Robin H. Johnson 提交于 12月 16, 2013

A user on StackExchange had a failing SSD that's soldered directly
onto the motherboard of his system. The BIOS does not give any option
to disable it at all, so he can't just hide it from the OS via the
BIOS.

The old IDE layer had hdX=noprobe override for situations like this,
but that was never ported to the libata layer.

This patch implements a disable flag for libata.force.

Example use:

 libata.force=2.0:disable

[v2 of the patch, removed the nodisable flag per Tejun Heo]
Signed-off-by: NRobin H. Johnson <robbat2@gentoo.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Link: http://unix.stackexchange.com/questions/102648/how-to-tell-linux-kernel-3-0-to-completely-ignore-a-failing-disk
Link: http://askubuntu.com/questions/352836/how-can-i-tell-linux-kernel-to-completely-ignore-a-disk-as-if-it-was-not-even-co
Link: http://superuser.com/questions/599333/how-to-disable-kernel-probing-for-drive

b8bd6dc3

16 12月, 2013 1 次提交

xfrm: Add file to document IPsec corner case · b3c6efbc

由 Fan Du 提交于 12月 16, 2013

Create Documentation/networking/ipsec.txt to document IPsec
corner issues and other info, which will be useful when user
deploying IPsec.
Signed-off-by: NFan Du <fan.du@windriver.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

b3c6efbc

13 12月, 2013 1 次提交

Add Documentation/module-signing.txt file · 3cafea30

由 James Solner 提交于 11月 06, 2013

This patch adds the Documentation/module-signing.txt file that is
currently missing from the Documentation directory. The init/Kconfig
file references the Documentation/module-signing.txt file to explain
how kernel module signing works. This patch supplies this documentation.
Signed-off-by: NJames Solner <solner@alcatel-lucent.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

3cafea30

12 12月, 2013 2 次提交

filter: doc: improve BPF documentation · 7924cd5e

由 Daniel Borkmann 提交于 12月 11, 2013

This patch significantly updates the BPF documentation and describes
its internal architecture, Linux extensions, and handling of the
kernel's BPF and JIT engine, plus documents how development can be
facilitated with the help of bpf_dbg, bpf_asm, bpf_jit_disasm.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7924cd5e

net: smc91x: Fix device tree based configuration so it's usable · 3f823c15

由 Tony Lindgren 提交于 12月 11, 2013

Commit 89ce376c (drivers/net: Use of_match_ptr() macro in smc91x.c)
added minimal device tree support to smc91x, but it's not working on
many platforms because of the lack of some key configuration bits.

Fix the issue by parsing the necessary configuration like the
smc911x driver is doing. As most smc91x users seem to use 16-bit
access, let's default to that if no reg-io-width is specified.

Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: netdev@vger.kernel.org
Cc: devicetree@vger.kernel.org
Acked-by: NNishanth Menon <nm@ti.com>
Signed-off-by: NTony Lindgren <tony@atomide.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f823c15

11 12月, 2013 1 次提交

dm cache: update Documentation for invalidate_cblocks's range syntax · 83f539e1

由 Mike Snitzer 提交于 11月 26, 2013

The cache target's invalidate_cblocks message allows cache block
(cblock) ranges to be expressed with: <cblock start>-<cblock end>

The range's <cblock end> value is "one past the end", so the range
includes <cblock start> through <cblock end>-1.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

83f539e1

10 12月, 2013 4 次提交

net: phy: consolidate PHY reset in phy_init_hw() · 87aa9f9c

由 Florian Fainelli 提交于 12月 06, 2013

There are quite a lot of drivers touching a PHY device MII_BMCR
register to reset the PHY without taking care of:

1) ensuring that BMCR_RESET is cleared after a given timeout
2) the PHY state machine resuming to the proper state and re-applying
potentially changed settings such as auto-negotiation

Introduce phy_poll_reset() which will take care of polling the MII_BMCR
for the BMCR_RESET bit to be cleared after a given timeout or return a
timeout error code.

In order to make sure the PHY is in a correct state, phy_init_hw() first
issues a software reset through MII_BMCR and then applies any fixups.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87aa9f9c

packet: introduce PACKET_QDISC_BYPASS socket option · d346a3fa

由 Daniel Borkmann 提交于 12月 06, 2013

This patch introduces a PACKET_QDISC_BYPASS socket option, that
allows for using a similar xmit() function as in pktgen instead
of taking the dev_queue_xmit() path. This can be very useful when
PF_PACKET applications are required to be used in a similar
scenario as pktgen, but with full, flexible packet payload that
needs to be provided, for example.

On default, nothing changes in behaviour for normal PF_PACKET
TX users, so everything stays as is for applications. New users,
however, can now set PACKET_QDISC_BYPASS if needed to prevent
own packets from i) reentering packet_rcv() and ii) to directly
push the frame to the driver.

In doing so we can increase pps (here 64 byte packets) for
PF_PACKET a bit:

  # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
  1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
  2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
  3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
  4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
  5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
  6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
  7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
  8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
  9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
 10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
 11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
 [...]
 20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081

 [**]: qdisc path with packet_rcv(), how probably most people
       seem to use it (hopefully not anymore if not needed)

The test was done using a modified trafgen, sending a simple
static 64 bytes packet, on all CPUs.  The trick in the fast
"qdisc path" case, is to avoid reentering packet_rcv() by
setting the RAW socket protocol to zero, like:
socket(PF_PACKET, SOCK_RAW, 0);

Tradeoffs are documented as well in this patch, clearly, if
queues are busy, we will drop more packets, tc disciplines are
ignored, and these packets are not visible to taps anymore. For
a pktgen like scenario, we argue that this is acceptable.

The pointer to the xmit function has been placed in packet
socket structure hole between cached_dev and prot_hook that
is hot anyway as we're working on cached_dev in each send path.

Done in joint work together with Jesper Dangaard Brouer.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d346a3fa

packet: fix send path when running with proto == 0 · 66e56cd4

由 Daniel Borkmann 提交于 12月 06, 2013

Commit e40526cb introduced a cached dev pointer, that gets
hooked into register_prot_hook(), __unregister_prot_hook() to
update the device used for the send path.

We need to fix this up, as otherwise this will not work with
sockets created with protocol = 0, plus with sll_protocol = 0
passed via sockaddr_ll when doing the bind.

So instead, assign the pointer directly. The compiler can inline
these helper functions automagically.

While at it, also assume the cached dev fast-path as likely(),
and document this variant of socket creation as it seems it is
not widely used (seems not even the author of TX_RING was aware
of that in his reference example [1]). Tested with reproducer
from e40526cb.

 [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example

Fixes: e40526cb ("packet: fix use after free race in send path when dev is released")
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Tested-by: NSalam Noureddine <noureddine@aristanetworks.com>
Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66e56cd4

[media] videobuf2: Add support for file access mode flags for DMABUF exporting · c1b96a23

由 Philipp Zabel 提交于 5月 21, 2013

Currently it is not possible for userspace to map a DMABUF exported buffer
with write permissions. This patch allows to also pass O_RDONLY/O_RDWR when
exporting the buffer, so that userspace may map it with write permissions.
Signed-off-by: NPhilipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: NSylwester Nawrocki <s.nawrocki@samsung.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

c1b96a23

07 12月, 2013 3 次提交

ether_addr_equal: Optimize implementation, remove unused compare_ether_addr · 0d74c42f

由 Joe Perches 提交于 12月 05, 2013

Add a new check for CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to reduce
the number of or's used in the ether_addr_equal comparison to very
slightly improve function performance.

Simplify the ether_addr_equal_64bits implementation.
Integrate and remove the zap_last_2bytes helper as it's now
used only once.

Remove the now unused compare_ether_addr function.

Update the unaligned-memory-access documentation to remove the
compare_ether_addr description and show how unaligned accesses
could occur with ether_addr_equal.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0d74c42f

Documentation: update Ethernet PHY devices binding with 'max-speed' · 9f2b0936

由 Florian Fainelli 提交于 12月 05, 2013

The 'max-speed' property is optional but defined in the ePAPR
specification and now supported by the Linux Device Tree parsing
infrastructure.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f2b0936

tcp: auto corking · f54b3111

由 Eric Dumazet 提交于 12月 05, 2013

With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.

Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)

If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.

This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.

The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.

Using FQ/pacing is a way to increase the probability of
autocorking being triggered.

Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)

Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.

Tested:

Interesting effects when using line buffered commands under ssh.

Excellent performance results in term of cpu usage and total throughput.

lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39

Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':

35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]

12.136204899 seconds time elapsed

lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89

Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]

12.130267788 seconds time elapsed
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f54b3111

06 12月, 2013 1 次提交

net: davinci_emac: Fix platform data handling and make usable for am3517 · dd0df47d

由 Tony Lindgren 提交于 12月 03, 2013

When booted with device tree, we may still have platform data passed
as auxdata. For am3517 this is needed for passing the interrupt_enable
and interrupt_disable callbacks that access the omap system control module
registers. These callback functions will eventually go away when we have
a separate system control module driver.

Some of the things that are currently passed as platform data we don't need
to set up as device tree properties as they are always the same on am3517.
So let's use a new compatible flag for those so we can get those from
the device tree match data.

Also note that we need to fix setting of phy_dev to NULL instead of an empty
string as the code later on uses that to find the first phy on the mdio bus.
This seems to have been caused by 5d69e007 (net: davinci_emac: switch to
new mdio).
Signed-off-by: NTony Lindgren <tony@atomide.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd0df47d

03 12月, 2013 10 次提交

Documentation: gpiolib: add 00-INDEX file · 56a39aac

由 Alexandre Courbot 提交于 11月 23, 2013

Give a short overview of the various GPIO documentation files.
Signed-off-by: NAlexandre Courbot <acourbot@nvidia.com>
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>

56a39aac

dt: binding: reword PowerPC 8xxx GPIO documentation · 2c0e641a

由 Gerhard Sittig 提交于 11月 21, 2013

re-format and re-word the device tree binding documentation for MPC8xxx
and compatibles, reference the common document for interrupt controllers
and remove outdated duplicate SoC specific information

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Pawel Moll <Pawel.Moll@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: <devicetree@vger.kernel.org>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NGerhard Sittig <gsi@denx.de>
Signed-off-by: NRob Herring <rob.herring@calxeda.com>

2c0e641a

ARM: tegra: delete nvidia,tegra20-spi.txt binding · 72d944ec

由 Stephen Warren 提交于 11月 25, 2013

This binding shouldn't exist; Tegra20 has two forms of SPI controller
that are documented separately in nvidia,tegra20-sflash.txt and
nvidia,tegra20-slink.txt.
Signed-off-by: NStephen Warren <swarren@nvidia.com>
Reviewed-by: NThierry Reding <treding@nvidia.com>
Signed-off-by: NRob Herring <rob.herring@calxeda.com>

72d944ec

hwmon: ntc_thermistor: Fix typo (pullup-uV -> pullup-uv) · 1675088f

由 Chanwoo Choi 提交于 10月 23, 2013

This patch fix typo of property name from 'pullup-uV' to 'pullup-uv'.
The ntc_thermistor.c use 'pullup-uv' when parsing dt data.
Signed-off-by: NChanwoo Choi <cw00.choi@samsung.com>
Reviewed-by: NJingoo Han <jg1.han@samsung.com>
Acked-by: NNaveen Krishna Chatradhi <ch.naveen@samsung.com>
Reviewed-by: NTomasz Figa <t.figa@samsung.com>
Signed-off-by: NRob Herring <rob.herring@calxeda.com>

1675088f

of: add vendor prefix for GMT · dd622d25

由 Wei Ni 提交于 11月 13, 2013

Adding Global Mixed-mode Technology Inc. to the list
of devicetree vendor prefixes.
Signed-off-by: NWei Ni <wni@nvidia.com>
Acked-by: NStephen Warren <swarren@nvidia.com>
Reviewed-by: NJean Delvare <khali@linux-fr.org>
Signed-off-by: NRob Herring <rob.herring@calxeda.com>

dd622d25

clk: exynos: Fix typos in DT bindings documentation · cdbea098

由 Laurent Pinchart 提交于 11月 18, 2013

s/comptible/compatible/
Signed-off-by: NLaurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: NSachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: NRob Herring <rob.herring@calxeda.com>

cdbea098

of: Add vendor prefix for LG Corporation · 53d6b360

由 Thierry Reding 提交于 11月 18, 2013

Signed-off-by: NThierry Reding <treding@nvidia.com>
Signed-off-by: NRob Herring <rob.herring@calxeda.com>

53d6b360

Documentation: net: fsl-fec.txt: Add phy-supply entry · 21ea0268

由 Fabio Estevam 提交于 11月 18, 2013

phy-supply is an optional property of the fec driver, so add it to the binding
documentation.
Signed-off-by: NFabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: NRob Herring <rob.herring@calxeda.com>

21ea0268

ARM: dts: doc: Document missing binding for omap5-mpu · f1e8e381

由 Sricharan R 提交于 11月 08, 2013

The binding and support for omap5-mpu which has a cortex-a15
smp core, gic and integrated L2 cache has been existing for sometime.
So Documenting the missing binding here.

Cc: Benoit Cousson <bcousson@baylibre.com>
Signed-off-by: NSricharan R <r.sricharan@ti.com>
Signed-off-by: NRob Herring <rob.herring@calxeda.com>

f1e8e381

dt-bindings: add ARMv8 PMU binding · f04bda90

由 Rob Herring 提交于 11月 07, 2013

Add missing "arm,armv8-pmuv3" compatible property for ARMv8 PMU.
Signed-off-by: NRob Herring <rob.herring@calxeda.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Pawel Moll <pawel.moll@arm.com>
Acked-by: NMark Rutland <mark.rutland@arm.com>
Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>

f04bda90

02 12月, 2013 1 次提交

KEYS: Fix multiple key add into associative array · 23fd78d7

由 David Howells 提交于 12月 02, 2013

If sufficient keys (or keyrings) are added into a keyring such that a node in
the associative array's tree overflows (each node has a capacity N, currently
16) and such that all N+1 keys have the same index key segment for that level
of the tree (the level'th nibble of the index key), then assoc_array_insert()
calls ops->diff_objects() to indicate at which bit position the two index keys
vary.

However, __key_link_begin() passes a NULL object to assoc_array_insert() with
the intention of supplying the correct pointer later before we commit the
change.  This means that keyring_diff_objects() is given a NULL pointer as one
of its arguments which it does not expect.  This results in an oops like the
attached.

With the previous patch to fix the keyring hash function, this can be forced
much more easily by creating a keyring and only adding keyrings to it.  Add any
other sort of key and a different insertion path is taken - all 16+1 objects
must want to cluster in the same node slot.

This can be tested by:

	r=`keyctl newring sandbox @s`
	for ((i=0; i<=16; i++)); do keyctl newring ring$i $r; done

This should work fine, but oopses when the 17th keyring is added.

Since ops->diff_objects() is always called with the first pointer pointing to
the object to be inserted (ie. the NULL pointer), we can fix the problem by
changing the to-be-inserted object pointer to point to the index key passed
into assoc_array_insert() instead.

Whilst we're at it, we also switch the arguments so that they are the same as
for ->compare_object().

BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
IP: [<ffffffff81191ee4>] hash_key_type_and_desc+0x18/0xb0
...
RIP: 0010:[<ffffffff81191ee4>] hash_key_type_and_desc+0x18/0xb0
...
Call Trace:
 [<ffffffff81191f9d>] keyring_diff_objects+0x21/0xd2
 [<ffffffff811f09ef>] assoc_array_insert+0x3b6/0x908
 [<ffffffff811929a7>] __key_link_begin+0x78/0xe5
 [<ffffffff81191a2e>] key_create_or_update+0x17d/0x36a
 [<ffffffff81192e0a>] SyS_add_key+0x123/0x183
 [<ffffffff81400ddb>] tracesys+0xdd/0xe2
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NStephen Gallagher <sgallagh@redhat.com>

23fd78d7

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功