- 09 6月, 2014 1 次提交
-
-
由 Eric Dumazet 提交于
unregister_netdevice_many() API is error prone and we had too many bugs because of dangling LIST_HEAD on stacks. See commit f87e6f47 ("net: dont leave active on stack LIST_HEAD") In fact, instead of making sure no caller leaves an active list_head, just force a list_del() in the callee. No one seems to need to access the list after unregister_netdevice_many() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 6月, 2014 1 次提交
-
-
由 Alexei Starovoitov 提交于
BPF classic->internal converter broke SKF_AD_PKTTYPE extension, since pkt_type_offset() was failing to find skb->pkt_type field which is defined as: __u8 pkt_type:3, fclone:2, ipvs_property:1, peeked:1, nf_trace:1; Fix it by searching for 3 most significant bits and shift them by 5 at run-time Fixes: bd4cf0ed ("net: filter: rework/optimize internal BPF interpreter's instruction set") Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Acked-by: NDaniel Borkmann <dborkman@redhat.com> Tested-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 6月, 2014 1 次提交
-
-
由 Cong Wang 提交于
It is possible that ->newlink() fails before registering the device, in this case we should just free it, it's safe to call free_netdev(). Fixes: commit 0e0eee24 (net: correct error path in rtnl_newlink()) Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NCong Wang <cwang@twopensource.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 6月, 2014 1 次提交
-
-
由 Leon Yu 提交于
__sk_prepare_filter() was reworked in commit bd4cf0ed (net: filter: rework/optimize internal BPF interpreter's instruction set) so that it should have uncharged memory once things went wrong. However that work isn't complete. Error is handled only in __sk_migrate_filter() while memory can still leak in the error path right after sk_chk_filter(). Fixes: bd4cf0ed ("net: filter: rework/optimize internal BPF interpreter's instruction set") Signed-off-by: NLeon Yu <chianglungyu@gmail.com> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Tested-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 6月, 2014 1 次提交
-
-
由 Nikolay Aleksandrov 提交于
After 1e785f48 ("net: Start with correct mac_len in skb_network_protocol") skb->mac_len is used as a start of the calculation in skb_network_protocol() but that is not always correct. If skb->protocol == 8021Q/AD, usually the vlan header is already inserted in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the destination mac address (skb->data + VLAN_HLEN) as a type in skb_network_protocol() and return vlan_depth == 4. In the case where TSO is off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so skb_network_protocol() returns a type from inside the packet and offset == 22. Also make vlan_depth unsigned as suggested before. As suggested by Eric Dumazet, move the while() loop in the if() so we can avoid additional testing in fast path. Here are few netperf tests + debug printk's to illustrate: cat netperf.tso-on.reorder-on.bugged - Vlan -> device (reorder on, default, this case is okay) MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 7111.54 [ 81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800 skb->mac_len 0 vlan_depth 0 type 0x800 - Vlan -> device (reorder off, bad) cat netperf.tso-on.reorder-off.bugged MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 241.35 [ 204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100 skb->mac_len 0 vlan_depth 4 type 0x5301 0x5301 are the last two bytes of the destination mac. And if we stop TSO, we may get even the following: [ 83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100 skb->mac_len 18 vlan_depth 22 type 0xb84 Because mac_len already accounts for VLAN_HLEN. After the fix: cat netperf.tso-on.reorder-off.fixed MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.01 5001.46 [ 81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100 skb->mac_len 0 vlan_depth 18 type 0x800 CC: Vlad Yasevich <vyasevic@redhat.com> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Daniel Borkman <dborkman@redhat.com> CC: David S. Miller <davem@davemloft.net> Fixes:1e785f48 ("net: Start with correct mac_len in skb_network_protocol") Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 5月, 2014 4 次提交
-
-
由 Vlad Yasevich 提交于
Prior to commit fbd929f2 bonding: support QinQ for bond arp interval the arp monitoring code allowed for proper detection of devices stacked on top of vlans. Since the above commit, the code can still detect a device stacked on top of single vlan, but not a device stacked on top of Q-in-Q configuration. The search will only set the inner vlan tag if the route device is the vlan device. However, this is not always the case, as it is possible to extend the stacked configuration. With this patch it is possible to provision devices on top Q-in-Q vlan configuration that should be used as a source of ARP monitoring information. For example: ip link add link bond0 vlan10 type vlan proto 802.1q id 10 ip link add link vlan10 vlan100 type vlan proto 802.1q id 100 ip link add link vlan100 type macvlan Note: This patch limites the number of stacked VLANs to 2, just like before. The original, however had another issue in that if we had more then 2 levels of VLANs, we would end up generating incorrectly tagged traffic. This is no longer possible. Fixes: fbd929f2 (bonding: support QinQ for bond arp interval) CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@redhat.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: Ding Tianhong <dingtianhong@huawei.com> CC: Patric McHardy <kaber@trash.net> Signed-off-by: NVlad Yasevich <vyasevic@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vlad Yasevich 提交于
This reverts commit dc8eaaa0. vlan: Fix lockdep warning when vlan dev handle notification Instead we use the new new API to find the lock subclass of our vlan device. This way we can support configurations where vlans are interspersed with other devices: bond -> vlan -> macvlan -> vlan Signed-off-by: NVlad Yasevich <vyasevic@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vlad Yasevich 提交于
Multiple devices in the kernel can be stacked/nested and they need to know their nesting level for the purposes of lockdep. This patch provides a generic function that determines a nesting level of a particular device by its type (ex: vlan, macvlan, etc). We only care about nesting of the same type of devices. For example: eth0 <- vlan0.10 <- macvlan0 <- vlan1.20 The nesting level of vlan1.20 would be 1, since there is another vlan in the stack under it. Signed-off-by: NVlad Yasevich <vyasevic@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Starting from linux-3.13, GRO attempts to build full size skbs. Problem is the commit assumed one particular field in skb->cb[] was clean, but it is not the case on some stacked devices. Timo reported a crash in case traffic is decrypted before reaching a GRE device. Fix this by initializing NAPI_GRO_CB(skb)->last at the right place, this also removes one conditional. Thanks a lot to Timo for providing full reports and bisecting this. Fixes: 8a29111c ("net: gro: allow to build full sized skb") Bisected-by: NTimo Teras <timo.teras@iki.fi> Signed-off-by: NEric Dumazet <edumazet@google.com> Tested-by: NTimo Teräs <timo.teras@iki.fi> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 5月, 2014 1 次提交
-
-
由 Cong Wang 提交于
From: Cong Wang <cwang@twopensource.com> commit 50624c93 (net: Delay default_device_exit_batch until no devices are unregistering) introduced rtnl_lock_unregistering() for default_device_exit_batch(). Same race could happen we when rmmod a driver which calls rtnl_link_unregister() as we call dev->destructor without rtnl lock. For long term, I think we should clean up the mess of netdev_run_todo() and net namespce exit code. Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NCong Wang <cwang@twopensource.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 5月, 2014 2 次提交
-
-
由 Hannes Frederic Sowa 提交于
net_get_random_once depends on the static keys infrastructure to patch up the branch to the slow path during boot. This was realized by abusing the static keys api and defining a new initializer to not enable the call site while still indicating that the branch point should get patched up. This was needed to have the fast path considered likely by gcc. The static key initialization during boot up normally walks through all the registered keys and either patches in ideal nops or enables the jump site but omitted that step on x86 if ideal nops where already placed at static_key branch points. Thus net_get_random_once branches not always became active. This patch switches net_get_random_once to the ordinary static_key api and thus places the kernel fast path in the - by gcc considered - unlikely path. Microbenchmarks on Intel and AMD x86-64 showed that the unlikely path actually beats the likely path in terms of cycle cost and that different nop patterns did not make much difference, thus this switch should not be noticeable. Fixes: a48e4292 ("net: introduce new macro net_get_random_once") Reported-by: NTuomas Räsänen <tuomasjjrasanen@tjjr.fi> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Duan Jiong 提交于
Since commit 7e980569("ipv6: router reachability probing"), a router falls into NUD_FAILED will be probed. Now if function rt6_select() selects a router which neighbour state is NUD_FAILED, and at the same time function rt6_probe() changes the neighbour state to NUD_PROBE, then function dst_neigh_output() can directly send packets, but actually the neighbour still is unreachable. If we set nud_state to NUD_INCOMPLETE instead NUD_PROBE, packets will not be sent out until the neihbour is reachable. In addition, because the route should be probes with a single NS, so we must set neigh->probes to neigh_max_probes(), then the neigh timer timeout and function neigh_timer_handler() will not send other NS Messages. Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 5月, 2014 1 次提交
-
-
由 Florian Westphal 提交于
This reverts commit d2069403, there are no more callers. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 4月, 2014 5 次提交
-
-
由 David Gibson 提交于
Since 115c9b81 (rtnetlink: Fix problem with buffer allocation), RTM_NEWLINK messages only contain the IFLA_VFINFO_LIST attribute if they were solicited by a GETLINK message containing an IFLA_EXT_MASK attribute with the RTEXT_FILTER_VF flag. That was done because some user programs broke when they received more data than expected - because IFLA_VFINFO_LIST contains information for each VF it can become large if there are many VFs. However, the IFLA_VF_PORTS attribute, supplied for devices which implement ndo_get_vf_port (currently the 'enic' driver only), has the same problem. It supplies per-VF information and can therefore become large, but it is not currently conditional on the IFLA_EXT_MASK value. Worse, it interacts badly with the existing EXT_MASK handling. When IFLA_EXT_MASK is not supplied, the buffer for netlink replies is fixed at NLMSG_GOODSIZE. If the information for IFLA_VF_PORTS exceeds this, then rtnl_fill_ifinfo() returns -EMSGSIZE on the first message in a packet. netlink_dump() will misinterpret this as having finished the listing and omit data for this interface and all subsequent ones. That can cause getifaddrs(3) to enter an infinite loop. This patch addresses the problem by only supplying IFLA_VF_PORTS when IFLA_EXT_MASK is supplied with the RTEXT_FILTER_VF flag set. Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au> Reviewed-by: NJiri Pirko <jiri@resnulli.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Gibson 提交于
Without IFLA_EXT_MASK specified, the information reported for a single interface in response to RTM_GETLINK is expected to fit within a netlink packet of NLMSG_GOODSIZE. If it doesn't, however, things will go badly wrong, When listing all interfaces, netlink_dump() will incorrectly treat -EMSGSIZE on the first message in a packet as the end of the listing and omit information for that interface and all subsequent ones. This can cause getifaddrs(3) to enter an infinite loop. This patch won't fix the problem, but it will WARN_ON() making it easier to track down what's going wrong. Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au> Reviewed-by: NJiri Pirko <jpirko@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
It is possible by passing a netlink socket to a more privileged executable and then to fool that executable into writing to the socket data that happens to be valid netlink message to do something that privileged executable did not intend to do. To keep this from happening replace bare capable and ns_capable calls with netlink_capable, netlink_net_calls and netlink_ns_capable calls. Which act the same as the previous calls except they verify that the opener of the socket had the desired permissions as well. Reported-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
sk_net_capable - The common case, operations that are safe in a network namespace. sk_capable - Operations that are not known to be safe in a network namespace sk_ns_capable - The general case for special cases. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
The permission check in sock_diag_put_filterinfo is wrong, and it is so removed from it's sources it is not clear why it is wrong. Move the computation into packet_diag_dump and pass a bool of the result into sock_diag_filterinfo. This does not yet correct the capability check but instead simply moves it to make it clear what is going on. Reported-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 4月, 2014 1 次提交
-
-
由 Alexei Starovoitov 提交于
exisiting BPF verifier allows uninitialized access to registers, 'ret A' is considered to be a valid filter. So initialize A and X to zero to prevent leaking kernel memory In the future BPF verifier will be rejecting such filters Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Cc: Daniel Borkmann <dborkman@redhat.com> Acked-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 4月, 2014 1 次提交
-
-
由 Andrew Lutomirski 提交于
The caller needs capabilities on the namespace being queried, not on their own namespace. This is a security bug, although it likely has only a minor impact. Cc: stable@vger.kernel.org Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 4月, 2014 1 次提交
-
-
由 dingtianhong 提交于
When I open the LOCKDEP config and run these steps: modprobe 8021q vconfig add eth2 20 vconfig add eth2.20 30 ifconfig eth2 xx.xx.xx.xx then the Call Trace happened: [32524.386288] ============================================= [32524.386293] [ INFO: possible recursive locking detected ] [32524.386298] 3.14.0-rc2-0.7-default+ #35 Tainted: G O [32524.386302] --------------------------------------------- [32524.386306] ifconfig/3103 is trying to acquire lock: [32524.386310] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0 [32524.386326] [32524.386326] but task is already holding lock: [32524.386330] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40 [32524.386341] [32524.386341] other info that might help us debug this: [32524.386345] Possible unsafe locking scenario: [32524.386345] [32524.386350] CPU0 [32524.386352] ---- [32524.386354] lock(&vlan_netdev_addr_lock_key/1); [32524.386359] lock(&vlan_netdev_addr_lock_key/1); [32524.386364] [32524.386364] *** DEADLOCK *** [32524.386364] [32524.386368] May be due to missing lock nesting notation [32524.386368] [32524.386373] 2 locks held by ifconfig/3103: [32524.386376] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81431d42>] rtnl_lock+0x12/0x20 [32524.386387] #1: (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40 [32524.386398] [32524.386398] stack backtrace: [32524.386403] CPU: 1 PID: 3103 Comm: ifconfig Tainted: G O 3.14.0-rc2-0.7-default+ #35 [32524.386409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [32524.386414] ffffffff81ffae40 ffff8800d9625ae8 ffffffff814f68a2 ffff8800d9625bc8 [32524.386421] ffffffff810a35fb ffff8800d8a8d9d0 00000000d9625b28 ffff8800d8a8e5d0 [32524.386428] 000003cc00000000 0000000000000002 ffff8800d8a8e5f8 0000000000000000 [32524.386435] Call Trace: [32524.386441] [<ffffffff814f68a2>] dump_stack+0x6a/0x78 [32524.386448] [<ffffffff810a35fb>] __lock_acquire+0x7ab/0x1940 [32524.386454] [<ffffffff810a323a>] ? __lock_acquire+0x3ea/0x1940 [32524.386459] [<ffffffff810a4874>] lock_acquire+0xe4/0x110 [32524.386464] [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0 [32524.386471] [<ffffffff814fc07a>] _raw_spin_lock_nested+0x2a/0x40 [32524.386476] [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0 [32524.386481] [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0 [32524.386489] [<ffffffffa0500cab>] vlan_dev_set_rx_mode+0x2b/0x50 [8021q] [32524.386495] [<ffffffff8141addf>] __dev_set_rx_mode+0x5f/0xb0 [32524.386500] [<ffffffff8141af8b>] dev_set_rx_mode+0x2b/0x40 [32524.386506] [<ffffffff8141b3cf>] __dev_open+0xef/0x150 [32524.386511] [<ffffffff8141b177>] __dev_change_flags+0xa7/0x190 [32524.386516] [<ffffffff8141b292>] dev_change_flags+0x32/0x80 [32524.386524] [<ffffffff8149ca56>] devinet_ioctl+0x7d6/0x830 [32524.386532] [<ffffffff81437b0b>] ? dev_ioctl+0x34b/0x660 [32524.386540] [<ffffffff814a05b0>] inet_ioctl+0x80/0xa0 [32524.386550] [<ffffffff8140199d>] sock_do_ioctl+0x2d/0x60 [32524.386558] [<ffffffff81401a52>] sock_ioctl+0x82/0x2a0 [32524.386568] [<ffffffff811a7123>] do_vfs_ioctl+0x93/0x590 [32524.386578] [<ffffffff811b2705>] ? rcu_read_lock_held+0x45/0x50 [32524.386586] [<ffffffff811b39e5>] ? __fget_light+0x105/0x110 [32524.386594] [<ffffffff811a76b1>] SyS_ioctl+0x91/0xb0 [32524.386604] [<ffffffff815057e2>] system_call_fastpath+0x16/0x1b ======================================================================== The reason is that all of the addr_lock_key for vlan dev have the same class, so if we change the status for vlan dev, the vlan dev and its real dev will hold the same class of addr_lock_key together, so the warning happened. we should distinguish the lock depth for vlan dev and its real dev. v1->v2: Convert the vlan_netdev_addr_lock_key to an array of eight elements, which could support to add 8 vlan id on a same vlan dev, I think it is enough for current scene, because a netdev's name is limited to IFNAMSIZ which could not hold 8 vlan id, and the vlan dev would not meet the same class key with its real dev. The new function vlan_dev_get_lockdep_subkey() will return the subkey and make the vlan dev could get a suitable class key. v2->v3: According David's suggestion, I use the subclass to distinguish the lock key for vlan dev and its real dev, but it make no sense, because the difference for subclass in the lock_class_key doesn't mean that the difference class for lock_key, so I use lock_depth to distinguish the different depth for every vlan dev, the same depth of the vlan dev could have the same lock_class_key, I import the MAX_LOCK_DEPTH from the include/linux/sched.h, I think it is enough here, the lockdep should never exceed that value. v3->v4: Add a huge array of locking keys will waste static kernel memory and is not a appropriate method, we could use _nested() variants to fix the problem, calculate the depth for every vlan dev, and use the depth as the subclass for addr_lock_key. Signed-off-by: NDing Tianhong <dingtianhong@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 4月, 2014 1 次提交
-
-
由 Eric Dumazet 提交于
In the dst->output() path for ipv4, the code assumes the skb it has to transmit is attached to an inet socket, specifically via ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the provider of the packet is an AF_PACKET socket. The dst->output() method gets an additional 'struct sock *sk' parameter. This needs a cascade of changes so that this parameter can be propagated from vxlan to final consumer. Fixes: 8f646c92 ("vxlan: keep original skb ownership") Reported-by: Nlucien xin <lucien.xin@gmail.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 4月, 2014 2 次提交
-
-
由 Vlad Yasevich 提交于
Sometimes, when the packet arrives at skb_mac_gso_segment() its skb->mac_len already accounts for some of the mac lenght headers in the packet. This seems to happen when forwarding through and OpenSSL tunnel. When we start looking for any vlan headers in skb_network_protocol() we seem to ignore any of the already known mac headers and start with an ETH_HLEN. This results in an incorrect offset, dropped TSO frames and general slowness of the connection. We can start counting from the known skb->mac_len and return at least that much if all mac level headers are known and accounted for. Fixes: 53d6471c (net: Account for all vlan headers in skb_mac_gso_segment) CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Daniel Borkman <dborkman@redhat.com> Tested-by: NMartin Filip <nexus+kernel@smoula.net> Signed-off-by: NVlad Yasevich <vyasevic@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS although it should have been decoded as BPF_LD|BPF_W|BPF_ABS. In practice, this should not have much side-effect though, as such conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse operation PR_GET_SECCOMP will only return the current seccomp mode, but not the filter itself. Since the transition to the new BPF infrastructure, it's also not used anymore, so we can simply remove this as it's unreachable. Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)") Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 4月, 2014 1 次提交
-
-
由 Mathias Krause 提交于
The BPF_S_ANC_NLATTR and BPF_S_ANC_NLATTR_NEST extensions fail to check for a minimal message length before testing the supplied offset to be within the bounds of the message. This allows the subtraction of the nla header to underflow and therefore -- as the data type is unsigned -- allowing far to big offset and length values for the search of the netlink attribute. The remainder calculation for the BPF_S_ANC_NLATTR_NEST extension is also wrong. It has the minuend and subtrahend mixed up, therefore calculates a huge length value, allowing to overrun the end of the message while looking for the netlink attribute. The following three BPF snippets will trigger the bugs when attached to a UNIX datagram socket and parsing a message with length 1, 2 or 3. ,-[ PoC for missing size check in BPF_S_ANC_NLATTR ]-- | ld #0x87654321 | ldx #42 | ld #nla | ret a `--- ,-[ PoC for the same bug in BPF_S_ANC_NLATTR_NEST ]-- | ld #0x87654321 | ldx #42 | ld #nlan | ret a `--- ,-[ PoC for wrong remainder calculation in BPF_S_ANC_NLATTR_NEST ]-- | ; (needs a fake netlink header at offset 0) | ld #0 | ldx #42 | ld #nlan | ret a `--- Fix the first issue by ensuring the message length fulfills the minimal size constrains of a nla header. Fix the second bug by getting the math for the remainder calculation right. Fixes: 4738c1db ("[SKFILTER]: Add SKF_ADF_NLATTR instruction") Fixes: d214c753 ("filter: add SKF_AD_NLATTR_NEST to look for nested..") Cc: Patrick McHardy <kaber@trash.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NMathias Krause <minipli@googlemail.com> Acked-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 4月, 2014 2 次提交
-
-
由 Daniel Borkmann 提交于
Similarly to commit 43279500 ("packet: respect devices with LLTX flag in direct xmit"), we can basically apply the very same to pktgen. This will help testing against LLTX devices such as dummy driver (or others), which only have a single netdevice txq and would otherwise require locking their txq from pktgen side while e.g. in dummy case, we would not need any locking. Fix this by making use of HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will be respected. Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 4月, 2014 1 次提交
-
-
由 Florian Westphal 提交于
In case of tcp, gso_size contains the tcpmss. For UFO (udp fragmentation offloading) skbs, gso_size is the fragment payload size, i.e. we must not account for udp header size. Otherwise, when using virtio drivers, a to-be-forwarded UFO GSO packet will be needlessly fragmented in the forward path, because we think its individual segments are too large for the outgoing link. Fixes: fe6cc55f ("net: ip, ipv6: handle gso skbs in forwarding path") Cc: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: NTobias Brunner <tobias@strongswan.org> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 4月, 2014 3 次提交
-
-
由 Veaceslav Falico 提交于
Currently we're checking a variable for != NULL after actually dereferencing it, in netdev_lower_get_next_private*(). It's counter-intuitive at best, and can lead to faulty usage (as it implies that the variable can be NULL), so fix it by removing the useless checks. Reported-by: NDaniel Borkmann <dborkman@redhat.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Jiri Pirko <jiri@resnulli.us> CC: stephen hemminger <stephen@networkplumber.org> CC: Jerry Chu <hkchu@google.com> Signed-off-by: NVeaceslav Falico <vfalico@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Similarly as in commit 8e2f1a63 ("packet: fix packet_direct_xmit for BQL enabled drivers"), we test for __QUEUE_STATE_STACK_XOFF bit in pktgen's xmit, which would not fully fill the device's TX ring for BQL drivers that use netdev_tx_sent_queue(). Fix is to use, similarly as we do in packet sockets, netif_xmit_frozen_or_drv_stopped() test. Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
The old interpreter behaviour was that we returned with 0 whenever we found a division by 0 would take place. In the new interpreter we would currently just skip that instead and continue execution. It's true that a value of 0 as return might not be appropriate in all cases, but current users (socket filters -> drop packet, seccomp -> SECCOMP_RET_KILL, cls_bpf -> unclassified, etc) seem fine with that behaviour. Better this than undefined BPF program behaviour as it's expected that A contains the result of the division. In future, as more use cases open up, we could further adapt this return value to our needs, if necessary. So reintroduce return of 0 for division by 0 as in the old interpreter. Also in case of K which is guaranteed to be 32bit wide, sk_chk_filter() already takes care of preventing division by 0 invoked through K, so we can generally spare us these tests. Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Reviewed-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 4月, 2014 2 次提交
-
-
由 Eric Dumazet 提交于
Recycling skb always had been very tough... This time it appears GRO layer can accumulate skb->truesize adjustments made by drivers when they attach a fragment to skb. skb_gro_receive() can only subtract from skb->truesize the used part of a fragment. I spotted this problem seeing TcpExtPruneCalled and TcpExtTCPRcvCollapsed that were unexpected with a recent kernel, where TCP receive window should be sized properly to accept traffic coming from a driver not overshooting skb->truesize. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
Currently there is no way how to find out if a device supports busy polling. So add a feature and make it dependent on ndo_busy_poll existence. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 4月, 2014 3 次提交
-
-
由 Eric W. Biederman 提交于
Replace the test in zap_completion_queue to test when it is safe to free skbs in hard irq context with skb_irq_freeable ensuring we only free skbs when it is safe, and removing the possibility of subtle problems. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
This commit fixes a build error reported by Fengguang, that is triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set: ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined! The fix is to introduce its own file for the PTP BPF classifier, so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select it independently from each other. IXP4xx driver on ARM needs to select it as well since it does not seem to select PTP_1588_CLOCK or similar that would pull it in automatically. This also allows for hiding all of the internals of the BPF PTP program inside that file, and only exporting relevant API bits to drivers. This patch also adds a kdoc documentation of ptp_classify_raw() API to make it clear that it can return PTP_CLASS_* defines. Also, the BPF program has been translated into bpf_asm code, so that it can be more easily read and altered (extensively documented in [1]). In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg tools, so the commented program can simply be translated via `./bpf_asm -c prog` where prog is a file that contains the commented code. This makes it easily readable/verifiable and when there's a need to change something, jump offsets etc do not need to be replaced manually which can be very error prone. Instead, a newly translated version via bpf_asm can simply replace the old code. I have checked opcode diffs before/after and it's the very same filter. [1] Documentation/networking/filter.txt Fixes: 164d8c66 ("net: ptp: do not reimplement PTP/BPF classifier") Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Jiri Benc <jbenc@redhat.com> Acked-by: NRichard Cochran <richardcochran@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
This minor patch fixes the following warning when doing a `make htmldocs`: DOCPROC Documentation/DocBook/networking.xml Warning(.../net/core/filter.c:135): No description found for parameter 'insn' Warning(.../net/core/filter.c:135): Excess function parameter 'fentry' description in '__sk_run_filter' HTML Documentation/DocBook/networking.html Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 4月, 2014 3 次提交
-
-
由 Eric Dumazet 提交于
Main difference between napi_frags_skb() and napi_gro_receive() is that the later is called while ethernet header was already pulled by the NIC driver (eth_type_trans() was called before napi_gro_receive()) Jerry Chu in commit 299603e8 ("net-gro: Prepare GRO stack for the upcoming tunneling support") tried to remove this difference by calling eth_type_trans() from napi_frags_skb() instead of doing this later from napi_frags_finish() Goal was that napi_gro_complete() could call ptype->callbacks.gro_complete(skb, 0) (offset of first network header = 0) Also, xxx_gro_receive() handlers all use off = skb_gro_offset(skb) to point to their own header, for the current skb and ones held in gro_list Problem is this cleanup work defeated the frag0 optimization: It turns out the consecutive pskb_may_pull() calls are too expensive. This patch brings back the frag0 stuff in napi_frags_skb(). As all skb have their mac header in skb head, we no longer need skb_gro_mac_header() Reported-by: NMichal Schmidt <mschmidt@redhat.com> Fixes: 299603e8 ("net-gro: Prepare GRO stack for the upcoming tunneling support") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 david decotigny 提交于
This allows to monitor carrier on/off transitions and detect link flapping issues: - new /sys/class/net/X/carrier_changes - new rtnetlink IFLA_CARRIER_CHANGES (getlink) Tested: - grep . /sys/class/net/*/carrier_changes + ip link set dev X down/up + plug/unplug cable - updated iproute2: prints IFLA_CARRIER_CHANGES - iproute2 20121211-2 (debian): unchanged behavior Signed-off-by: NDavid Decotigny <decot@googlers.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vlad Yasevich 提交于
Signed-off-by: NVlad Yasevich <vyasevic@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 3月, 2014 1 次提交
-
-
由 Alexei Starovoitov 提交于
This patch replaces/reworks the kernel-internal BPF interpreter with an optimized BPF instruction set format that is modelled closer to mimic native instruction sets and is designed to be JITed with one to one mapping. Thus, the new interpreter is noticeably faster than the current implementation of sk_run_filter(); mainly for two reasons: 1. Fall-through jumps: BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. The new BPF jump instructions have only one branch and fall-through otherwise, which fits the CPU branch predictor logic better. `perf stat` shows drastic difference for branch-misses between the old and new code. 2. Jump-threaded implementation of interpreter vs switch statement: Instead of single table-jump at the top of 'switch' statement, gcc will now generate multiple table-jump instructions, which helps CPU branch predictor logic. Note that the verification of filters is still being done through sk_chk_filter() in classical BPF format, so filters from user- or kernel space are verified in the same way as we do now, and same restrictions/constraints hold as well. We reuse current BPF JIT compilers in a way that this upgrade would even be fine as is, but nevertheless allows for a successive upgrade of BPF JIT compilers to the new format. The internal instruction set migration is being done after the probing for JIT compilation, so in case JIT compilers are able to create a native opcode image, we're going to use that, and in all other cases we're doing a follow-up migration of the BPF program's instruction set, so that it can be transparently run in the new interpreter. In short, the *internal* format extends BPF in the following way (more details can be taken from the appended documentation): - Number of registers increase from 2 to 10 - Register width increases from 32-bit to 64-bit - Conditional jt/jf targets replaced with jt/fall-through - Adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - Introduction of bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions - Adds arithmetic right shift and endianness conversion insns - Adds atomic_add insn - Old tax/txa insns are replaced with 'mov dst,src' insn Performance of two BPF filters generated by libpcap resp. bpf_asm was measured on x86_64, i386 and arm32 (other libpcap programs have similar performance differences): fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call, smaller is better: --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 new BPF 31 71 47 97 old BPF jit 12 34 17 44 new BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 107 136 227 252 new BPF 40 119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 202 300 475 540 new BPF 180 270 330 470 old BPF jit 26 182 37 202 new BPF jit TBD Thus, without changing any userland BPF filters, applications on top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf classifier, netfilter's xt_bpf, team driver's load-balancing mode, and many more will have better interpreter filtering performance. While we are replacing the internal BPF interpreter, we also need to convert seccomp BPF in the same step to make use of the new internal structure since it makes use of lower-level API details without being further decoupled through higher-level calls like sk_unattached_filter_{create,destroy}(), for example. Just as for normal socket filtering, also seccomp BPF experiences a time-to-verdict speedup: 05-sim-long_jumps.c of libseccomp was used as micro-benchmark: seccomp_rule_add_exact(ctx,... seccomp_rule_add_exact(ctx,... rc = seccomp_load(ctx); for (i = 0; i < 10000000; i++) syscall(199, 100); 'short filter' has 2 rules 'large filter' has 200 rules 'short filter' performance is slightly better on x86_64/i386/arm32 'large filter' is much faster on x86_64 and i386 and shows no difference on arm32 --x86_64-- short filter old BPF: 2.7 sec 39.12% bench libc-2.15.so [.] syscall 8.10% bench [kernel.kallsyms] [k] sk_run_filter 6.31% bench [kernel.kallsyms] [k] system_call 5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller 3.70% bench [kernel.kallsyms] [k] __secure_computing 3.67% bench [kernel.kallsyms] [k] lock_is_held 3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load new BPF: 2.58 sec 42.05% bench libc-2.15.so [.] syscall 6.91% bench [kernel.kallsyms] [k] system_call 6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 6.07% bench [kernel.kallsyms] [k] __secure_computing 5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp --arm32-- short filter old BPF: 4.0 sec 39.92% bench [kernel.kallsyms] [k] vector_swi 16.60% bench [kernel.kallsyms] [k] sk_run_filter 14.66% bench libc-2.17.so [.] syscall 5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load 5.10% bench [kernel.kallsyms] [k] __secure_computing new BPF: 3.7 sec 35.93% bench [kernel.kallsyms] [k] vector_swi 21.89% bench libc-2.17.so [.] syscall 13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 6.25% bench [kernel.kallsyms] [k] __secure_computing 3.96% bench [kernel.kallsyms] [k] syscall_trace_exit --x86_64-- large filter old BPF: 8.6 seconds 73.38% bench [kernel.kallsyms] [k] sk_run_filter 10.70% bench libc-2.15.so [.] syscall 5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.97% bench [kernel.kallsyms] [k] system_call new BPF: 5.7 seconds 66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 16.75% bench libc-2.15.so [.] syscall 3.31% bench [kernel.kallsyms] [k] system_call 2.88% bench [kernel.kallsyms] [k] __secure_computing --i386-- large filter old BPF: 5.4 sec new BPF: 3.8 sec --arm32-- large filter old BPF: 13.5 sec 73.88% bench [kernel.kallsyms] [k] sk_run_filter 10.29% bench [kernel.kallsyms] [k] vector_swi 6.46% bench libc-2.17.so [.] syscall 2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.19% bench [kernel.kallsyms] [k] __secure_computing 0.87% bench [kernel.kallsyms] [k] sys_getuid new BPF: 13.5 sec 76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 10.98% bench [kernel.kallsyms] [k] vector_swi 5.87% bench libc-2.17.so [.] syscall 1.77% bench [kernel.kallsyms] [k] __secure_computing 0.93% bench [kernel.kallsyms] [k] sys_getuid BPF filters generated by seccomp are very branchy, so the new internal BPF performance is better than the old one. Performance gains will be even higher when BPF JIT is committed for the new structure, which is planned in future work (as successive JIT migrations). BPF has also been stress-tested with trinity's BPF fuzzer. Joint work with Daniel Borkmann. Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Kees Cook <keescook@chromium.org> Cc: Paul Moore <pmoore@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: linux-kernel@vger.kernel.org Acked-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-