- 22 5月, 2015 5 次提交
-
-
由 Alexei Starovoitov 提交于
bpf_tail_call() arguments: ctx - context pointer jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table index - index in the jump table In this implementation x64 JIT bypasses stack unwind and jumps into the callee program after prologue, so the callee program reuses the same stack. The logic can be roughly expressed in C like: u32 tail_call_cnt; void *jumptable[2] = { &&label1, &&label2 }; int bpf_prog1(void *ctx) { label1: ... } int bpf_prog2(void *ctx) { label2: ... } int bpf_prog1(void *ctx) { ... if (tail_call_cnt++ < MAX_TAIL_CALL_CNT) goto *jumptable[index]; ... and pass my 'ctx' to callee ... ... fall through if no entry in jumptable ... } Note that 'skip current program epilogue and next program prologue' is an optimization. Other JITs don't have to do it the same way. >From safety point of view it's valid as well, since programs always initialize the stack before use, so any residue in the stack left by the current program is not going be read. The same verifier checks are done for the calls from the kernel into all bpf programs. Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
introduce bpf_tail_call(ctx, &jmp_table, index) helper function which can be used from BPF programs like: int bpf_prog(struct pt_regs *ctx) { ... bpf_tail_call(ctx, &jmp_table, index); ... } that is roughly equivalent to: int bpf_prog(struct pt_regs *ctx) { ... if (jmp_table[index]) return (*jmp_table[index])(ctx); ... } The important detail that it's not a normal call, but a tail call. The kernel stack is precious, so this helper reuses the current stack frame and jumps into another BPF program without adding extra call frame. It's trivially done in interpreter and a bit trickier in JITs. In case of x64 JIT the bigger part of generated assembler prologue is common for all programs, so it is simply skipped while jumping. Other JITs can do similar prologue-skipping optimization or do stack unwind before jumping into the next program. bpf_tail_call() arguments: ctx - context pointer jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table index - index in the jump table Since all BPF programs are idenitified by file descriptor, user space need to populate the jmp_table with FDs of other BPF programs. If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere and program execution continues as normal. New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can populate this jmp_table array with FDs of other bpf programs. Programs can share the same jmp_table array or use multiple jmp_tables. The chain of tail calls can form unpredictable dynamic loops therefore tail_call_cnt is used to limit the number of calls and currently is set to 32. Use cases: Acked-by: NDaniel Borkmann <daniel@iogearbox.net> ========== - simplify complex programs by splitting them into a sequence of small programs - dispatch routine For tracing and future seccomp the program may be triggered on all system calls, but processing of syscall arguments will be different. It's more efficient to implement them as: int syscall_entry(struct seccomp_data *ctx) { bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */); ... default: process unknown syscall ... } int sys_write_event(struct seccomp_data *ctx) {...} int sys_read_event(struct seccomp_data *ctx) {...} syscall_jmp_table[__NR_write] = sys_write_event; syscall_jmp_table[__NR_read] = sys_read_event; For networking the program may call into different parsers depending on packet format, like: int packet_parser(struct __sk_buff *skb) { ... parse L2, L3 here ... __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol)); bpf_tail_call(skb, &ipproto_jmp_table, ipproto); ... default: process unknown protocol ... } int parse_tcp(struct __sk_buff *skb) {...} int parse_udp(struct __sk_buff *skb) {...} ipproto_jmp_table[IPPROTO_TCP] = parse_tcp; ipproto_jmp_table[IPPROTO_UDP] = parse_udp; - for TC use case, bpf_tail_call() allows to implement reclassify-like logic - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table are atomic, so user space can build chains of BPF programs on the fly Implementation details: ======================= - high performance of bpf_tail_call() is the goal. It could have been implemented without JIT changes as a wrapper on top of BPF_PROG_RUN() macro, but with two downsides: . all programs would have to pay performance penalty for this feature and tail call itself would be slower, since mandatory stack unwind, return, stack allocate would be done for every tailcall. . tailcall would be limited to programs running preempt_disabled, since generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would need to be either global per_cpu variable accessed by helper and by wrapper or global variable protected by locks. In this implementation x64 JIT bypasses stack unwind and jumps into the callee program after prologue. - bpf_prog_array_compatible() ensures that prog_type of callee and caller are the same and JITed/non-JITed flag is the same, since calling JITed program from non-JITed is invalid, since stack frames are different. Similarly calling kprobe type program from socket type program is invalid. - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map' abstraction, its user space API and all of verifier logic. It's in the existing arraymap.c file, since several functions are shared with regular array map. Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Reduce ifdef pollution slightly, no functional change. We can simply remove the extra alternative definition of handle_ing() and nf_ingress(). Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger") we fixed a possible hang of TCP sockets under memory pressure, by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule() if no packet is in socket write queue. It turns out there are other cases where we want to force memory schedule : tcp_fragment() & tso_fragment() need to split a big TSO packet into two smaller ones. If we block here because of TCP memory pressure, we can effectively block TCP socket from sending new data. If no further ACK is coming, this hang would be definitive, and socket has no chance to effectively reduce its memory usage. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Erik Kline 提交于
[1] When entering NUD_PROBE state via neigh_update(), perhaps received from userspace, correctly (re)initialize the probes count to zero. This is useful for forcing revalidation of a neighbor (for example if the host is attempting to do DNA [IPv4 4436, IPv6 6059]). [2] Notify listeners when a neighbor goes into NUD_PROBE state. By sending notifications on entry to NUD_PROBE state listeners get more timely warnings of imminent connectivity issues. The current notifications on entry to NUD_STALE have somewhat limited usefulness: NUD_STALE is a perfectly normal state, as is NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after a neighbor reachability problem has been confirmed (typically after three probes). Signed-off-by: NErik Kline <ek@google.com> Acked-By: NLorenzo Colitti <lorenzo@google.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 5月, 2015 9 次提交
-
-
由 Andy Zhou 提交于
ip_do_nat() function was removed prior to kernel 3.4. Remove the unnecessary function prototype as well. Reported-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NAndy Zhou <azhou@nicira.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing ECN connections. In other words, this work adds a retry with a non-ECN setup SYN packet, as suggested from the RFC on the first timeout: [...] A host that receives no reply to an ECN-setup SYN within the normal SYN retransmission timeout interval MAY resend the SYN and any subsequent SYN retransmissions with CWR and ECE cleared. [...] Schematic client-side view when assuming the server is in tcp_ecn=2 mode, that is, Linux default since 2009 via commit 255cac91 ("tcp: extend ECN sysctl to allow server-side only ECN"): 1) Normal ECN-capable path: SYN ECE CWR -----> <----- SYN ACK ECE ACK -----> 2) Path with broken middlebox, when client has fallback: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN -----> <----- SYN ACK ACK -----> In case we would not have the fallback implemented, the middlebox drop point would basically end up as: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) In any case, it's rather a smaller percentage of sites where there would occur such additional setup latency: it was found in end of 2014 that ~56% of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the fallback would mitigate with a slight latency trade-off. Recent related paper on this topic: Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth, Gorry Fairhurst, and Richard Scheffenegger: "Enabling Internet-Wide Deployment of Explicit Congestion Notification." Proc. PAM 2015, New York. http://ecn.ethz.ch/ecn-pam15.pdf Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168, section 6.1.1.1. fallback on timeout. For users explicitly not wanting this which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that allows for disabling the fallback. tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but rather we let tcp_ecn_rcv_synack() take that over on input path in case a SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent ECN being negotiated eventually in that case. Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch> Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch> Cc: Eric Dumazet <edumazet@google.com> Cc: Dave That <dave.taht@gmail.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Hariprasad Shenai says: ==================== cxgb4: Remove dead code and replace byte-oder functions This series removes dead fn t4_read_edc and t4_read_mc, also replaces ntoh{s,l} and hton{s,l} calls with the generic byteorder. PATCH 2/2 was sent as a single PATCH, but had some byte-ordering issues in t4_read_edc and t4_read_mc function. Found that t4_read_edc and t4_read_mc is unused, so PATCH 1/2 is added to remove it. This patch series is created against net-next tree and includes patches on cxgb4 driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hariprasad Shenai 提交于
replace ntoh{s,l} and hton{s,l} calls with the generic byteorder in cxgb4/t4_hw.c file Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hariprasad Shenai 提交于
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Merge tag 'mac80211-next-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== This just has a few fixes: * LED throughput trigger was crashing * fast-xmit wasn't treating QoS changes in IBSS correctly * TDLS could use the wrong channel definition * using a reserved channel context could use the wrong channel width ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Arnd Bergmann 提交于
The hwmon interface in the be2net driver causes a link error when be2net is built-in while the hwmon subsystem is a loadable module: drivers/built-in.o: In function `be_probe': drivers/net/ethernet/emulex/benet/be_main.c:5761: undefined reference to `devm_hwmon_device_register_with_groups' This adds a new Kconfig symbol, following the example of multiple other drivers that have the same problem. The new CONFIG_BE2NET_HWMON will not be available when (BE2NET=y && HWMON=m) to avoid this problem. We have to also mark be_hwmon_show_temp as 'static' to ensure the compiler can optimize out all the unused code. Signed-off-by: NArnd Bergmann <arnd@arndb.de> Fixes: 29e9122b ("be2net: Export board temperature using hwmon-sysfs interface.") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric B Munson 提交于
Currently the getsockopt() requesting the cached contents of the syn packet headers will fail silently if the caller uses a buffer that is too small to contain the requested data. Rather than fail silently and discard the headers, getsockopt() should return an error and report the required size to hold the data. Signed-off-by: NEric B Munson <emunson@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Parav Pandit 提交于
Signed-off-by: NParav Pandit <parav.pandit@avagotech.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 5月, 2015 16 次提交
-
-
由 David S. Miller 提交于
Andy Zhou says: ==================== fragmentation ICMP Currently, we send ICMP packets when errors occur during fragmentation or de-fragmentation. However, it is a bug when sending those ICMP packets in the context of using netfilter for bridging. Those ICMP packets are only expected in the context of routing, not in bridging mode. The local stack is not involved in bridging forward decisions, thus should be not used for deciding the reverse path for those ICMP messages. This bug only affects IPV4, not in IPv6. v1->v2: restructure the patches into two patches that fix defragmentation and fragmentation respectively. A bit is add in IPCB to control whether ICMP packet should be generated for defragmentation. Fragmentation ICMP is now removed by restructuring the ip_fragment() API. v2->v3: Add droping icmp for bridging contrack users drop exporting ip_fragment() API. v3->v4: Remove unnecessary parentheses in 'return' statements v4->v5: Drop the patch that sets and checks a bit in IPCB that prevents ip_defrag to send ICMP. ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andy Zhou 提交于
When bridge netfilter re-fragments an IP packet for output, all packets that can not be re-fragmented to their original input size should be silently discarded. However, current bridge netfilter output path generates an ICMP packet with 'size exceeded MTU' message for such packets, this is a bug. This patch refactors the ip_fragment() API to allow two separate use cases. The bridge netfilter user case will not send ICMP, the routing output will, as before. Signed-off-by: NAndy Zhou <azhou@nicira.com> Acked-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andy Zhou 提交于
users in [IP_DEFRAG_CONNTRACK_BRIDGE_IN, __IP_DEFRAG_CONNTRACK_BR_IN] should not ICMP message also. Reported-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NAndy Zhou <azhou@nicira.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andy Zhou 提交于
Improve readability of skip ICMP for de-fragmentation expiration logic. This change will also make the logic easier to maintain when the following patches in this series are applied. Signed-off-by: NAndy Zhou <azhou@nicira.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
psock_fanout tests the various fanout modes. Change the test for rollover mode to expect early rollover due to socket pressure as implemented in 2ccdbaa6 ("packet: rollover lock contention avoidance"). Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Edward Cree 提交于
We expect that MC_CMD_SRIOV will fail if the card has no VFs configured. So output a readable message instead of a cryptic MCDI error. Signed-off-by: NEdward Cree <ecree@solarflare.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next由 David S. Miller 提交于
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. Briefly speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and Serget Popovich, more incremental updates to make br_netfilter a better place from Florian Westphal, ARP support to the x_tables mark match / target from and context Zhang Chunyu and the addition of context to know that the x_tables runs through nft_compat. More specifically, they are: 1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik. 2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef. 3) Use skb->network_header to calculate the transport offset in ip_set_get_ip{4,6}_port(). From Alexander Drozdov. 4) Reduce memory consumption per element due to size miscalculation, this patch and follow up patches from Sergey Popovich. 5) Expand nomatch field from 1 bit to 8 bits to allow to simplify mtype_data_reset_flags(), also from Sergey. 6) Small clean for ipset macro trickery. 7) Fix error reporting when both ip_set_get_hostipaddr4() and ip_set_get_extensions() from per-set uadt functions. 8) Simplify IPSET_ATTR_PORT netlink attribute validation. 9) Introduce HOST_MASK instead of hardcoded 32 in ipset. 10) Return true/false instead of 0/1 in functions that return boolean in the ipset code. 11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute. 12) Allow to dereference from ext_*() ipset macros. 13) Get rid of incorrect definitions of HKEY_DATALEN. 14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match. 15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal. 16) Release nf_bridge_info after POSTROUTING since this is only needed from the physdev match, also from Florian. 17) Reduce size of ipset code by deinlining ip_set_put_extensions(), from Denys Vlasenko. 18) Oneliner to add ARP support to the x_tables mark match/target, from Zhang Chunyu. 19) Add context to know if the x_tables extension runs from nft_compat, to address minor problems with three existing extensions. 20) Correct return value in several seqfile *_show() functions in the netfilter tree, from Joe Perches. ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Ursula Braun says: ==================== s390: network patches for net-next here are s390 related patches for net-next ==================== Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Peter Oberparleiter 提交于
An attempt to configure a CTC device as LCS results in the following error message: (null): Detecting a network adapter for LCS devices failed with rc=-5 (0xfffffffb) "(null)" results from access to &card->dev->dev in lcs_new_device() which is only initialized later in the function. Fix this by using &ccwgdev->dev instead which is initialized before lcs_new_device() is called. Signed-off-by: NPeter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: NUrsula Braun <ursula.braun@de.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eugene Crosser 提交于
Since recently, `checkpatch.pl` advices that ENOSYS should not be used for anything other than "invalid syscall nr". This patch replaces ENOSYS return code with EOPNOTSUPP for the "unsupported function" conditions. Signed-off-by: NEugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: NUrsula Braun <ursula.braun@de.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eugene Crosser 提交于
Forbid enabling IFF_PROMISC reflection to BRIDGEPORT when a role is already assigned, and forbid direct manipulation of the role when reflection mode is engaged. Reviewed-by: NThomas Richter <tmricht@linux.vnet.ibm.com> Signed-off-by: NEugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: NUrsula Braun <ursula.braun@de.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eugene Crosser 提交于
OSA Ethernet hardware is introducing BRIDGEPORT functionality similar (but not identical) to HiperSockets BRIDGEPORT. This patch makes HiperSockets BRIDGEPORT related sysfs attributes and udev events work with OSA hardware too. Reviewed-by: NThomas Richter <tmricht@linux.vnet.ibm.com> Signed-off-by: NEugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: NUrsula Braun <ursula.braun@de.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eugene Crosser 提交于
OSA and HiperSocket devices do not support promiscuous mode proper, but they support "BRIDGE PORT" mode that is functionally similar. This update introduces sysfs attribute that, when set, makes the driver try to "reflect" setting and resetting of the IFF_PROMISC flag on the interface into setting and resetting PRIMARY or SECONDARY bridge port role on the underlying OSA or HiperSocket device. Reviewed-by: NThomas Richter <tmricht@linux.vnet.ibm.com> Signed-off-by: NEugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: NUrsula Braun <ursula.braun@de.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eugene Crosser 提交于
Locking is probably unnecessary in this case, and the rest of the qeth sysfs code does not use locks in the *_show() functions. Remove locks from the layer2 *_show() functions in which they where accidentally introduced. Signed-off-by: NEugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: NUrsula Braun <ursula.braun@de.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eugene Crosser 提交于
Function that executes IPA commands returns the result code from the IPA response block. If non-negative, it needs to be transformed into errno-compatible code before returning to the caller. Signed-off-by: NEugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: NUrsula Braun <ursula.braun@de.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Thomas Richter 提交于
ethtool is used to change some device driver features such as RX/TX hardware checksum offloading. The qeth device driver callback function to turn on/off RX hardware check sum handling never changes the hardware configuration. The NETIF_F_RXCSUM bit is cleared when the feature bitset type netdev_features_t(64bit) is assigned to 32 a bit variable. This patch fixes the NETIF_F_RXCSUM handling. Also there is no need to manipulate the device's features bit set as this is done by the caller when no error occurs. Signed-off-by: NThomas Richter <tmricht@linux.vnet.ibm.com> Signed-off-by: NUrsula Braun <ursula.braun@de.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 5月, 2015 10 次提交
-
-
由 Herbert Xu 提交于
Currently we use a global rover to select a port ID that is unique. This used to work consistently when it was protected with a global lock. However as we're now lockless, the global rover can exhibit pathological behaviour should multiple threads all stomp on it at the same time. Granted this will eventually resolve itself but the process is suboptimal. This patch replaces the global rover with a pseudorandom starting point to avoid this issue. Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 WANG Cong 提交于
The spinlock is used to protect netns_ids which is per net, so there is no need to use a global spinlock. Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Fainelli 提交于
MoCA interfaces require the use of an user-space daemon (mocad) which will typically use cmd->autoneg to force the link. This is causing other network manager applications not to get proper carrier down notifications because of the following sequence of events: - link down interrupt is received, link is set to 0 by the interrupt handler - fixed_link update callback runs and updates the BMSR register accordingly - PHY library polls the PHY for link status, sees the link is down, proceeds with reporting that - mocad gets notified of the link state and call phy_ethtool_sset() with cmd->autoneg set to the link status (0) - phy_start_aneg() is called at the end of phy_ethtool_sset() and sets the PHY state to PHY_FORCING Just make sure we notify the interface carrier appropriately when we detect that the link is down in our fixed_link update callback. This is made local to the bcm_sf2 driver as the PHY library does the right thing in any case. This is similar to the GENET change introduced in 54d7c01d ("net: bcmgenet: enable MoCA link state change detection"). Fixes: 246d7f77 ("net: dsa: add Broadcom SF2 switch driver") Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
Fixes: 06635a35 ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NJiri Pirko <jiri@resnulli.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 sixiao@microsoft.com 提交于
Currently the struct netvsc_stats has a member s_sync of type u64_stats_sync. This definition will break kernel build as the macro netdev_alloc_pcpu_stats requires this member name to be syncp. (see netdev_alloc_pcpu_stats definition in ./include/linux/netdevice.h) This patch changes netvsc_stats's member name from s_sync to syncp to fix the build break. Signed-off-by: NSimon Xiao <sixiao@microsoft.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Samudrala, Sridhar 提交于
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump() - use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops. - add support for fdb obj in switchdev_port_obj_add/del/dump() - switch rocker to implement fdb ops via switchdev_ops v3: updated to sync with named union changes. Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: NScott Feldman <sfeldma@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
Eric Dumazet says: ==================== tcp: better handling of memory pressure When testing commit 790ba456 ("tcp: set SOCK_NOSPACE under memory pressure") using edge triggered epoll applications, I found various issues under memory pressure and thousands of active sockets. This patch series is a first round to solve these issues, in send and receive paths. There are probably other fixes needed, but with this series, my tests now all succeed. v2: fix typo in "allow one skb to be received per socket under memory pressure", as spotted by Jason Baron. ==================== Acked-by: NJason Baron <jbaron@akamai.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Allowing tcp to use ~19% of physical memory is way too much, and allowed bugs to be hidden. Add to this that some drivers use a full page per incoming frame, so real cost can be twice the advertized one. Reduce tcp_mem by 50 % as a first step to sanity. tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
While testing tight tcp_mem settings, I found tcp sessions could be stuck because we do not allow even one skb to be received on them. By allowing one skb to be received, we introduce fairness and eventuallu force memory hogs to release their allocation. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Under memory pressure, tcp_sendmsg() can fail to queue a packet while no packet is present in write queue. If we return -EAGAIN with no packet in write queue, no ACK packet will ever come to raise EPOLLOUT. We need to allow one skb per TCP socket, and make sure that tcp sockets can release their forward allocations under pressure. This is a followup to commit 790ba456 ("tcp: set SOCK_NOSPACE under memory pressure") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-