1. 14 1月, 2016 1 次提交
    • J
      mac80211: recalculate SW ROC only when needed · e9db4557
      Johannes Berg 提交于
      The current (new) code recalculates the new work timeout
      for software remain-on-channel whenever any item started.
      In two of the callers of ieee80211_handle_roc_started(),
      this is completely pointless since they're for hardware
      and will skip the recalculation entirely; it's necessary
      only in the case of having just added a new item to the
      list, as in the last remaining case the recalculation had
      just been done.
      
      This last case, however, is also problematic - if one of
      the items on the list actually expires during the recalc
      the list iteration outside becomes corrupted and crashes.
      
      Fix this by moving the recalculation to the only place
      where it's required.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      e9db4557
  2. 13 1月, 2016 7 次提交
  3. 12 1月, 2016 7 次提交
  4. 11 1月, 2016 18 次提交
    • A
      net/rtnetlink: remove unused sz_idx variable · 617cfc75
      Alexander Kuleshov 提交于
      The sz_idx variable is defined in the rtnetlink_rcv_msg(), but
      not used anywhere. Let's remove it.
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      617cfc75
    • W
      unix: properly account for FDs passed over unix sockets · 712f4aad
      willy tarreau 提交于
      It is possible for a process to allocate and accumulate far more FDs than
      the process' limit by sending them over a unix socket then closing them
      to keep the process' fd count low.
      
      This change addresses this problem by keeping track of the number of FDs
      in flight per user and preventing non-privileged processes from having
      more FDs in flight than their configured FD limit.
      
      Reported-by: socketpair@gmail.com
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Mitigates: CVE-2013-4312 (Linux 2.0+)
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      712f4aad
    • J
      openvswitch: update kernel doc for struct vport · c5420eb1
      Jean Sacren 提交于
      commit be4ace6e ("openvswitch: Move dev pointer into vport itself")
      
      The commit above added @dev and moved @rcu to the bottom of struct
      vport, but the change was not reflected in the kernel doc. So let's
      update the kernel doc as well.
      Signed-off-by: NJean Sacren <sakiwit@gmail.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5420eb1
    • J
      openvswitch: fix struct geneve_port member name · 2f7066ad
      Jean Sacren 提交于
      commit 6b001e68 ("openvswitch: Use Geneve device.")
      
      The commit above introduced 'port_no' as the name for the member of
      struct geneve_port. The correct name should be 'dst_port' as described
      in the kernel doc. Let's fix that member name and all the pertinent
      instances so that both doc and code would be consistent.
      Signed-off-by: NJean Sacren <sakiwit@gmail.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f7066ad
    • J
      openvswitch: clean up unused function · 5ea03042
      Jean Sacren 提交于
      commit 6b001e68 ("openvswitch: Use Geneve device.")
      
      The commit above deleted the only call site of ovs_tunnel_route_lookup()
      and now that function is not used any more. So let's delete the function
      definition as well.
      Signed-off-by: NJean Sacren <sakiwit@gmail.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ea03042
    • E
      ipv6: tcp: add rcu locking in tcp_v6_send_synack() · 3e4006f0
      Eric Dumazet 提交于
      When first SYNACK is sent, we already hold rcu_read_lock(), but this
      is not true if a SYNACK is retransmitted, as a timer (soft) interrupt
      does not hold rcu_read_lock()
      
      Fixes: 45f6fad8 ("ipv6: add complete rcu protection around np->opt")
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e4006f0
    • E
      net: add scheduling point in recvmmsg/sendmmsg · a78cb84c
      Eric Dumazet 提交于
      Applications often have to reduce number of datagrams
      they receive or send per system call to avoid starvation problems.
      
      Really the kernel should take care of this by using cond_resched(),
      so that applications can experiment bigger batch sizes.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a78cb84c
    • L
      ipv6: always add flag an address that failed DAD with DADFAILED · 3d171f39
      Lubomir Rintel 提交于
      The userspace needs to know why is the address being removed so that it can
      perhaps obtain a new address.
      
      Without the DADFAILED flag it's impossible to distinguish removal of a
      temporary and tentative address due to DAD failure from other reasons (device
      removed, manual address removal).
      Signed-off-by: NLubomir Rintel <lkundrak@v3.sk>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d171f39
    • D
      net, sched: add clsact qdisc · 1f211a1b
      Daniel Borkmann 提交于
      This work adds a generalization of the ingress qdisc as a qdisc holding
      only classifiers. The clsact qdisc works on ingress, but also on egress.
      In both cases, it's execution happens without taking the qdisc lock, and
      the main difference for the egress part compared to prior version of [1]
      is that this can be applied with _any_ underlying real egress qdisc (also
      classless ones).
      
      Besides solving the use-case of [1], that is, allowing for more programmability
      on assigning skb->priority for the mqprio case that is supported by most
      popular 10G+ NICs, it also opens up a lot more flexibility for other tc
      applications. The main work on classification can already be done at clsact
      egress time if the use-case allows and state stored for later retrieval
      f.e. again in skb->priority with major/minors (which is checked by most
      classful qdiscs before consulting tc_classify()) and/or in other skb fields
      like skb->tc_index for some light-weight post-processing to get to the
      eventual classid in case of a classful qdisc. Another use case is that
      the clsact egress part allows to have a central egress counterpart to
      the ingress classifiers, so that classifiers can easily share state (e.g.
      in cls_bpf via eBPF maps) for ingress and egress.
      
      Currently, default setups like mq + pfifo_fast would require for this to
      use, for example, prio qdisc instead (to get a tc_classify() run) and to
      duplicate the egress classifier for each queue. With clsact, it allows
      for leaving the setup as is, it can additionally assign skb->priority to
      put the skb in one of pfifo_fast's bands and it can share state with maps.
      Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
      w/o the need to perform a skb_dst_force() to hold on to it any longer. In
      lwt case, we can also use this facility to setup dst metadata via cls_bpf
      (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
      that (case of IFF_NO_QUEUE devices, for example).
      
      The realization can be done without any changes to the scheduler core
      framework. All it takes is that we have two a-priori defined minors/child
      classes, where we can mux between ingress and egress classifier list
      (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
      dev->_tx to avoid extra cacheline miss for moderate loads). The egress
      part is a bit similar modelled to handle_ing() and patched to a noop in
      case the functionality is not used. Both handlers are now called
      sch_handle_ingress() and sch_handle_egress(), code sharing among the two
      doesn't seem practical as there are various minor differences in both
      paths, so that making them conditional in a single handler would rather
      slow things down.
      
      Full compatibility to ingress qdisc is provided as well. Since both
      piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
      per netdevice, and thus ingress qdisc specific behaviour can be retained
      for user space. This means, either a user does 'tc qdisc add dev foo ingress'
      and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
      alternative, where both, ingress and egress classifier can be configured
      as in the below example. ingress qdisc supports attaching classifier to any
      minor number whereas clsact has two fixed minors for muxing between the
      lists, therefore to not break user space setups, they are better done as
      two separate qdiscs.
      
      I decided to extend the sch_ingress module with clsact functionality so
      that commonly used code can be reused, the module is being aliased with
      sch_clsact so that it can be auto-loaded properly. Alternative would have been
      to add a flag when initializing ingress to alter its behaviour plus aliasing
      to a different name (as it's more than just ingress). However, the first would
      end up, based on the flag, choosing the new/old behaviour by calling different
      function implementations to handle each anyway, the latter would require to
      register ingress qdisc once again under different alias. So, this really begs
      to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
      by its own that share callbacks used by both.
      
      Example, adding qdisc:
      
         # tc qdisc add dev foo clsact
         # tc qdisc show dev foo
         qdisc mq 0: root
         qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc clsact ffff: parent ffff:fff1
      
      Adding filters (deleting, etc works analogous by specifying ingress/egress):
      
         # tc filter add dev foo ingress bpf da obj bar.o sec ingress
         # tc filter add dev foo egress  bpf da obj bar.o sec egress
         # tc filter show dev foo ingress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
         # tc filter show dev foo egress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
      
      A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
      show an empty list for clsact. Either using the parent names (ingress/egress)
      or specifying the full major/minor will then show the related filter lists.
      
      Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
      
        [1] http://patchwork.ozlabs.org/patch/512949/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f211a1b
    • S
      net: sctp: prevent writes to cookie_hmac_alg from accessing invalid memory · 320f1a4a
      Sasha Levin 提交于
      proc_dostring() needs an initialized destination string, while the one
      provided in proc_sctp_do_hmac_alg() contains stack garbage.
      
      Thus, writing to cookie_hmac_alg would strlen() that garbage and end up
      accessing invalid memory.
      
      Fixes: 3c68198e ("sctp: Make hmac algorithm selection for cookie generation dynamic")
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      320f1a4a
    • D
      bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions · f8ffad69
      Daniel Borkmann 提交于
      Add a small helper skb_postpush_rcsum() and fix up redirect locations
      that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
      a proper csum that covers also Ethernet header, f.e. since 2c26d34b
      ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
      also do skb_postpull_rcsum() after pulling Ethernet header off via
      eth_type_trans().
      
      When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
      I can trigger the following csum issue with an IPv6 setup:
      
        [  505.144065] dummy1: hw csum failure
        [...]
        [  505.144108] Call Trace:
        [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
        [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
        [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
        [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
        [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
        [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
        [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
        [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
        [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
        [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
        [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
        [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
        [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
        [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
        [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
        [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
        [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
        [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
        [...]
      
      What happens is that on ingress, we push Ethernet header back in, either
      from cls_bpf or right before skb_do_redirect(), but without updating csum.
      The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
      helper for the dev_forward_skb() case to correct the csum diff again.
      
      Thanks to Hannes Frederic Sowa for the csum_partial() idea!
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ffad69
    • D
      net, sched: add skb_at_tc_ingress helper · fdc5432a
      Daniel Borkmann 提交于
      Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
      can hide the ugly ifdef.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fdc5432a
    • N
      ipv4: Namespecify the tcp_keepalive_intvl sysctl knob · b840d15d
      Nikolay Borisov 提交于
      This is the final part required to namespaceify the tcp
      keep alive mechanism.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b840d15d
    • N
      ipv4: Namespecify tcp_keepalive_probes sysctl knob · 9bd6861b
      Nikolay Borisov 提交于
      This is required to have full tcp keepalive mechanism namespace
      support.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bd6861b
    • N
      ipv4: Namespaceify tcp_keepalive_time sysctl knob · 13b287e8
      Nikolay Borisov 提交于
      Different net namespaces might have different requirements as to
      the keepalive time of tcp sockets. This might be required in cases
      where different firewall rules are in place which require tcp
      timeout sockets to be increased/decreased independently of the host.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13b287e8
    • H
      udp: restrict offloads to one namespace · 787d7ac3
      Hannes Frederic Sowa 提交于
      udp tunnel offloads tend to aggregate datagrams based on inner
      headers. gro engine gets notified by tunnel implementations about
      possible offloads. The match is solely based on the port number.
      
      Imagine a tunnel bound to port 53, the offloading will look into all
      DNS packets and tries to aggregate them based on the inner data found
      within. This could lead to data corruption and malformed DNS packets.
      
      While this patch minimizes the problem and helps an administrator to find
      the issue by querying ip tunnel/fou, a better way would be to match on
      the specific destination ip address so if a user space socket is bound
      to the same address it will conflict.
      
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      787d7ac3
    • E
      bridge: Reflect MDB entries to hardware · f1fecb1d
      Elad Raz 提交于
      Offload MDB changes per port to hardware
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1fecb1d
    • E
      switchdev: Adding MDB entry offload · 4d41e125
      Elad Raz 提交于
      Define HW multicast entry: MAC and VID.
      Using a MAC address simplifies support for both IPV4 and IPv6.
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d41e125
  5. 09 1月, 2016 7 次提交