1. 05 6月, 2014 5 次提交
    • T
      tcp: Call gso_make_checksum · e9c3a24b
      Tom Herbert 提交于
      Call common gso_make_checksum when calculating checksum for a
      TCP GSO segment.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9c3a24b
    • T
      net: Support for multiple checksums with gso · 7e2b10c1
      Tom Herbert 提交于
      When creating a GSO packet segment we may need to set more than
      one checksum in the packet (for instance a TCP checksum and
      UDP checksum for VXLAN encapsulation). To be efficient, we want
      to do checksum calculation for any part of the packet at most once.
      
      This patch adds csum_start offset to skb_gso_cb. This tracks the
      starting offset for skb->csum which is initially set in skb_segment.
      When a protocol needs to compute a transport checksum it calls
      gso_make_checksum which computes the checksum value from the start
      of transport header to csum_start and then adds in skb->csum to get
      the full checksum. skb->csum and csum_start are then updated to reflect
      the checksum of the resultant packet starting from the transport header.
      
      This patch also adds a flag to skbuff, encap_hdr_csum, which is set
      in *gso_segment fucntions to indicate that a tunnel protocol needs
      checksum calculation
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e2b10c1
    • T
      l2tp: call udp{6}_set_csum · 77157e19
      Tom Herbert 提交于
      Call common functions to set checksum for UDP tunnel.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77157e19
    • T
      udp: Generic functions to set checksum · af5fcba7
      Tom Herbert 提交于
      Added udp_set_csum and udp6_set_csum functions to set UDP checksums
      in packets. These are for simple UDP packets such as those that might
      be created in UDP tunnels.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af5fcba7
    • S
      net: Revert "fib_trie: use seq_file_net rather than seq->private" · f830b022
      Sasha Levin 提交于
      This reverts commit 30f38d2f.
      
      fib_triestat is surrounded by a big lie: while it claims that it's a
      seq_file (fib_triestat_seq_open, fib_triestat_seq_show), it isn't:
      
      	static const struct file_operations fib_triestat_fops = {
      	        .owner  = THIS_MODULE,
      	        .open   = fib_triestat_seq_open,
      	        .read   = seq_read,
      	        .llseek = seq_lseek,
      	        .release = single_release_net,
      	};
      
      Yes, fib_triestat is just a regular file.
      
      A small detail (assuming CONFIG_NET_NS=y) is that while for seq_files
      you could do seq_file_net() to get the net ptr, doing so for a regular
      file would be wrong and would dereference an invalid pointer.
      
      The fib_triestat lie claimed a victim, and trying to show the file would
      be bad for the kernel. This patch just reverts the issue and fixes
      fib_triestat, which still needs a rewrite to either be a seq_file or
      stop claiming it is.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f830b022
  2. 04 6月, 2014 3 次提交
  3. 03 6月, 2014 14 次提交
    • B
      ethtool: Check that reserved fields of struct ethtool_rxfh are 0 · f062a384
      Ben Hutchings 提交于
      We should fail rather than silently ignoring use of these extensions.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      f062a384
    • B
      ethtool: Replace ethtool_ops::{get,set}_rxfh_indir() with {get,set}_rxfh() · fe62d001
      Ben Hutchings 提交于
      ETHTOOL_{G,S}RXFHINDIR and ETHTOOL_{G,S}RSSH should work for drivers
      regardless of whether they expose the hash key, unless you try to
      set a hash key for a driver that doesn't expose it.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      fe62d001
    • R
      bridge: Add bridge ifindex to bridge fdb notify msgs · 41c389d7
      Roopa Prabhu 提交于
      (This patch was previously posted as RFC at
      http://patchwork.ozlabs.org/patch/352677/)
      
      This patch adds NDA_MASTER attribute to neighbour attributes enum for
      bridge/master ifindex. And adds NDA_MASTER to bridge fdb notify msgs.
      
      Today bridge fdb notifications dont contain bridge information.
      Userspace can derive it from the port information in the fdb
      notification. However this is tricky in some scenarious.
      
      Example, bridge port delete notification comes before bridge fdb
      delete notifications. And we have seen problems in userspace
      when using libnl where, the bridge fdb delete notification handling code
      does not understand which bridge this fdb entry is part of because
      the bridge and port association has already been deleted.
      And these notifications (port membership and fdb) are generated on
      separate rtnl groups.
      
      Fixing the order of notifications could possibly solve the problem
      for some cases (I can submit a separate patch for that).
      
      This patch chooses to add NDA_MASTER to bridge fdb notify msgs
      because it not only solves the problem described above, but also helps
      userspace avoid another lookup into link msgs to derive the master index.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41c389d7
    • L
      net: filter: fix possible memory leak in __sk_prepare_filter() · 418c96ac
      Leon Yu 提交于
      __sk_prepare_filter() was reworked in commit bd4cf0ed (net: filter:
      rework/optimize internal BPF interpreter's instruction set) so that it should
      have uncharged memory once things went wrong. However that work isn't complete.
      Error is handled only in __sk_migrate_filter() while memory can still leak in
      the error path right after sk_chk_filter().
      
      Fixes: bd4cf0ed ("net: filter: rework/optimize internal BPF interpreter's instruction set")
      Signed-off-by: NLeon Yu <chianglungyu@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Tested-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      418c96ac
    • Y
      tcp: fix cwnd undo on DSACK in F-RTO · 0cfa5c07
      Yuchung Cheng 提交于
      This bug is discovered by an recent F-RTO issue on tcpm list
      https://www.ietf.org/mail-archive/web/tcpm/current/msg08794.html
      
      The bug is that currently F-RTO does not use DSACK to undo cwnd in
      certain cases: upon receiving an ACK after the RTO retransmission in
      F-RTO, and the ACK has DSACK indicating the retransmission is spurious,
      the sender only calls tcp_try_undo_loss() if some never retransmisted
      data is sacked (FLAG_ORIG_DATA_SACKED).
      
      The correct behavior is to unconditionally call tcp_try_undo_loss so
      the DSACK information is used properly to undo the cwnd reduction.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cfa5c07
    • D
      fib_trie: use seq_file_net rather than seq->private · 30f38d2f
      David Ahern 提交于
      Make fib_triestat_seq_show consistent with other /proc/net files and
      use seq_file_net.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f38d2f
    • E
      netlink: Only check file credentials for implicit destinations · 2d7a85f4
      Eric W. Biederman 提交于
      It was possible to get a setuid root or setcap executable to write to
      it's stdout or stderr (which has been set made a netlink socket) and
      inadvertently reconfigure the networking stack.
      
      To prevent this we check that both the creator of the socket and
      the currentl applications has permission to reconfigure the network
      stack.
      
      Unfortunately this breaks Zebra which always uses sendto/sendmsg
      and creates it's socket without any privileges.
      
      To keep Zebra working don't bother checking if the creator of the
      socket has privilege when a destination address is specified.  Instead
      rely exclusively on the privileges of the sender of the socket.
      
      Note from Andy: This is exactly Eric's code except for some comment
      clarifications and formatting fixes.  Neither I nor, I think, anyone
      else is thrilled with this approach, but I'm hesitant to wait on a
      better fix since 3.15 is almost here.
      
      Note to stable maintainers: This is a mess.  An earlier series of
      patches in 3.15 fix a rather serious security issue (CVE-2014-0181),
      but they did so in a way that breaks Zebra.  The offending series
      includes:
      
          commit aa4cf945
          Author: Eric W. Biederman <ebiederm@xmission.com>
          Date:   Wed Apr 23 14:28:03 2014 -0700
      
              net: Add variants of capable for use on netlink messages
      
      If a given kernel version is missing that series of fixes, it's
      probably worth backporting it and this patch.  if that series is
      present, then this fix is critical if you care about Zebra.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d7a85f4
    • E
      net: fix inet_getid() and ipv6_select_ident() bugs · 39c36094
      Eric Dumazet 提交于
      I noticed we were sending wrong IPv4 ID in TCP flows when MTU discovery
      is disabled.
      Note how GSO/TSO packets do not have monotonically incrementing ID.
      
      06:37:41.575531 IP (id 14227, proto: TCP (6), length: 4396)
      06:37:41.575534 IP (id 14272, proto: TCP (6), length: 65212)
      06:37:41.575544 IP (id 14312, proto: TCP (6), length: 57972)
      06:37:41.575678 IP (id 14317, proto: TCP (6), length: 7292)
      06:37:41.575683 IP (id 14361, proto: TCP (6), length: 63764)
      
      It appears I introduced this bug in linux-3.1.
      
      inet_getid() must return the old value of peer->ip_id_count,
      not the new one.
      
      Lets revert this part, and remove the prevention of
      a null identification field in IPv6 Fragment Extension Header,
      which is dubious and not even done properly.
      
      Fixes: 87c48fa3 ("ipv6: make fragment identifications less predictable")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39c36094
    • T
      bridge: Prevent insertion of FDB entry with disallowed vlan · e0d7968a
      Toshiaki Makita 提交于
      br_handle_local_finish() is allowing us to insert an FDB entry with
      disallowed vlan. For example, when port 1 and 2 are communicating in
      vlan 10, and even if vlan 10 is disallowed on port 3, port 3 can
      interfere with their communication by spoofed src mac address with
      vlan id 10.
      
      Note: Even if it is judged that a frame should not be learned, it should
      not be dropped because it is destined for not forwarding layer but higher
      layer. See IEEE 802.1Q-2011 8.13.10.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0d7968a
    • E
      inetpeer: get rid of ip_id_count · 73f156a6
      Eric Dumazet 提交于
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f156a6
    • A
      net: Add support for device specific address syncing · 670e5b8e
      Alexander Duyck 提交于
      This change provides a function to be used in order to break the
      ndo_set_rx_mode call into a set of address add and remove calls.  The code
      is based on the implementation of dev_uc_sync/dev_mc_sync.  Since they
      essentially do the same thing but with only one dev I simply named my
      functions __dev_uc_sync/__dev_mc_sync.
      
      I also implemented an unsync version of the functions as well to allow for
      cleanup on close.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      670e5b8e
    • A
      6lowpan_rtnl: fix off by one while fragmentation · eb06481d
      Alexander Aring 提交于
      This patch fix a off by one error while fragmentation. If the frag_cap
      value is equal to skb_unprocessed value we need to stop the
      fragmentation loop because the last fragment which has a size of
      skb_unprocessed fits into the frag capability size.
      
      This issue was introduced by commit d4b2816d
      ("6lowpan: fix fragmentation").
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb06481d
    • A
      6lowpan_rtnl: fix fragmentation with two fragments · 51263fff
      Alexander Aring 提交于
      This patch fix the 6LoWPAN fragmentation for the case if we have exactly
      two fragments. The problem is that the (skb_unprocessed >= frag_cap)
      condition is always false on the second fragment after sending the first
      fragment. A fragmentation with only one fragment doesn't make any sense.
      The solution is that we use a do while loop here, that ensures we sending
      always a minimum of two fragments if we need a fragmentation.
      
      This issue was introduced by commit d4b2816d
      ("6lowpan: fix fragmentation").
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51263fff
    • D
      genetlink: remove superfluous assignment · 2f91abd4
      Denis ChengRq 提交于
      the local variable ops and n_ops were just read out from family,
      and not changed, hence no need to assign back.
      
      Validation functions should operate on const parameters and not
      change anything.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f91abd4
  4. 02 6月, 2014 5 次提交
    • D
      net: filter: improve filter block macros · f8f6d679
      Daniel Borkmann 提交于
      Commit 9739eef1 ("net: filter: make BPF conversion more readable")
      started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
      macros from classic BPF.
      
      However, quite some statements in the filter conversion functions
      remained in the old style which gives a mixture of block macros and
      non block macros in the code. This patch makes the block macros itself
      more readable by using explicit member initialization, and converts
      the remaining ones where possible to remain in a more consistent state.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8f6d679
    • D
      net: filter: get rid of BPF_S_* enum · 34805931
      Daniel Borkmann 提交于
      This patch finally allows us to get rid of the BPF_S_* enum.
      Currently, the code performs unnecessary encode and decode
      workarounds in seccomp and filter migration itself when a filter
      is being attached in order to overcome BPF_S_* encoding which
      is not used anymore by the new interpreter resp. JIT compilers.
      
      Keeping it around would mean that also in future we would need
      to extend and maintain this enum and related encoders/decoders.
      We can get rid of all that and save us these operations during
      filter attaching. Naturally, also JIT compilers need to be updated
      by this.
      
      Before JIT conversion is being done, each compiler checks if A
      is being loaded at startup to obtain information if it needs to
      emit instructions to clear A first. Since BPF extensions are a
      subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
      for extensions can be removed at that point. To ease and minimalize
      code changes in the classic JITs, we have introduced bpf_anc_helper().
      
      Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
      arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
      unfortunately didn't have access, but changes are analogous to
      the rest.
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mircea Gherzan <mgherzan@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NChema Gonzalez <chemag@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34805931
    • J
      bridge: notify user space after fdb update · c65c7a30
      Jon Maxwell 提交于
      There has been a number incidents recently where customers running KVM have
      reported that VM hosts on different Hypervisors are unreachable. Based on
      pcap traces we found that the bridge was broadcasting the ARP request out
      onto the network. However some NICs have an inbuilt switch which on occasions
      were broadcasting the VMs ARP request back through the physical NIC on the
      Hypervisor. This resulted in the bridge changing ports and incorrectly learning
      that the VMs mac address was external. As a result the ARP reply was directed
      back onto the external network and VM never updated it's ARP cache. This patch
      will notify the bridge command, after a fdb has been updated to identify such
      port toggling.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Reviewed-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c65c7a30
    • W
      bridge: fix the unbalanced promiscuous count when add_if failed · 019ee792
      wangweidong 提交于
      As commit 2796d0c6 ("bridge: Automatically manage port
      promiscuous mode."), make the add_if use dev_set_allmulti
      instead of dev_set_promiscuous, so when add_if failed, we
      should do dev_set_allmulti(dev, -1).
      Signed-off-by: NWang Weidong <wangweidong1@huawei.com>
      Reviewed-by: NAmos Kong <akong@redhat.com>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      019ee792
    • N
      net: fix wrong mac_len calculation for vlans · 4b9b1cdf
      Nikolay Aleksandrov 提交于
      After 1e785f48 ("net: Start with correct mac_len in
      skb_network_protocol") skb->mac_len is used as a start of the
      calculation in skb_network_protocol() but that is not always correct. If
      skb->protocol == 8021Q/AD, usually the vlan header is already inserted
      in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters
      dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the
      destination mac address (skb->data + VLAN_HLEN) as a type in
      skb_network_protocol() and return vlan_depth == 4. In the case where TSO is
      off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so
      skb_network_protocol() returns a type from inside the packet and
      offset == 22. Also make vlan_depth unsigned as suggested before.
      As suggested by Eric Dumazet, move the while() loop in the if() so we
      can avoid additional testing in fast path.
      
      Here are few netperf tests + debug printk's to illustrate:
      cat netperf.tso-on.reorder-on.bugged
      - Vlan -> device (reorder on, default, this case is okay)
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      192.168.3.1 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    7111.54
      [   81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800
      skb->mac_len 0 vlan_depth 0 type 0x800
      
      - Vlan -> device (reorder off, bad)
      cat netperf.tso-on.reorder-off.bugged
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      192.168.3.1 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00     241.35
      [  204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100
      skb->mac_len 0 vlan_depth 4 type 0x5301
      0x5301 are the last two bytes of the destination mac.
      
      And if we stop TSO, we may get even the following:
      [   83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100
      skb->mac_len 18 vlan_depth 22 type 0xb84
      Because mac_len already accounts for VLAN_HLEN.
      
      After the fix:
      cat netperf.tso-on.reorder-off.fixed
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      192.168.3.1 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.01    5001.46
      [   81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100
      skb->mac_len 0 vlan_depth 18 type 0x800
      
      CC: Vlad Yasevich <vyasevic@redhat.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Daniel Borkman <dborkman@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      
      Fixes:1e785f48 ("net: Start with correct mac_len in
      skb_network_protocol")
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b9b1cdf
  5. 31 5月, 2014 9 次提交
  6. 28 5月, 2014 4 次提交