1. 28 8月, 2014 5 次提交
    • F
      net: dsa: allow switches to work without tagging · 5aed85ce
      Florian Fainelli 提交于
      In case switch port tagging is disabled (voluntarily, or the switch just
      does not support it), allow us to continue using the defined set of
      dsa_device_ops in net/dsa/slave.c.
      
      We introduce dsa_protocol_is_tagged() to check whether we need to
      override skb->protocol and go through the DSA-specifif packet_type
      function, or if we just go on and receive the SKB through the normal
      path.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aed85ce
    • F
      net: dsa: allow for more complex PHY setups · 0d8bcdd3
      Florian Fainelli 提交于
      Modify the DSA slave interface to be bound to an arbitray PHY, not just
      the ones that are available as child PHY devices of the switch MDIO bus.
      
      This allows us for instance to have external PHYs connected to a
      separate MDIO bus, but yet also connected to a given switch port.
      
      Under certain configurations, the physical port mask might not be a 1:1
      mapping to the MII PHYs mask. This is the case, if e.g: Port 1 of the
      switch is used and connects to a PHY at a MDIO address different than 1.
      
      Introduce a phys_mii_mask variable which allows driver to implement and
      divert their own MDIO read/writes operations for a subset of the MDIO
      PHY addresses.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d8bcdd3
    • F
      net: dsa: retain a per-port device_node pointer · bd47497a
      Florian Fainelli 提交于
      We will later use the per-port device_node pointer to fetch a bunch of
      port-specific properties.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd47497a
    • F
      net: dsa: provide a switch device device tree node pointer · fa981d9a
      Florian Fainelli 提交于
      We might need to fetch additional resources from the device tree node
      pointer, such as register ranges or other properties. Keep a device_node
      pointer around for this.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa981d9a
    • F
      net: dsa: reduce number of protocol hooks · 3e8a72d1
      Florian Fainelli 提交于
      DSA is currently registering one packet_type function per EtherType it
      needs to intercept in the receive path of a DSA-enabled Ethernet device.
      Right now we have three of them: trailer, DSA and eDSA, and there might
      be more in the future, this will not scale to the addition of new
      protocols.
      
      This patch proceeds with adding a new layer of abstraction and two new
      functions:
      
      dsa_switch_rcv() which will dispatch into the tag-protocol specific
      receive function implemented by net/dsa/tag_*.c
      
      dsa_slave_xmit() which will dispatch into the tag-protocol specific
      transmit function implemented by net/dsa/tag_*.c
      
      When we do create the per-port slave network devices, we iterate over
      the switch protocol to assign the DSA-specific receive and transmit
      operations.
      
      A new fake ethertype value is used: ETH_P_XDSA to illustrate the fact
      that this is no longer going to look like ETH_P_DSA or ETH_P_TRAILER
      like it used to be.
      
      This allows us to greatly simplify the check in eth_type_trans() and
      always override the skb->protocol with ETH_P_XDSA for Ethernet switches
      tagged protocol, while also reducing the number repetitive slave
      netdevice_ops assignments.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e8a72d1
  2. 26 8月, 2014 1 次提交
  3. 25 8月, 2014 2 次提交
  4. 24 8月, 2014 1 次提交
    • D
      net: Allow raw buffers to be passed into the flow dissector. · 690e36e7
      David S. Miller 提交于
      Drivers, and perhaps other entities we have not yet considered,
      sometimes want to know how deep the protocol headers go before
      deciding how large of an SKB to allocate and how much of the packet to
      place into the linear SKB area.
      
      For example, consider a driver which has a device which DMAs into
      pools of pages and then tells the driver where the data went in the
      DMA descriptor(s).  The driver can then build an SKB and reference
      most of the data via SKB fragments (which are page/offset/length
      triplets).
      
      However at least some of the front of the packet should be placed into
      the linear SKB area, which comes before the fragments, so that packet
      processing can get at the headers efficiently.  The first thing each
      protocol layer is going to do is a "pskb_may_pull()" so we might as
      well aggregate as much of this as possible while we're building the
      SKB in the driver.
      
      Part of supporting this is that we don't have an SKB yet, so we want
      to be able to let the flow dissector operate on a raw buffer in order
      to compute the offset of the end of the headers.
      
      So now we have a __skb_flow_dissect() which takes an explicit data
      pointer and length.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      690e36e7
  5. 23 8月, 2014 1 次提交
  6. 15 8月, 2014 3 次提交
    • H
      tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logic · a26552af
      Hannes Frederic Sowa 提交于
      tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
      ordering of incoming connections and teardowns without the need to
      hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
      the last measured RTO. To do so, we keep the last seen timestamp in a
      per-host indexed data structure and verify if the incoming timestamp
      in a connection request is strictly greater than the saved one during
      last connection teardown. Thus we can verify later on that no old data
      packets will be accepted by the new connection.
      
      During moving a socket to time-wait state we already verify if timestamps
      where seen on a connection. Only if that was the case we let the
      time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
      will be used. But we don't verify this on incoming SYN packets. If a
      connection teardown was less than TCP_PAWS_MSL seconds in the past we
      cannot guarantee to not accept data packets from an old connection if
      no timestamps are present. We should drop this SYN packet. This patch
      closes this loophole.
      
      Please note, this patch does not make tcp_tw_recycle in any way more
      usable but only adds another safety check:
      Sporadic drops of SYN packets because of reordering in the network or
      in the socket backlog queues can happen. Users behing NAT trying to
      connect to a tcp_tw_recycle enabled server can get caught in blackholes
      and their connection requests may regullary get dropped because hosts
      behind an address translator don't have synchronized tcp timestamp clocks.
      tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.
      
      In general, use of tcp_tw_recycle is disadvised.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a26552af
    • N
      tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced() · 4fab9071
      Neal Cardwell 提交于
      Make sure we use the correct address-family-specific function for
      handling MTU reductions from within tcp_release_cb().
      
      Previously AF_INET6 sockets were incorrectly always using the IPv6
      code path when sometimes they were handling IPv4 traffic and thus had
      an IPv4 dst.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Diagnosed-by: NWillem de Bruijn <willemb@google.com>
      Fixes: 563d34d0 ("tcp: dont drop MTU reduction indications")
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fab9071
    • A
      tcp: don't use timestamp from repaired skb-s to calculate RTT (v2) · 9d186cac
      Andrey Vagin 提交于
      We don't know right timestamp for repaired skb-s. Wrong RTT estimations
      isn't good, because some congestion modules heavily depends on it.
      
      This patch adds the TCPCB_REPAIRED flag, which is included in
      TCPCB_RETRANS.
      
      Thanks to Eric for the advice how to fix this issue.
      
      This patch fixes the warning:
      [  879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
      [  879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
      [  879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  879.568177]  0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
      [  879.568776]  0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
      [  879.569386]  ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
      [  879.569982] Call Trace:
      [  879.570264]  [<ffffffff817aa2d2>] dump_stack+0x4d/0x66
      [  879.570599]  [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0
      [  879.570935]  [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20
      [  879.571292]  [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380
      [  879.571614]  [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710
      [  879.571958]  [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370
      [  879.572315]  [<ffffffff81657459>] release_sock+0x89/0x1d0
      [  879.572642]  [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860
      [  879.573000]  [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80
      [  879.573352]  [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40
      [  879.573678]  [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20
      [  879.574031]  [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0
      [  879.574393]  [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b
      [  879.574730] ---[ end trace a17cbc38eb8c5c00 ]---
      
      v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
          where it's set for other skb-s.
      
      Fixes: 431a9124 ("tcp: timestamp SYN+DATA messages")
      Fixes: 740b0f18 ("tcp: switch rtt estimations to usec resolution")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d186cac
  7. 08 8月, 2014 1 次提交
  8. 07 8月, 2014 1 次提交
  9. 06 8月, 2014 4 次提交
    • W
      net-timestamp: add key to disambiguate concurrent datagrams · 09c2d251
      Willem de Bruijn 提交于
      Datagrams timestamped on transmission can coexist in the kernel stack
      and be reordered in packet scheduling. When reading looped datagrams
      from the socket error queue it is not always possible to unique
      correlate looped data with original send() call (for application
      level retransmits). Even if possible, it may be expensive and complex,
      requiring packet inspection.
      
      Introduce a data-independent ID mechanism to associate timestamps with
      send calls. Pass an ID alongside the timestamp in field ee_data of
      sock_extended_err.
      
      The ID is a simple 32 bit unsigned int that is associated with the
      socket and incremented on each send() call for which software tx
      timestamp generation is enabled.
      
      The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
      avoid changing ee_data for existing applications that expect it 0.
      The counter is reset each time the flag is reenabled. Reenabling
      does not change the ID of already submitted data. It is possible
      to receive out of order IDs if the timestamp stream is not quiesced
      first.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09c2d251
    • W
      net-timestamp: move timestamp flags out of sk_flags · b9f40e21
      Willem de Bruijn 提交于
      sk_flags is reaching its limit. New timestamping options will not fit.
      Move all of them into a new field sk->sk_tsflags.
      
      Added benefit is that this removes boilerplate code to convert between
      SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.
      
      SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
      timestamp logic (netstamp_needed). That can be simplified and this
      last key removed, but will leave that for a separate patch.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
      though that scatters tstamp fields throughout the struct.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f40e21
    • W
      net-timestamp: extend SCM_TIMESTAMPING ancillary data struct · f24b9be5
      Willem de Bruijn 提交于
      Applications that request kernel tx timestamps with SO_TIMESTAMPING
      read timestamps as recvmsg() ancillary data. The response is defined
      implicitly as timespec[3].
      
      1) define struct scm_timestamping explicitly and
      
      2) add support for new tstamp types. On tx, scm_timestamping always
         accompanies a sock_extended_err. Define previously unused field
         ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
         define the existing behavior.
      
      The reception path is not modified. On rx, no struct similar to
      sock_extended_err is passed along with SCM_TIMESTAMPING.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f24b9be5
    • N
      tcp: reduce spurious retransmits due to transient SACK reneging · 5ae344c9
      Neal Cardwell 提交于
      This commit reduces spurious retransmits due to apparent SACK reneging
      by only reacting to SACK reneging that persists for a short delay.
      
      When a sequence space hole at snd_una is filled, some TCP receivers
      send a series of ACKs as they apparently scan their out-of-order queue
      and cumulatively ACK all the packets that have now been consecutiveyly
      received. This is essentially misbehavior B in "Misbehaviors in TCP
      SACK generation" ACM SIGCOMM Computer Communication Review, April
      2011, so we suspect that this is from several common OSes (Windows
      2000, Windows Server 2003, Windows XP). However, this issue has also
      been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
      into spurious retransmissions by lack of timestamps?" from March 2014,
      where the receiver was thought to be a BSD box.
      
      Since snd_una would temporarily be adjacent to a previously SACKed
      range in these scenarios, this receiver behavior triggered the Linux
      SACK reneging code path in the sender. This led the sender to clear
      the SACK scoreboard, enter CA_Loss, and spuriously retransmit
      (potentially) every packet from the entire write queue at line rate
      just a few milliseconds before the ACK for each packet arrives at the
      sender.
      
      To avoid such situations, now when a sender sees apparent reneging it
      does not yet retransmit, but rather adjusts the RTO timer to give the
      receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
      that will restore sanity to the SACK scoreboard. If the reneging
      persists until this RTO then, as before, we clear the SACK scoreboard
      and enter CA_Loss.
      
      A 10ms delay tolerates a receiver sending such a stream of ACKs at
      56Kbit/sec. And to allow for receivers with slower or more congested
      paths, we wait for at least RTT/2.
      
      We validated the resulting max(RTT/2, 10ms) delay formula with a mix
      of North American and South American Google web server traffic, and
      found that for ACKs displaying transient reneging:
      
       (1) 90% of inter-ACK delays were less than 10ms
       (2) 99% of inter-ACK delays were less than RTT/2
      
      In tests on Google web servers this commit reduced reneging events by
      75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
      any measurable impact on latency for user HTTP and SPDY requests.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ae344c9
  10. 03 8月, 2014 3 次提交
  11. 01 8月, 2014 4 次提交
    • P
      netlabel: shorter names for the NetLabel catmap funcs/structs · 4fbe63d1
      Paul Moore 提交于
      Historically the NetLabel LSM secattr catmap functions and data
      structures have had very long names which makes a mess of the NetLabel
      code and anyone who uses NetLabel.  This patch renames the catmap
      functions and structures from "*_secattr_catmap_*" to just "*_catmap_*"
      which improves things greatly.
      
      There are no substantial code or logic changes in this patch.
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
      4fbe63d1
    • P
      netlabel: fix the horribly broken catmap functions · 4b8feff2
      Paul Moore 提交于
      The NetLabel secattr catmap functions, and the SELinux import/export
      glue routines, were broken in many horrible ways and the SELinux glue
      code fiddled with the NetLabel catmap structures in ways that we
      probably shouldn't allow.  At some point this "worked", but that was
      likely due to a bit of dumb luck and sub-par testing (both inflicted
      by yours truly).  This patch corrects these problems by basically
      gutting the code in favor of something less obtuse and restoring the
      NetLabel abstractions in the SELinux catmap glue code.
      
      Everything is working now, and if it decides to break itself in the
      future this code will be much easier to debug than the code it
      replaces.
      
      One noteworthy side effect of the changes is that it is no longer
      necessary to allocate a NetLabel catmap before calling one of the
      NetLabel APIs to set a bit in the catmap.  NetLabel will automatically
      allocate the catmap nodes when needed, resulting in less allocations
      when the lowest bit is greater than 255 and less code in the LSMs.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristian Evans <frodox@zoho.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
      4b8feff2
    • P
      netlabel: fix a problem when setting bits below the previously lowest bit · 41c3bd20
      Paul Moore 提交于
      The NetLabel category (catmap) functions have a problem in that they
      assume categories will be set in an increasing manner, e.g. the next
      category set will always be larger than the last.  Unfortunately, this
      is not a valid assumption and could result in problems when attempting
      to set categories less than the startbit in the lowest catmap node.
      In some cases kernel panics and other nasties can result.
      
      This patch corrects the problem by checking for this and allocating a
      new catmap node instance and placing it at the front of the list.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristian Evans <frodox@zoho.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
      41c3bd20
    • J
      sctp: Fixup v4mapped behaviour to comply with Sock API · 299ee123
      Jason Gunthorpe 提交于
      The SCTP socket extensions API document describes the v4mapping option as
      follows:
      
      8.1.15.  Set/Clear IPv4 Mapped Addresses (SCTP_I_WANT_MAPPED_V4_ADDR)
      
         This socket option is a Boolean flag which turns on or off the
         mapping of IPv4 addresses.  If this option is turned on, then IPv4
         addresses will be mapped to V6 representation.  If this option is
         turned off, then no mapping will be done of V4 addresses and a user
         will receive both PF_INET6 and PF_INET type addresses on the socket.
         See [RFC3542] for more details on mapped V6 addresses.
      
      This description isn't really in line with what the code does though.
      
      Introduce addr_to_user (renamed addr_v4map), which should be called
      before any sockaddr is passed back to user space. The new function
      places the sockaddr into the correct format depending on the
      SCTP_I_WANT_MAPPED_V4_ADDR option.
      
      Audit all places that touched v4mapped and either sanely construct
      a v4 or v6 address then call addr_to_user, or drop the
      unnecessary v4mapped check entirely.
      
      Audit all places that call addr_to_user and verify they are on a sycall
      return path.
      
      Add a custom getname that formats the address properly.
      
      Several bugs are addressed:
       - SCTP_I_WANT_MAPPED_V4_ADDR=0 often returned garbage for
         addresses to user space
       - The addr_len returned from recvmsg was not correct when
         returning AF_INET on a v6 socket
       - flowlabel and scope_id were not zerod when promoting
         a v4 to v6
       - Some syscalls like bind and connect behaved differently
         depending on v4mapped
      
      Tested bind, getpeername, getsockname, connect, and recvmsg for proper
      behaviour in v4mapped = 1 and 0 cases.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Tested-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      299ee123
  12. 31 7月, 2014 7 次提交
  13. 30 7月, 2014 2 次提交
  14. 29 7月, 2014 2 次提交
    • E
      ip: make IP identifiers less predictable · 04ca6973
      Eric Dumazet 提交于
      In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
      Jedidiah describe ways exploiting linux IP identifier generation to
      infer whether two machines are exchanging packets.
      
      With commit 73f156a6 ("inetpeer: get rid of ip_id_count"), we
      changed IP id generation, but this does not really prevent this
      side-channel technique.
      
      This patch adds a random amount of perturbation so that IP identifiers
      for a given destination [1] are no longer monotonically increasing after
      an idle period.
      
      Note that prandom_u32_max(1) returns 0, so if generator is used at most
      once per jiffy, this patch inserts no hole in the ID suite and do not
      increase collision probability.
      
      This is jiffies based, so in the worst case (HZ=1000), the id can
      rollover after ~65 seconds of idle time, which should be fine.
      
      We also change the hash used in __ip_select_ident() to not only hash
      on daddr, but also saddr and protocol, so that ICMP probes can not be
      used to infer information for other protocols.
      
      For IPv6, adds saddr into the hash as well, but not nexthdr.
      
      If I ping the patched target, we can see ID are now hard to predict.
      
      21:57:11.008086 IP (...)
          A > target: ICMP echo request, seq 1, length 64
      21:57:11.010752 IP (... id 2081 ...)
          target > A: ICMP echo reply, seq 1, length 64
      
      21:57:12.013133 IP (...)
          A > target: ICMP echo request, seq 2, length 64
      21:57:12.015737 IP (... id 3039 ...)
          target > A: ICMP echo reply, seq 2, length 64
      
      21:57:13.016580 IP (...)
          A > target: ICMP echo request, seq 3, length 64
      21:57:13.019251 IP (... id 3437 ...)
          target > A: ICMP echo reply, seq 3, length 64
      
      [1] TCP sessions uses a per flow ID generator not changed by this patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJeffrey Knockel <jeffk@cs.unm.edu>
      Reported-by: NJedidiah R. Crandall <crandall@cs.unm.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hannes Frederic Sowa <hannes@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ca6973
    • M
      netlink: Fix shadow warning on jiffies · d87de1f3
      Mark Rustad 提交于
      Change formal parameter name to not shadow the global jiffies.
      Signed-off-by: NMark Rustad <mark.d.rustad@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d87de1f3
  15. 28 7月, 2014 3 次提交