1. 28 5月, 2013 2 次提交
    • D
      netpoll: remove return value from netpoll_rx_disable() · da6e378b
      dingtianhong 提交于
      The netpoll_rx_disable() will always return 0, it is no use and looks wordy,
      so remove the unnecessary code and get rid of it in _dev_open and _dev_close.
      Signed-off-by: NDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da6e378b
    • S
      MPLS: Add limited GSO support · 0d89d203
      Simon Horman 提交于
      In the case where a non-MPLS packet is received and an MPLS stack is
      added it may well be the case that the original skb is GSO but the
      NIC used for transmit does not support GSO of MPLS packets.
      
      The aim of this code is to provide GSO in software for MPLS packets
      whose skbs are GSO.
      
      SKB Usage:
      
      When an implementation adds an MPLS stack to a non-MPLS packet it should do
      the following to skb metadata:
      
      * Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
        skb->inner_protocol is added by this patch.
      
      * Set skb->protocol to the new MPLS ethertype of the packet.
      
      * Set skb->network_header to correspond to the
        end of the L3 header, including the MPLS label stack.
      
      I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
      kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
      That patch sets the above requirements in datapath/actions.c:push_mpls()
      and was used to exercise this code.  The datapath patch is against the Open
      vSwtich tree but it is intended that it be added to the Open vSwtich code
      present in the mainline Linux kernel at some point.
      
      Features:
      
      I believe that the approach that I have taken is at least partially
      consistent with the handling of other protocols.  Jesse, I understand that
      you have some ideas here.  I am more than happy to change my implementation.
      
      This patch adds dev->mpls_features which may be used by devices
      to advertise features supported for MPLS packets.
      
      A new NETIF_F_MPLS_GSO feature is added for devices which support
      hardware MPLS GSO offload.  Currently no devices support this
      and MPLS GSO always falls back to software.
      
      Alternate Implementation:
      
      One possible alternate implementation is to teach netif_skb_features()
      and skb_network_protocol() about MPLS, in a similar way to their
      understanding of VLANs. I believe this would avoid the need
      for net/mpls/mpls_gso.c and in particular the calls to
      __skb_push() and __skb_push() in mpls_gso_segment().
      
      I have decided on the implementation in this patch as it should
      not introduce any overhead in the case where mpls_gso is not compiled
      into the kernel or inserted as a module.
      
      MPLS GSO suggested by Jesse Gross.
      Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
      by Pravin B Shelar.
      
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d89d203
  2. 26 5月, 2013 5 次提交
  3. 24 5月, 2013 2 次提交
  4. 23 5月, 2013 9 次提交
    • T
      xfrm: properly handle invalid states as an error · 497574c7
      Timo Teräs 提交于
      The error exit path needs err explicitly set. Otherwise it
      returns success and the only caller, xfrm_output_resume(),
      would oops in skb_dst(skb)->ops derefence as skb_dst(skb) is
      NULL.
      
      Bug introduced in commit bb65a9cb (xfrm: removes a superfluous
      check and add a statistic).
      Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
      Cc: Li RongQing <roy.qing.li@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      497574c7
    • C
      ipv6: use ipv6_addr_scope() helper · 88924753
      Cong Wang 提交于
      ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK could be replaced
      by ipv6_addr_scope(), which is slightly faster.
      
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88924753
    • C
      ipv6: use ipv6_addr_any() helper · 7996c799
      Cong Wang 提交于
      ipv6_addr_any() is a faster way to determine if an addr
      is ipv6 any addr, no need to compute the addr type.
      
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7996c799
    • N
      tcp: bug fix in proportional rate reduction. · 35f079eb
      Nandita Dukkipati 提交于
      This patch is a fix for a bug triggering newly_acked_sacked < 0
      in tcp_ack(.).
      
      The bug is triggered by sacked_out decreasing relative to prior_sacked,
      but packets_out remaining the same as pior_packets. This is because the
      snapshot of prior_packets is taken after tcp_sacktag_write_queue() while
      prior_sacked is captured before tcp_sacktag_write_queue(). The problem
      is: tcp_sacktag_write_queue (tcp_match_skb_to_sack() -> tcp_fragment)
      adjusts the pcount for packets_out and sacked_out (MSS change or other
      reason). As a result, this delta in pcount is reflected in
      (prior_sacked - sacked_out) but not in (prior_packets - packets_out).
      
      This patch does the following:
      1) initializes prior_packets at the start of tcp_ack() so as to
      capture the delta in packets_out created by tcp_fragment.
      2) introduces a new "previous_packets_out" variable that snapshots
      packets_out right before tcp_clean_rtx_queue, so pkts_acked can be
      correctly computed as before.
      3) Computes pkts_acked using previous_packets_out, and computes
      newly_acked_sacked using prior_packets.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f079eb
    • E
      sch_tbf: segment too big GSO packets · e43ac79a
      Eric Dumazet 提交于
      If a GSO packet has a length above tbf burst limit, the packet
      is currently silently dropped.
      
      Current way to handle this is to set the device in non GSO/TSO mode, or
      setting high bursts, and its sub optimal.
      
      We can actually segment too big GSO packets, and send individual
      segments as tbf parameters allow, allowing for better interoperability.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e43ac79a
    • S
      net: Loosen constraints for recalculating checksum in skb_segment() · 1cdbcb79
      Simon Horman 提交于
      This is a generic solution to resolve a specific problem that I have observed.
      
      If the encapsulation of an skb changes then ability to offload checksums
      may also change. In particular it may be necessary to perform checksumming
      in software.
      
      An example of such a case is where a non-GRE packet is received but
      is to be encapsulated and transmitted as GRE.
      
      Another example relates to my proposed support for for packets
      that are non-MPLS when received but MPLS when transmitted.
      
      The cost of this change is that the value of the csum variable may be
      checked when it previously was not. In the case where the csum variable is
      true this is pure overhead. In the case where the csum variable is false it
      leads to software checksumming, which I believe also leads to correct
      checksums in transmitted packets for the cases described above.
      
      Further analysis:
      
      This patch relies on the return value of can_checksum_protocol()
      being correct and in turn the return value of skb_network_protocol(),
      used to provide the protocol parameter of can_checksum_protocol(),
      being correct. It also relies on the features passed to skb_segment()
      and in turn to can_checksum_protocol() being correct.
      
      I believe that this problem has not been observed for VLANs because it
      appears that almost all drivers, the exception being xgbe, set
      vlan_features such that that the checksum offload support for VLAN packets
      is greater than or equal to that of non-VLAN packets.
      
      I wonder if the code in xgbe may be an oversight and the hardware does
      support checksumming of VLAN packets.  If so it may be worth updating the
      vlan_features of the driver as this patch will force such checksums to be
      performed in software rather than hardware.
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cdbcb79
    • C
      bridge: send query as soon as leave is received · 6b7df111
      Cong Wang 提交于
      Continue sending queries when leave is received if the user marks
      it as a querier.
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Adam Baker <linux@baker-net.org.uk>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b7df111
    • C
      bridge: only expire the mdb entry when query is received · 9f00b2e7
      Cong Wang 提交于
      Currently we arm the expire timer when the mdb entry is added,
      however, this causes problem when there is no querier sent
      out after that.
      
      So we should only arm the timer when a corresponding query is
      received, as suggested by Herbert.
      
      And he also mentioned "if there is no querier then group
      subscriptions shouldn't expire. There has to be at least one querier
      in the network for this thing to work.  Otherwise it just degenerates
      into a non-snooping switch, which is OK."
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Adam Baker <linux@baker-net.org.uk>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f00b2e7
    • C
      bridge: use the bridge IP addr as source addr for querier · 1c8ad5bf
      Cong Wang 提交于
      Quote from Adam:
      "If it is believed that the use of 0.0.0.0
      as the IP address is what is causing strange behaviour on other devices
      then is there a good reason that a bridge rather than a router shouldn't
      be the active querier? If not then using the bridge IP address and
      having the querier enabled by default may be a reasonable solution
      (provided that our querier obeys the election rules and shuts up if it
      sees a query from a lower IP address that isn't 0.0.0.0). Just because a
      device is the elected querier for IGMP doesn't appear to mean it is
      required to perform any other routing functions."
      
      And introduce a new troggle for it, as suggested by Herbert.
      Suggested-by: NAdam Baker <linux@baker-net.org.uk>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Adam Baker <linux@baker-net.org.uk>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c8ad5bf
  5. 22 5月, 2013 1 次提交
  6. 21 5月, 2013 2 次提交
    • E
      tcp: md5: remove spinlock usage in fast path · 71cea17e
      Eric Dumazet 提交于
      TCP md5 code uses per cpu variables but protects access to them with
      a shared spinlock, which is a contention point.
      
      [ tcp_md5sig_pool_lock is locked twice per incoming packet ]
      
      Makes things much simpler, by allocating crypto structures once, first
      time a socket needs md5 keys, and not deallocating them as they are
      really small.
      
      Next step would be to allow crypto allocations being done in a NUMA
      aware way.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71cea17e
    • W
      rps: selective flow shedding during softnet overflow · 99bbc707
      Willem de Bruijn 提交于
      A cpu executing the network receive path sheds packets when its input
      queue grows to netdev_max_backlog. A single high rate flow (such as a
      spoofed source DoS) can exceed a single cpu processing rate and will
      degrade throughput of other flows hashed onto the same cpu.
      
      This patch adds a more fine grained hashtable. If the netdev backlog
      is above a threshold, IRQ cpus track the ratio of total traffic of
      each flow (using 4096 buckets, configurable). The ratio is measured
      by counting the number of packets per flow over the last 256 packets
      from the source cpu. Any flow that occupies a large fraction of this
      (set at 50%) will see packet drop while above the threshold.
      
      Tested:
      Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
      kernel receive (RPS) on cpu0 and application threads on cpus 2--7
      each handling 20k req/s. Throughput halves when hit with a 400 kpps
      antagonist storm. With this patch applied, antagonist overload is
      dropped and the server processes its complete load.
      
      The patch is effective when kernel receive processing is the
      bottleneck. The above RPS scenario is a extreme, but the same is
      reached with RFS and sufficient kernel processing (iptables, packet
      socket tap, ..).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99bbc707
  7. 20 5月, 2013 6 次提交
    • E
      ip_gre: fix a possible crash in ipgre_err() · 96f5a846
      Eric Dumazet 提交于
      Another fix needed in ipgre_err(), as parse_gre_header() might change
      skb->head.
      
      Bug added in commit c5441932 (GRE: Refactor GRE tunneling code.)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96f5a846
    • Y
      tcp: remove bad timeout logic in fast recovery · 3e59cb0d
      Yuchung Cheng 提交于
      tcp_timeout_skb() was intended to trigger fast recovery on timeout,
      unfortunately in reality it often causes spurious retransmission
      storms during fast recovery. The particular sign is a fast retransmit
      over the highest sacked sequence (SND.FACK).
      
      Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
      to avoid spurious timeout: when SND.UNA advances the sender re-arms
      RTO and extends the timeout by icsk_rto. The sender does not offset
      the time elapsed since the packet at SND.UNA was sent.
      
      But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
      tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
      sent before the icsk_rto interval lost, including one that's above the
      highest sacked sequence. Most likely a large part of scorebard will be
      marked.
      
      If most packets are not lost then the subsequent DUPACKs with new SACK
      blocks will cause the sender to continue to retransmit packets beyond
      SND.FACK spuriously. Even if only one packet is lost the sender may
      falsely retransmit almost the entire window.
      
      The situation becomes common in the world of bufferbloat: the RTT
      continues to grow as the queue builds up but RTTVAR remains small and
      close to the minimum 200ms. If a data packet is lost and the DUPACK
      triggered by the next data packet is slightly delayed, then a spurious
      retransmission storm forms.
      
      As the original comment on tcp_timeout_skb() suggests: the usefulness
      of this feature is questionable. It also wastes cycles walking the
      sack scoreboard and is actually harmful because of false recovery.
      
      It's time to remove this.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e59cb0d
    • R
      Hoist memcpy_fromiovec/memcpy_toiovec into lib/ · d2f83e90
      Rusty Russell 提交于
      ERROR: "memcpy_fromiovec" [drivers/vhost/vhost_scsi.ko] undefined!
      
      That function is only present with CONFIG_NET.  Turns out that
      crypto/algif_skcipher.c also uses that outside net, but it actually
      needs sockets anyway.
      
      In addition, commit 6d4f0139 added
      CONFIG_NET dependency to CONFIG_VMCI for memcpy_toiovec, so hoist
      that function and revert that commit too.
      
      socket.h already includes uio.h, so no callers need updating; trying
      only broke things fo x86_64 randconfig (thanks Fengguang!).
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      d2f83e90
    • C
      net: irda: using kzalloc() instead of kmalloc() to avoid strncpy() issue. · ff0102ee
      Chen Gang 提交于
      'discovery->data.info' length is 22, NICKNAME_MAX_LEN is 21, so the
      strncpy() will always left the last byte of 'discovery->data.info'
      uninitialized.
      
      When 'text' length is longer than 21 (NICKNAME_MAX_LEN), if still left
      the last byte of 'discovery->data.info' uninitialized, the next
      strlen() will cause issue.
      
      Also 'discovery->data' is 'struct irda_device_info' which defined in
      "include/uapi/...", it may copy to user mode, so need whole initialized.
      
      All together, need use kzalloc() instead of kmalloc() to initialize all
      members firstly.
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff0102ee
    • N
      ipv6: add support of peer address · caeaba79
      Nicolas Dichtel 提交于
      This patch adds the support of peer address for IPv6. For example, it is
      possible to specify the remote end of a 6inY tunnel.
      This was already possible in IPv4:
       ip addr add ip1 peer ip2 dev dev1
      
      The peer address is specified with IFA_ADDRESS and the local address with
      IFA_LOCAL (like explained in include/uapi/linux/if_addr.h).
      Note that the API is not changed, because before this patch, it was not
      possible to specify two different addresses in IFA_LOCAL and IFA_REMOTE.
      There is a small change for the dump: if the peer is different from ::,
      IFA_ADDRESS will contain the peer address instead of the local address.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      caeaba79
    • P
      netlabel: improve domain mapping validation · 6b21e1b7
      Paul Moore 提交于
      The net/netlabel/netlabel_domainhash.c:netlbl_domhsh_add() function
      does not properly validate new domain hash entries resulting in
      potential problems when an administrator attempts to add an invalid
      entry.  One such problem, as reported by Vlad Halilov, is a kernel
      BUG (found in netlabel_domainhash.c:netlbl_domhsh_audit_add()) when
      adding an IPv6 outbound mapping with a CIPSO configuration.
      
      This patch corrects this problem by adding the necessary validation
      code to netlbl_domhsh_add() via the newly created
      netlbl_domhsh_validate() function.
      
      Ideally this patch should also be pushed to the currently active
      -stable trees.
      Reported-by: NVlad Halilov <vlad.halilov@gmail.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b21e1b7
  8. 19 5月, 2013 1 次提交
  9. 18 5月, 2013 1 次提交
  10. 17 5月, 2013 11 次提交