1. 29 11月, 2012 2 次提交
    • N
      ip6tnl/sit: drop packet if ECN present with not-ECT · f4e0b4c5
      Nicolas Dichtel 提交于
      This patch reports the change made by Stephen Hemminger in ipip and gre[6] in
      commit eccc1bb8 (tunnel: drop packet if ECN present with not-ECT).
      
      Goal is to handle RFC6040, Section 4.2:
      
      Default Tunnel Egress Behaviour.
       o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
            propagate any other ECN codepoint onwards.  This is because the
            inner Not-ECT marking is set by transports that rely on dropped
            packets as an indication of congestion and would not understand or
            respond to any other ECN codepoint [RFC4774].  Specifically:
      
            *  If the inner ECN field is Not-ECT and the outer ECN field is
               CE, the decapsulator MUST drop the packet.
      
            *  If the inner ECN field is Not-ECT and the outer ECN field is
               Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
               outgoing packet with the ECN field cleared to Not-ECT.
      
      The patch takes benefits from common function added in net/inet_ecn.h.
      
      Like it was done for Xin4 tunnels, it adds logging to allow detecting broken
      systems that set ECN bits incorrectly when tunneling (or an intermediate
      router might be changing the header). Errors are also tracked via
      rx_frame_error.
      
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4e0b4c5
    • P
      pkt_sched: QFQ Plus: fair-queueing service at DRR cost · 462dbc91
      Paolo Valente 提交于
      This patch turns QFQ into QFQ+, a variant of QFQ that provides the
      following two benefits: 1) QFQ+ is faster than QFQ, 2) differently
      from QFQ, QFQ+ correctly schedules also non-leaves classes in a
      hierarchical setting. A detailed description of QFQ+, plus a
      performance comparison with DRR and QFQ, can be found in [1].
      
      [1] P. Valente, "Reducing the Execution Time of Fair-Queueing Schedulers"
      http://algo.ing.unimo.it/people/paolo/agg-sched/agg-sched.pdfSigned-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      462dbc91
  2. 27 11月, 2012 4 次提交
    • J
      ip6mr: Add sizeof verification to MRT6_ASSERT and MT6_PIM · 03f52a0a
      Joe Perches 提交于
      Verify the length of the user-space arguments.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03f52a0a
    • B
      sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name · c91f6df2
      Brian Haley 提交于
      Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
      will then require another call like if_indextoname() to get the actual interface
      name, have it return the name directly.
      
      This also matches the existing man page description on socket(7) which mentions
      the argument being an interface name.
      
      If the value has not been set, zero is returned and optlen will be set to zero
      to indicate there is no interface name present.
      
      Added a seqlock to protect this code path, and dev_ifname(), from someone
      changing the device name via dev_change_name().
      
      v2: Added seqlock protection while copying device name.
      
      v3: Fixed word wrap in patch.
      Signed-off-by: NBrian Haley <brian.haley@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c91f6df2
    • D
      atm: br2684: Fix excessive queue bloat · ae088d66
      David Woodhouse 提交于
      There's really no excuse for an additional wmem_default of buffering
      between the netdev queue and the ATM device. Two packets (one in-flight,
      and one ready to send) ought to be fine. It's not as if it should take
      long to get another from the netdev queue when we need it.
      
      If necessary we can make the queue space configurable later, but I don't
      think it's likely to be necessary.
      
      cf. commit 9d02daf7 (pppoatm: Fix
      excessive queue bloat) which did something very similar for PPPoATM.
      
      Note that there is a tremendously unlikely race condition which may
      result in qspace temporarily going negative. If a CPU running the
      br2684_pop() function goes off into the weeds for a long period of time
      after incrementing qspace to 1, but before calling netdev_wake_queue()...
      and another CPU ends up calling br2684_start_xmit() and *stopping* the
      queue again before the first CPU comes back, the netdev queue could
      end up being woken when qspace has already reached zero.
      
      An alternative approach to coping with this race would be to check in
      br2684_start_xmit() for qspace==0 and return NETDEV_TX_BUSY, but just
      using '> 0' and '< 1' for comparison instead of '== 0' and '!= 0' is
      simpler. It just warranted a mention of *why* we do it that way...
      
      Move the call to atmvcc->send() to happen *after* the accounting and
      potentially stopping the netdev queue, in br2684_xmit_vcc(). This matters
      if the ->send() call suffers an immediate failure, because it'll call
      br2684_pop() with the offending skb before returning. We want that to
      happen *after* we've done the initial accounting for the packet in
      question. Also make it return an appropriate success/failure indication
      while we're at it.
      
      Tested by running 'ping -l 1000 bottomless.aaisp.net.uk' from within my
      network, with only a single PPPoE-over-BR2684 link running. And after
      setting txqueuelen on the nas0 interface to something low (5, in fact).
      Before the patch, we'd see about 15 packets being queued and a resulting
      latency of ~56ms being reached. After the patch, we see only about 8,
      which is fairly much what we expect. And a max latency of ~36ms. On this
      OpenWRT box, wmem_default is 163840.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Reviewed-by: NKrzysztof Mazur <krzysiek@podlesie.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae088d66
    • B
      dsa: Hide core config options; make drivers select what they need · b3422a31
      Ben Hutchings 提交于
      Commit 82167cb8 ('net: dsa/slave: Fix
      compilation warnings') fixed one possible invalid configuration
      (NET_DSA enabled with no trailer formats) but added others: drivers
      can select NET_DSA without its dependencies being met.
      
      It's not very useful to make either the DSA core or the tagging
      formats manually selectable without a driver to use them, so:
      
      1. Define a hidden HAVE_NET_DSA option and move the dependencies of
         NET_DSA to that.  While we're at it, drop the deprecated
         EXPERIMENTAL dependency.
      2. Make NET_DSA and the drivers dependent on HAVE_NET_DSA.
      3. Hide the tagging format options again.
      4. Make drivers select both NET_DSA and the appropriate tagging format
         option.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3422a31
  3. 26 11月, 2012 4 次提交
    • J
      ipv4/ipmr and ipv6/ip6mr: Convert int mroute_do_<foo> to bool · 53d6841d
      Joe Perches 提交于
      Save a few bytes per table by convert mroute_do_assert and
      mroute_do_pim from int to bool.
      
      Remove !! as the compiler does that when assigning int to bool.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53d6841d
    • E
      ipv4: ipmr: various fixes and cleanups · 5e1859fb
      Eric Dumazet 提交于
      1) ip_mroute_setsockopt() & ip_mroute_getsockopt() should not
         access/set raw_sk(sk)->ipmr_table before making sure the socket
         is a raw socket, and protocol is IGMP
      
      2) MRT_INIT should return -EINVAL if optlen != sizeof(int), not
         -ENOPROTOOPT
      
      3) MRT_ASSERT & MRT_PIM should validate optlen
      
      4) " (v) ? 1 : 0 " can be written as " !!v "
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e1859fb
    • V
      net: dsa/slave: Fix compilation warnings · 82167cb8
      viresh kumar 提交于
      Currently when none of CONFIG_NET_DSA_TAG_DSA, CONFIG_NET_DSA_TAG_EDSA and
      CONFIG_NET_DSA_TAG_TRAILER is defined, we get following compilation warnings:
      
      net/dsa/slave.c:51:12: warning: 'dsa_slave_init' defined but not used [-Wunused-function]
      net/dsa/slave.c:60:12: warning: 'dsa_slave_open' defined but not used [-Wunused-function]
      net/dsa/slave.c:98:12: warning: 'dsa_slave_close' defined but not used [-Wunused-function]
      net/dsa/slave.c:116:13: warning: 'dsa_slave_change_rx_flags' defined but not used [-Wunused-function]
      net/dsa/slave.c:127:13: warning: 'dsa_slave_set_rx_mode' defined but not used [-Wunused-function]
      net/dsa/slave.c:136:12: warning: 'dsa_slave_set_mac_address' defined but not used [-Wunused-function]
      net/dsa/slave.c:164:12: warning: 'dsa_slave_ioctl' defined but not used [-Wunused-function]
      
      Earlier approach to fix this was discussed here:
      
      lkml.org/lkml/2012/10/29/549
      
      This is another approach to fix it. This is done by some changes in config
      options, which make more sense than the earlier approach. As, atleast one
      tagging option must always be selected for using net/dsa/ infrastructure, this
      patch selects NET_DSA from tagging configs instead of having it as an selectable
      config.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82167cb8
    • M
      net: sched: enable CAN Identifier to be build into kernel · a303fbf3
      Marc Kleine-Budde 提交于
      This patch makes it possible to build the CAN Identifier into the kernel, even
      if the CAN support is build as a module.
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a303fbf3
  4. 23 11月, 2012 5 次提交
    • J
      ipv4: do not cache looped multicasts · 63617421
      Julian Anastasov 提交于
      	Starting from 3.6 we cache output routes for
      multicasts only when using route to 224/4. For local receivers
      we can set RTCF_LOCAL flag depending on the membership but
      in such case we use maddr and saddr which are not caching
      keys as before. Additionally, we can not use same place to
      cache routes that differ in RTCF_LOCAL flag value.
      
      	Fix it by caching only RTCF_MULTICAST entries
      without RTCF_LOCAL (send-only, no loopback). As a side effect,
      we avoid unneeded lookup for fnhe when not caching because
      multicasts are not redirected and they do not learn PMTU.
      
      	Thanks to Maxime Bizon for showing the caching
      problems in __mkroute_output for 3.6 kernels: different
      RTCF_LOCAL flag in cache can lead to wrong ip_mc_output or
      ip_output call and the visible problem is that traffic can
      not reach local receivers via loopback.
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Tested-by: NMaxime Bizon <mbizon@freebox.fr>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63617421
    • A
      ipv6: adapt connect for repair move · 2b916477
      Andrey Vagin 提交于
      This is work the same as for ipv4.
      
      All other hacks about tcp repair are in common code for ipv4 and ipv6,
      so this patch is enough for repairing ipv6 connections.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b916477
    • P
      tipc: delete TIPC_ADVANCED Kconfig variable · 94fc9c47
      Paul Gortmaker 提交于
      There used to be a time when TIPC had lots of Kconfig knobs the
      end user could alter, but they have all been made automatic or
      obsolete, with the exception of CONFIG_TIPC_PORTS.  This
      previously existing set of options was all hidden under the
      TIPC_ADVANCED setting, which does not exist in any code, but
      only in Kconfig scope.
      
      Having this now, just to hide the one remaining "advanced"
      option no longer makes sense.  Remove it.  Also get rid of the
      ifdeffery in the TIPC code that allowed for TIPC_PORTS to be
      possibly undefined.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      94fc9c47
    • Y
      tipc: eliminate an unnecessary cast of node variable · 4cb7d55a
      Ying Xue 提交于
      As the variable:node is currently defined to u32 type, it is
      unnecessary to cast its type to u32 again when using it.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      4cb7d55a
    • J
      tipc: introduce message to synchronize broadcast link · c64f7a6a
      Jon Maloy 提交于
      Upon establishing a first link between two nodes, there is
      currently a risk that the two endpoints will disagree on exactly
      which sequence number reception and acknowleding of broadcast
      packets should start.
      
      The following scenarios may happen:
      
      1: Node A sends an ACTIVATE message to B, telling it to start acking
         packets from sequence number N.
      2: Node A sends out broadcast N, but does not expect an acknowledge
         from B, since B is not yet in its broadcast receiver's list.
      3: Node A receives ACK for N from all nodes except B, and releases
         packet N.
      4: Node B receives the ACTIVATE, activates its link endpoint, and
         stores the value N as sequence number of first expected packet.
      5: Node B sends a NAME_DISTR message to A.
      6: Node A receives the NAME_DISTR message, and activates its endpoint.
         At this moment B is added to A's broadcast receiver's set.
         Node A also sets sequence number 0 as the first broadcast packet
         to be received from B.
      7: Node A sends broadcast N+1.
      8: B receives N+1, determines there is a gap in the sequence, since
         it is expecting N, and sends a NACK for N back to A.
      9: Node A has already released N, so no retransmission is possible.
         The broadcast link in direction A->B is stale.
      
      In addition to, or instead of, 7-9 above, the following may happen:
      
      10: Node B sends broadcast M > 0 to A.
      11: Node A receives M, falsely decides there must be a gap, since
          it is expecting packet 0, and asks for retransmission of packets
          [0,M-1].
      12: Node B has already released these packets, so the broadcast
          link is stale in direction B->A.
      
      We solve this problem by introducing a new unicast message type,
      BCAST_PROTOCOL/STATE, to convey the sequence number of the next
      sent broadcast packet to the other endpoint, at exactly the moment
      that endpoint is added to the own node's broadcast receivers list,
      and before any other unicast messages are permitted to be sent.
      
      Furthermore, we don't allow any node to start receiving and
      processing broadcast packets until this new synchronization
      message has been received.
      
      To maintain backwards compatibility, we still open up for
      broadcast reception if we receive a NAME_DISTR message without
      any preceding broadcast sync message. In this case, we must
      assume that the other end has an older code version, and will
      never send out the new synchronization message. Hence, for mixed
      old and new nodes, the issue arising in 7-12 of the above may
      happen with the same probability as before.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      c64f7a6a
  5. 22 11月, 2012 8 次提交
    • Y
      tipc: rename supported flag to recv_permitted · 389dd9bc
      Ying Xue 提交于
      Rename the "supported" flag in bclink structure to "recv_permitted"
      to better reflect what it is used for. When this flag is set for a
      given node, we are permitted to receive and acknowledge broadcast
      messages from that node.  Convert it to a bool at the same time,
      since it is not used to store any numerical values.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      389dd9bc
    • Y
      tipc: remove supportable flag from bclink structure · 818f4da5
      Ying Xue 提交于
      The "supportable" flag in bclink structure is a compatibility flag
      indicating whether a peer node is capable of receiving TIPC broadcast
      messages. However, all TIPC versions since tipc-1.5, and after the
      inclusion in the upstream Linux kernel in 2006, support this capability.
      It is highly unlikely that anybody is still using such an old
      version of TIPC, let alone that they want to mix it with TIPC-2.0
      nodes. Therefore, we now remove the "supportable" flag.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      818f4da5
    • Y
      tipc: remove the bearer congestion mechanism · 3c294cb3
      Ying Xue 提交于
      Currently at the TIPC bearer layer there is the following congestion
      mechanism:
      
      Once sending packets has failed via that bearer, the bearer will be
      flagged as being in congested state at once. During bearer congestion,
      all packets arriving at link will be queued on the link's outgoing
      buffer.  When we detect that the state of bearer congestion has
      relaxed (e.g. some packets are received from the bearer) we will try
      our best to push all packets in the link's outgoing buffer until the
      buffer is empty, or until the bearer is congested again.
      
      However, in fact the TIPC bearer never receives any feedback from the
      device layer whether a send was successful or not, so it must always
      assume it was successful. Therefore, the bearer congestion mechanism
      as it exists currently is of no value.
      
      But the bearer blocking state is still useful for us. For example,
      when the physical media goes down/up, we need to change the state of
      the links bound to the bearer.  So the code maintaing the state
      information is not removed.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      3c294cb3
    • Y
      tipc: wake up all waiting threads at socket shutdown · 75031151
      Ying Xue 提交于
      When a socket is shut down, we should wake up all thread sleeping on
      it, instead of just one of them. Otherwise, when several threads are
      polling the same socket, and one of them does shutdown(), the
      remaining threads may end up sleeping forever.
      
      Also, to align socket usage with common practice in other stacks, we
      use one of the common socket callback handlers, sk_state_change(),
      to wake up pending users. This is similar to the usage in e.g.
      inet_shutdown(). [net/ipv4/af_inet.c].
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      75031151
    • F
      netfilter: cttimeout: fix buffer overflow · e93b5f9f
      Florian Westphal 提交于
      Chen Gang reports:
      the length of nla_data(cda[CTA_TIMEOUT_NAME]) is not limited in server side.
      
      And indeed, its used to strcpy to a fixed-sized buffer.
      
      Fortunately, nfnetlink users need CAP_NET_ADMIN.
      Reported-by: NChen Gang <gang.chen@asianux.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e93b5f9f
    • J
      netfilter: ipset: Fix range bug in hash:ip,port,net · 4fe198e6
      Jozsef Kadlecsik 提交于
      Due to the missing ininitalization at adding/deleting entries, when
      a plain_ip,port,net element was the object, multiple elements were
      added/deleted instead. The bug came from the missing dangling
      default initialization.
      
      The error-prone default initialization is corrected in all hash:* types.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4fe198e6
    • E
      tipc: return POLLOUT for sockets in an unconnected state · c4fc298a
      Erik Hugne 提交于
      If an implied connect is attempted on a nonblocking STREAM/SEQPACKET
      socket during link congestion, the connect message will be discarded
      and sendmsg will return EAGAIN. This is normal behavior, and the
      application is expected to poll the socket until POLLOUT is set,
      after which the connection attempt can be retried.
      However, the POLLOUT flag is never set for unconnected sockets and
      poll() always returns a zero mask. The application is then left without
      a trigger for when it can make another attempt at sending the message.
      
      The solution is to check if we're polling on an unconnected socket
      and set the POLLOUT flag if the TIPC port owned by this socket
      is not congested. The TIPC ports waiting on a specific link will be
      marked as 'not congested' when the link congestion have abated.
      Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      c4fc298a
    • Y
      tipc: fix race/inefficiencies in poll/wait behaviour · f288bef4
      Ying Xue 提交于
      When an application blocks at poll/select on a TIPC socket
      while requesting a specific event mask, both the filter_rcv() and
      wakeupdispatch() case will wake it up unconditionally whenever
      the state changes (i.e an incoming message arrives, or congestion
      has subsided).  No mask is used.
      
      To avoid this, we populate sk->sk_data_ready and sk->sk_write_space
      with tipc_data_ready and tipc_write_space respectively, which makes
      tipc more in alignment with the rest of the networking code.  These
      pass the exact set of possible events to the waker in fs/select.c
      hence avoiding waking up blocked processes unnecessarily.
      
      In doing so, we uncover another issue -- that there needs to be a
      memory barrier in these poll/receive callbacks, otherwise we are
      subject to the the same race as documented above wq_has_sleeper()
      [in commit a57de0b4 "net: adding memory barrier to the poll and
      receive callbacks"].  So we need to replace poll_wait() with
      sock_poll_wait() and use rcu protection for the sk->sk_wq pointer
      in these two new functions.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      f288bef4
  6. 21 11月, 2012 13 次提交
  7. 20 11月, 2012 4 次提交