1. 29 9月, 2014 1 次提交
  2. 27 9月, 2014 1 次提交
    • L
      net: optimise csum_replace4() · 4565af0d
      LEROY Christophe 提交于
      csum_partial() is a generic function which is not optimised for small fixed
      length calculations, and its use requires to store "from" and "to" values in
      memory while we already have them available in registers. This also has impact,
      especially on RISC processors. In the same spirit as the change done by
      Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4()
      taking into account RFC1624.
      
      I spotted during a NATted tcp transfert that csum_partial() is one of top 5
      consuming functions (around 8%), and the second user of csum_partial() is
      inet_proto_csum_replace4().
      
      I have proposed the same modification to inet_proto_csum_replace4() in another
      patch.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4565af0d
  3. 24 9月, 2014 1 次提交
    • E
      icmp: add a global rate limitation · 4cdf507d
      Eric Dumazet 提交于
      Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
      protected by a lock, meaning that hosts can be stuck hard if all cpus
      want to check ICMP limits.
      
      When say a DNS or NTP server process is restarted, inetpeer tree grows
      quick and machine comes to its knees.
      
      iptables can not help because the bottleneck happens before ICMP
      messages are even cooked and sent.
      
      This patch adds a new global limitation, using a token bucket filter,
      controlled by two new sysctl :
      
      icmp_msgs_per_sec - INTEGER
          Limit maximal number of ICMP packets sent per second from this host.
          Only messages whose type matches icmp_ratemask are
          controlled by this limit.
          Default: 1000
      
      icmp_msgs_burst - INTEGER
          icmp_msgs_per_sec controls number of ICMP packets sent per second,
          while icmp_msgs_burst controls the burst size of these packets.
          Default: 50
      
      Note that if we really want to send millions of ICMP messages per
      second, we might extend idea and infra added in commit 04ca6973
      ("ip: make IP identifiers less predictable") :
      add a token bucket in the ip_idents hash and no longer rely on inetpeer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cdf507d
  4. 23 9月, 2014 5 次提交
    • E
      tcp: avoid possible arithmetic overflows · fcdd1cf4
      Eric Dumazet 提交于
      icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
      or more if some sysctl (eg tcp_retries2) are changed.
      
      Better use 64bit to perform icsk_rto << icsk_backoff operations
      
      As Joe Perches suggested, add a helper for this.
      
      Yuchung spotted the tcp_v4_err() case.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcdd1cf4
    • D
      ipv6: mld: answer mldv2 queries with mldv1 reports in mldv1 fallback · 35f7aa53
      Daniel Borkmann 提交于
      RFC2710 (MLDv1), section 3.7. says:
      
        The length of a received MLD message is computed by taking the
        IPv6 Payload Length value and subtracting the length of any IPv6
        extension headers present between the IPv6 header and the MLD
        message. If that length is greater than 24 octets, that indicates
        that there are other fields present *beyond* the fields described
        above, perhaps belonging to a *future backwards-compatible* version
        of MLD. An implementation of the version of MLD specified in this
        document *MUST NOT* send an MLD message longer than 24 octets and
        MUST ignore anything past the first 24 octets of a received MLD
        message.
      
      RFC3810 (MLDv2), section 8.2.1. states for *listeners* regarding
      presence of MLDv1 routers:
      
        In order to be compatible with MLDv1 routers, MLDv2 hosts MUST
        operate in version 1 compatibility mode. [...] When Host
        Compatibility Mode is MLDv2, a host acts using the MLDv2 protocol
        on that interface. When Host Compatibility Mode is MLDv1, a host
        acts in MLDv1 compatibility mode, using *only* the MLDv1 protocol,
        on that interface. [...]
      
      While section 8.3.1. specifies *router* behaviour regarding presence
      of MLDv1 routers:
      
        MLDv2 routers may be placed on a network where there is at least
        one MLDv1 router. The following requirements apply:
      
        If an MLDv1 router is present on the link, the Querier MUST use
        the *lowest* version of MLD present on the network. This must be
        administratively assured. Routers that desire to be compatible
        with MLDv1 MUST have a configuration option to act in MLDv1 mode;
        if an MLDv1 router is present on the link, the system administrator
        must explicitly configure all MLDv2 routers to act in MLDv1 mode.
        When in MLDv1 mode, the Querier MUST send periodic General Queries
        truncated at the Multicast Address field (i.e., 24 bytes long),
        and SHOULD also warn about receiving an MLDv2 Query (such warnings
        must be rate-limited). The Querier MUST also fill in the Maximum
        Response Delay in the Maximum Response Code field, i.e., the
        exponential algorithm described in section 5.1.3. is not used. [...]
      
      That means that we should not get queries from different versions of
      MLD. When there's a MLDv1 router present, MLDv2 enforces truncation
      and MRC == MRD (both fields are overlapping within the 24 octet range).
      
      Section 8.3.2. specifies behaviour in the presence of MLDv1 multicast
      address *listeners*:
      
        MLDv2 routers may be placed on a network where there are hosts
        that have not yet been upgraded to MLDv2. In order to be compatible
        with MLDv1 hosts, MLDv2 routers MUST operate in version 1 compatibility
        mode. MLDv2 routers keep a compatibility mode per multicast address
        record. The compatibility mode of a multicast address is determined
        from the Multicast Address Compatibility Mode variable, which can be
        in one of the two following states: MLDv1 or MLDv2.
      
        The Multicast Address Compatibility Mode of a multicast address
        record is set to MLDv1 whenever an MLDv1 Multicast Listener Report is
        *received* for that multicast address. At the same time, the Older
        Version Host Present timer for the multicast address is set to Older
        Version Host Present Timeout seconds. The timer is re-set whenever a
        new MLDv1 Report is received for that multicast address. If the Older
        Version Host Present timer expires, the router switches back to
        Multicast Address Compatibility Mode of MLDv2 for that multicast
        address. [...]
      
      That means, what can happen is the following scenario, that hosts can
      act in MLDv1 compatibility mode when they previously have received an
      MLDv1 query (or, simply operate in MLDv1 mode-only); and at the same
      time, an MLDv2 router could start up and transmits MLDv2 startup query
      messages while being unaware of the current operational mode.
      
      Given RFC2710, section 3.7 we would need to answer to that with an MLDv1
      listener report, so that the router according to RFC3810, section 8.3.2.
      would receive that and internally switch to MLDv1 compatibility as well.
      
      Right now, I believe since the initial implementation of MLDv2, Linux
      hosts would just silently drop such MLDv2 queries instead of replying
      with an MLDv1 listener report, which would prevent a MLDv2 router going
      into fallback mode (until it receives other MLDv1 queries).
      
      Since the mapping of MRC to MRD in exactly such cases can make use of
      the exponential algorithm from 5.1.3, we cannot [strictly speaking] be
      aware in MLDv1 of the encoding in MRC, it seems also not mentioned by
      the RFC. Since encodings are the same up to 32767, assume in such a
      situation this value as a hard upper limit we would clamp. We have asked
      one of the RFC authors on that regard, and he mentioned that there seem
      not to be any implementations that make use of that exponential algorithm
      on startup messages. In any case, this patch fixes this MLD
      interoperability issue.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f7aa53
    • F
      net: dsa: add {get, set}_wol callbacks to slave devices · 19e57c4e
      Florian Fainelli 提交于
      Allow switch drivers to implement per-port Wake-on-LAN getter and
      setters.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19e57c4e
    • F
      net: dsa: allow switch drivers to implement suspend/resume hooks · 24462549
      Florian Fainelli 提交于
      Add an abstraction layer to suspend/resume switch devices, doing the
      following split:
      
      - suspend/resume the slave network devices and their corresponding PHY
        devices
      - suspend/resume the switch hardware using switch driver callbacks
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24462549
    • E
      net: sched: shrink struct qdisc_skb_cb to 28 bytes · 25711786
      Eric Dumazet 提交于
      We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
      or increasing skb->cb[] size.
      
      Commit e0f31d84 ("flow_keys: Record IP layer protocol in
      skb_flow_dissect()") broke IPoIB.
      
      Only current offender is sch_choke, and this one do not need an
      absolutely precise flow key.
      
      If we store 17 bytes of flow key, its more than enough. (Its the actual
      size of flow_keys if it was a packed structure, but we might add new
      fields at the end of it later)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: e0f31d84 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25711786
  5. 20 9月, 2014 6 次提交
  6. 16 9月, 2014 4 次提交
    • S
      xfrm: Generate queueing routes only from route lookup functions · b8c203b2
      Steffen Klassert 提交于
      Currently we genarate a queueing route if we have matching policies
      but can not resolve the states and the sysctl xfrm_larval_drop is
      disabled. Here we assume that dst_output() is called to kill the
      queued packets. Unfortunately this assumption is not true in all
      cases, so it is possible that these packets leave the system unwanted.
      
      We fix this by generating queueing routes only from the
      route lookup functions, here we can guarantee a call to
      dst_output() afterwards.
      
      Fixes: a0073fe1 ("xfrm: Add a state resolution packet queue")
      Reported-by: NKonstantinos Kolelis <k.kolelis@sirrix.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      b8c203b2
    • S
      xfrm: Generate blackhole routes only from route lookup functions · f92ee619
      Steffen Klassert 提交于
      Currently we genarate a blackhole route route whenever we have
      matching policies but can not resolve the states. Here we assume
      that dst_output() is called to kill the balckholed packets.
      Unfortunately this assumption is not true in all cases, so
      it is possible that these packets leave the system unwanted.
      
      We fix this by generating blackhole routes only from the
      route lookup functions, here we can guarantee a call to
      dst_output() afterwards.
      
      Fixes: 2774c131 ("xfrm: Handle blackhole route creation via afinfo.")
      Reported-by: NKonstantinos Kolelis <k.kolelis@sirrix.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      f92ee619
    • A
      dsa: Replace mii_bus with a generic host device · b4d2394d
      Alexander Duyck 提交于
      This change makes it so that instead of passing and storing a mii_bus we
      instead pass and store a host_dev.  From there we can test to determine the
      exact type of device, and can verify it is the correct device for our switch.
      
      So for example it would be possible to pass a device pointer from a pci_dev
      and instead of checking for a PHY ID we could check for a vendor and/or device
      ID.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4d2394d
    • A
      dsa: Split ops up, and avoid assigning tag_protocol and receive separately · 5075314e
      Alexander Duyck 提交于
      This change addresses several issues.
      
      First, it was possible to set tag_protocol without setting the ops pointer.
      To correct that I have reordered things so that rcv is now populated before
      we set tag_protocol.
      
      Second, it didn't make much sense to keep setting the device ops each time a
      new slave was registered.  So by moving the receive portion out into root
      switch initialization that issue should be addressed.
      
      Third, I wanted to avoid sending tags if the rcv pointer was not registered
      so I changed the tag check to verify if the rcv function pointer is set on
      the root tree.  If it is then we start sending DSA tagged frames.
      
      Finally I split the device ops pointer in the structures into two spots.  I
      placed the rcv function pointer in the root switch since this makes it
      easiest to access from there, and I placed the xmit function pointer in the
      slave for the same reason.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5075314e
  7. 14 9月, 2014 6 次提交
  8. 13 9月, 2014 1 次提交
  9. 11 9月, 2014 5 次提交
  10. 10 9月, 2014 2 次提交
  11. 09 9月, 2014 8 次提交
    • A
      netfilter: nf_tables: add new nft_masq expression · 9ba1f726
      Arturo Borrero 提交于
      The nft_masq expression is intended to perform NAT in the masquerade flavour.
      
      We decided to have the masquerade functionality in a separated expression other
      than nft_nat.
      Signed-off-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      9ba1f726
    • A
      netfilter: nf_nat: generalize IPv6 masquerading support for nf_tables · be6b635c
      Arturo Borrero 提交于
      Let's refactor the code so we can reach the masquerade functionality
      from outside the xt context (ie. nftables).
      
      The patch includes the addition of an atomic counter to the masquerade
      notifier: the stuff to be done by the notifier is the same for xt and
      nftables. Therefore, only one notification handler is needed.
      
      This factorization only involves IPv6; a similar patch exists to
      handle IPv4.
      Signed-off-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      be6b635c
    • A
      netfilter: nf_nat: generalize IPv4 masquerading support for nf_tables · 8dd33cc9
      Arturo Borrero 提交于
      Let's refactor the code so we can reach the masquerade functionality
      from outside the xt context (ie. nftables).
      
      The patch includes the addition of an atomic counter to the masquerade
      notifier: the stuff to be done by the notifier is the same for xt and
      nftables. Therefore, only one notification handler is needed.
      
      This factorization only involves IPv4; a similar patch follows to
      handle IPv6.
      Signed-off-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8dd33cc9
    • P
      netfilter: nat: move specific NAT IPv6 to core · 2a5538e9
      Pablo Neira Ayuso 提交于
      Move the specific NAT IPv6 core functions that are called from the
      hooks from ip6table_nat.c to nf_nat_l3proto_ipv6.c. This prepares the
      ground to allow iptables and nft to use the same NAT engine code that
      comes in a follow up patch.
      
      This also renames nf_nat_ipv6_fn to nft_nat_ipv6_fn in
      net/ipv6/netfilter/nft_chain_nat_ipv6.c to avoid a compilation breakage.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2a5538e9
    • J
      Bluetooth: Fix mgmt pairing failure when authentication fails · e1e930f5
      Johan Hedberg 提交于
      Whether through HCI with BR/EDR or SMP with LE when authentication fails
      we should also notify any pending Pair Device mgmt command. This patch
      updates the mgmt_auth_failed function to take the actual hci_conn object
      and makes sure that any pending pairing command is notified and cleaned
      up appropriately.
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      e1e930f5
    • W
      inet: remove dead inetpeer sequence code · a7f26b7e
      Willem de Bruijn 提交于
      inetpeer sequence numbers are no longer incremented, so no need to
      check and flush the tree. The function that increments the sequence
      number was already dead code and removed in in "ipv4: remove unused
      function" (068a6e18). Remove the code that checks for a change, too.
      
      Verifying that v4_seq and v6_seq are never incremented and thus that
      flush_check compares bp->flush_seq to 0 is trivial.
      
      The second part of the change removes flush_check completely even
      though bp->flush_seq is exactly !0 once, at initialization. This
      change is correct because the time this branch is true is when
      bp->root == peer_avl_empty_rcu, in which the branch and
      inetpeer_invalidate_tree are a NOOP.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7f26b7e
    • J
      Bluetooth: Fix locking of the SMP context · fc75cc86
      Johan Hedberg 提交于
      Before the move the l2cap_chan the SMP context (smp_chan) didn't have
      any kind of proper locking. The best there existed was the
      HCI_CONN_LE_SMP_PEND flag which was used to enable mutual exclusion for
      potential multiple creators of the SMP context.
      
      Now that SMP has been converted to use the l2cap_chan infrastructure and
      since the SMP context is directly mapped to a corresponding l2cap_chan
      we get the SMP context locking essentially for free through the
      l2cap_chan lock. For all callbacks that l2cap_core.c makes for each
      channel implementation (smp.c in the case of SMP) the l2cap_chan lock is
      held through l2cap_chan_lock(chan).
      
      Since the calls from l2cap_core.c to smp.c are covered the only missing
      piece to have the locking implemented properly is to ensure that the
      lock is held for any other call path that may access the SMP context.
      This means user responses through mgmt.c, requests to elevate the
      security of a connection through hci_conn.c, as well as any deferred
      work through workqueues.
      
      This patch adds the necessary locking to all these other code paths that
      try to access the SMP context. Since mutual exclusion for the l2cap_chan
      access is now covered from all directions the patch also removes
      unnecessary HCI_CONN_LE_SMP_PEND flag (once we've acquired the chan lock
      we can simply check whether chan->smp is set to know if there's an SMP
      context).
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      fc75cc86
    • J
      Bluetooth: Move identity address update behind a workqueue · f3d82d0c
      Johan Hedberg 提交于
      The identity address update of all channels for an l2cap_conn needs to
      take the lock for each channel, i.e. it's safest to do this by a
      separate workqueue callback.
      
      Previously this was partially solved by moving the entire SMP key
      distribution behind a workqueue. However, if we want SMP context locking
      to be correct and safe we should always use the l2cap_chan lock when
      accessing it, meaning even smp_distribute_keys needs to take that lock
      which would once again create a dead lock when updating the identity
      address.
      
      The simplest way to solve this is to have l2cap_conn manage the deferred
      work which is what this patch does. A subsequent patch will remove the
      now unnecessary SMP key distribution work struct.
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      f3d82d0c