1. 27 9月, 2013 11 次提交
    • V
      net: create sysfs symlinks for neighbour devices · 5831d66e
      Veaceslav Falico 提交于
      Also, remove the same functionality from bonding - it will be already done
      for any device that links to its lower/upper neighbour.
      
      The links will be created for dev's kobject, and will look like
      lower_eth0 for lower device eth0 and upper_bridge0 for upper device
      bridge0.
      
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5831d66e
    • V
      net: expose the master link to sysfs, and remove it from bond · 842d67a7
      Veaceslav Falico 提交于
      Currently, we can have only one master upper neighbour, so it would be
      useful to create a symlink to it in the sysfs device directory, the way
      that bonding now does it, for every device. Lower devices from
      bridge/team/etc will automagically get it, so we could rely on it.
      
      Also, remove the same functionality from bonding.
      
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      842d67a7
    • V
      vlan: unlink the upper neighbour before unregistering · 47701a36
      Veaceslav Falico 提交于
      On netdev unregister we're removing also all of its sysfs-associated stuff,
      including the sysfs symlinks that are controlled by netdev neighbour code.
      Also, it's a subtle race condition - cause we can still access it after
      unregistering.
      
      Move the unlinking right before the unregistering to fix both.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47701a36
    • V
      vlan: link the upper neighbour only after registering · 5df27e6c
      Veaceslav Falico 提交于
      Otherwise users might access it without being fully registered, as per
      sysfs - it only inits in register_netdevice(), so is unusable till it is
      called.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5df27e6c
    • V
      net: add a possibility to get private from netdev_adjacent->list · b6ccba4c
      Veaceslav Falico 提交于
      It will be useful to get first/last element.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6ccba4c
    • V
      net: add for_each iterators through neighbour lower link's private · 31088a11
      Veaceslav Falico 提交于
      Add a possibility to iterate through netdev_adjacent's private, currently
      only for lower neighbours.
      
      Add both RCU and RTNL/other locking variants of iterators, and make the
      non-rcu variant to be safe from removal.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31088a11
    • V
      net: add netdev_adjacent->private and allow to use it · 402dae96
      Veaceslav Falico 提交于
      Currently, even though we can access any linked device, we can't attach
      anything to it, which is vital to properly manage them.
      
      To fix this, add a new void *private to netdev_adjacent and functions
      setting/getting it (per link), so that we can save, per example, bonding's
      slave structures there, per slave device.
      
      netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
      upper dev and populates the neighbour link only with private.
      
      netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      402dae96
    • V
      net: add RCU variant to search for netdev_adjacent link · 5249dec7
      Veaceslav Falico 提交于
      Currently we have only the RTNL flavour, however we can traverse it while
      holding only RCU, so add the RCU search. Add an RCU variant that uses
      list_head * as an argument, so that it can be universally used afterwards.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5249dec7
    • V
      net: add adj_list to save only neighbours · 2f268f12
      Veaceslav Falico 提交于
      Currently, we distinguish neighbours (first-level linked devices) from
      non-neighbours by the neighbour bool in the netdev_adjacent. This could be
      quite time-consuming in case we would like to traverse *only* through
      neighbours - cause we'd have to traverse through all devices and check for
      this flag, and in a (quite common) scenario where we have lots of vlans on
      top of bridge, which is on top of a bond - the bonding would have to go
      through all those vlans to get its upper neighbour linked devices.
      
      This situation is really unpleasant, cause there are already a lot of cases
      when a device with slaves needs to go through them in hot path.
      
      To fix this, introduce a new upper/lower device lists structure -
      adj_list, which contains only the neighbours. It works always in
      pair with the all_adj_list structure (renamed from upper/lower_dev_list),
      i.e. both of them contain the same links, only that all_adj_list contains
      also non-neighbour device links. It's really a small change visible,
      currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
      change the main linked logic at all.
      
      Also, add some comments a fix a name collision in
      netdev_for_each_upper_dev_rcu() and rework the naming by the following
      rules:
      
      netdev_(all_)(upper|lower)_*
      
      If "all_" is present, then we work with the whole list of upper/lower
      devices, otherwise - only with direct neighbours. Uninline functions - to
      get better stack traces.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f268f12
    • V
      net: use lists as arguments instead of bool upper · 7863c054
      Veaceslav Falico 提交于
      Currently we make use of bool upper when we want to specify if we want to
      work with upper/lower list. It's, however, harder to read, debug and
      occupies a lot more code.
      
      Fix this by just passing the correct upper/lower_dev_list list_head pointer
      instead of bool upper, and work internally with it.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7863c054
    • H
      net: neighbour: use source address of last enqueued packet for solicitation · 4ed377e3
      Hannes Frederic Sowa 提交于
      Currently we always use the first member of the arp_queue to determine
      the sender ip address of the arp packet (or in case of IPv6 - source
      address of the ndisc packet). This skb is fixed as long as the queue is
      not drained by a complete purge because of a timeout or by a successful
      response.
      
      If the first packet enqueued on the arp_queue is from a local application
      with a manually set source address and the to be discovered system
      does some kind of uRPF checks on the source address in the arp packet
      the resolving process hangs until a timeout and restarts. This hurts
      communication with the participating network node.
      
      This could be mitigated a bit if we use the latest enqueued skb's
      source address for the resolving process, which is not as static as
      the arp_queue's head. This change of the source address could result in
      better recovery of a failed solicitation.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Julian Anastasov <ja@ssi.bg>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ed377e3
  2. 24 9月, 2013 4 次提交
    • C
      ipv6: do not allow ipv6 module to be removed · 8ce44061
      Cong Wang 提交于
      There was some bug report on ipv6 module removal path before.
      Also, as Stephen pointed out, after vxlan module gets ipv6 support,
      the ipv6 stub it used is not safe against this module removal either.
      So, let's just remove inet6_exit() so that ipv6 module will not be
      able to be unloaded.
      
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ce44061
    • E
      tcp: fix dynamic right sizing · b0983d3c
      Eric Dumazet 提交于
      Dynamic Right Sizing (DRS) is supposed to open TCP receive window
      automatically, but suffers from two bugs, presented by order
      of importance.
      
      1) tcp_rcv_space_adjust() fix :
      
      Using twice the last received amount is very pessimistic,
      because it doesn't allow fast recovery or proper slow start
      ramp up, if sender wants to increase cwin by 100% every RTT.
      
      copied = bytes received in previous RTT
      
      2*copied = bytes we expect to receive in next RTT
      
      4*copied = bytes we need to advertise in rwin at end of next RTT
      
      DRS is one RTT late, it needs a 4x factor.
      
      If sender is not using ABC, and increases cwin by 50% every rtt,
      then we needed 1.5*1.5 = 2.25 factor.
      This is probably why this bug was not really noticed.
      
      2) There is no window adjustment after first RTT. DRS triggers only
        after the second RTT.
        DRS needs two RTT to initialize, so tcp_fixup_rcvbuf() should setup
        sk_rcvbuf to allow proper window grow for first two RTT.
      
      This patch increases TCP efficiency particularly for large RTT flows
      when autotuning is used at the receiver, and more particularly
      in presence of packet losses.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0983d3c
    • F
      tcp: syncookies: reduce mss table to four values · 08629354
      Florian Westphal 提交于
      Halve mss table size to make blind cookie guessing more difficult.
      This is sad since the tables were already small, but there
      is little alternative except perhaps adding more precise mss information
      in the tcp timestamp.  Timestamps are unfortunately not ubiquitous.
      
      Guessing all possible cookie values still has 8-in 2**32 chance.
      Reported-by: NJakob Lell <jakob@jakoblell.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08629354
    • F
      tcp: syncookies: reduce cookie lifetime to 128 seconds · 8c27bd75
      Florian Westphal 提交于
      We currently accept cookies that were created less than 4 minutes ago
      (ie, cookies with counter delta 0-3).  Combined with the 8 mss table
      values, this yields 32 possible values (out of 2**32) that will be valid.
      
      Reducing the lifetime to < 2 minutes halves the guessing chance while
      still providing a large enough period.
      
      While at it, get rid of jiffies value -- they overflow too quickly on
      32 bit platforms.
      
      getnstimeofday is used to create a counter that increments every 64s.
      perf shows getnstimeofday cost is negible compared to sha_transform;
      normal tcp initial sequence number generation uses getnstimeofday, too.
      Reported-by: NJakob Lell <jakob@jakoblell.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c27bd75
  3. 21 9月, 2013 3 次提交
  4. 20 9月, 2013 3 次提交
  5. 19 9月, 2013 1 次提交
  6. 18 9月, 2013 2 次提交
  7. 17 9月, 2013 7 次提交
  8. 16 9月, 2013 2 次提交
  9. 13 9月, 2013 7 次提交
    • M
      Remove GENERIC_HARDIRQ config option · 0244ad00
      Martin Schwidefsky 提交于
      After the last architecture switched to generic hard irqs the config
      options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
      for !CONFIG_GENERIC_HARDIRQS can be removed.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      0244ad00
    • P
      netfilter: nf_nat_proto_icmpv6:: fix wrong comparison in icmpv6_manip_pkt · d830f0fa
      Phil Oester 提交于
      In commit 58a317f1 (netfilter: ipv6: add IPv6 NAT support), icmpv6_manip_pkt
      was added with an incorrect comparison of ICMP codes to types.  This causes
      problems when using NAT rules with the --random option.  Correct the
      comparison.
      
      This closes netfilter bugzilla #851, reported by Alexander Neumann.
      Signed-off-by: NPhil Oester <kernel@linuxace.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d830f0fa
    • H
      bridge: Clamp forward_delay when enabling STP · be4f154d
      Herbert Xu 提交于
      At some point limits were added to forward_delay.  However, the
      limits are only enforced when STP is enabled.  This created a
      scenario where you could have a value outside the allowed range
      while STP is disabled, which then stuck around even after STP
      is enabled.
      
      This patch fixes this by clamping the value when we enable STP.
      
      I had to move the locking around a bit to ensure that there is
      no window where someone could insert a value outside the range
      while we're in the middle of enabling STP.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      
      Cheers,
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be4f154d
    • C
      resubmit bridge: fix message_age_timer calculation · 9a062013
      Chris Healy 提交于
      This changes the message_age_timer calculation to use the BPDU's max age as
      opposed to the local bridge's max age.  This is in accordance with section
      8.6.2.3.2 Step 2 of the 802.1D-1998 sprecification.
      
      With the current implementation, when running with very large bridge
      diameters, convergance will not always occur even if a root bridge is
      configured to have a longer max age.
      
      Tested successfully on bridge diameters of ~200.
      Signed-off-by: NChris Healy <cphealy@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a062013
    • S
      memcg: rename RESOURCE_MAX to RES_COUNTER_MAX · 6de5a8bf
      Sha Zhengju 提交于
      RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de5a8bf
    • D
      net: sctp: fix ipv6 ipsec encryption bug in sctp_v6_xmit · 95ee6208
      Daniel Borkmann 提交于
      Alan Chester reported an issue with IPv6 on SCTP that IPsec traffic is not
      being encrypted, whereas on IPv4 it is. Setting up an AH + ESP transport
      does not seem to have the desired effect:
      
      SCTP + IPv4:
      
        22:14:20.809645 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 116)
          192.168.0.2 > 192.168.0.5: AH(spi=0x00000042,sumlen=16,seq=0x1): ESP(spi=0x00000044,seq=0x1), length 72
        22:14:20.813270 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 340)
          192.168.0.5 > 192.168.0.2: AH(spi=0x00000043,sumlen=16,seq=0x1):
      
      SCTP + IPv6:
      
        22:31:19.215029 IP6 (class 0x02, hlim 64, next-header SCTP (132) payload length: 364)
          fe80::222:15ff:fe87:7fc.3333 > fe80::92e6:baff:fe0d:5a54.36767: sctp
          1) [INIT ACK] [init tag: 747759530] [rwnd: 62464] [OS: 10] [MIS: 10]
      
      Moreover, Alan says:
      
        This problem was seen with both Racoon and Racoon2. Other people have seen
        this with OpenSwan. When IPsec is configured to encrypt all upper layer
        protocols the SCTP connection does not initialize. After using Wireshark to
        follow packets, this is because the SCTP packet leaves Box A unencrypted and
        Box B believes all upper layer protocols are to be encrypted so it drops
        this packet, causing the SCTP connection to fail to initialize. When IPsec
        is configured to encrypt just SCTP, the SCTP packets are observed unencrypted.
      
      In fact, using `socat sctp6-listen:3333 -` on one end and transferring "plaintext"
      string on the other end, results in cleartext on the wire where SCTP eventually
      does not report any errors, thus in the latter case that Alan reports, the
      non-paranoid user might think he's communicating over an encrypted transport on
      SCTP although he's not (tcpdump ... -X):
      
        ...
        0x0030: 5d70 8e1a 0003 001a 177d eb6c 0000 0000  ]p.......}.l....
        0x0040: 0000 0000 706c 6169 6e74 6578 740a 0000  ....plaintext...
      
      Only in /proc/net/xfrm_stat we can see XfrmInTmplMismatch increasing on the
      receiver side. Initial follow-up analysis from Alan's bug report was done by
      Alexey Dobriyan. Also thanks to Vlad Yasevich for feedback on this.
      
      SCTP has its own implementation of sctp_v6_xmit() not calling inet6_csk_xmit().
      This has the implication that it probably never really got updated along with
      changes in inet6_csk_xmit() and therefore does not seem to invoke xfrm handlers.
      
      SCTP's IPv4 xmit however, properly calls ip_queue_xmit() to do the work. Since
      a call to inet6_csk_xmit() would solve this problem, but result in unecessary
      route lookups, let us just use the cached flowi6 instead that we got through
      sctp_v6_get_dst(). Since all SCTP packets are being sent through sctp_packet_transmit(),
      we do the route lookup / flow caching in sctp_transport_route(), hold it in
      tp->dst and skb_dst_set() right after that. If we would alter fl6->daddr in
      sctp_v6_xmit() to np->opt->srcrt, we possibly could run into the same effect
      of not having xfrm layer pick it up, hence, use fl6_update_dst() in sctp_v6_get_dst()
      instead to get the correct source routed dst entry, which we assign to the skb.
      
      Also source address routing example from 62503411 ("sctp: fix sctp to work with
      ipv6 source address routing") still works with this patch! Nevertheless, in RFC5095
      it is actually 'recommended' to not use that anyway due to traffic amplification [1].
      So it seems we're not supposed to do that anyway in sctp_v6_xmit(). Moreover, if
      we overwrite the flow destination here, the lower IPv6 layer will be unable to
      put the correct destination address into IP header, as routing header is added in
      ipv6_push_nfrag_opts() but then probably with wrong final destination. Things aside,
      result of this patch is that we do not have any XfrmInTmplMismatch increase plus on
      the wire with this patch it now looks like:
      
      SCTP + IPv6:
      
        08:17:47.074080 IP6 2620:52:0:102f:7a2b:cbff:fe27:1b0a > 2620:52:0:102f:213:72ff:fe32:7eba:
          AH(spi=0x00005fb4,seq=0x1): ESP(spi=0x00005fb5,seq=0x1), length 72
        08:17:47.074264 IP6 2620:52:0:102f:213:72ff:fe32:7eba > 2620:52:0:102f:7a2b:cbff:fe27:1b0a:
          AH(spi=0x00003d54,seq=0x1): ESP(spi=0x00003d55,seq=0x1), length 296
      
      This fixes Kernel Bugzilla 24412. This security issue seems to be present since
      2.6.18 kernels. Lets just hope some big passive adversary in the wild didn't have
      its fun with that. lksctp-tools IPv6 regression test suite passes as well with
      this patch.
      
       [1] http://www.secdev.org/conf/IPv6_RH_security-csw07.pdfReported-by: NAlan Chester <alan.chester@tekelec.com>
      Reported-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95ee6208
    • S
      netpoll: Should handle ETH_P_ARP other than ETH_P_IP in netpoll_neigh_reply · b0dd663b
      Sonic Zhang 提交于
      The received ARP request type in the Ethernet packet head is ETH_P_ARP other than ETH_P_IP.
      
      [ Bug introduced by commit b7394d24
        ("netpoll: prepare for ipv6") ]
      Signed-off-by: NSonic Zhang <sonic.zhang@analog.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0dd663b