1. 01 10月, 2013 2 次提交
  2. 29 9月, 2013 3 次提交
    • E
      net: introduce SO_MAX_PACING_RATE · 62748f32
      Eric Dumazet 提交于
      As mentioned in commit afe4fd06 ("pkt_sched: fq: Fair Queue packet
      scheduler"), this patch adds a new socket option.
      
      SO_MAX_PACING_RATE offers the application the ability to cap the
      rate computed by transport layer. Value is in bytes per second.
      
      u32 val = 1000000;
      setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
      
      To be effectively paced, a flow must use FQ packet scheduler.
      
      Note that a packet scheduler takes into account the headers for its
      computations. The effective payload rate depends on MSS and retransmits
      if any.
      
      I chose to make this pacing rate a SOL_SOCKET option instead of a
      TCP one because this can be used by other protocols.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Steinar H. Gunderson <sesse@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62748f32
    • F
      ipv4: processing ancillary IP_TOS or IP_TTL · aa661581
      Francesco Fusco 提交于
      If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
      packets with the specified TTL or TOS overriding the socket values specified
      with the traditional setsockopt().
      
      The struct inet_cork stores the values of TOS, TTL and priority that are
      passed through the struct ipcm_cookie. If there are user-specified TOS
      (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
      used to override the per-socket values. In case of TOS also the priority
      is changed accordingly.
      
      Two helper functions get_rttos and get_rtconn_flags are defined to take
      into account the presence of a user specified TOS value when computing
      RT_TOS and RT_CONN_FLAGS.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa661581
    • F
      ipv4: IP_TOS and IP_TTL can be specified as ancillary data · f02db315
      Francesco Fusco 提交于
      This patch enables the IP_TTL and IP_TOS values passed from userspace to
      be stored in the ipcm_cookie struct. Three fields are added to the struct:
      
      - the TTL, expressed as __u8.
        The allowed values are in the [1-255].
        A value of 0 means that the TTL is not specified.
      
      - the TOS, expressed as __s16.
        The allowed values are in the range [0,255].
        A value of -1 means that the TOS is not specified.
      
      - the priority, expressed as a char and computed when
        handling the ancillary data.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f02db315
  3. 28 9月, 2013 2 次提交
  4. 27 9月, 2013 11 次提交
    • V
      net: create sysfs symlinks for neighbour devices · 5831d66e
      Veaceslav Falico 提交于
      Also, remove the same functionality from bonding - it will be already done
      for any device that links to its lower/upper neighbour.
      
      The links will be created for dev's kobject, and will look like
      lower_eth0 for lower device eth0 and upper_bridge0 for upper device
      bridge0.
      
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5831d66e
    • V
      net: expose the master link to sysfs, and remove it from bond · 842d67a7
      Veaceslav Falico 提交于
      Currently, we can have only one master upper neighbour, so it would be
      useful to create a symlink to it in the sysfs device directory, the way
      that bonding now does it, for every device. Lower devices from
      bridge/team/etc will automagically get it, so we could rely on it.
      
      Also, remove the same functionality from bonding.
      
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      842d67a7
    • V
      vlan: unlink the upper neighbour before unregistering · 47701a36
      Veaceslav Falico 提交于
      On netdev unregister we're removing also all of its sysfs-associated stuff,
      including the sysfs symlinks that are controlled by netdev neighbour code.
      Also, it's a subtle race condition - cause we can still access it after
      unregistering.
      
      Move the unlinking right before the unregistering to fix both.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47701a36
    • V
      vlan: link the upper neighbour only after registering · 5df27e6c
      Veaceslav Falico 提交于
      Otherwise users might access it without being fully registered, as per
      sysfs - it only inits in register_netdevice(), so is unusable till it is
      called.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5df27e6c
    • V
      net: add a possibility to get private from netdev_adjacent->list · b6ccba4c
      Veaceslav Falico 提交于
      It will be useful to get first/last element.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6ccba4c
    • V
      net: add for_each iterators through neighbour lower link's private · 31088a11
      Veaceslav Falico 提交于
      Add a possibility to iterate through netdev_adjacent's private, currently
      only for lower neighbours.
      
      Add both RCU and RTNL/other locking variants of iterators, and make the
      non-rcu variant to be safe from removal.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31088a11
    • V
      net: add netdev_adjacent->private and allow to use it · 402dae96
      Veaceslav Falico 提交于
      Currently, even though we can access any linked device, we can't attach
      anything to it, which is vital to properly manage them.
      
      To fix this, add a new void *private to netdev_adjacent and functions
      setting/getting it (per link), so that we can save, per example, bonding's
      slave structures there, per slave device.
      
      netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
      upper dev and populates the neighbour link only with private.
      
      netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      402dae96
    • V
      net: add RCU variant to search for netdev_adjacent link · 5249dec7
      Veaceslav Falico 提交于
      Currently we have only the RTNL flavour, however we can traverse it while
      holding only RCU, so add the RCU search. Add an RCU variant that uses
      list_head * as an argument, so that it can be universally used afterwards.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5249dec7
    • V
      net: add adj_list to save only neighbours · 2f268f12
      Veaceslav Falico 提交于
      Currently, we distinguish neighbours (first-level linked devices) from
      non-neighbours by the neighbour bool in the netdev_adjacent. This could be
      quite time-consuming in case we would like to traverse *only* through
      neighbours - cause we'd have to traverse through all devices and check for
      this flag, and in a (quite common) scenario where we have lots of vlans on
      top of bridge, which is on top of a bond - the bonding would have to go
      through all those vlans to get its upper neighbour linked devices.
      
      This situation is really unpleasant, cause there are already a lot of cases
      when a device with slaves needs to go through them in hot path.
      
      To fix this, introduce a new upper/lower device lists structure -
      adj_list, which contains only the neighbours. It works always in
      pair with the all_adj_list structure (renamed from upper/lower_dev_list),
      i.e. both of them contain the same links, only that all_adj_list contains
      also non-neighbour device links. It's really a small change visible,
      currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
      change the main linked logic at all.
      
      Also, add some comments a fix a name collision in
      netdev_for_each_upper_dev_rcu() and rework the naming by the following
      rules:
      
      netdev_(all_)(upper|lower)_*
      
      If "all_" is present, then we work with the whole list of upper/lower
      devices, otherwise - only with direct neighbours. Uninline functions - to
      get better stack traces.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f268f12
    • V
      net: use lists as arguments instead of bool upper · 7863c054
      Veaceslav Falico 提交于
      Currently we make use of bool upper when we want to specify if we want to
      work with upper/lower list. It's, however, harder to read, debug and
      occupies a lot more code.
      
      Fix this by just passing the correct upper/lower_dev_list list_head pointer
      instead of bool upper, and work internally with it.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7863c054
    • H
      net: neighbour: use source address of last enqueued packet for solicitation · 4ed377e3
      Hannes Frederic Sowa 提交于
      Currently we always use the first member of the arp_queue to determine
      the sender ip address of the arp packet (or in case of IPv6 - source
      address of the ndisc packet). This skb is fixed as long as the queue is
      not drained by a complete purge because of a timeout or by a successful
      response.
      
      If the first packet enqueued on the arp_queue is from a local application
      with a manually set source address and the to be discovered system
      does some kind of uRPF checks on the source address in the arp packet
      the resolving process hangs until a timeout and restarts. This hurts
      communication with the participating network node.
      
      This could be mitigated a bit if we use the latest enqueued skb's
      source address for the resolving process, which is not as static as
      the arp_queue's head. This change of the source address could result in
      better recovery of a failed solicitation.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Julian Anastasov <ja@ssi.bg>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ed377e3
  5. 24 9月, 2013 4 次提交
    • C
      ipv6: do not allow ipv6 module to be removed · 8ce44061
      Cong Wang 提交于
      There was some bug report on ipv6 module removal path before.
      Also, as Stephen pointed out, after vxlan module gets ipv6 support,
      the ipv6 stub it used is not safe against this module removal either.
      So, let's just remove inet6_exit() so that ipv6 module will not be
      able to be unloaded.
      
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ce44061
    • E
      tcp: fix dynamic right sizing · b0983d3c
      Eric Dumazet 提交于
      Dynamic Right Sizing (DRS) is supposed to open TCP receive window
      automatically, but suffers from two bugs, presented by order
      of importance.
      
      1) tcp_rcv_space_adjust() fix :
      
      Using twice the last received amount is very pessimistic,
      because it doesn't allow fast recovery or proper slow start
      ramp up, if sender wants to increase cwin by 100% every RTT.
      
      copied = bytes received in previous RTT
      
      2*copied = bytes we expect to receive in next RTT
      
      4*copied = bytes we need to advertise in rwin at end of next RTT
      
      DRS is one RTT late, it needs a 4x factor.
      
      If sender is not using ABC, and increases cwin by 50% every rtt,
      then we needed 1.5*1.5 = 2.25 factor.
      This is probably why this bug was not really noticed.
      
      2) There is no window adjustment after first RTT. DRS triggers only
        after the second RTT.
        DRS needs two RTT to initialize, so tcp_fixup_rcvbuf() should setup
        sk_rcvbuf to allow proper window grow for first two RTT.
      
      This patch increases TCP efficiency particularly for large RTT flows
      when autotuning is used at the receiver, and more particularly
      in presence of packet losses.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0983d3c
    • F
      tcp: syncookies: reduce mss table to four values · 08629354
      Florian Westphal 提交于
      Halve mss table size to make blind cookie guessing more difficult.
      This is sad since the tables were already small, but there
      is little alternative except perhaps adding more precise mss information
      in the tcp timestamp.  Timestamps are unfortunately not ubiquitous.
      
      Guessing all possible cookie values still has 8-in 2**32 chance.
      Reported-by: NJakob Lell <jakob@jakoblell.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08629354
    • F
      tcp: syncookies: reduce cookie lifetime to 128 seconds · 8c27bd75
      Florian Westphal 提交于
      We currently accept cookies that were created less than 4 minutes ago
      (ie, cookies with counter delta 0-3).  Combined with the 8 mss table
      values, this yields 32 possible values (out of 2**32) that will be valid.
      
      Reducing the lifetime to < 2 minutes halves the guessing chance while
      still providing a large enough period.
      
      While at it, get rid of jiffies value -- they overflow too quickly on
      32 bit platforms.
      
      getnstimeofday is used to create a counter that increments every 64s.
      perf shows getnstimeofday cost is negible compared to sha_transform;
      normal tcp initial sequence number generation uses getnstimeofday, too.
      Reported-by: NJakob Lell <jakob@jakoblell.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c27bd75
  6. 21 9月, 2013 3 次提交
  7. 20 9月, 2013 3 次提交
  8. 19 9月, 2013 1 次提交
  9. 18 9月, 2013 2 次提交
  10. 17 9月, 2013 7 次提交
  11. 16 9月, 2013 2 次提交
新手
引导
客服 返回
顶部