1. 08 4月, 2015 1 次提交
  2. 12 3月, 2015 1 次提交
  3. 06 3月, 2015 1 次提交
  4. 04 3月, 2015 2 次提交
    • E
      mpls: Multicast route table change notifications · 8de147dc
      Eric W. Biederman 提交于
      Unlike IPv4 this code notifies on all cases where mpls routes
      are added or removed and it never automatically removes routes.
      Avoiding both the userspace confusion that is caused by omitting
      route updates and the possibility of a flood of netlink traffic
      when an interface goes doew.
      
      For now reserved labels are handled automatically and userspace
      is not notified.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8de147dc
    • E
      mpls: Netlink commands to add, remove, and dump routes · 03c05665
      Eric W. Biederman 提交于
      This change adds two new netlink routing attributes:
      RTA_VIA and RTA_NEWDST.
      
      RTA_VIA specifies the specifies the next machine to send a packet to
      like RTA_GATEWAY.  RTA_VIA differs from RTA_GATEWAY in that it
      includes the address family of the address of the next machine to send
      a packet to.  Currently the MPLS code supports addresses in AF_INET,
      AF_INET6 and AF_PACKET.  For AF_INET and AF_INET6 the destination mac
      address is acquired from the neighbour table.  For AF_PACKET the
      destination mac_address is specified in the netlink configuration.
      
      I think raw destination mac address support with the family AF_PACKET
      will prove useful.  There is MPLS-TP which is defined to operate
      on machines that do not support internet packets of any flavor.  Further
      seem to be corner cases where it can be useful.  At this point
      I don't care much either way.
      
      RTA_NEWDST specifies the destination address to forward the packet
      with.  MPLS typically changes it's destination address at every hop.
      For a swap operation RTA_NEWDST is specified with a length of one label.
      For a push operation RTA_NEWDST is specified with two or more labels.
      For a pop operation RTA_NEWDST is not specified or equivalently an emtpy
      RTAN_NEWDST is specified.
      
      Those new netlink attributes are used to implement handling of rt-netlink
      RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE messages, to maintain the
      MPLS label table.
      
      rtm_to_route_config parses a netlink RTM_NEWROUTE or RTM_DELROUTE message,
      verify no unhandled attributes or unhandled values are present and sets
      up the data structures for mpls_route_add and mpls_route_del.
      
      I did my best to match up with the existing conventions with the caveats
      that MPLS addresses are all destination-specific-addresses, and so
      don't properly have a scope.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03c05665
  5. 20 1月, 2015 1 次提交
  6. 13 1月, 2015 1 次提交
  7. 06 1月, 2015 1 次提交
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
  8. 11 11月, 2014 1 次提交
  9. 20 6月, 2013 1 次提交
    • C
      tcp: introduce a per-route knob for quick ack · bcefe17c
      Cong Wang 提交于
      In previous discussions, I tried to find some reasonable heuristics
      for delayed ACK, however this seems not possible, according to Eric:
      
      	"ACKS might also be delayed because of bidirectional
      	traffic, and is more controlled by the application
      	response time. TCP stack can not easily estimate it."
      
      	"ACK can be incredibly useful to recover from losses in
      	a short time.
      
      	The vast majority of TCP sessions are small lived, and we
      	send one ACK per received segment anyway at beginning or
      	retransmits to let the sender smoothly increase its cwnd,
      	so an auto-tuning facility wont help them that much."
      
      and according to David:
      
      	"ACKs are the only information we have to detect loss.
      
      	And, for the same reasons that TCP VEGAS is fundamentally
      	broken, we cannot measure the pipe or some other
      	receiver-side-visible piece of information to determine
      	when it's "safe" to stretch ACK.
      
      	And even if it's "safe", we should not do it so that losses are
      	accurately detected and we don't spuriously retransmit.
      
      	The only way to know when the bandwidth increases is to
      	"test" it, by sending more and more packets until drops happen.
      	That's why all successful congestion control algorithms must
      	operate on explicited tested pieces of information.
      
      	Similarly, it's not really possible to universally know if
      	it's safe to stretch ACK or not."
      
      It still makes sense to enable or disable quick ack mode like
      what TCP_QUICK_ACK does.
      
      Similar to TCP_QUICK_ACK option, but for people who can't
      modify the source code and still wants to control
      TCP delayed ACK behavior. As David suggested, this should belong
      to per-path scope, since different pathes may want different
      behaviors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Rick Jones <rick.jones2@hp.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      CC: David Laight <David.Laight@ACULAB.COM>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcefe17c
  10. 14 2月, 2013 1 次提交
  11. 13 12月, 2012 1 次提交
  12. 08 12月, 2012 1 次提交
    • C
      bridge: export multicast database via netlink · ee07c6e7
      Cong Wang 提交于
      V5: fix two bugs pointed out by Thomas
          remove seq check for now, mark it as TODO
      
      V4: remove some useless #include
          some coding style fix
      
      V3: drop debugging printk's
          update selinux perm table as well
      
      V2: drop patch 1/2, export ifindex directly
          Redesign netlink attributes
          Improve netlink seq check
          Handle IPv6 addr as well
      
      This patch exports bridge multicast database via netlink
      message type RTM_GETMDB. Similar to fdb, but currently bridge-specific.
      We may need to support modify multicast database too (RTM_{ADD,DEL}MDB).
      
      (Thanks to Thomas for patient reviews)
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee07c6e7
  13. 05 12月, 2012 2 次提交
  14. 29 10月, 2012 3 次提交
  15. 13 10月, 2012 1 次提交
  16. 11 7月, 2012 1 次提交
  17. 28 6月, 2012 1 次提交
  18. 16 4月, 2012 1 次提交
    • J
      net: add generic PF_BRIDGE:RTM_ FDB hooks · 77162022
      John Fastabend 提交于
      This adds two new flags NTF_MASTER and NTF_SELF that can
      now be used to specify where PF_BRIDGE netlink commands should
      be sent. NTF_MASTER sends the commands to the 'dev->master'
      device for parsing. Typically this will be the linux net/bridge,
      or open-vswitch devices. Also without any flags set the command
      will be handled by the master device as well so that current user
      space tools continue to work as expected.
      
      The NTF_SELF flag will push the PF_BRIDGE commands to the
      device. In the basic example below the commands are then parsed
      and programmed in the embedded bridge.
      
      Note if both NTF_SELF and NTF_MASTER bits are set then the
      command will be sent to both 'dev->master' and 'dev' this allows
      user space to easily keep the embedded bridge and software bridge
      in sync.
      
      There is a slight complication in the case with both flags set
      when an error occurs. To resolve this the rtnl handler clears
      the NTF_ flag in the netlink ack to indicate which sets completed
      successfully. The add/del handlers will abort as soon as any
      error occurs.
      
      To support this new net device ops were added to call into
      the device and the existing bridging code was refactored
      to use these. There should be no required changes in user space
      to support the current bridge behavior.
      
      A basic setup with a SR-IOV enabled NIC looks like this,
      
                veth0  veth2
                  |      |
                ------------
                |  bridge0 |   <---- software bridging
                ------------
                     /
                     /
        ethx.y      ethx
          VF         PF
           \         \          <---- propagate FDB entries to HW
           \         \
        --------------------
        |  Embedded Bridge |    <---- hardware offloaded switching
        --------------------
      
      In this case the embedded bridge must be managed to allow 'veth0'
      to communicate with 'ethx.y' correctly. At present drivers managing
      the embedded bridge either send frames onto the network which
      then get dropped by the switch OR the embedded bridge will flood
      these frames. With this patch we have a mechanism to manage the
      embedded bridge correctly from user space. This example is specific
      to SR-IOV but replacing the VF with another PF or dropping this
      into the DSA framework generates similar management issues.
      
      Examples session using the 'br'[1] tool to add, dump and then
      delete a mac address with a new "embedded" option and enabled
      ixgbe driver:
      
      # br fdb add 22:35:19:ac:60:59 dev eth3
      # br fdb
      port    mac addr                flags
      veth0   22:35:19:ac:60:58       static
      veth0   9a:5f:81:f7:f6:ec       local
      eth3    00:1b:21:55:23:59       local
      eth3    22:35:19:ac:60:59       static
      veth0   22:35:19:ac:60:57       static
      #br fdb add 22:35:19:ac:60:59 embedded dev eth3
      #br fdb
      port    mac addr                flags
      veth0   22:35:19:ac:60:58       static
      veth0   9a:5f:81:f7:f6:ec       local
      eth3    00:1b:21:55:23:59       local
      eth3    22:35:19:ac:60:59       static
      veth0   22:35:19:ac:60:57       static
      eth3    22:35:19:ac:60:59       local embedded
      #br fdb del 22:35:19:ac:60:59 embedded dev eth3
      
      I added a couple lines to 'br' to set the flags correctly is all. It
      is my opinion that the merit of this patch is now embedded and SW
      bridges can both be modeled correctly in user space using very nearly
      the same message passing.
      
      [1] 'br' tool was published as an RFC here and will be renamed 'bridge'
          http://patchwork.ozlabs.org/patch/117664/
      
      Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for
      valuable feedback, suggestions, and review.
      
      v2: fixed api descriptions and error case with both NTF_SELF and
          NTF_MASTER set plus updated patch description.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77162022
  19. 22 2月, 2012 1 次提交
    • G
      rtnetlink: Fix problem with buffer allocation · 115c9b81
      Greg Rose 提交于
      Implement a new netlink attribute type IFLA_EXT_MASK.  The mask
      is a 32 bit value that can be used to indicate to the kernel that
      certain extended ifinfo values are requested by the user application.
      At this time the only mask value defined is RTEXT_FILTER_VF to
      indicate that the user wants the ifinfo dump to send information
      about the VFs belonging to the interface.
      
      This patch fixes a bug in which certain applications do not have
      large enough buffers to accommodate the extra information returned
      by the kernel with large numbers of SR-IOV virtual functions.
      Those applications will not send the new netlink attribute with
      the interface info dump request netlink messages so they will
      not get unexpectedly large request buffers returned by the kernel.
      
      Modifies the rtnl_calcit function to traverse the list of net
      devices and compute the minimum buffer size that can hold the
      info dumps of all matching devices based upon the filter passed
      in via the new netlink attribute filter mask.  If no filter
      mask is sent then the buffer allocation defaults to NLMSG_GOODSIZE.
      
      With this change it is possible to add yet to be defined netlink
      attributes to the dump request which should make it fairly extensible
      in the future.
      Signed-off-by: NGreg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      115c9b81
  20. 09 7月, 2011 1 次提交
  21. 22 6月, 2011 1 次提交
    • J
      net: dcbnl, add multicast group for DCB · 314b4778
      John Fastabend 提交于
      Now that dcbnl is being used in many cases by more
      than a single agent it is beneficial to be notified
      when some entity either driver or user space has
      changed the DCB attributes.
      
      Today applications either end up polling the interface
      or relying on a user space database to maintain the DCB
      state and post events. Polling is a poor solution for
      obvious reasons. And relying on a user space database
      has its own downside. Namely it has created strange
      boot dependencies requiring the database be populated
      before any applications dependent on DCB attributes
      starts or the application goes into a polling loop.
      Populating the database requires negotiating link
      setting with the peer and can take anywhere from less
      than a second up to a few seconds depending on the switch
      implementation.
      
      Perhaps more importantly if another application or an
      embedded agent sets a DCB link attribute the database
      has no way of knowing other than polling the kernel.
      This prevents applications from responding quickly to
      changes in link events which at least in the FCoE case
      and probably any other protocols expecting a lossless
      link may result in IO errors.
      
      By adding a multicast group for DCB we have clean way
      to disseminate kernel DCB link attributes up to user
      space. Avoiding the need for user space to maintain
      a coherant database and disperse events that potentially
      do not reflect the current link state.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      314b4778
  22. 16 11月, 2010 1 次提交
    • A
      net: rtnetlink.h -- only include linux/netdevice.h when used by the kernel · 3b42a96d
      Andy Whitcroft 提交于
      The commit below added a new helper dev_ingress_queue to cleanly obtain the
      ingress queue pointer.  This necessitated including 'linux/netdevice.h':
      
        commit 24824a09
        Author: Eric Dumazet <eric.dumazet@gmail.com>
        Date:   Sat Oct 2 06:11:55 2010 +0000
      
          net: dynamic ingress_queue allocation
      
      However this include triggers issues for applications in userspace
      which use the rtnetlink interfaces.  Commonly this requires they include
      'net/if.h' and 'linux/rtnetlink.h' leading to a compiler error as below:
      
        In file included from /usr/include/linux/netdevice.h:28:0,
                         from /usr/include/linux/rtnetlink.h:9,
                         from t.c:2:
        /usr/include/linux/if.h:135:8: error: redefinition of ‘struct ifmap’
        /usr/include/net/if.h:112:8: note: originally defined here
        /usr/include/linux/if.h:169:8: error: redefinition of ‘struct ifreq’
        /usr/include/net/if.h:127:8: note: originally defined here
        /usr/include/linux/if.h:218:8: error: redefinition of ‘struct ifconf’
        /usr/include/net/if.h:177:8: note: originally defined here
      
      The new helper is only defined for the kernel and protected by __KERNEL__
      therefore we can simply pull the include down into the same protected
      section.
      Signed-off-by: NAndy Whitcroft <apw@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b42a96d
  23. 05 10月, 2010 2 次提交
  24. 16 9月, 2010 1 次提交
  25. 09 9月, 2010 1 次提交
  26. 23 7月, 2010 1 次提交
  27. 11 5月, 2010 1 次提交
    • P
      ipv6: ip6mr: support multiple tables · d1db275d
      Patrick McHardy 提交于
      This patch adds support for multiple independant multicast routing instances,
      named "tables".
      
      Userspace multicast routing daemons can bind to a specific table instance by
      issuing a setsockopt call using a new option MRT6_TABLE. The table number is
      stored in the raw socket data and affects all following ip6mr setsockopt(),
      getsockopt() and ioctl() calls. By default, a single table (RT6_TABLE_DFLT)
      is created with a default routing rule pointing to it. Newly created pim6reg
      devices have the table number appended ("pim6regX"), with the exception of
      devices created in the default table, which are named just "pim6reg" for
      compatibility reasons.
      
      Packets are directed to a specific table instance using routing rules,
      similar to how regular routing rules work. Currently iif, oif and mark
      are supported as keys, source and destination addresses could be supported
      additionally.
      
      Example usage:
      
      - bind pimd/xorp/... to a specific table:
      
      uint32_t table = 123;
      setsockopt(fd, SOL_IPV6, MRT6_TABLE, &table, sizeof(table));
      
      - create routing rules directing packets to the new table:
      
      # ip -6 mrule add iif eth0 lookup 123
      # ip -6 mrule add oif eth0 lookup 123
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      d1db275d
  28. 26 4月, 2010 1 次提交
    • P
      net: rtnetlink: decouple rtnetlink address families from real address families · 25239cee
      Patrick McHardy 提交于
      Decouple rtnetlink address families from real address families in socket.h to
      be able to add rtnetlink interfaces to code that is not a real address family
      without increasing AF_MAX/NPROTO.
      
      This will be used to add support for multicast route dumping from all tables
      as the proc interface can't be extended to support anything but the main table
      without breaking compatibility.
      
      This partialy undoes the patch to introduce independant families for routing
      rules and converts ipmr routing rules to a new rtnetlink family. Similar to
      that patch, values up to 127 are reserved for real address families, values
      above that may be used arbitrarily.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      25239cee
  29. 25 2月, 2010 1 次提交
    • P
      net: Add checking to rcu_dereference() primitives · a898def2
      Paul E. McKenney 提交于
      Update rcu_dereference() primitives to use new lockdep-based
      checking. The rcu_dereference() in __in6_dev_get() may be
      protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
      The rcu_dereference() in __sk_free() is protected by the fact
      that it is never reached if an update could change it.  Check
      for this by using rcu_dereference_check() to verify that the
      struct sock's ->sk_wmem_alloc counter is zero.
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a898def2
  30. 24 12月, 2009 1 次提交
    • L
      net: Add rtnetlink init_rcvwnd to set the TCP initial receive window · 31d12926
      laurent chavey 提交于
      Add rtnetlink init_rcvwnd to set the TCP initial receive window size
      advertised by passive and active TCP connections.
      The current Linux TCP implementation limits the advertised TCP initial
      receive window to the one prescribed by slow start. For short lived
      TCP connections used for transaction type of traffic (i.e. http
      requests), bounding the advertised TCP initial receive window results
      in increased latency to complete the transaction.
      Support for setting initial congestion window is already supported
      using rtnetlink init_cwnd, but the feature is useless without the
      ability to set a larger TCP initial receive window.
      The rtnetlink init_rcvwnd allows increasing the TCP initial receive
      window, allowing TCP connection to advertise larger TCP receive window
      than the ones bounded by slow start.
      Signed-off-by: NLaurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31d12926
  31. 16 12月, 2009 1 次提交
    • D
      tcp: Revert per-route SACK/DSACK/TIMESTAMP changes. · bb5b7c11
      David S. Miller 提交于
      It creates a regression, triggering badness for SYN_RECV
      sockets, for example:
      
      [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
      [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
      [19148.023035] REGS: eeecbd30 TRAP: 0700   Not tainted  (2.6.32)
      [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR>  CR: 24002442  XER: 00000000
      [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000
      
      This is likely caused by the change in the 'estab' parameter
      passed to tcp_parse_options() when invoked by the functions
      in net/ipv4/tcp_minisocks.c
      
      But even if that is fixed, the ->conn_request() changes made in
      this patch series is fundamentally wrong.  They try to use the
      listening socket's 'dst' to probe the route settings.  The
      listening socket doesn't even have a route, and you can't
      get the right route (the child request one) until much later
      after we setup all of the state, and it must be done by hand.
      
      This stuff really isn't ready, so the best thing to do is a
      full revert.  This reverts the following commits:
      
      f55017a9
      022c3f7d
      1aba721e
      cda42ebd
      345cda2f
      dc343475
      05eaade2
      6a2a2d6bSigned-off-by: NDavid S. Miller <davem@davemloft.net>
      bb5b7c11
  32. 05 11月, 2009 1 次提交
  33. 29 10月, 2009 3 次提交