1. 04 1月, 2014 1 次提交
  2. 13 12月, 2013 1 次提交
    • J
      net-gro: Prepare GRO stack for the upcoming tunneling support · 299603e8
      Jerry Chu 提交于
      This patch modifies the GRO stack to avoid the use of "network_header"
      and associated macros like ip_hdr() and ipv6_hdr() in order to allow
      an arbitary number of IP hdrs (v4 or v6) to be used in the
      encapsulation chain. This lays the foundation for various IP
      tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.
      
      With this patch, the GRO stack traversing now is mostly based on
      skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
      skb->network_header). As a result all but the top layer (i.e., the
      the transport layer) must have hdrs of the same length in order for
      a pkt to be considered for aggregation. Therefore when adding a new
      encap layer (e.g., for tunneling), one must check and skip flows
      (e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
      different hdr length.
      
      Note that unlike the network header, the transport header can and
      will continue to be set by the GRO code since there will be at
      most one "transport layer" in the encap chain.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      299603e8
  3. 11 12月, 2013 1 次提交
    • Y
      tipc: remove TIPC usage of field af_packet_priv in struct net_device · 37cb0620
      Ying Xue 提交于
      TIPC is currently using the field 'af_packet_priv' in struct net_device
      as a handle to find the bearer instance associated to the given network
      device. But, by doing so it is blocking other networking cleanups, such
      as the one discussed here:
      
      http://patchwork.ozlabs.org/patch/178044/
      
      This commit removes this usage from TIPC. Instead, we introduce a new
      field, 'tipc_ptr', to the net_device structure, to serve this purpose.
      When TIPC bearer is enabled, the bearer object is associated to
      'tipc_ptr'. When a TIPC packet arrives in the recv_msg() upcall
      from a networking device, the bearer object can now be obtained from
      'tipc_ptr'. When a bearer is disabled, the bearer object is detached
      from its underlying network device by setting 'tipc_ptr' to NULL.
      
      Additionally, an RCU lock is used to protect the new pointer.
      Henceforth, the existing tipc_net_lock is used in write mode to
      serialize write accesses to this pointer, while the new RCU lock is
      applied on the read side to ensure that the pointer is 100% valid
      within its wrapped area for all readers.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37cb0620
  4. 07 12月, 2013 1 次提交
    • E
      net: introduce dev_consume_skb_any() · e6247027
      Eric Dumazet 提交于
      Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq()
      helpers to free skbs, both for dropped packets and TX completed ones.
      
      We need to separate the two causes to get better diagnostics
      given by dropwatch or "perf record -e skb:kfree_skb"
      
      This patch provides two new helpers, dev_consume_skb_any() and
      dev_consume_skb_irq() to be used for consumed skbs.
      
      __dev_kfree_skb_irq() is slightly optimized to remove one
      atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6247027
  5. 08 11月, 2013 2 次提交
  6. 04 11月, 2013 1 次提交
  7. 29 10月, 2013 1 次提交
    • J
      net: add might_sleep() call to napi_disable · 80c33ddd
      Jacob Keller 提交于
      napi_disable uses an msleep() call to wait for outstanding napi work to be
      finished after setting the disable bit. It does not always sleep incase there
      was no outstanding work. This resulted in a rare bug in ixgbe_down operation
      where a napi_disable call took place inside of a local_bh_disable()d context.
      In order to enable easier detection of future sleep while atomic BUGs, this
      patch adds a might_sleep() call, so that every use of napi_disable during
      atomic context will be visible.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
      Cc: Alexander Duyck <alexander.duyck@intel.com>
      Cc: Hyong-Youb Kim <hykim@myri.com>
      Cc: Amir Vadai <amirv@mellanox.com>
      Cc: Dmitry Kravkov <dmitry@broadcom.com>
      Tested-by: NPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      80c33ddd
  8. 18 10月, 2013 1 次提交
  9. 14 10月, 2013 1 次提交
  10. 08 10月, 2013 2 次提交
  11. 01 10月, 2013 1 次提交
  12. 27 9月, 2013 6 次提交
    • J
      [networking]device.h: Remove extern from function prototypes · f629d208
      Joe Perches 提交于
      There are a mix of function prototypes with and without extern
      in the kernel sources.  Standardize on not using extern for
      function prototypes.
      
      Function prototypes don't need to be written with extern.
      extern is assumed by the compiler.  Its use is as unnecessary as
      using auto to declare automatic/local variables in a block.
      Signed-off-by: NJoe Perches <joe@perches.com>
      f629d208
    • T
      sysfs: make attr namespace interface less convoluted · 58292cbe
      Tejun Heo 提交于
      sysfs ns (namespace) implementation became more convoluted than
      necessary while trying to hide ns information from visible interface.
      The relatively recent attr ns support is a good example.
      
      * attr ns tag is determined by sysfs_ops->namespace() callback while
        dir tag is determined by kobj_type->namespace().  The placement is
        arbitrary.
      
      * Instead of performing operations with explicit ns tag, the namespace
        callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(),
        class_attr_namespace(), class_attr->namespace().  It's not simpler
        in any sense.  The only thing this convolution does is traversing
        the whole stack backwards.
      
      The namespace callbacks are unncessary because the operations involved
      are inherently synchronous.  The information can be provided in in
      straight-forward top-down direction and reversing that direction is
      unnecessary and against basic design principles.
      
      This backward interface is unnecessarily convoluted and hinders
      properly separating out sysfs from driver model / kobject for proper
      layering.  This patch updates attr ns support such that
      
      * sysfs_ops->namespace() and class_attr->namespace() are dropped.
      
      * sysfs_{create|remove}_file_ns(), which take explicit @ns param, are
        added and sysfs_{create|remove}_file() are now simple wrappers
        around the ns aware functions.
      
      * ns handling is dropped from sysfs_chmod_file().  Nobody uses it at
        this point.  sysfs_chmod_file_ns() can be added later if necessary.
      
      * Explicit @ns is propagated through class_{create|remove}_file_ns()
        and netdev_class_{create|remove}_file_ns().
      
      * driver/net/bonding which is currently the only user of attr
        namespace is updated to use netdev_class_{create|remove}_file_ns()
        with @bh->net as the ns tag instead of using the namespace callback.
      
      This patch should be an equivalent conversion without any functional
      difference.  It makes the code easier to follow, reduces lines of code
      a bit and helps proper separation and layering.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58292cbe
    • V
      net: add a possibility to get private from netdev_adjacent->list · b6ccba4c
      Veaceslav Falico 提交于
      It will be useful to get first/last element.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6ccba4c
    • V
      net: add for_each iterators through neighbour lower link's private · 31088a11
      Veaceslav Falico 提交于
      Add a possibility to iterate through netdev_adjacent's private, currently
      only for lower neighbours.
      
      Add both RCU and RTNL/other locking variants of iterators, and make the
      non-rcu variant to be safe from removal.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31088a11
    • V
      net: add netdev_adjacent->private and allow to use it · 402dae96
      Veaceslav Falico 提交于
      Currently, even though we can access any linked device, we can't attach
      anything to it, which is vital to properly manage them.
      
      To fix this, add a new void *private to netdev_adjacent and functions
      setting/getting it (per link), so that we can save, per example, bonding's
      slave structures there, per slave device.
      
      netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
      upper dev and populates the neighbour link only with private.
      
      netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      402dae96
    • V
      net: add adj_list to save only neighbours · 2f268f12
      Veaceslav Falico 提交于
      Currently, we distinguish neighbours (first-level linked devices) from
      non-neighbours by the neighbour bool in the netdev_adjacent. This could be
      quite time-consuming in case we would like to traverse *only* through
      neighbours - cause we'd have to traverse through all devices and check for
      this flag, and in a (quite common) scenario where we have lots of vlans on
      top of bridge, which is on top of a bond - the bonding would have to go
      through all those vlans to get its upper neighbour linked devices.
      
      This situation is really unpleasant, cause there are already a lot of cases
      when a device with slaves needs to go through them in hot path.
      
      To fix this, introduce a new upper/lower device lists structure -
      adj_list, which contains only the neighbours. It works always in
      pair with the all_adj_list structure (renamed from upper/lower_dev_list),
      i.e. both of them contain the same links, only that all_adj_list contains
      also non-neighbour device links. It's really a small change visible,
      currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
      change the main linked logic at all.
      
      Also, add some comments a fix a name collision in
      netdev_for_each_upper_dev_rcu() and rework the naming by the following
      rules:
      
      netdev_(all_)(upper|lower)_*
      
      If "all_" is present, then we work with the whole list of upper/lower
      devices, otherwise - only with direct neighbours. Uninline functions - to
      get better stack traces.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f268f12
  13. 16 9月, 2013 1 次提交
  14. 07 9月, 2013 1 次提交
  15. 06 9月, 2013 1 次提交
    • J
      vxlan: Notify drivers for listening UDP port changes · 53cf5275
      Joseph Gasparakis 提交于
      This patch adds two more ndo ops: ndo_add_rx_vxlan_port() and
      ndo_del_rx_vxlan_port().
      
      Drivers can get notifications through the above functions about changes
      of the UDP listening port of VXLAN. Also, when physical ports come up,
      now they can call vxlan_get_rx_port() in order to obtain the port number(s)
      of the existing VXLAN interface in case they already up before them.
      
      This information about the listening UDP port would be used for VXLAN
      related offloads.
      
      A big thank you to John Fastabend (john.r.fastabend@intel.com) for his
      input and his suggestions on this patch set.
      
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJoseph Gasparakis <joseph.gasparakis@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53cf5275
  16. 30 8月, 2013 2 次提交
    • V
      net: add netdev_for_each_upper_dev_rcu() · 8b5be856
      Veaceslav Falico 提交于
      The new macro netdev_for_each_upper_dev_rcu(dev, upper, iter) iterates
      through the dev->upper_dev_list starting from the first element, using
      the netdev_upper_get_next_dev_rcu(dev, &iter).
      
      Must be called under RCU read lock.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b5be856
    • V
      net: add lower_dev_list to net_device and make a full mesh · 5d261913
      Veaceslav Falico 提交于
      This patch adds lower_dev_list list_head to net_device, which is the same
      as upper_dev_list, only for lower devices, and begins to use it in the same
      way as the upper list.
      
      It also changes the way the whole adjacent device lists work - now they
      contain *all* of upper/lower devices, not only the first level. The first
      level devices are distinguished by the bool neighbour field in
      netdev_adjacent, also added by this patch.
      
      There are cases when a device can be added several times to the adjacent
      list, the simplest would be:
      
           /---- eth0.10 ---\
      eth0-		       --- bond0
           \---- eth0.20 ---/
      
      where both bond0 and eth0 'see' each other in the adjacent lists two times.
      To avoid duplication of netdev_adjacent structures ref_nr is being kept as
      the number of times the device was added to the list.
      
      The 'full view' is achieved by adding, on link creation, all of the
      upper_dev's upper_dev_list devices as upper devices to all of the
      lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
      versa. On unlink they are removed using the same logic.
      
      I've tested it with thousands vlans/bonds/bridges, everything works ok and
      no observable lags even on a huge number of interfaces.
      
      Memory footprint for 128 devices interconnected with each other via both
      upper and lower (which is impossible, but for the comparison) lists would be:
      
      128*128*2*sizeof(netdev_adjacent) = 1.5MB
      
      but in the real world we usualy have at most several devices with slaves
      and a lot of vlans, so the footprint will be much lower.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d261913
  17. 02 8月, 2013 1 次提交
  18. 31 7月, 2013 1 次提交
  19. 25 7月, 2013 1 次提交
  20. 24 7月, 2013 1 次提交
  21. 11 7月, 2013 1 次提交
  22. 27 6月, 2013 1 次提交
    • N
      net: fix kernel deadlock with interface rename and netdev name retrieval. · 5dbe7c17
      Nicolas Schichan 提交于
      When the kernel (compiled with CONFIG_PREEMPT=n) is performing the
      rename of a network interface, it can end up waiting for a workqueue
      to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a
      SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to
      the fact that read_secklock_begin() will spin forever waiting for the
      writer process (the one doing the interface rename) to update the
      devnet_rename_seq sequence.
      
      This patch fixes the problem by adding a helper (netdev_get_name())
      and using it in the code handling the SIOCGIFNAME ioctl and
      SO_BINDTODEVICE setsockopt.
      
      The netdev_get_name() helper uses raw_seqcount_begin() to avoid
      spinning forever, waiting for devnet_rename_seq->sequence to become
      even. cond_resched() is used in the contended case, before retrying
      the access to give the writer process a chance to finish.
      
      The use of raw_seqcount_begin() will incur some unneeded work in the
      reader process in the contended case, but this is better than
      deadlocking the system.
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dbe7c17
  23. 14 6月, 2013 2 次提交
    • R
      net/core: Add VF link state control · 1d8faf48
      Rony Efraim 提交于
      Add netlink directives and ndo entry to allow for controling
      VF link, which can be in one of three states:
      
      Auto - VF link state reflects the PF link state (default)
      
      Up - VF link state is up, traffic from VF to VF works even if
      the actual PF link is down
      
      Down - VF link state is down, no traffic from/to this VF, can be of
      use while configuring the VF
      Signed-off-by: NRony Efraim <ronye@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d8faf48
    • W
      net-rps: fixes for rps flow limit · 5f121b9a
      Willem de Bruijn 提交于
      Caught by sparse:
      - __rcu: missing annotation to sd->flow_limit
      - __user: direct access in cpumask_scnprintf
      
      Also
      - add endline character when printing bitmap if room in buffer
      - avoid bucket overflow by reducing FLOW_LIMIT_HISTORY
      
      The last item warrants some explanation. The hashtable buckets are
      subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal
      to bucket size, since all packets may end up in a single bucket. The
      current (rather arbitrary) history value of 256 happens to match the
      buffer size (u8).
      
      As a result, with a single flow, the first 128 packets are accepted
      (correct), the second 128 packets dropped (correct) and then the
      history[] array has filled, so that each subsequent new packet
      causes an increment in the bucket for new_flow plus a decrement
      for old_flow: a steady state.
      
      This is fine if packets are dropped, as the steady state goes away
      as soon as a mix of traffic reappears. But, because the 256th packet
      overflowed the bucket to 0: no packets are dropped.
      
      Instead of explicitly adding an overflow check, this patch changes
      FLOW_LIMIT_HISTORY to never be able to overflow a single bucket.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      (first item)
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f121b9a
  24. 12 6月, 2013 1 次提交
  25. 11 6月, 2013 2 次提交
  26. 29 5月, 2013 3 次提交
  27. 28 5月, 2013 1 次提交
    • S
      MPLS: Add limited GSO support · 0d89d203
      Simon Horman 提交于
      In the case where a non-MPLS packet is received and an MPLS stack is
      added it may well be the case that the original skb is GSO but the
      NIC used for transmit does not support GSO of MPLS packets.
      
      The aim of this code is to provide GSO in software for MPLS packets
      whose skbs are GSO.
      
      SKB Usage:
      
      When an implementation adds an MPLS stack to a non-MPLS packet it should do
      the following to skb metadata:
      
      * Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
        skb->inner_protocol is added by this patch.
      
      * Set skb->protocol to the new MPLS ethertype of the packet.
      
      * Set skb->network_header to correspond to the
        end of the L3 header, including the MPLS label stack.
      
      I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
      kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
      That patch sets the above requirements in datapath/actions.c:push_mpls()
      and was used to exercise this code.  The datapath patch is against the Open
      vSwtich tree but it is intended that it be added to the Open vSwtich code
      present in the mainline Linux kernel at some point.
      
      Features:
      
      I believe that the approach that I have taken is at least partially
      consistent with the handling of other protocols.  Jesse, I understand that
      you have some ideas here.  I am more than happy to change my implementation.
      
      This patch adds dev->mpls_features which may be used by devices
      to advertise features supported for MPLS packets.
      
      A new NETIF_F_MPLS_GSO feature is added for devices which support
      hardware MPLS GSO offload.  Currently no devices support this
      and MPLS GSO always falls back to software.
      
      Alternate Implementation:
      
      One possible alternate implementation is to teach netif_skb_features()
      and skb_network_protocol() about MPLS, in a similar way to their
      understanding of VLANs. I believe this would avoid the need
      for net/mpls/mpls_gso.c and in particular the calls to
      __skb_push() and __skb_push() in mpls_gso_segment().
      
      I have decided on the implementation in this patch as it should
      not introduce any overhead in the case where mpls_gso is not compiled
      into the kernel or inserted as a module.
      
      MPLS GSO suggested by Jesse Gross.
      Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
      by Pravin B Shelar.
      
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d89d203
  28. 26 5月, 2013 1 次提交