1. 15 3月, 2016 1 次提交
  2. 10 3月, 2016 1 次提交
  3. 09 3月, 2016 4 次提交
    • A
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov 提交于
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c905981
    • D
      bpf: support for access to tunnel options · 14ca0751
      Daniel Borkmann 提交于
      After eBPF being able to programmatically access/manage tunnel key meta
      data via commit d3aa45ce ("bpf: add helpers to access tunnel metadata")
      and more recently also for IPv6 through c6c33454 ("bpf: support ipv6
      for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
      helpers to generically access their auxiliary tunnel options.
      
      Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
      and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
      case only makes sense, if we can also read/write TLVs into it. In the GBP
      case, it provides the flexibility to easily map the group policy ID in
      combination with other helpers or maps.
      
      I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
      for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
      complex by itself, and there may be cases for tunnel key backends where
      tunnel options are not always needed. If we would have integrated this
      into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
      remaining helper arguments, so keeping compatibility on structs in case of
      passing in a flat buffer gets more cumbersome. Separating both also allows
      for more flexibility and future extensibility, f.e. options could be fed
      directly from a map, etc.
      
      Moreover, change geneve's xmit path to test only for info->options_len
      instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
      xmit path and allows for avoiding to specify a protocol flag in the API on
      xmit, so it can be protocol agnostic. Having info->options_len is enough
      information that is needed. Tested with vxlan and geneve.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14ca0751
    • D
      bpf: allow to propagate df in bpf_skb_set_tunnel_key · 22080870
      Daniel Borkmann 提交于
      Added by 9a628224 ("ip_tunnel: Add dont fragment flag."), allow to
      feed df flag into tunneling facilities (currently supported on TX by
      vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key()
      helper.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22080870
    • D
      bpf: add flags to bpf_skb_store_bytes for clearing hash · 8afd54c8
      Daniel Borkmann 提交于
      When overwriting parts of the packet with bpf_skb_store_bytes() that
      were fed previously into skb->hash calculation, we should clear the
      current hash with skb_clear_hash(), so that a next skb_get_hash() call
      can determine the correct hash related to this skb.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8afd54c8
  4. 05 3月, 2016 2 次提交
  5. 04 3月, 2016 2 次提交
  6. 03 3月, 2016 2 次提交
  7. 02 3月, 2016 6 次提交
    • J
      introduce IFE action · ef6980b6
      Jamal Hadi Salim 提交于
      This action allows for a sending side to encapsulate arbitrary metadata
      which is decapsulated by the receiving end.
      The sender runs in encoding mode and the receiver in decode mode.
      Both sender and receiver must specify the same ethertype.
      At some point we hope to have a registered ethertype and we'll
      then provide a default so the user doesnt have to specify it.
      For now we enforce the user specify it.
      
      Lets show example usage where we encode icmp from a sender towards
      a receiver with an skbmark of 17; both sender and receiver use
      ethertype of 0xdead to interop.
      
      YYYY: Lets start with Receiver-side policy config:
      xxx: add an ingress qdisc
      sudo tc qdisc add dev $ETH ingress
      
      xxx: any packets with ethertype 0xdead will be subjected to ife decoding
      xxx: we then restart the classification so we can match on icmp at prio 3
      sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
      u32 match u32 0 0 flowid 1:1 \
      action ife decode reclassify
      
      xxx: on restarting the classification from above if it was an icmp
      xxx: packet, then match it here and continue to the next rule at prio 4
      xxx: which will match based on skb mark of 17
      sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
      u32 match ip protocol 1 0xff flowid 1:1 \
      action continue
      
      xxx: match on skbmark of 0x11 (decimal 17) and accept
      sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
      handle 0x11 fw flowid 1:1 \
      action ok
      
      xxx: Lets show the decoding policy
      sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
      xxx:
      filter pref 2 u32
      filter pref 2 u32 fh 800: ht divisor 1
      filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 0 success 0)
        match 00000000/00000000 at 0 (success 0 )
              action order 1: ife decode action reclassify
               index 1 ref 1 bind 1 installed 14 sec used 14 sec
               type: 0x0
               Metadata: allow mark allow hash allow prio allow qmap
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
      xxx:
      Observe that above lists all metadatum it can decode. Typically these
      submodules will already be compiled into a monolithic kernel or
      loaded as modules
      
      YYYY: Lets show the sender side now ..
      
      xxx: Add an egress qdisc on the sender netdev
      sudo tc qdisc add dev $ETH root handle 1: prio
      xxx:
      xxx: Match all icmp packets to 192.168.122.237/24, then
      xxx: tag the packet with skb mark of decimal 17, then
      xxx: Encode it with:
      xxx:	ethertype 0xdead
      xxx:	add skb->mark to whitelist of metadatum to send
      xxx:	rewrite target dst MAC address to 02:15:15:15:15:15
      xxx:
      sudo $TC filter add dev $ETH parent 1: protocol ip prio 10  u32 \
      match ip dst 192.168.122.237/24 \
      match ip protocol 1 0xff \
      flowid 1:2 \
      action skbedit mark 17 \
      action ife encode \
      type 0xDEAD \
      allow mark \
      dst 02:15:15:15:15:15
      
      xxx: Lets show the encoding policy
      sudo tc -s filter ls dev $ETH parent 1: protocol ip
      xxx:
      filter pref 10 u32
      filter pref 10 u32 fh 800: ht divisor 1
      filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2  (rule hit 0 success 0)
        match c0a87aed/ffffffff at 16 (success 0 )
        match 00010000/00ff0000 at 8 (success 0 )
      
      	action order 1:  skbedit mark 17
      	 index 6 ref 1 bind 1
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      	action order 2: ife encode action pipe
      	 index 3 ref 1 bind 1
      	 dst MAC: 02:15:15:15:15:15 type: 0xDEAD
       	 Metadata: allow mark
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      xxx:
      
      test by sending ping from sender to destination
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef6980b6
    • N
      bridge: mcast: add support for more router port information dumping · 59f78f9f
      Nikolay Aleksandrov 提交于
      Allow for more multicast router port information to be dumped such as
      timer and type attributes. For that that purpose we need to extend the
      MDBA_ROUTER_PORT attribute similar to how it was done for the mdb entries
      recently. The new format is thus:
      [MDBA_ROUTER_PORT] = { <- nested attribute
          u32 ifindex <- router port ifindex for user-space compatibility
          [MDBA_ROUTER_PATTR attributes]
      }
      This way it remains compatible with older users (they'll simply retrieve
      the u32 in the beginning) and new users can parse the remaining
      attributes. It would also allow to add future extensions to the router
      port without breaking compatibility.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59f78f9f
    • N
      bridge: mcast: add support for temporary port router · a55d8246
      Nikolay Aleksandrov 提交于
      Add support for a temporary router port which doesn't depend only on the
      incoming query. It can be refreshed if set to the same value, which is
      a no-op for the rest.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a55d8246
    • N
      bridge: mcast: use names for the different multicast_router types · 7f0aec7a
      Nikolay Aleksandrov 提交于
      Using raw values makes it difficult to extend and also understand the
      code, give them names and do explicit per-option manipulation in
      br_multicast_set_port_router.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f0aec7a
    • J
      Introduce devlink infrastructure · bfcd3a46
      Jiri Pirko 提交于
      Introduce devlink infrastructure for drivers to register and expose to
      userspace via generic Netlink interface.
      
      There are two basic objects defined:
      devlink - one instance for every "parent device", for example switch ASIC
      devlink port - one instance for every physical port of the device.
      
      This initial portion implements basic get/dump of objects to userspace.
      Also, port splitter and port type setting is implemented.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfcd3a46
    • J
      net: sched: cls_u32 add bit to specify software only rules · 9e8ce79c
      John Fastabend 提交于
      In the initial implementation the only way to stop a rule from being
      inserted into the hardware table was via the device feature flag.
      However this doesn't work well when working on an end host system
      where packets are expect to hit both the hardware and software
      datapaths.
      
      For example we can imagine a rule that will match an IP address and
      increment a field. If we install this rule in both hardware and
      software we may increment the field twice. To date we have only
      added support for the drop action so we have been able to ignore
      these cases. But as we extend the action support we will hit this
      example plus more such cases. Arguably these are not even corner
      cases in many working systems these cases will be common.
      
      To avoid forcing the driver to always abort (i.e. the above example)
      this patch adds a flag to add a rule in software only. A careful
      user can use this flag to build software and hardware datapaths
      that work together. One example we have found particularly useful
      is to use hardware resources to set the skb->mark on the skb when
      the match may be expensive to run in software but a mark lookup
      in a hash table is cheap. The idea here is hardware can do in one
      lookup what the u32 classifier may need to traverse multiple lists
      and hash tables to compute. The flag is only passed down on inserts.
      On deletion to avoid stale references in hardware we always try
      to remove a rule if it exists.
      
      The flags field is part of the classifier specific options. Although
      it is tempting to lift this into the generic structure doing this
      proves difficult do to how the tc netlink attributes are implemented
      along with how the dump/change routines are called. There is also
      precedence for putting seemingly generic pieces in the specific
      classifier options such as TCA_U32_POLICE, TCA_U32_ACT, etc. So
      although not ideal I've left FLAGS in the u32 options as well as it
      simplifies the code greatly and user space has already learned how
      to manage these bits ala 'tc' tool.
      
      Another thing if trying to update a rule we require the flags to
      be unchanged. This is to force user space, software u32 and
      the hardware u32 to keep in sync. Thanks to Simon Horman for
      catching this case.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e8ce79c
  8. 29 2月, 2016 1 次提交
  9. 26 2月, 2016 2 次提交
    • D
      net: ethtool: add new ETHTOOL_xLINKSETTINGS API · 3f1ac7a7
      David Decotigny 提交于
      This patch defines a new ETHTOOL_GLINKSETTINGS/SLINKSETTINGS API,
      handled by the new get_link_ksettings/set_link_ksettings callbacks.
      This API provides support for most legacy ethtool_cmd fields, adds
      support for larger link mode masks (up to 4064 bits, variable length),
      and removes ethtool_cmd deprecated
      fields (transceiver/maxrxpkt/maxtxpkt).
      
      This API is deprecating the legacy ETHTOOL_GSET/SSET API and provides
      the following backward compatibility properties:
       - legacy ethtool with legacy drivers: no change, still using the
         get_settings/set_settings callbacks.
       - legacy ethtool with new get/set_link_ksettings drivers: the new
         driver callbacks are used, data internally converted to legacy
         ethtool_cmd. ETHTOOL_GSET will return only the 1st 32b of each link
         mode mask. ETHTOOL_SSET will fail if user tries to set the
         ethtool_cmd deprecated fields to
         non-0 (transceiver/maxrxpkt/maxtxpkt). A kernel warning is logged if
         driver sets higher bits.
       - future ethtool with legacy drivers: no change, still using the
         get_settings/set_settings callbacks, internally converted to new data
         structure. Deprecated fields (transceiver/maxrxpkt/maxtxpkt) will be
         ignored and seen as 0 from user space. Note that that "future"
         ethtool tool will not allow changes to these deprecated fields.
       - future ethtool with new drivers: direct call to the new callbacks.
      
      By "future" ethtool, what is meant is:
       - query: first try ETHTOOL_GLINKSETTINGS, and revert to ETHTOOL_GSET if
         fails
       - set: query first and remember which of ETHTOOL_GLINKSETTINGS or
         ETHTOOL_GSET was successful
         + if ETHTOOL_GLINKSETTINGS was successful, then change config with
           ETHTOOL_SLINKSETTINGS. A failure there is final (do not try
           ETHTOOL_SSET).
         + otherwise ETHTOOL_GSET was successful, change config with
           ETHTOOL_SSET. A failure there is final (do not try
           ETHTOOL_SLINKSETTINGS).
      
      The interaction user/kernel via the new API requires a small
      ETHTOOL_GLINKSETTINGS handshake first to agree on the length of the link
      mode bitmaps. If kernel doesn't agree with user, it returns the bitmap
      length it is expecting from user as a negative length (and cmd field is
      0). When kernel and user agree, kernel returns valid info in all
      fields (ie. link mode length > 0 and cmd is ETHTOOL_GLINKSETTINGS).
      
      Data structure crossing user/kernel boundary is 32/64-bit
      agnostic. Converted internally to a legal kernel bitmap.
      
      The internal __ethtool_get_settings kernel helper will gradually be
      replaced by __ethtool_get_link_ksettings by the time the first
      "link_settings" drivers start to appear. So this patch doesn't change
      it, it will be removed before it needs to be changed.
      Signed-off-by: NDavid Decotigny <decot@googlers.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f1ac7a7
    • D
      net: ipv6: Make address flushing on ifdown optional · f1705ec1
      David Ahern 提交于
      Currently, all ipv6 addresses are flushed when the interface is configured
      down, including global, static addresses:
      
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          << nothing; all addresses have been flushed>>
      
      Add a new sysctl to make this behavior optional. The new setting defaults to
      flush all addresses to maintain backwards compatibility. When the set global
      addresses with no expire times are not flushed on an admin down. The sysctl
      is per-interface or system-wide for all interfaces
      
          $ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1
      or
          $ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
      
      Will keep addresses on eth1 on an admin down.
      
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000
              inet6 2100:1::2/120 scope global tentative
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative
                 valid_lft forever preferred_lft forever
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1705ec1
  10. 25 2月, 2016 1 次提交
    • D
      bpf: fix csum setting for bpf_set_tunnel_key · 2da897e5
      Daniel Borkmann 提交于
      The fix in 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly
      controlled.") changed behavior for bpf_set_tunnel_key() when in use with
      IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't.
      As a result, the stack dropped ingress vxlan IPv6 packets, that have been
      sent via eBPF through collect meta data mode due to checksum now being zero.
      
      Since after LCO, we enable IPv4 checksum by default, so make that analogous
      and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in
      IPv4 case.
      
      Fixes: 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.")
      Fixes: c6c33454 ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2da897e5
  11. 24 2月, 2016 4 次提交
    • B
      cfg80211: Add global RRM capability · 0c9ca11b
      Beni Lev 提交于
      Today, the supplicant will add the RRM capabilities
      Information Element in the association request only if
      Quiet period is supported (NL80211_FEATURE_QUIET).
      
      Quiet is one of many RRM features, and there are other RRM
      features that are not related to Quiet (e.g. neighbor
      report). Therefore, requiring Quiet to enable RRM is too
      restrictive.
      Some of the features, like neighbor report, can be
      supported by user space without any help from the kernel.
      Hence adding the RRM capabilities IE to association request
      should be the sole user space's decision.
      Removing the RRM dependency on Quiet in the driver solves
      this problem, but using an old driver with a user space
      tool that would not require Quiet feature would be
      problematic: the user space would add NL80211_ATTR_USE_RRM
      in the association request even if the kernel doesn't
      advertize NL80211_FEATURE_QUIET and the association would
      be denied by the kernel.
      
      This solution adds a global RRM capability, that tells user
      space that it can request RRM capabilities IE publishment
      without any specific feature support in the kernel.
      Signed-off-by: NBeni Lev <beni.lev@intel.com>
      Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      0c9ca11b
    • L
      cfg80211: basic support for PBSS network type · 34d50519
      Lior David 提交于
      PBSS (Personal Basic Service Set) is a new BSS type for DMG
      networks. It is similar to infrastructure BSS, having an AP-like
      entity called PCP (PBSS Control Point), but it has few differences.
      PBSS support is mandatory for 11ad devices.
      
      Add support for PBSS by introducing a new PBSS flag attribute.
      The PBSS flag is used in the START_AP command to request starting
      a PCP instead of an AP, and in the CONNECT command to request
      connecting to a PCP instead of an AP.
      Signed-off-by: NLior David <liord@codeaurora.org>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      34d50519
    • J
      rfkill: Update userspace API documentation · d4634e8d
      João Paulo Rechi Vita 提交于
      Add a note to userspace on the effect of RFKILL_OP_CHANGE_ALL also
      updating the default state for hotplugged devices.
      Signed-off-by: NJoão Paulo Rechi Vita <jprvita@endlessm.com>
      [reword a bit]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      d4634e8d
    • D
      nfit: update address range scrub commands to the acpi 6.1 format · 4577b066
      Dan Williams 提交于
      The original format of these commands from the "NVDIMM DSM Interface
      Example" [1] are superseded by the ACPI 6.1 definition of the "NVDIMM Root
      Device _DSMs" [2].
      
      [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      [2]: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
           "9.20.7 NVDIMM Root Device _DSMs"
      
      Changes include:
      1/ New 'restart' fields in ars_status, unfortunately these are
         implemented in the middle of the existing definition so this change
         is not backwards compatible.  The expectation is that shipping
         platforms will only ever support the ACPI 6.1 definition.
      
      2/ New status values for ars_start ('busy') and ars_status ('overflow').
      
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Linda Knippers <linda.knippers@hpe.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      4577b066
  12. 22 2月, 2016 2 次提交
    • D
      bpf: fix csum update in bpf_l4_csum_replace helper for udp · 2f72959a
      Daniel Borkmann 提交于
      When using this helper for updating UDP checksums, we need to extend
      this in order to write CSUM_MANGLED_0 for csum computations that result
      into 0 as sum. Reason we need this is because packets with a checksum
      could otherwise become incorrectly marked as a packet without a checksum.
      Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should
      not turn packets without a checksum into ones with a checksum.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f72959a
    • D
      bpf: add generic bpf_csum_diff helper · 7d672345
      Daniel Borkmann 提交于
      For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's
      currently limited to handle 2 and 4 byte changes in a header and feeds the
      from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When
      working with IPv6, for example, this makes it rather cumbersome to deal
      with, similarly when editing larger parts of a header.
      
      Instead, extend the API in a more generic way: For bpf_l4_csum_replace(),
      add a case for header field mask of 0 to change the checksum at a given
      offset through inet_proto_csum_replace_by_diff(), and provide a helper
      bpf_csum_diff() that can generically calculate a from/to diff for arbitrary
      amounts of data.
      
      This can be used in multiple ways: for the bpf_l4_csum_replace() only
      part, this even provides us with the option to insert precalculated diffs
      from user space f.e. from a map, or from bpf_csum_diff() during runtime.
      
      bpf_csum_diff() has a optional from/to stack buffer input, so we can
      calculate a diff by using a scratchbuffer for scenarios where we're
      inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers
      don't need to be of equal size) data. Also, bpf_csum_diff() allows to
      feed a previous csum into csum_partial(), so the function can also be
      cascaded.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d672345
  13. 20 2月, 2016 3 次提交
    • A
      bpf: introduce BPF_MAP_TYPE_STACK_TRACE · d5a3b1f6
      Alexei Starovoitov 提交于
      add new map type to store stack traces and corresponding helper
      bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
      @ctx: struct pt_regs*
      @map: pointer to stack_trace map
      @flags: bits 0-7 - numer of stack frames to skip
              bit 8 - collect user stack instead of kernel
              bit 9 - compare stacks by hash only
              bit 10 - if two different stacks hash into the same stackid
                       discard old
              other bits - reserved
      Return: >= 0 stackid on success or negative error
      
      stackid is a 32-bit integer handle that can be further combined with
      other data (including other stackid) and used as a key into maps.
      
      Userspace will access stackmap using standard lookup/delete syscall commands to
      retrieve full stack trace for given stackid.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5a3b1f6
    • K
      net/ethtool: introduce a new ioctl for per queue setting · ac2c7ad0
      Kan Liang 提交于
      Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting.
      The following patches will enable some SUB_COMMANDs for per queue
      setting.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac2c7ad0
    • N
      bridge: mdb: add support for more attributes and export timer · 21257156
      Nikolay Aleksandrov 提交于
      Currently mdb entries are exported directly as a structure inside
      MDBA_MDB_ENTRY_INFO attribute, we can't really extend it without
      breaking user-space. In order to export new mdb fields, I've converted
      the MDBA_MDB_ENTRY_INFO into a nested attribute which starts like before
      with struct br_mdb_entry (without header, as it's casted directly in
      iproute2) and continues with MDBA_MDB_EATTR_ attributes. This way we
      keep compatibility with older users and can export new data.
      I've tested this with iproute2, both with and without support for the
      added attribute and it works fine.
      So basically we again have MDBA_MDB_ENTRY_INFO with struct br_mdb_entry
      inside but it may contain also some additional MDBA_MDB_EATTR_ attributes
      such as MDBA_MDB_EATTR_TIMER which can be parsed by user-space.
      
      So the new structure is:
      [MDBA_MDB] = {
           [MDBA_MDB_ENTRY] = {
               [MDBA_MDB_ENTRY_INFO]
               [MDBA_MDB_ENTRY_INFO] { <- Nested attribute
                   struct br_mdb_entry <- nla_put_nohdr()
                   [MDBA_MDB_ENTRY attributes] <- normal netlink attributes
               }
           }
      }
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21257156
  14. 19 2月, 2016 1 次提交
    • F
      netlink: remove mmapped netlink support · d1b4c689
      Florian Westphal 提交于
      mmapped netlink has a number of unresolved issues:
      
      - TX zerocopy support had to be disabled more than a year ago via
        commit 4682a035 ("netlink: Always copy on mmap TX.")
        because the content of the mmapped area can change after netlink
        attribute validation but before message processing.
      
      - RX support was implemented mainly to speed up nfqueue dumping packet
        payload to userspace.  However, since commit ae08ce00
        ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
        with the socket-based interface too (via the skb_zerocopy helper).
      
      The other problem is that skbs attached to mmaped netlink socket
      behave different from normal skbs:
      
      - they don't have a shinfo area, so all functions that use skb_shinfo()
      (e.g. skb_clone) cannot be used.
      
      - reserving headroom prevents userspace from seeing the content as
      it expects message to start at skb->head.
      See for instance
      commit aa3a0220 ("netlink: not trim skb for mmaped socket when dump").
      
      - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
      crash because it needs the sk to check if a tx ring is attached.
      
      Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf35
      ("netfilter: nfnetlink: use original skbuff when acking batches").
      
      mmaped netlink also didn't play nicely with the skb_zerocopy helper
      used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
      commit 6bb0fef4 ("netlink, mmap: fix edge-case leakages in nf queue
      zero-copy")' but at the cost of also needing to provide remaining
      length to the allocation function.
      
      nfqueue also has problems when used with mmaped rx netlink:
      - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
        Problem is that in the mmap case, the allocation time also determines
        the ordering in which the frame will be seen by userspace (A
        allocating before B means that A is located in earlier ring slot,
        but this also means that B might get a lower sequence number then A
        since seqno is decided later.  To fix this we would need to extend the
        spinlocked region to also cover the allocation and message setup which
        isn't desirable.
      - nfqueue can now be configured to queue large (GSO) skbs to userspace.
        Queing GSO packets is faster than having to force a software segmentation
        in the kernel, so this is a desirable option.  However, with a mmap based
        ring one has to use 64kb per ring slot element, else mmap has to fall back
        to the socket path (NL_MMAP_STATUS_COPY) for all large packets.
      
      To use the mmap interface, userspace not only has to probe for mmap netlink
      support, it also has to implement a recv/socket receive path in order to
      handle messages that exceed the size of an rx ring element.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b4c689
  15. 17 2月, 2016 1 次提交
  16. 16 2月, 2016 2 次提交
  17. 12 2月, 2016 1 次提交
  18. 11 2月, 2016 4 次提交