1. 11 11月, 2017 3 次提交
    • D
      bpf: Revert bpf_overrid_function() helper changes. · f3edacbd
      David S. Miller 提交于
      NACK'd by x86 maintainer.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3edacbd
    • M
      net: ipv6: sysctl to specify IPv6 ND traffic class · 2210d6b2
      Maciej Żenczykowski 提交于
      Add a per-device sysctl to specify the default traffic class to use for
      kernel originated IPv6 Neighbour Discovery packets.
      
      Currently this includes:
      
        - Router Solicitation (ICMPv6 type 133)
          ndisc_send_rs() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Neighbour Solicitation (ICMPv6 type 135)
          ndisc_send_ns() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Neighbour Advertisement (ICMPv6 type 136)
          ndisc_send_na() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Redirect (ICMPv6 type 137)
          ndisc_send_redirect() -> ndisc_send_skb() -> ip6_nd_hdr()
      
      and if the kernel ever gets around to generating RA's,
      it would presumably also include:
      
        - Router Advertisement (ICMPv6 type 134)
          (radvd daemon could pick up on the kernel setting and use it)
      
      Interface drivers may examine the Traffic Class value and translate
      the DiffServ Code Point into a link-layer appropriate traffic
      prioritization scheme.  An example of mapping IETF DSCP values to
      IEEE 802.11 User Priority values can be found here:
      
          https://tools.ietf.org/html/draft-ietf-tsvwg-ieee-802-11
      
      The expected primary use case is to properly prioritize ND over wifi.
      
      Testing:
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        0
        jzem22:~# echo -1 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        -bash: echo: write error: Invalid argument
        jzem22:~# echo 256 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        -bash: echo: write error: Invalid argument
        jzem22:~# echo 0 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# echo 255 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        255
        jzem22:~# echo 34 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        34
      
        jzem22:~# echo $[0xDC] > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# tcpdump -v -i eth0 icmp6 and src host jzem22.pgc and dst host fe80::1
        tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
        IP6 (class 0xdc, hlim 255, next-header ICMPv6 (58) payload length: 24)
        jzem22.pgc > fe80::1: [icmp6 sum ok] ICMP6, neighbor advertisement,
        length 24, tgt is jzem22.pgc, Flags [solicited]
      
      (based on original change written by Erik Kline, with minor changes)
      
      v2: fix 'suspicious rcu_dereference_check() usage'
          by explicitly grabbing the rcu_read_lock.
      
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NErik Kline <ek@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2210d6b2
    • J
      bpf: add a bpf_override_function helper · dd0bb688
      Josef Bacik 提交于
      Error injection is sloppy and very ad-hoc.  BPF could fill this niche
      perfectly with it's kprobe functionality.  We could make sure errors are
      only triggered in specific call chains that we care about with very
      specific situations.  Accomplish this with the bpf_override_funciton
      helper.  This will modify the probe'd callers return value to the
      specified value and set the PC to an override function that simply
      returns, bypassing the originally probed function.  This gives us a nice
      clean way to implement systematic error injection for all of our code
      paths.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd0bb688
  2. 08 11月, 2017 5 次提交
    • Y
      openvswitch: enable NSH support · b2d0f5d5
      Yi Yang 提交于
      v16->17
       - Fixed disputed check code: keep them in nsh_push and nsh_pop
         but also add them in __ovs_nla_copy_actions
      
      v15->v16
       - Add csum recalculation for nsh_push, nsh_pop and set_nsh
         pointed out by Pravin
       - Move nsh key into the union with ipv4 and ipv6 and add
         check for nsh key in match_validate pointed out by Pravin
       - Add nsh check in validate_set and __ovs_nla_copy_actions
      
      v14->v15
       - Check size in nsh_hdr_from_nlattr
       - Fixed four small issues pointed out By Jiri and Eric
      
      v13->v14
       - Rename skb_push_nsh to nsh_push per Dave's comment
       - Rename skb_pop_nsh to nsh_pop per Dave's comment
      
      v12->v13
       - Fix NSH header length check in set_nsh
      
      v11->v12
       - Fix missing changes old comments pointed out
       - Fix new comments for v11
      
      v10->v11
       - Fix the left three disputable comments for v9
         but not fixed in v10.
      
      v9->v10
       - Change struct ovs_key_nsh to
             struct ovs_nsh_key_base base;
             __be32 context[NSH_MD1_CONTEXT_SIZE];
       - Fix new comments for v9
      
      v8->v9
       - Fix build error reported by daily intel build
         because nsh module isn't selected by openvswitch
      
      v7->v8
       - Rework nested value and mask for OVS_KEY_ATTR_NSH
       - Change pop_nsh to adapt to nsh kernel module
       - Fix many issues per comments from Jiri Benc
      
      v6->v7
       - Remove NSH GSO patches in v6 because Jiri Benc
         reworked it as another patch series and they have
         been merged.
       - Change it to adapt to nsh kernel module added by NSH
         GSO patch series
      
      v5->v6
       - Fix the rest comments for v4.
       - Add NSH GSO support for VxLAN-gpe + NSH and
         Eth + NSH.
      
      v4->v5
       - Fix many comments by Jiri Benc and Eric Garver
         for v4.
      
      v3->v4
       - Add new NSH match field ttl
       - Update NSH header to the latest format
         which will be final format and won't change
         per its author's confirmation.
       - Fix comments for v3.
      
      v2->v3
       - Change OVS_KEY_ATTR_NSH to nested key to handle
         length-fixed attributes and length-variable
         attriubte more flexibly.
       - Remove struct ovs_action_push_nsh completely
       - Add code to handle nested attribute for SET_MASKED
       - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
         to transfer NSH header data.
       - Fix comments and coding style issues by Jiri and Eric
      
      v1->v2
       - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
       - Dynamically allocate struct ovs_action_push_nsh for
         length-variable metadata.
      
      OVS master and 2.8 branch has merged NSH userspace
      patch series, this patch is to enable NSH support
      in kernel data path in order that OVS can support
      NSH in compat mode by porting this.
      Signed-off-by: NYi Yang <yi.y.yang@intel.com>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NEric Garver <e@erig.me>
      Acked-by: NPravin Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2d0f5d5
    • N
      net_sch: red: Add offload ability to RED qdisc · 602f3baf
      Nogah Frankel 提交于
      Add the ability to offload RED qdisc by using ndo_setup_tc.
      There are four commands for RED offloading:
      * TC_RED_SET: handles set and change.
      * TC_RED_DESTROY: handle qdisc destroy.
      * TC_RED_STATS: update the qdiscs counters (given as reference)
      * TC_RED_XSTAT: returns red xstats.
      
      Whether RED is being offloaded is being determined every time dump action
      is being called because parent change of this qdisc could change its
      offload state but doesn't require any RED function to be called.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      602f3baf
    • T
      ila: Add a hook type for LWT routes · fddb231e
      Tom Herbert 提交于
      In LWT tunnels both an input and output route method is defined.
      If both of these are executed in the same path then double translation
      happens and the effect is not correct.
      
      This patch adds a new attribute that indicates the hook type. Two
      values are defined for route output and route output. ILA
      translation is only done for the one that is set. The default is
      to enable ILA on route output.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fddb231e
    • T
      ila: allow configuration of identifier type · 70d5aef4
      Tom Herbert 提交于
      Allow identifier to be explicitly configured for a mapping.
      This can either be one of the identifier types specified in the
      ILA draft or a value of ILA_ATYPE_USE_FORMAT which means the
      identifier type is inferred from the identifier type field.
      If a value other than ILA_ATYPE_USE_FORMAT is set for a
      mapping then it is assumed that the identifier type field is
      not present in an identifier.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70d5aef4
    • T
      ila: add checksum neutral map auto · 84287bb3
      Tom Herbert 提交于
      Add checksum neutral auto that performs checksum neutral mapping
      without using the C-bit. This is enabled by configuration of
      a mapping.
      
      The checksum neutral function has been split into
      ila_csum_do_neutral_fmt and ila_csum_do_neutral_nofmt. The former
      handles the C-bit and includes it in the adjustment value. The latter
      just sets the adjustment value on the locator diff only.
      
      Added configuration for checksum neutral map aut in ila_lwt
      and ila_xlat.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84287bb3
  3. 05 11月, 2017 6 次提交
    • R
      bpf, cgroup: implement eBPF-based device controller for cgroup v2 · ebc614f6
      Roman Gushchin 提交于
      Cgroup v2 lacks the device controller, provided by cgroup v1.
      This patch adds a new eBPF program type, which in combination
      of previously added ability to attach multiple eBPF programs
      to a cgroup, will provide a similar functionality, but with some
      additional flexibility.
      
      This patch introduces a BPF_PROG_TYPE_CGROUP_DEVICE program type.
      A program takes major and minor device numbers, device type
      (block/character) and access type (mknod/read/write) as parameters
      and returns an integer which defines if the operation should be
      allowed or terminated with -EPERM.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebc614f6
    • J
      bpf: report offload info to user space · bd601b6a
      Jakub Kicinski 提交于
      Extend struct bpf_prog_info to contain information about program
      being bound to a device.  Since the netdev may get destroyed while
      program still exists we need a flag to indicate the program is
      loaded for a device, even if the device is gone.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd601b6a
    • J
      bpf: offload: add infrastructure for loading programs for a specific netdev · ab3f0063
      Jakub Kicinski 提交于
      The fact that we don't know which device the program is going
      to be used on is quite limiting in current eBPF infrastructure.
      We have to reverse or limit the changes which kernel makes to
      the loaded bytecode if we want it to be offloaded to a networking
      device.  We also have to invent new APIs for debugging and
      troubleshooting support.
      
      Make it possible to load programs for a specific netdev.  This
      helps us to bring the debug information closer to the core
      eBPF infrastructure (e.g. we will be able to reuse the verifer
      log in device JIT).  It allows device JITs to perform translation
      on the original bytecode.
      
      __bpf_prog_get() when called to get a reference for an attachment
      point will now refuse to give it if program has a device assigned.
      Following patches will add a version of that function which passes
      the expected netdev in. @type argument in __bpf_prog_get() is
      renamed to attach_type to make it clearer that it's only set on
      attachment.
      
      All calls to ndo_bpf are protected by rtnl, only verifier callbacks
      are not.  We need a wait queue to make sure netdev doesn't get
      destroyed while verifier is still running and calling its driver.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab3f0063
    • J
      rtnetlink: use netnsid to query interface · 79e1ad14
      Jiri Benc 提交于
      Currently, when an application gets netnsid from the kernel (for example as
      the result of RTM_GETLINK call on one end of the veth pair), it's not much
      useful. There's no reliable way to get to the netns fd from the netnsid, nor
      does any kernel API accept netnsid.
      
      Extend the RTM_GETLINK call to also accept netnsid. It will operate on the
      netns with the given netnsid in such case. Of course, the calling process
      needs to have enough capabilities in the target name space; for now, require
      CAP_NET_ADMIN. This can be relaxed in the future.
      
      To signal to the calling process that the kernel understood the new
      IFLA_IF_NETNSID attribute in the query, it will include it in the response.
      This is needed to detect older kernels, as they will just ignore
      IFLA_IF_NETNSID and query in the current name space.
      
      This patch implemetns IFLA_IF_NETNSID only for get and dump. For set
      operations, this can be extended later.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79e1ad14
    • J
      openvswitch: reliable interface indentification in port dumps · 9354d452
      Jiri Benc 提交于
      This patch allows reliable identification of netdevice interfaces connected
      to openvswitch bridges. In particular, user space queries the netdev
      interfaces belonging to the ports for statistics, up/down state, etc.
      Datapath dump needs to provide enough information for the user space to be
      able to do that.
      
      Currently, only interface names are returned. This is not sufficient, as
      openvswitch allows its ports to be in different name spaces and the
      interface name is valid only in its name space. What is needed and generally
      used in other netlink APIs, is the pair ifindex+netnsid.
      
      The solution is addition of the ifindex+netnsid pair (or only ifindex if in
      the same name space) to vport get/dump operation.
      
      On request side, ideally the ifindex+netnsid pair could be used to
      get/set/del the corresponding vport. This is not implemented by this patch
      and can be added later if needed.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9354d452
    • H
      net/dcb: Add dscp to priority selector type · ee205981
      Huy Nguyen 提交于
      IEEE specification P802.1Qcd/D2.1 defines priority selector 5.
      This APP TLV selector defines DSCP to priority map.
      This patch defines such DSCP selector.
      Signed-off-by: NHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: NParav Pandit <parav@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      ee205981
  4. 04 11月, 2017 1 次提交
    • I
      tools/headers: Synchronize kernel ABI headers · fb7df12d
      Ingo Molnar 提交于
      After the SPDX license tags were added a number of tooling headers got out of
      sync with their kernel variants, generating lots of build warnings.
      
      Sync them:
      
       - tools/arch/x86/include/asm/disabled-features.h,
         tools/arch/x86/include/asm/required-features.h,
         tools/include/linux/hash.h:
      
           Remove the SPDX tag where the kernel version does not have it.
      
       - tools/include/asm-generic/bitops/__fls.h,
         tools/include/asm-generic/bitops/arch_hweight.h,
         tools/include/asm-generic/bitops/const_hweight.h,
         tools/include/asm-generic/bitops/fls.h,
         tools/include/asm-generic/bitops/fls64.h,
         tools/include/uapi/asm-generic/ioctls.h,
         tools/include/uapi/asm-generic/mman-common.h,
         tools/include/uapi/sound/asound.h,
         tools/include/uapi/linux/kvm.h,
         tools/include/uapi/linux/perf_event.h,
         tools/include/uapi/linux/sched.h,
         tools/include/uapi/linux/vhost.h,
         tools/include/uapi/sound/asound.h:
      
           Add the SPDX tag of the respective kernel header.
      
       - tools/include/uapi/linux/bpf_common.h,
         tools/include/uapi/linux/fcntl.h,
         tools/include/uapi/linux/hw_breakpoint.h,
         tools/include/uapi/linux/mman.h,
         tools/include/uapi/linux/stat.h,
      
           Change the tag to the kernel header version:
      
             -/* SPDX-License-Identifier: GPL-2.0 */
             +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
      
      Also sync other header details:
      
       - include/uapi/sound/asound.h:
      
           Fix pointless end of line whitespace noise the header grew in this cycle.
      
       - tools/arch/x86/lib/memcpy_64.S:
      
           Sync the code and add tools/include/asm/export.h with dummy wrappers
           to support building the kernel side code in a tooling header environment.
      
       - tools/include/uapi/asm-generic/mman.h,
         tools/include/uapi/linux/bpf.h:
      
           Sync other details that don't impact tooling's use of the ABIs.
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      fb7df12d
  5. 02 11月, 2017 2 次提交
    • G
      License cleanup: add SPDX license identifier to uapi header files with a license · e2be04c7
      Greg Kroah-Hartman 提交于
      Many user space API headers have licensing information, which is either
      incomplete, badly formatted or just a shorthand for referring to the
      license under which the file is supposed to be.  This makes it hard for
      compliance tools to determine the correct license.
      
      Update these files with an SPDX license identifier.  The identifier was
      chosen based on the license information in the file.
      
      GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license
      identifier with the added 'WITH Linux-syscall-note' exception, which is
      the officially assigned exception identifier for the kernel syscall
      exception:
      
         NOTE! This copyright does *not* cover user programs that use kernel
         services by normal system calls - this is merely considered normal use
         of the kernel, and does *not* fall under the heading of "derived work".
      
      This exception makes it possible to include GPL headers into non GPL
      code, without confusing license compliance tools.
      
      Headers which have either explicit dual licensing or are just licensed
      under a non GPL license are updated with the corresponding SPDX
      identifier and the GPLv2 with syscall exception identifier.  The format
      is:
              ((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE)
      
      SPDX license identifiers are a legally binding shorthand, which can be
      used instead of the full boiler plate text.  The update does not remove
      existing license information as this has to be done on a case by case
      basis and the copyright holders might have to be consulted. This will
      happen in a separate step.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.  See the previous patch in this series for the
      methodology of how this patch was researched.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2be04c7
    • G
      License cleanup: add SPDX license identifier to uapi header files with no license · 6f52b16c
      Greg Kroah-Hartman 提交于
      Many user space API headers are missing licensing information, which
      makes it hard for compliance tools to determine the correct license.
      
      By default are files without license information under the default
      license of the kernel, which is GPLV2.  Marking them GPLV2 would exclude
      them from being included in non GPLV2 code, which is obviously not
      intended. The user space API headers fall under the syscall exception
      which is in the kernels COPYING file:
      
         NOTE! This copyright does *not* cover user programs that use kernel
         services by normal system calls - this is merely considered normal use
         of the kernel, and does *not* fall under the heading of "derived work".
      
      otherwise syscall usage would not be possible.
      
      Update the files which contain no license information with an SPDX
      license identifier.  The chosen identifier is 'GPL-2.0 WITH
      Linux-syscall-note' which is the officially assigned identifier for the
      Linux syscall exception.  SPDX license identifiers are a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.  See the previous patch in this series for the
      methodology of how this patch was researched.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f52b16c
  6. 01 11月, 2017 1 次提交
  7. 29 10月, 2017 5 次提交
    • M
      ipvlan: implement VEPA mode · fe89aa6b
      Mahesh Bandewar 提交于
      This is very similar to the Macvlan VEPA mode, however, there is some
      difference. IPvlan uses the mac-address of the lower device, so the VEPA
      mode has implications of ICMP-redirects for packets destined for its
      immediate neighbors sharing same master since the packets will have same
      source and dest mac. The external switch/router will send redirect msg.
      
      Having said that, this will be useful tool in terms of debugging
      since IPvlan will not switch packets within its slaves and rely completely
      on the external entity as intended in 802.1Qbg.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe89aa6b
    • M
      ipvlan: introduce 'private' attribute for all existing modes. · a190d04d
      Mahesh Bandewar 提交于
      IPvlan has always operated in bridge mode. However there are scenarios
      where each slave should be able to talk through the master device but
      not necessarily across each other. Think of an environment where each
      of a namespace is a private and independant customer. In this scenario
      the machine which is hosting these namespaces neither want to tell who
      their neighbor is nor the individual namespaces care to talk to neighbor
      on short-circuited network path.
      
      This patch implements the mode that is very similar to the 'private' mode
      in macvlan where individual slaves can send and receive traffic through
      the master device, just that they can not talk among slave devices.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a190d04d
    • X
      sctp: fix some type cast warnings introduced since very beginning · 978aa047
      Xin Long 提交于
      These warnings were found by running 'make C=2 M=net/sctp/'.
      They are there since very beginning.
      
      Note after this patch, there still one warning left in
      sctp_outq_flush():
        sctp_chunk_fail(chunk, SCTP_ERROR_INV_STRM)
      
      Since it has been moved to sctp_stream_outq_migrate on net-next,
      to avoid the extra job when merging net-next to net, I will post
      the fix for it after the merging is done.
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      978aa047
    • W
      ipv6: prevent user from adding cached routes · 2ea2352e
      Wei Wang 提交于
      Cached routes should only be created by the system when receiving pmtu
      discovery or ip redirect msg. Users should not be allowed to create
      cached routes.
      
      Furthermore, after the patch series to move cached routes into exception
      table, user added cached routes will trigger the following warning in
      fib6_add():
      
      WARNING: CPU: 0 PID: 2985 at net/ipv6/ip6_fib.c:1137
      fib6_add+0x20d9/0x2c10 net/ipv6/ip6_fib.c:1137
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 0 PID: 2985 Comm: syzkaller320388 Not tainted 4.14.0-rc3+ #74
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x417 kernel/panic.c:181
       __warn+0x1c4/0x1d9 kernel/panic.c:542
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
       do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:fib6_add+0x20d9/0x2c10 net/ipv6/ip6_fib.c:1137
      RSP: 0018:ffff8801cf09f6a0 EFLAGS: 00010297
      RAX: ffff8801ce45e340 RBX: 1ffff10039e13eec RCX: ffff8801d749c814
      RDX: 0000000000000000 RSI: ffff8801d749c700 RDI: ffff8801d749c780
      RBP: ffff8801cf09fa08 R08: 0000000000000000 R09: ffff8801cf09f360
      R10: ffff8801cf09f2d8 R11: 1ffff10039c8befb R12: 0000000000000001
      R13: dffffc0000000000 R14: ffff8801d749c700 R15: ffffffff860655c0
       __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1011
       ip6_route_add+0x148/0x1a0 net/ipv6/route.c:2782
       ipv6_route_ioctl+0x4d5/0x690 net/ipv6/route.c:3291
       inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:521
       sock_do_ioctl+0x65/0xb0 net/socket.c:961
       sock_ioctl+0x2c2/0x440 net/socket.c:1058
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      So we fix this by failing the attemp to add cached routes from userspace
      with returning EINVAL error.
      
      Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ea2352e
    • J
      bpf: rename sk_actions to align with bpf infrastructure · bfa64075
      John Fastabend 提交于
      Recent additions to support multiple programs in cgroups impose
      a strict requirement, "all yes is yes, any no is no". To enforce
      this the infrastructure requires the 'no' return code, SK_DROP in
      this case, to be 0.
      
      To apply these rules to SK_SKB program types the sk_actions return
      codes need to be adjusted.
      
      This fix adds SK_PASS and makes 'SK_DROP = 0'. Finally, remove
      SK_ABORTED to remove any chance that the API may allow aborted
      program flows to be passed up the stack. This would be incorrect
      behavior and allow programs to break existing policies.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfa64075
  8. 28 10月, 2017 1 次提交
  9. 25 10月, 2017 1 次提交
    • S
      ip6_tunnel: Allow rcv/xmit even if remote address is a local address · 908d140a
      Shmulik Ladkani 提交于
      Currently, ip6_tnl_xmit_ctl drops tunneled packets if the remote
      address (outer v6 destination) is one of host's locally configured
      addresses.
      Same applies to ip6_tnl_rcv_ctl: it drops packets if the remote address
      (outer v6 source) is a local address.
      
      This prevents using ipxip6 (and ip6_gre) tunnels whose local/remote
      endpoints are on same host; OTOH v4 tunnels (ipip or gre) allow such
      configurations.
      
      An example where this proves useful is a system where entities are
      identified by their unique v6 addresses, and use tunnels to encapsulate
      traffic between them. The limitation prevents placing several entities
      on same host.
      
      Introduce IP6_TNL_F_ALLOW_LOCAL_REMOTE which allows to bypass this
      restriction.
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      908d140a
  10. 24 10月, 2017 1 次提交
    • C
      tcp: Configure TFO without cookie per socket and/or per route · 71c02379
      Christoph Paasch 提交于
      We already allow to enable TFO without a cookie by using the
      fastopen-sysctl and setting it to TFO_SERVER_COOKIE_NOT_REQD (or
      TFO_CLIENT_NO_COOKIE).
      This is safe to do in certain environments where we know that there
      isn't a malicous host (aka., data-centers) or when the
      application-protocol already provides an authentication mechanism in the
      first flight of data.
      
      A server however might be providing multiple services or talking to both
      sides (public Internet and data-center). So, this server would want to
      enable cookie-less TFO for certain services and/or for connections that
      go to the data-center.
      
      This patch exposes a socket-option and a per-route attribute to enable such
      fine-grained configurations.
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71c02379
  11. 23 10月, 2017 1 次提交
  12. 22 10月, 2017 2 次提交
    • L
      bpf: Adding helper function bpf_getsockops · cd86d1fd
      Lawrence Brakmo 提交于
      Adding support for helper function bpf_getsockops to socket_ops BPF
      programs. This patch only supports TCP_CONGESTION.
      Signed-off-by: NVlad Vysotsky <vlad@cs.ucla.edu>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd86d1fd
    • L
      bpf: add support for BPF_SOCK_OPS_BASE_RTT · e6546ef6
      Lawrence Brakmo 提交于
      A congestion control algorithm can make a call to the BPF socket_ops
      program to request the base RTT. The base RTT can be congestion control
      dependent and is meant to represent a congestion threshold such that
      RTTs above it indicate congestion. This is especially useful for flows
      within a DC where the base RTT is easy to obtain.
      
      Being provided a base RTT solves a basic problem in RTT based congestion
      avoidance algorithms (such as Vegas, NV and BBR). Although it is easy
      to get the base RTT when the network is not congested, it is very
      diffcult to do when it is very congested. Newer connections get an
      inflated value of the base RTT leading to unfariness (newer flows with a
      larger base RTT get more bandwidth). As a result, RTT based congestion
      avoidance algorithms tend to update their base RTTs to improve fairness.
      In very congested networks this can lead to base RTT inflation, reducing
      the ability of these RTT based congestion control algorithms to prevent
      congestion.
      
      Note that in my experiments with TCP-NV, the base RTT provided can be
      much larger than the actual hardware RTT. For example, experimenting
      with hosts within a rack where the hardware RTT is 16-20us, I've used
      base RTTs up to 150us. The effect of using a larger base RTT is that the
      congestion avoidance algorithm will allow more queueing. When there are
      only a few flows the main effect is larger measured RTTs and RPC
      latencies due to the increased queueing. When there are a lot of flows,
      a larger base RTT can lead to more congestion and more packet drops.
      For this case, where the hardware RTT is 20us, a base RTT of 80us
      produces good results.
      
      This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this
      set adds support for using it in TCP-NV. Further study and testing is
      needed before support can be added to other delay based congestion
      avoidance algorithms.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6546ef6
  13. 20 10月, 2017 3 次提交
  14. 18 10月, 2017 1 次提交
    • J
      bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112
      Jesper Dangaard Brouer 提交于
      The 'cpumap' is primarily used as a backend map for XDP BPF helper
      call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
      
      This patch implement the main part of the map.  It is not connected to
      the XDP redirect system yet, and no SKB allocation are done yet.
      
      The main concern in this patch is to ensure the datapath can run
      without any locking.  This adds complexity to the setup and tear-down
      procedure, which assumptions are extra carefully documented in the
      code comments.
      
      V2:
       - make sure array isn't larger than NR_CPUS
       - make sure CPUs added is a valid possible CPU
      
      V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
      
      V5:
       - Restrict map allocation to root / CAP_SYS_ADMIN
       - WARN_ON_ONCE if queue is not empty on tear-down
       - Return -EPERM on memlock limit instead of -ENOMEM
       - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
       - Moved cpu_map_enqueue() to next patch
      
      V6: all notice by Daniel Borkmann
       - Fix err return code in cpu_map_alloc() introduced in V5
       - Move cpu_possible() check after max_entries boundary check
       - Forbid usage initially in check_map_func_compatibility()
      
      V7:
       - Fix alloc error path spotted by Daniel Borkmann
       - Did stress test adding+removing CPUs from the map concurrently
       - Fixed refcnt issue on cpu_map_entry, kthread started too soon
       - Make sure packets are flushed during tear-down, involved use of
         rcu_barrier() and kthread_run only exit after queue is empty
       - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
      
      V8:
       - Nitpicking comments and gramma by Edward Cree
       - Fix missing semi-colon introduced in V7 due to rebasing
       - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6710e112
  15. 17 10月, 2017 1 次提交
  16. 14 10月, 2017 1 次提交
    • A
      mqprio: Introduce new hardware offload mode and shaper in mqprio · 4e8b86c0
      Amritha Nambiar 提交于
      The offload types currently supported in mqprio are 0 (no offload) and
      1 (offload only TCs) by setting these values for the 'hw' option. If
      offloads are supported by setting the 'hw' option to 1, the default
      offload mode is 'dcb' where only the TC values are offloaded to the
      device. This patch introduces a new hardware offload mode called
      'channel' with 'hw' set to 1 in mqprio which makes full use of the
      mqprio options, the TCs, the queue configurations and the QoS parameters
      for the TCs. This is achieved through a new netlink attribute for the
      'mode' option which takes values such as 'dcb' (default) and 'channel'.
      The 'channel' mode also supports QoS attributes for traffic class such as
      minimum and maximum values for bandwidth rate limits.
      
      This patch enables configuring additional HW shaper attributes associated
      with a traffic class. Currently the shaper for bandwidth rate limiting is
      supported which takes options such as minimum and maximum bandwidth rates
      and are offloaded to the hardware in the 'channel' mode. The min and max
      limits for bandwidth rates are provided by the user along with the TCs
      and the queue configurations when creating the mqprio qdisc. The interface
      can be extended to support new HW shapers in future through the 'shaper'
      attribute.
      
      Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
      mqprio queue options and use this to be shared between the kernel and
      device driver. This contains a copy of the existing data structure
      for mqprio queue options. This new data structure can be extended when
      adding new attributes for traffic class such as mode, shaper, shaper
      parameters (bandwidth rate limits). The existing data structure for mqprio
      queue options will be shared between the kernel and userspace.
      
      Example:
        queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
        min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
      
      To dump the bandwidth rates:
      
      qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
                   queues:(0:3) (4:7)
                   mode:channel
                   shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit
      Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4e8b86c0
  17. 13 10月, 2017 3 次提交
    • J
      tipc: receive group membership events via member socket · ae236fb2
      Jon Maloy 提交于
      Like with any other service, group members' availability can be
      subscribed for by connecting to be topology server. However, because
      the events arrive via a different socket than the member socket, there
      is a real risk that membership events my arrive out of synch with the
      actual JOIN/LEAVE action. I.e., it is possible to receive the first
      messages from a new member before the corresponding JOIN event arrives,
      just as it is possible to receive the last messages from a leaving
      member after the LEAVE event has already been received.
      
      Since each member socket is internally also subscribing for membership
      events, we now fix this problem by passing those events on to the user
      via the member socket. We leverage the already present member synch-
      ronization protocol to guarantee correct message/event order. An event
      is delivered to the user as an empty message where the two source
      addresses identify the new/lost member. Furthermore, we set the MSG_OOB
      bit in the message flags to mark it as an event. If the event is an
      indication about a member loss we also set the MSG_EOR bit, so it can
      be distinguished from a member addition event.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae236fb2
    • J
      tipc: introduce communication groups · 75da2163
      Jon Maloy 提交于
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75da2163
    • F
      sched: tc_mirred: Remove whitespaces · ad2d116c
      Florian Fainelli 提交于
      This file contains unnecessary whitespaces as newlines, remove them,
      found by looking at what struct tc_mirred looks like.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad2d116c
  18. 12 10月, 2017 2 次提交