1. 05 10月, 2017 5 次提交
    • D
      net: Add extack to ndo_add_slave · 33eaf2a6
      David Ahern 提交于
      Pass extack to do_set_master and down to ndo_add_slave
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33eaf2a6
    • D
      net: Add extack to netdev_notifier_info · 51d0c047
      David Ahern 提交于
      Add netlink_ext_ack to netdev_notifier_info to allow notifier
      handlers to return errors to userspace.
      
      Clean up the initialization in dev.c such that extack is easily
      added in subsequent patches where relevant. Specifically, remove
      the init call in call_netdevice_notifiers_info and have callers
      initalize on stack when info is declared.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51d0c047
    • N
      dev: advertise the new nsid when the netns iface changes · 6621dd29
      Nicolas Dichtel 提交于
      x-netns interfaces are bound to two netns: the link netns and the upper
      netns. Usually, this kind of interfaces is created in the link netns and
      then moved to the upper netns. At the end, the interface is visible only
      in the upper netns. The link nsid is advertised via netlink in the upper
      netns, thus the user always knows where is the link part.
      
      There is no such mechanism in the link netns. When the interface is moved
      to another netns, the user cannot "follow" it.
      This patch adds a new netlink attribute which helps to follow an interface
      which moves to another netns. When the interface is unregistered, the new
      nsid is advertised. If the interface is a x-netns interface (ie
      rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.
      
      CC: Jason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6621dd29
    • A
      bpf: introduce BPF_PROG_QUERY command · 468e2f64
      Alexei Starovoitov 提交于
      introduce BPF_PROG_QUERY command to retrieve a set of either
      attached programs to given cgroup or a set of effective programs
      that will execute for events within a cgroup
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      468e2f64
    • A
      bpf: multi program support for cgroup+bpf · 324bda9e
      Alexei Starovoitov 提交于
      introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
      bpf programs to a cgroup.
      
      The difference between three possible flags for BPF_PROG_ATTACH command:
      - NONE(default): No further bpf programs allowed in the subtree.
      - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
        the program in this cgroup yields to sub-cgroup program.
      - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
        that cgroup program gets run in addition to the program in this cgroup.
      
      NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
      change their behavior. It only clarifies the semantics in relation
      to new flag.
      
      Only one program is allowed to be attached to a cgroup with
      NONE or BPF_F_ALLOW_OVERRIDE flag.
      Multiple programs are allowed to be attached to a cgroup with
      BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
      (those that were attached first, run first)
      The programs of sub-cgroup are executed first, then programs of
      this cgroup and then programs of parent cgroup.
      All eligible programs are executed regardless of return code from
      earlier programs.
      
      To allow efficient execution of multiple programs attached to a cgroup
      and to avoid penalizing cgroups without any programs attached
      introduce 'struct bpf_prog_array' which is RCU protected array
      of pointers to bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      324bda9e
  2. 04 10月, 2017 3 次提交
    • F
      net: core: decouple ifalias get/set from rtnl lock · 6c557001
      Florian Westphal 提交于
      Device alias can be set by either rtnetlink (rtnl is held) or sysfs.
      
      rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
      Add an extra mutex for it and use rcu to protect concurrent accesses.
      
      This allows the sysfs path to not take rtnl and would later allow
      to not hold it when dumping ifalias.
      
      Based on suggestion from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c557001
    • Y
      ipv4: ipmr: Add the parent ID field to VIF struct · 5d8b3e69
      Yotam Gigi 提交于
      In order to allow the ipmr module to do partial multicast forwarding
      according to the device parent ID, add the device parent ID field to the
      VIF struct. This way, the forwarding path can use the parent ID field
      without invoking switchdev calls, which requires the RTNL lock.
      
      When a new VIF is added, set the device parent ID field in it by invoking
      the switchdev_port_attr_get call.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d8b3e69
    • Y
      skbuff: Add the offload_mr_fwd_mark field · abf4bb6b
      Yotam Gigi 提交于
      Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
      used to allow partial offloading of MFC multicast routes.
      
      Switchdev drivers can offload MFC multicast routes to the hardware by
      registering to the FIB notification chain. When one of the route output
      interfaces is not offload-able, i.e. has different parent ID, the route
      cannot be fully offloaded by the hardware. Examples to non-offload-able
      devices are a management NIC, dummy device, pimreg device, etc.
      
      Similar problem exists in the bridge module, as one bridge can hold
      interfaces with different parent IDs. At the bridge, the problem is solved
      by the offload_fwd_mark skb field.
      
      Currently, when a route cannot go through full offload, the only solution
      for a switchdev driver is not to offload it at all and let the packet go
      through slow path.
      
      Using the offload_mr_fwd_mark field, a driver can indicate that a packet
      was already forwarded by hardware to all the devices with the same parent
      ID as the input device. Further patches in this patch-set are going to
      enhance ipmr to skip multicast forwarding to devices with the same parent
      ID if a packets is marked with that field.
      
      The reason why the already existing "offload_fwd_mark" bit cannot be used
      is that a switchdev driver would want to make the distinction between a
      packet that has already gone through L2 forwarding but did not go through
      multicast forwarding, and a packet that has already gone through both L2
      and multicast forwarding.
      
      For example: when a packet is ingressing from a switchport enslaved to a
      bridge, which is configured with multicast forwarding, the following
      scenarios are possible:
       - The packet can be trapped to the CPU due to exception while multicast
         forwarding (for example, MTU error). In that case, it had already gone
         through L2 forwarding in the hardware, thus A switchdev driver would
         want to set the skb->offload_fwd_mark and not the
         skb->offload_mr_fwd_mark.
       - The packet can also be trapped due to a pimreg/dummy device used as one
         of the output interfaces. In that case, it can go through both L2 and
         (partial) multicast forwarding inside the hardware, thus a switchdev
         driver would want to set both the skb->offload_fwd_mark and
         skb->offload_mr_fwd_mark.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellaox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abf4bb6b
  3. 03 10月, 2017 12 次提交
  4. 01 10月, 2017 1 次提交
  5. 30 9月, 2017 1 次提交
    • A
      i40e: Enable VF to negotiate number of allocated queues · a3f5aa90
      Alan Brady 提交于
      Currently the PF allocates a default number of queues for each VF and
      cannot be changed.  This patch enables the VF to request a different
      number of queues allocated to it.  This patch also adds a new virtchnl
      op and capability flag to facilitate this negotiation.
      
      After the PF receives a request message, it will set a requested number
      of queues for that VF.  Then when the VF resets, its VSI will get a new
      number of queues allocated to it.
      
      This is a best effort request and since we only allocate a guaranteed
      default number, if the VF tries to ask for more than the guaranteed
      number, there may not be enough in HW to accommodate it unless other
      queues for other VFs are freed. It should also be noted decreasing the
      number queues allocated to a VF to below the default will NOT enable the
      allocation of more than 32 VFs per PF and will not free queues guaranteed
      to each VF by default.
      Signed-off-by: NAlan Brady <alan.brady@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a3f5aa90
  6. 29 9月, 2017 3 次提交
  7. 28 9月, 2017 4 次提交
  8. 27 9月, 2017 3 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
    • D
      bpf: rename bpf_compute_data_end into bpf_compute_data_pointers · 6aaae2b6
      Daniel Borkmann 提交于
      Just do the rename into bpf_compute_data_pointers() as we'll add
      one more pointer here to recompute.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6aaae2b6
    • M
      qed: iWARP - Add check for errors on a SYN packet · 1e99c497
      Michal Kalderon 提交于
      A SYN packet which arrives with errors from FW should be dropped.
      This required adding an additional field to the ll2
      rx completion data.
      Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e99c497
  9. 23 9月, 2017 1 次提交
  10. 22 9月, 2017 2 次提交
    • D
      Input: uinput - avoid FF flush when destroying device · e8b95728
      Dmitry Torokhov 提交于
      Normally, when input device supporting force feedback effects is being
      destroyed, we try to "flush" currently playing effects, so that the
      physical device does not continue vibrating (or executing other effects).
      Unfortunately this does not work well for uinput as flushing of the effects
      deadlocks with the destroy action:
      
      - if device is being destroyed because the file descriptor is being closed,
        then there is noone to even service FF requests;
      
      - if device is being destroyed because userspace sent UI_DEV_DESTROY,
        while theoretically it could be possible to service FF requests,
        userspace is unlikely to do so (they'd need to make sure FF handling
        happens on a separate thread) even if kernel solves the issue with FF
        ioctls deadlocking with UI_DEV_DESTROY ioctl on udev->mutex.
      
      To avoid lockups like the one below, let's install a custom input device
      flush handler, and avoid trying to flush force feedback effects when we
      destroying the device, and instead rely on uinput to shut off the device
      properly.
      
      NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
      ...
       <<EOE>>  [<ffffffff817a0307>] _raw_spin_lock_irqsave+0x37/0x40
       [<ffffffff810e633d>] complete+0x1d/0x50
       [<ffffffffa00ba08c>] uinput_request_done+0x3c/0x40 [uinput]
       [<ffffffffa00ba587>] uinput_request_submit.part.7+0x47/0xb0 [uinput]
       [<ffffffffa00bb62b>] uinput_dev_erase_effect+0x5b/0x76 [uinput]
       [<ffffffff815d91ad>] erase_effect+0xad/0xf0
       [<ffffffff815d929d>] flush_effects+0x4d/0x90
       [<ffffffff815d4cc0>] input_flush_device+0x40/0x60
       [<ffffffff815daf1c>] evdev_cleanup+0xac/0xc0
       [<ffffffff815daf5b>] evdev_disconnect+0x2b/0x60
       [<ffffffff815d74ac>] __input_unregister_device+0xac/0x150
       [<ffffffff815d75f7>] input_unregister_device+0x47/0x70
       [<ffffffffa00bac45>] uinput_destroy_device+0xb5/0xc0 [uinput]
       [<ffffffffa00bb2de>] uinput_ioctl_handler.isra.9+0x65e/0x740 [uinput]
       [<ffffffff811231ab>] ? do_futex+0x12b/0xad0
       [<ffffffffa00bb3f8>] uinput_ioctl+0x18/0x20 [uinput]
       [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
       [<ffffffff81337553>] ? security_file_ioctl+0x43/0x60
       [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
       [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
      Reported-by: NRodrigo Rivas Costa <rodrigorivascosta@gmail.com>
      Reported-by: NClément VUCHENER <clement.vuchener@gmail.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=193741Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      e8b95728
    • P
      net: avoid a full fib lookup when rp_filter is disabled. · 6e617de8
      Paolo Abeni 提交于
      Since commit 1dced6a8 ("ipv4: Restore accept_local behaviour
      in fib_validate_source()") a full fib lookup is needed even if
      the rp_filter is disabled, if accept_local is false - which is
      the default.
      
      What we really need in the above scenario is just checking
      that the source IP address is not local, and in most case we
      can do that is a cheaper way looking up the ifaddr hash table.
      
      This commit adds a helper for such lookup, and uses it to
      validate the src address when rp_filter is disabled and no
      'local' routes are created by the user space in the relevant
      namespace.
      
      A new ipv4 netns flag is added to account for such routes.
      We need that to preserve the same behavior we had before this
      patch.
      
      It also drops the checks to bail early from __fib_validate_source,
      added by the commit 1dced6a8 ("ipv4: Restore accept_local
      behaviour in fib_validate_source()") they do not give any
      measurable performance improvement: if we do the lookup with are
      on a slower path.
      
      This improves UDP performances for unconnected sockets
      when rp_filter is disabled by 5% and also gives small but
      measurable performance improvement for TCP flood scenarios.
      
      v1 -> v2:
       - use the ifaddr lookup helper in __ip_dev_find(), as suggested
         by Eric
       - fall-back to full lookup if custom local routes are present
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e617de8
  11. 21 9月, 2017 1 次提交
    • Y
      bpf: one perf event close won't free bpf program attached by another perf event · ec9dd352
      Yonghong Song 提交于
      This patch fixes a bug exhibited by the following scenario:
        1. fd1 = perf_event_open with attr.config = ID1
        2. attach bpf program prog1 to fd1
        3. fd2 = perf_event_open with attr.config = ID1
           <this will be successful>
        4. user program closes fd2 and prog1 is detached from the tracepoint.
        5. user program with fd1 does not work properly as tracepoint
           no output any more.
      
      The issue happens at step 4. Multiple perf_event_open can be called
      successfully, but only one bpf prog pointer in the tp_event. In the
      current logic, any fd release for the same tp_event will free
      the tp_event->prog.
      
      The fix is to free tp_event->prog only when the closing fd
      corresponds to the one which registered the program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec9dd352
  12. 20 9月, 2017 1 次提交
    • E
      net: sk_buff rbnode reorg · bffa72cf
      Eric Dumazet 提交于
      skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
      
      Current uses (TCP receive ofo queue and netem) need to save/restore
      tstamp, while skb->dev is either NULL (TCP) or a constant for a given
      queue (netem).
      
      Since we plan using an RB tree for TCP retransmit queue to speedup SACK
      processing with large BDP, this patch exchanges skb->dev and
      skb->tstamp.
      
      This saves some overhead in both TCP and netem.
      
      v2: removes the swtstamp field from struct tcp_skb_cb
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bffa72cf
  13. 18 9月, 2017 1 次提交
  14. 15 9月, 2017 2 次提交