1. 14 4月, 2017 1 次提交
  2. 12 4月, 2017 3 次提交
  3. 10 4月, 2017 1 次提交
  4. 08 4月, 2017 1 次提交
  5. 05 4月, 2017 2 次提交
    • V
      rtnl: Add support for netdev event to link messages · def12888
      Vlad Yasevich 提交于
      When netdev events happen, a rtnetlink_event() handler will send
      messages for every event in it's white list.  These messages contain
      current information about a particular device, but they do not include
      the iformation about which event just happened.  The consumer of
      the message has to try to infer this information.  In some cases
      (ex: NETDEV_NOTIFY_PEERS), that is not possible.
      
      This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
      that would have an encoding of the which event triggered this
      message.  This would allow the the message consumer to easily determine
      if it is interested in a particular event or not.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      def12888
    • V
      rtnetlink: Convert rtnetlink_event to white list · 5138e86f
      Vlad Yasevich 提交于
      The rtnetlink_event currently functions as a blacklist where
      we block cerntain netdev events from being sent to user space.
      As a result, events have been added to the system that userspace
      probably doesn't care about.
      
      This patch converts the implementation to the white list so that
      newly events would have to be specifically added to the list to
      be sent to userspace.  This would force new event implementers to
      consider whether a given event is usefull to user space or if it's
      just a kernel event.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5138e86f
  6. 04 4月, 2017 5 次提交
  7. 02 4月, 2017 1 次提交
    • A
      bpf: introduce BPF_PROG_TEST_RUN command · 1cf1cae9
      Alexei Starovoitov 提交于
      development and testing of networking bpf programs is quite cumbersome.
      Despite availability of user space bpf interpreters the kernel is
      the ultimate authority and execution environment.
      Current test frameworks for TC include creation of netns, veth,
      qdiscs and use of various packet generators just to test functionality
      of a bpf program. XDP testing is even more complicated, since
      qemu needs to be started with gro/gso disabled and precise queue
      configuration, transferring of xdp program from host into guest,
      attaching to virtio/eth0 and generating traffic from the host
      while capturing the results from the guest.
      
      Moreover analyzing performance bottlenecks in XDP program is
      impossible in virtio environment, since cost of running the program
      is tiny comparing to the overhead of virtio packet processing,
      so performance testing can only be done on physical nic
      with another server generating traffic.
      
      Furthermore ongoing changes to user space control plane of production
      applications cannot be run on the test servers leaving bpf programs
      stubbed out for testing.
      
      Last but not least, the upstream llvm changes are validated by the bpf
      backend testsuite which has no ability to test the code generated.
      
      To improve this situation introduce BPF_PROG_TEST_RUN command
      to test and performance benchmark bpf programs.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf1cae9
  8. 31 3月, 2017 1 次提交
  9. 29 3月, 2017 2 次提交
    • A
      net: break include loop netdevice.h, dsa.h, devlink.h · c6e970a0
      Andrew Lunn 提交于
      There is an include loop between netdevice.h, dsa.h, devlink.h because
      of NETDEV_ALIGN, making it impossible to use devlink structures in
      dsa.h.
      
      Break this loop by taking dsa.h out of netdevice.h, add a forward
      declaration of dsa_switch_tree and netdev_set_default_ethtool_ops()
      function, which is what netdevice.h requires.
      
      No longer having dsa.h in netdevice.h means the includes in dsa.h no
      longer get included. This breaks a few other files which depend on
      these includes. Add these directly in the affected file.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6e970a0
    • A
      devlink: Support for pipeline debug (dpipe) · 1555d204
      Arkadi Sharshevsky 提交于
      The pipeline debug is used to export the pipeline abstractions for the
      main objects - tables, headers and entries. The only support for set is
      for changing the counter parameter on specific table.
      
      The basic structures:
      
      Header - can represent a real protocol header information or internal
               metadata. Generic protocol headers like IPv4 can be shared
               between drivers. Each driver can add local headers.
      
      Field - part of a header. Can represent protocol field or specific ASIC
              metadata field. Hardware special metadata fields can be mapped
              to different resources, for example switch ASIC ports can have
              internal number which from the systems point of view is mapped
              to netdeivce ifindex.
      
      Match - represent specific match rule. Can describe match on specific
              field or header. The header index should be specified as well
              in order to support several header instances of the same type
              (tunneling).
      
      Action - represents specific action rule. Actions can describe operations
               on specific field values for example like set, increment, etc.
               And header operation like add and delete.
      
      Value - represents value which can be associated with specific match or
              action.
      
      Table - represents a hardware block which can be described with match/
              action behavior. The match/action can be done on the packets
              data or on the internal metadata that it gathered along the
              packets traversal throw the pipeline which is vendor specific
              and should be exported in order to provide understanding of
              ASICs behavior.
      
      Entry - represents single record in a specific table. The entry is
              identified by specific combination of values for match/action.
      
      Prior to accessing the tables/entries the drivers provide the header/
      field data base which is used by driver to user-space. The data base
      is split between the shared headers and unique headers.
      Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1555d204
  10. 25 3月, 2017 8 次提交
  11. 24 3月, 2017 3 次提交
  12. 23 3月, 2017 6 次提交
    • D
      socket, bpf: fix sk_filter use after free in sk_clone_lock · a97e50cc
      Daniel Borkmann 提交于
      In sk_clone_lock(), we create a new socket and inherit most of the
      parent's members via sock_copy() which memcpy()'s various sections.
      Now, in case the parent socket had a BPF socket filter attached,
      then newsk->sk_filter points to the same instance as the original
      sk->sk_filter.
      
      sk_filter_charge() is then called on the newsk->sk_filter to take a
      reference and should that fail due to hitting max optmem, we bail
      out and release the newsk instance.
      
      The issue is that commit 278571ba ("net: filter: simplify socket
      charging") wrongly combined the dismantle path with the failure path
      of xfrm_sk_clone_policy(). This means, even when charging failed, we
      call sk_free_unlock_clone() on the newsk, which then still points to
      the same sk_filter as the original sk.
      
      Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually
      where it tests for present sk_filter and calls sk_filter_uncharge()
      on it, which potentially lets sk_omem_alloc wrap around and releases
      the eBPF prog and sk_filter structure from the (still intact) parent.
      
      Fix it by making sure that when sk_filter_charge() failed, we reset
      newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(),
      so that we don't mess with the parents sk_filter.
      
      Only if xfrm_sk_clone_policy() fails, we did reach the point where
      either the parent's filter was NULL and as a result newsk's as well
      or where we previously had a successful sk_filter_charge(), thus for
      that case, we do need sk_filter_uncharge() to release the prior taken
      reference on sk_filter.
      
      Fixes: 278571ba ("net: filter: simplify socket charging")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a97e50cc
    • D
      rtnetlink: Add dump all for netconf · a7678c70
      David Ahern 提交于
      Use the rtnl_dump_all to dump all netconf handlers that have been
      registered. Allows userspace to send a dump request for PF_UNSPEC
      and get all families.
      
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7678c70
    • R
      net: convert sk_filter.refcnt from atomic_t to refcount_t · 4c355cdf
      Reshetova, Elena 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c355cdf
    • J
      sock: introduce SO_MEMINFO getsockopt · a2d133b1
      Josh Hunt 提交于
      Allows reading of SK_MEMINFO_VARS via socket option. This way an
      application can get all meminfo related information in single socket
      option call instead of multiple calls.
      
      Adds helper function, sk_get_meminfo(), and uses that for both
      getsockopt and sock_diag_put_meminfo().
      
      Suggested by Eric Dumazet.
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Reviewed-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2d133b1
    • R
      neighbour: fix nlmsg_pid in notifications · 7b8f7a40
      Roopa Prabhu 提交于
      neigh notifications today carry pid 0 for nlmsg_pid
      in all cases. This patch fixes it to carry calling process
      pid when available. Applications (eg. quagga) rely on
      nlmsg_pid to ignore notifications generated by their own
      netlink operations. This patch follows the routing subsystem
      which already sets this correctly.
      Reported-by: NVivek Venkatraman <vivek@cumulusnetworks.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b8f7a40
    • T
      cgroup, net_cls: iterate the fds of only the tasks which are being migrated · a05d4fd9
      Tejun Heo 提交于
      The net_cls controller controls the classid field of each socket which
      is associated with the cgroup.  Because the classid is per-socket
      attribute, when a task migrates to another cgroup or the configured
      classid of the cgroup changes, the controller needs to walk all
      sockets and update the classid value, which was implemented by
      3b13758f ("cgroups: Allow dynamically changing net_classid").
      
      While the approach is not scalable, migrating tasks which have a lot
      of fds attached to them is rare and the cost is born by the ones
      initiating the operations.  However, for simplicity, both the
      migration and classid config change paths call update_classid() which
      scans all fds of all tasks in the target css.  This is an overkill for
      the migration path which only needs to cover a much smaller subset of
      tasks which are actually getting migrated in.
      
      On cgroup v1, this can lead to unexpected scalability issues when one
      tries to migrate a task or process into a net_cls cgroup which already
      contains a lot of fds.  Even if the migration traget doesn't have many
      to get scanned, update_classid() ends up scanning all fds in the
      target cgroup which can be extremely numerous.
      
      Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
      even worse.  Before bfc2cf6f ("cgroup: call subsys->*attach() only
      for subsystems which are actually affected by migration"), cgroup core
      would call the ->css_attach callback even for controllers which don't
      see actual migration to a different css.
      
      As net_cls is always disabled but still mounted on cgroup v2, whenever
      a process is migrated on the cgroup v2 hierarchy, net_cls sees
      identity migration from root to root and cgroup core used to call
      ->css_attach callback for those.  The net_cls ->css_attach ends up
      calling update_classid() on the root net_cls css to which all
      processes on the system belong to as the controller isn't used.  This
      makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
      which is horrible and easily leads to noticeable stalls triggering RCU
      stall warnings and so on.
      
      The worst symptom is already fixed in upstream by bfc2cf6f
      ("cgroup: call subsys->*attach() only for subsystems which are
      actually affected by migration"); however, backporting that commit is
      too invasive and we want to avoid other cases too.
      
      This patch updates net_cls's cgrp_attach() to iterate fds of only the
      processes which are actually getting migrated.  This removes the
      surprising migration cost which is dependent on the total number of
      fds in the target cgroup.  As this leaves write_classid() the only
      user of update_classid(), open-code the helper into write_classid().
      Reported-by: NDavid Goode <dgoode@fb.com>
      Fixes: 3b13758f ("cgroups: Allow dynamically changing net_classid")
      Cc: stable@vger.kernel.org # v4.4+
      Cc: Nina Schiff <ninasc@fb.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a05d4fd9
  13. 22 3月, 2017 2 次提交
  14. 17 3月, 2017 1 次提交
    • I
      ipv4: fib_rules: Check if rule is a default rule · 3c71006d
      Ido Schimmel 提交于
      Currently, when non-default (custom) FIB rules are used, devices capable
      of layer 3 offloading flush their tables and let the kernel do the
      forwarding instead.
      
      When these devices' drivers are loaded they register to the FIB
      notification chain, which lets them know about the existence of any
      custom FIB rules. This is done by sending a RULE_ADD notification based
      on the value of 'net->ipv4.fib_has_custom_rules'.
      
      This approach is problematic when VRF offload is taken into account, as
      upon the creation of the first VRF netdev, a l3mdev rule is programmed
      to direct skbs to the VRF's table.
      
      Instead of merely reading the above value and sending a single RULE_ADD
      notification, we should iterate over all the FIB rules and send a
      detailed notification for each, thereby allowing offloading drivers to
      sanitize the rules they don't support and potentially flush their
      tables.
      
      While l3mdev rules are uniquely marked, the default rules are not.
      Therefore, when they are being notified they might invoke offloading
      drivers to unnecessarily flush their tables.
      
      Solve this by adding an helper to check if a FIB rule is a default rule.
      Namely, its selector should match all packets and its action should
      point to the local, main or default tables.
      
      As noted by David Ahern, uniquely marking the default rules is
      insufficient. When using VRFs, it's common to avoid false hits by moving
      the rule for the local table to just before the main table:
      
      Default configuration:
      $ ip rule show
      0:      from all lookup local
      32766:  from all lookup main
      32767:  from all lookup default
      
      Common configuration with VRFs:
      $ ip rule show
      1000:   from all lookup [l3mdev-table]
      32765:  from all lookup local
      32766:  from all lookup main
      32767:  from all lookup default
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c71006d
  15. 16 3月, 2017 1 次提交
  16. 15 3月, 2017 1 次提交
  17. 14 3月, 2017 1 次提交
新手
引导
客服 返回
顶部