1. 16 1月, 2020 4 次提交
    • Y
      bpf: Add batch ops to all htab bpf map · 05799638
      Yonghong Song 提交于
      htab can't use generic batch support due some problematic behaviours
      inherent to the data structre, i.e. while iterating the bpf map  a
      concurrent program might delete the next entry that batch was about to
      use, in that case there's no easy solution to retrieve the next entry,
      the issue has been discussed multiple times (see [1] and [2]).
      
      The only way hmap can be traversed without the problem previously
      exposed is by making sure that the map is traversing entire buckets.
      This commit implements those strict requirements for hmap, the
      implementation follows the same interaction that generic support with
      some exceptions:
      
       - If keys/values buffer are not big enough to traverse a bucket,
         ENOSPC will be returned.
       - out_batch contains the value of the next bucket in the iteration, not
         the next key, but this is transparent for the user since the user
         should never use out_batch for other than bpf batch syscalls.
      
      This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
      command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
      batch ops it is possible to use the generic implementations.
      
      [1] https://lore.kernel.org/bpf/20190724165803.87470-1-brianvv@google.com/
      [2] https://lore.kernel.org/bpf/20190906225434.3635421-1-yhs@fb.com/Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-6-brianvv@google.com
      05799638
    • B
      bpf: Add generic support for update and delete batch ops · aa2e93b8
      Brian Vazquez 提交于
      This commit adds generic support for update and delete batch ops that
      can be used for almost all the bpf maps. These commands share the same
      UAPI attr that lookup and lookup_and_delete batch ops use and the
      syscall commands are:
      
        BPF_MAP_UPDATE_BATCH
        BPF_MAP_DELETE_BATCH
      
      The main difference between update/delete and lookup batch ops is that
      for update/delete keys/values must be specified for userspace and
      because of that, neither in_batch nor out_batch are used.
      Suggested-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-4-brianvv@google.com
      aa2e93b8
    • B
      bpf: Add generic support for lookup batch op · cb4d03ab
      Brian Vazquez 提交于
      This commit introduces generic support for the bpf_map_lookup_batch.
      This implementation can be used by almost all the bpf maps since its core
      implementation is relying on the existing map_get_next_key and
      map_lookup_elem. The bpf syscall subcommand introduced is:
      
        BPF_MAP_LOOKUP_BATCH
      
      The UAPI attribute is:
      
        struct { /* struct used by BPF_MAP_*_BATCH commands */
               __aligned_u64   in_batch;       /* start batch,
                                                * NULL to start from beginning
                                                */
               __aligned_u64   out_batch;      /* output: next start batch */
               __aligned_u64   keys;
               __aligned_u64   values;
               __u32           count;          /* input/output:
                                                * input: # of key/value
                                                * elements
                                                * output: # of filled elements
                                                */
               __u32           map_fd;
               __u64           elem_flags;
               __u64           flags;
        } batch;
      
      in_batch/out_batch are opaque values use to communicate between
      user/kernel space, in_batch/out_batch must be of key_size length.
      
      To start iterating from the beginning in_batch must be null,
      count is the # of key/value elements to retrieve. Note that the 'keys'
      buffer must be a buffer of key_size * count size and the 'values' buffer
      must be value_size * count, where value_size must be aligned to 8 bytes
      by userspace if it's dealing with percpu maps. 'count' will contain the
      number of keys/values successfully retrieved. Note that 'count' is an
      input/output variable and it can contain a lower value after a call.
      
      If there's no more entries to retrieve, ENOENT will be returned. If error
      is ENOENT, count might be > 0 in case it copied some values but there were
      no more entries to retrieve.
      
      Note that if the return code is an error and not -EFAULT,
      count indicates the number of elements successfully processed.
      Suggested-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-3-brianvv@google.com
      cb4d03ab
    • Y
      bpf: Add bpf_send_signal_thread() helper · 8482941f
      Yonghong Song 提交于
      Commit 8b401f9e ("bpf: implement bpf_send_signal() helper")
      added helper bpf_send_signal() which permits bpf program to
      send a signal to the current process. The signal may be
      delivered to any threads in the process.
      
      We found a use case where sending the signal to the current
      thread is more preferable.
        - A bpf program will collect the stack trace and then
          send signal to the user application.
        - The user application will add some thread specific
          information to the just collected stack trace for
          later analysis.
      
      If bpf_send_signal() is used, user application will need
      to check whether the thread receiving the signal matches
      the thread collecting the stack by checking thread id.
      If not, it will need to send signal to another thread
      through pthread_kill().
      
      This patch proposed a new helper bpf_send_signal_thread(),
      which sends the signal to the thread corresponding to
      the current kernel task. This way, user space is guaranteed that
      bpf_program execution context and user space signal handling
      context are the same thread.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115035002.602336-1-yhs@fb.com
      8482941f
  2. 11 1月, 2020 1 次提交
    • A
      bpf: Introduce function-by-function verification · 51c39bb1
      Alexei Starovoitov 提交于
      New llvm and old llvm with libbpf help produce BTF that distinguish global and
      static functions. Unlike arguments of static function the arguments of global
      functions cannot be removed or optimized away by llvm. The compiler has to use
      exactly the arguments specified in a function prototype. The argument type
      information allows the verifier validate each global function independently.
      For now only supported argument types are pointer to context and scalars. In
      the future pointers to structures, sizes, pointer to packet data can be
      supported as well. Consider the following example:
      
      static int f1(int ...)
      {
        ...
      }
      
      int f3(int b);
      
      int f2(int a)
      {
        f1(a) + f3(a);
      }
      
      int f3(int b)
      {
        ...
      }
      
      int main(...)
      {
        f1(...) + f2(...) + f3(...);
      }
      
      The verifier will start its safety checks from the first global function f2().
      It will recursively descend into f1() because it's static. Then it will check
      that arguments match for the f3() invocation inside f2(). It will not descend
      into f3(). It will finish f2() that has to be successfully verified for all
      possible values of 'a'. Then it will proceed with f3(). That function also has
      to be safe for all possible values of 'b'. Then it will start subprog 0 (which
      is main() function). It will recursively descend into f1() and will skip full
      check of f2() and f3(), since they are global. The order of processing global
      functions doesn't affect safety, since all global functions must be proven safe
      based on their arguments only.
      
      Such function by function verification can drastically improve speed of the
      verification and reduce complexity.
      
      Note that the stack limit of 512 still applies to the call chain regardless whether
      functions were static or global. The nested level of 8 also still applies. The
      same recursion prevention checks are in place as well.
      
      The type information and static/global kind is preserved after the verification
      hence in the above example global function f2() and f3() can be replaced later
      by equivalent functions with the same types that are loaded and verified later
      without affecting safety of this main() program. Such replacement (re-linking)
      of global functions is a subject of future patches.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org
      51c39bb1
  3. 10 1月, 2020 5 次提交
    • A
      bpf: Document BPF_F_QUERY_EFFECTIVE flag · f5bfcd95
      Andrey Ignatov 提交于
      Document BPF_F_QUERY_EFFECTIVE flag, mostly to clarify how it affects
      attach_flags what may not be obvious and what may lead to confision.
      
      Specifically attach_flags is returned only for target_fd but if programs
      are inherited from an ancestor cgroup then returned attach_flags for
      current cgroup may be confusing. For example, two effective programs of
      same attach_type can be returned but w/o BPF_F_ALLOW_MULTI in
      attach_flags.
      
      Simple repro:
        # bpftool c s /sys/fs/cgroup/path/to/task
        ID       AttachType      AttachFlags     Name
        # bpftool c s /sys/fs/cgroup/path/to/task effective
        ID       AttachType      AttachFlags     Name
        95043    ingress                         tw_ipt_ingress
        95048    ingress                         tw_ingress
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200108014006.938363-1-rdna@fb.com
      f5bfcd95
    • M
      bpf: Add BPF_FUNC_tcp_send_ack helper · 206057fe
      Martin KaFai Lau 提交于
      Add a helper to send out a tcp-ack.  It will be used in the later
      bpf_dctcp implementation that requires to send out an ack
      when the CE state changed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109004551.3900448-1-kafai@fb.com
      206057fe
    • M
      bpf: tcp: Support tcp_congestion_ops in bpf · 0baf26b0
      Martin KaFai Lau 提交于
      This patch makes "struct tcp_congestion_ops" to be the first user
      of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
      in bpf.
      
      The BPF implemented tcp_congestion_ops can be used like
      regular kernel tcp-cc through sysctl and setsockopt.  e.g.
      [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
      net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
      net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
      net.ipv4.tcp_congestion_control = bpf_cubic
      
      There has been attempt to move the TCP CC to the user space
      (e.g. CCP in TCP).   The common arguments are faster turn around,
      get away from long-tail kernel versions in production...etc,
      which are legit points.
      
      BPF has been the continuous effort to join both kernel and
      userspace upsides together (e.g. XDP to gain the performance
      advantage without bypassing the kernel).  The recent BPF
      advancements (in particular BTF-aware verifier, BPF trampoline,
      BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
      possible in BPF.  It allows a faster turnaround for testing algorithm
      in the production while leveraging the existing (and continue growing)
      BPF feature/framework instead of building one specifically for
      userspace TCP CC.
      
      This patch allows write access to a few fields in tcp-sock
      (in bpf_tcp_ca_btf_struct_access()).
      
      The optional "get_info" is unsupported now.  It can be added
      later.  One possible way is to output the info with a btf-id
      to describe the content.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
      0baf26b0
    • M
      bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS · 85d33df3
      Martin KaFai Lau 提交于
      The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
      is a kernel struct with its func ptr implemented in bpf prog.
      This new map is the interface to register/unregister/introspect
      a bpf implemented kernel struct.
      
      The kernel struct is actually embedded inside another new struct
      (or called the "value" struct in the code).  For example,
      "struct tcp_congestion_ops" is embbeded in:
      struct bpf_struct_ops_tcp_congestion_ops {
      	refcount_t refcnt;
      	enum bpf_struct_ops_state state;
      	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
      }
      The map value is "struct bpf_struct_ops_tcp_congestion_ops".
      The "bpftool map dump" will then be able to show the
      state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
      number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
      is created automatically by a macro.  Having a separate "value" struct
      will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
      "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
      initialization works before registering the struct_ops to the kernel
      subsystem).  The libbpf will take care of finding and populating the
      "struct bpf_struct_ops_XYZ" from "struct XYZ".
      
      Register a struct_ops to a kernel subsystem:
      1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
      2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
         set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
         running kernel.
         Instead of reusing the attr->btf_value_type_id,
         btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
         used as the "user" btf which could store other useful sysadmin/debug
         info that may be introduced in the furture,
         e.g. creation-date/compiler-details/map-creator...etc.
      3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
         in the running kernel btf.  Populate the value of this object.
         The function ptr should be populated with the prog fds.
      4. Call BPF_MAP_UPDATE with the object created in (3) as
         the map value.  The key is always "0".
      
      During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
      args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
      the specific struct_ops to do some final checks in "st_ops->init_member()"
      (e.g. ensure all mandatory func ptrs are implemented).
      If everything looks good, it will register this kernel struct
      to the kernel subsystem.  The map will not allow further update
      from this point.
      
      Unregister a struct_ops from the kernel subsystem:
      BPF_MAP_DELETE with key "0".
      
      Introspect a struct_ops:
      BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
      have the prog _id_ populated as the func ptr.
      
      The map value state (enum bpf_struct_ops_state) will transit from:
      INIT (map created) =>
      INUSE (map updated, i.e. reg) =>
      TOBEFREE (map value deleted, i.e. unreg)
      
      The kernel subsystem needs to call bpf_struct_ops_get() and
      bpf_struct_ops_put() to manage the "refcnt" in the
      "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
      for the purose of tracking the subsystem usage.  Another approach
      is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
      the subsystem's usage by doing map->refcnt - map->usercnt to filter out
      the map-fd/pinned-map usage.  However, that will also tie down the
      future semantics of map->refcnt and map->usercnt.
      
      The very first subsystem's refcnt (during reg()) holds one
      count to map->refcnt.  When the very last subsystem's refcnt
      is gone, it will also release the map->refcnt.  All bpf_prog will be
      freed when the map->refcnt reaches 0 (i.e. during map_free()).
      
      Here is how the bpftool map command will look like:
      [root@arch-fb-vm1 bpf]# bpftool map show
      6: struct_ops  name dctcp  flags 0x0
      	key 4B  value 256B  max_entries 1  memlock 4096B
      	btf_id 6
      [root@arch-fb-vm1 bpf]# bpftool map dump id 6
      [{
              "value": {
                  "refcnt": {
                      "refs": {
                          "counter": 1
                      }
                  },
                  "state": 1,
                  "data": {
                      "list": {
                          "next": 0,
                          "prev": 0
                      },
                      "key": 0,
                      "flags": 2,
                      "init": 24,
                      "release": 0,
                      "ssthresh": 25,
                      "cong_avoid": 30,
                      "set_state": 27,
                      "cwnd_event": 28,
                      "in_ack_event": 26,
                      "undo_cwnd": 29,
                      "pkts_acked": 0,
                      "min_tso_segs": 0,
                      "sndbuf_expand": 0,
                      "cong_control": 0,
                      "get_info": 0,
                      "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                      ],
                      "owner": 0
                  }
              }
          }
      ]
      
      Misc Notes:
      * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
        It does an inplace update on "*value" instead returning a pointer
        to syscall.c.  Otherwise, it needs a separate copy of "zero" value
        for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
      
      * The bpf_struct_ops_map_delete_elem() is also called without
        preempt_disable() from map_delete_elem().  It is because
        the "->unreg()" may requires sleepable context, e.g.
        the "tcp_unregister_congestion_control()".
      
      * "const" is added to some of the existing "struct btf_func_model *"
        function arg to avoid a compiler warning caused by this patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
      85d33df3
    • M
      bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS · 27ae7997
      Martin KaFai Lau 提交于
      This patch allows the kernel's struct ops (i.e. func ptr) to be
      implemented in BPF.  The first use case in this series is the
      "struct tcp_congestion_ops" which will be introduced in a
      latter patch.
      
      This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
      The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
      func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
      of a kernel struct.  The attr->expected_attach_type is the member
      "index" of that kernel struct.  The first member of a struct starts
      with member index 0.  That will avoid ambiguity when a kernel struct
      has multiple func ptrs with the same func signature.
      
      For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
      to implement the "init" func ptr of the "struct tcp_congestion_ops".
      The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
      of the _running_ kernel.  The attr->expected_attach_type is 3.
      
      The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
      by arch_prepare_bpf_trampoline that will be done in the next
      patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
      
      "struct bpf_struct_ops" is introduced as a common interface for the kernel
      struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
      struct will need to implement an instance of the "struct bpf_struct_ops".
      
      The supporting kernel struct also needs to implement a bpf_verifier_ops.
      During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
      bpf_verifier_ops by searching the attr->attach_btf_id.
      
      A new "btf_struct_access" is also added to the bpf_verifier_ops such
      that the supporting kernel struct can optionally provide its own specific
      check on accessing the func arg (e.g. provide limited write access).
      
      After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
      to initialize some values (e.g. the btf id of the supporting kernel
      struct) and it can only be done once the btf_vmlinux is available.
      
      The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
      if the return type of the prog->aux->attach_func_proto is "void".
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003503.3855825-1-kafai@fb.com
      27ae7997
  4. 06 1月, 2020 10 次提交
    • V
      net: mscc: ocelot: export ANA, DEV and QSYS registers to include/soc/mscc · 964ee5c8
      Vladimir Oltean 提交于
      Since the Felix DSA driver is implementing its own PHYLINK instance due
      to SoC differences, it needs access to the few registers that are
      common, mainly for flow control.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      964ee5c8
    • V
      net: mscc: ocelot: make phy_mode a member of the common struct ocelot_port · ee50d07c
      Vladimir Oltean 提交于
      The Ocelot switchdev driver and the Felix DSA one need it for different
      reasons. Felix (or at least the VSC9959 instantiation in NXP LS1028A) is
      integrated with the traditional NXP Layerscape PCS design which does not
      support runtime configuration of SerDes protocol. So it needs to
      pre-validate the phy-mode from the device tree and prevent PHYLINK from
      attempting to change it. For this, it needs to cache it in a private
      variable.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee50d07c
    • C
      enetc: Make MDIO accessors more generic and export to include/linux/fsl · 6517798d
      Claudiu Manoil 提交于
      Within the LS1028A SoC, the register map for the ENETC MDIO controller
      is instantiated a few times: for the central (external) MDIO controller,
      for the internal bus of each standalone ENETC port, and for the internal
      bus of the Felix switch.
      
      Refactoring is needed to support multiple MDIO buses from multiple
      drivers. The enetc_hw structure is made an opaque type and a smaller
      enetc_mdio_priv is created.
      
      'mdio_base' - MDIO registers base address - is being parameterized, to
      be able to work with different MDIO register bases.
      
      The ENETC MDIO bus operations are exported from the fsl-enetc-mdio
      kernel object, the same that registers the central MDIO controller (the
      dedicated PF). The ENETC main driver has been changed to select it, and
      use its exported helpers to further register its private MDIO bus. The
      DSA Felix driver will do the same.
      Signed-off-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6517798d
    • V
      net: dsa: Pass pcs_poll flag from driver to PHYLINK · 787cac3f
      Vladimir Oltean 提交于
      The DSA drivers that implement .phylink_mac_link_state should normally
      register an interrupt for the PCS, from which they should call
      phylink_mac_change(). However not all switches implement this, and those
      who don't should set this flag in dsa_switch in the .setup callback, so
      that PHYLINK will poll for a few ms until the in-band AN link timer
      expires and the PCS state settles.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      787cac3f
    • V
      net: phylink: add support for polling MAC PCS · 1511ed0a
      Vladimir Oltean 提交于
      Some MAC PCS blocks are unable to provide interrupts when their status
      changes. As we already have support in phylink for polling status, use
      this to provide a hook for MACs to enable polling mode.
      
      The patch idea was picked up from Russell King's suggestion on the macb
      phylink patch thread here [0] but the implementation was changed.
      Instead of introducing a new phylink_start_poll() function, which would
      make the implementation cumbersome for common PHYLINK implementations
      for multiple types of devices, like DSA, just add a boolean property to
      the phylink_config structure, which is just as backwards-compatible.
      
      https://lkml.org/lkml/2019/12/16/603Suggested-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1511ed0a
    • V
      mii: Add helpers for parsing SGMII auto-negotiation · 6c930994
      Vladimir Oltean 提交于
      Typically a MAC PCS auto-configures itself after it receives the
      negotiated copper-side link settings from the PHY, but some MAC devices
      are more special and need manual interpretation of the SGMII AN result.
      
      In other cases, the PCS exposes the entire tx_config_reg base page as it
      is transmitted on the wire during auto-negotiation, so it makes sense to
      be able to decode the equivalent lp_advertised bit mask from the raw u16
      (of course, "lp" considering the PCS to be the local PHY).
      
      Therefore, add the bit definitions for the SGMII registers 4 and 5
      (local device ability, link partner ability), as well as a link_mode
      conversion helper that can be used to feed the AN results into
      phy_resolve_aneg_linkmode.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c930994
    • V
      net: dsa: Make deferred_xmit private to sja1105 · a68578c2
      Vladimir Oltean 提交于
      There are 3 things that are wrong with the DSA deferred xmit mechanism:
      
      1. Its introduction has made the DSA hotpath ever so slightly more
         inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs
         to be initialized to false for every transmitted frame, in order to
         figure out whether the driver requested deferral or not (a very rare
         occasion, rare even for the only driver that does use this mechanism:
         sja1105). That was necessary to avoid kfree_skb from freeing the skb.
      
      2. Because L2 PTP is a link-local protocol like STP, it requires
         management routes and deferred xmit with this switch. But as opposed
         to STP, the deferred work mechanism needs to schedule the packet
         rather quickly for the TX timstamp to be collected in time and sent
         to user space. But there is no provision for controlling the
         scheduling priority of this deferred xmit workqueue. Too bad this is
         a rather specific requirement for a feature that nobody else uses
         (more below).
      
      3. Perhaps most importantly, it makes the DSA core adhere a bit too
         much to the NXP company-wide policy "Innovate Where It Doesn't
         Matter". The sja1105 is probably the only DSA switch that requires
         some frames sent from the CPU to be routed to the slave port via an
         out-of-band configuration (register write) rather than in-band (DSA
         tag). And there are indeed very good reasons to not want to do that:
         if that out-of-band register is at the other end of a slow bus such
         as SPI, then you limit that Ethernet flow's throughput to effectively
         the throughput of the SPI bus. So hardware vendors should definitely
         not be encouraged to design this way. We do _not_ want more
         widespread use of this mechanism.
      
      Luckily we have a solution for each of the 3 issues:
      
      For 1, we can just remove that variable in the skb->cb and counteract
      the effect of kfree_skb with skb_get, much to the same effect. The
      advantage, of course, being that anybody who doesn't use deferred xmit
      doesn't need to do any extra operation in the hotpath.
      
      For 2, we can create a kernel thread for each port's deferred xmit work.
      If the user switch ports are named swp0, swp1, swp2, the kernel threads
      will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15
      character length limit on kernel thread names). With this, the user can
      change the scheduling priority with chrt $(pidof swp2_xmit).
      
      For 3, we can actually move the entire implementation to the sja1105
      driver.
      
      So this patch deletes the generic implementation from the DSA core and
      adds a new one, more adequate to the requirements of PTP TX
      timestamping, in sja1105_main.c.
      Suggested-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a68578c2
    • V
      net: dsa: sja1105: Always send through management routes in slot 0 · 0a51826c
      Vladimir Oltean 提交于
      I finally found out how the 4 management route slots are supposed to
      be used, but.. it's not worth it.
      
      The description from the comment I've just deleted in this commit is
      still true: when more than 1 management slot is active at the same time,
      the switch will match frames incoming [from the CPU port] on the lowest
      numbered management slot that matches the frame's DMAC.
      
      My issue was that one was not supposed to statically assign each port a
      slot. Yes, there are 4 slots and also 4 non-CPU ports, but that is a
      mere coincidence.
      
      Instead, the switch can be used like this: every management frame gets a
      slot at the right of the most recently assigned slot:
      
      Send mgmt frame 1 through S0:    S0 x  x  x
      Send mgmt frame 2 through S1:    S0 S1 x  x
      Send mgmt frame 3 through S2:    S0 S1 S2 x
      Send mgmt frame 4 through S3:    S0 S1 S2 S3
      
      The difference compared to the old usage is that the transmission of
      frames 1-4 doesn't need to wait until the completion of the management
      route. It is safe to use a slot to the right of the most recently used
      one, because by protocol nobody will program a slot to your left and
      "steal" your route towards the correct egress port.
      
      So there is a potential throughput benefit here.
      
      But mgmt frame 5 has no more free slot to use, so it has to wait until
      _all_ of S0, S1, S2, S3 are full, in order to use S0 again.
      
      And that's actually exactly the problem: I was looking for something
      that would bring more predictable transmission latency, but this is
      exactly the opposite: 3 out of 4 frames would be transmitted quicker,
      but the 4th would draw the short straw and have a worse worst-case
      latency than before.
      
      Useless.
      
      Things are made even worse by PTP TX timestamping, which is something I
      won't go deeply into here. Suffice to say that the fact there is a
      driver-level lock on the SPI bus offsets any potential throughput gains
      that parallelism might bring.
      
      So there's no going back to the multi-slot scheme, remove the
      "mgmt_slot" variable from sja1105_port and the dummy static assignment
      made at probe time.
      
      While passing by, also remove the assignment to casc_port altogether.
      Don't pretend that we support cascaded setups.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a51826c
    • R
      net: phy: add PHY_INTERFACE_MODE_10GBASER · c114574e
      Russell King 提交于
      Recent discussion has revealed that the use of PHY_INTERFACE_MODE_10GKR
      is incorrect. Add a 10GBASE-R definition, document both the -R and -KR
      versions, and the fact that 10GKR was used incorrectly.
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c114574e
    • K
      net: ethernet: sxgbe: Rename Samsung to lowercase · 14a65084
      Krzysztof Kozlowski 提交于
      Fix up inconsistent usage of upper and lowercase letters in "Samsung"
      name.
      
      "SAMSUNG" is not an abbreviation but a regular trademarked name.
      Therefore it should be written with lowercase letters starting with
      capital letter.
      
      Although advertisement materials usually use uppercase "SAMSUNG", the
      lowercase version is used in all legal aspects (e.g. on Wikipedia and in
      privacy/legal statements on
      https://www.samsung.com/semiconductor/privacy-global/).
      Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14a65084
  5. 04 1月, 2020 1 次提交
  6. 03 1月, 2020 2 次提交
    • D
      net: Add device index to tcp_md5sig · 6b102db5
      David Ahern 提交于
      Add support for userspace to specify a device index to limit the scope
      of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad
      is renamed to tcpm_ifindex and the new field is only checked if the new
      TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index
      must point to an L3 master device (e.g., VRF). The API and error
      handling are setup to allow the constraint to be relaxed in the future
      to any device index.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b102db5
    • D
      tcp: Add l3index to tcp_md5sig_key and md5 functions · dea53bb8
      David Ahern 提交于
      Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
      add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.
      
      With the key now based on an l3index, add the new parameter to the
      lookup functions and consider the l3index when looking for a match.
      
      The l3index comes from the skb when processing ingress packets leveraging
      the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
      v6 variants). When the sdif index is set it means the packet ingressed a
      device that is part of an L3 domain and inet_iif points to the VRF device.
      For egress, the L3 domain is determined from the socket binding and
      sk_bound_dev_if.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dea53bb8
  7. 31 12月, 2019 4 次提交
    • D
      net/sched: add delete_empty() to filters and use it in cls_flower · a5b72a08
      Davide Caratti 提交于
      Revert "net/sched: cls_u32: fix refcount leak in the error path of
      u32_change()", and fix the u32 refcount leak in a more generic way that
      preserves the semantic of rule dumping.
      On tc filters that don't support lockless insertion/removal, there is no
      need to guard against concurrent insertion when a removal is in progress.
      Therefore, for most of them we can avoid a full walk() when deleting, and
      just decrease the refcount, like it was done on older Linux kernels.
      This fixes situations where walk() was wrongly detecting a non-empty
      filter, like it happened with cls_u32 in the error path of change(), thus
      leading to failures in the following tdc selftests:
      
       6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
       6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
       74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id
      
      On cls_flower, and on (future) lockless filters, this check is necessary:
      move all the check_empty() logic in a callback so that each filter
      can have its own implementation. For cls_flower, it's sufficient to check
      if no IDRs have been allocated.
      
      This reverts commit 275c44aa.
      
      Changes since v1:
       - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
         is used, thanks to Vlad Buslov
       - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
       - squash revert and new fix in a single patch, to be nice with bisect
         tests that run tdc on u32 filter, thanks to Dave Miller
      
      Fixes: 275c44aa ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
      Fixes: 6676d5e4 ("net: sched: set dedicated tcf_walker flag when tp is empty")
      Suggested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Suggested-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: NVlad Buslov <vladbu@mellanox.com>
      Tested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5b72a08
    • V
      net: dsa: sja1105: Use PTP core's dedicated kernel thread for RX timestamping · 1e762bd2
      Vladimir Oltean 提交于
      And move the queue of skb's waiting for RX timestamps into the ptp_data
      structure, since it isn't needed if PTP is not compiled.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e762bd2
    • V
      ptp: introduce ptp_cancel_worker_sync · 544fed47
      Vladimir Oltean 提交于
      In order to effectively use the PTP kernel thread for tasks such as
      timestamping packets, allow the user control over stopping it, which is
      needed e.g. when the timestamping queues must be drained.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      544fed47
    • V
      ptp: fix the race between the release of ptp_clock and cdev · a33121e5
      Vladis Dronov 提交于
      In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
      device is removed, closing this file leads to a race. This reproduces
      easily in a kvm virtual machine:
      
      ts# cat openptp0.c
      int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
      ts# uname -r
      5.5.0-rc3-46cf053e
      ts# cat /proc/cmdline
      ... slub_debug=FZP
      ts# modprobe ptp_kvm
      ts# ./openptp0 &
      [1] 670
      opened /dev/ptp0, sleeping 10s...
      ts# rmmod ptp_kvm
      ts# ls /dev/ptp*
      ls: cannot access '/dev/ptp*': No such file or directory
      ts# ...woken up
      [   48.010809] general protection fault: 0000 [#1] SMP
      [   48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25
      [   48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
      [   48.016270] RIP: 0010:module_put.part.0+0x7/0x80
      [   48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202
      [   48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0
      [   48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b
      [   48.019470] ...                                              ^^^ a slub poison
      [   48.023854] Call Trace:
      [   48.024050]  __fput+0x21f/0x240
      [   48.024288]  task_work_run+0x79/0x90
      [   48.024555]  do_exit+0x2af/0xab0
      [   48.024799]  ? vfs_write+0x16a/0x190
      [   48.025082]  do_group_exit+0x35/0x90
      [   48.025387]  __x64_sys_exit_group+0xf/0x10
      [   48.025737]  do_syscall_64+0x3d/0x130
      [   48.026056]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   48.026479] RIP: 0033:0x7f53b12082f6
      [   48.026792] ...
      [   48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
      [   48.045001] Fixing recursive fault but reboot is needed!
      
      This happens in:
      
      static void __fput(struct file *file)
      {   ...
          if (file->f_op->release)
              file->f_op->release(inode, file); <<< cdev is kfree'd here
          if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
                   !(mode & FMODE_PATH))) {
              cdev_put(inode->i_cdev); <<< cdev fields are accessed here
      
      Namely:
      
      __fput()
        posix_clock_release()
          kref_put(&clk->kref, delete_clock) <<< the last reference
            delete_clock()
              delete_ptp_clock()
                kfree(ptp) <<< cdev is embedded in ptp
        cdev_put
          module_put(p->owner) <<< *p is kfree'd, bang!
      
      Here cdev is embedded in posix_clock which is embedded in ptp_clock.
      The race happens because ptp_clock's lifetime is controlled by two
      refcounts: kref and cdev.kobj in posix_clock. This is wrong.
      
      Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
      created especially for such cases. This way the parent device with its
      ptp_clock is not released until all references to the cdev are released.
      This adds a requirement that an initialized but not exposed struct
      device should be provided to posix_clock_register() by a caller instead
      of a simple dev_t.
      
      This approach was adopted from the commit 72139dfa ("watchdog: Fix
      the race between the release of watchdog_core_data and cdev"). See
      details of the implementation in the commit 233ed09d ("chardev: add
      helper function to register char devs with a struct device").
      
      Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#uAnalyzed-by: NStephen Johnston <sjohnsto@redhat.com>
      Analyzed-by: NVern Lovejoy <vlovejoy@redhat.com>
      Signed-off-by: NVladis Dronov <vdronov@redhat.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a33121e5
  8. 28 12月, 2019 12 次提交
    • M
      ethtool: provide link state with LINKSTATE_GET request · 3d2b847f
      Michal Kubecek 提交于
      Implement LINKSTATE_GET netlink request to get link state information.
      
      At the moment, only link up flag as provided by ETHTOOL_GLINK ioctl command
      is returned.
      
      LINKSTATE_GET request can be used with NLM_F_DUMP (without device
      identification) to request the information for all devices in current
      network namespace providing the data.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d2b847f
    • M
      ethtool: add LINKMODES_NTF notification · 1b1b1847
      Michal Kubecek 提交于
      Send ETHTOOL_MSG_LINKMODES_NTF notification message whenever device link
      settings or advertised modes are modified using ETHTOOL_MSG_LINKMODES_SET
      netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands.
      
      The notification message has the same format as reply to LINKMODES_GET
      request. ETHTOOL_MSG_LINKMODES_SET netlink request only triggers the
      notification if there is a change but the ioctl command handlers do not
      check if there is an actual change and trigger the notification whenever
      the commands are executed.
      
      As all work is done by ethnl_default_notify() handler and callback
      functions introduced to handle LINKMODES_GET requests, all that remains is
      adding entries for ETHTOOL_MSG_LINKMODES_NTF into ethnl_notify_handlers and
      ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where
      needed.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b1b1847
    • M
      ethtool: set link modes related data with LINKMODES_SET request · bfbcfe20
      Michal Kubecek 提交于
      Implement LINKMODES_SET netlink request to set advertised linkmodes and
      related attributes as ETHTOOL_SLINKSETTINGS and ETHTOOL_SSET commands do.
      
      The request allows setting autonegotiation flag, speed, duplex and
      advertised link modes.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfbcfe20
    • M
      ethtool: provide link mode information with LINKMODES_GET request · f625aa9b
      Michal Kubecek 提交于
      Implement LINKMODES_GET netlink request to get link modes related
      information provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl
      commands.
      
      This request provides supported, advertised and peer advertised link modes,
      autonegotiation flag, speed and duplex.
      
      LINKMODES_GET request can be used with NLM_F_DUMP (without device
      identification) to request the information for all devices in current
      network namespace providing the data.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f625aa9b
    • M
      ethtool: add LINKINFO_NTF notification · 73286734
      Michal Kubecek 提交于
      Send ETHTOOL_MSG_LINKINFO_NTF notification message whenever device link
      settings are modified using ETHTOOL_MSG_LINKINFO_SET netlink message or
      ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands.
      
      The notification message has the same format as reply to LINKINFO_GET
      request. ETHTOOL_MSG_LINKINFO_SET netlink request only triggers the
      notification if there is a change but the ioctl command handlers do not
      check if there is an actual change and trigger the notification whenever
      the commands are executed.
      
      As all work is done by ethnl_default_notify() handler and callback
      functions introduced to handle LINKINFO_GET requests, all that remains is
      adding entries for ETHTOOL_MSG_LINKINFO_NTF into ethnl_notify_handlers and
      ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where
      needed.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73286734
    • M
      ethtool: set link settings with LINKINFO_SET request · a53f3d41
      Michal Kubecek 提交于
      Implement LINKINFO_SET netlink request to set link settings queried by
      LINKINFO_GET message.
      
      Only physical port, phy MDIO address and MDI(-X) control can be set,
      attempt to modify MDI(-X) status and transceiver is rejected.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a53f3d41
    • M
      ethtool: provide link settings with LINKINFO_GET request · 459e0b81
      Michal Kubecek 提交于
      Implement LINKINFO_GET netlink request to get basic link settings provided
      by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands.
      
      This request provides settings not directly related to autonegotiation and
      link mode selection: physical port, phy MDIO address, MDI(-X) status,
      MDI(-X) control and transceiver.
      
      LINKINFO_GET request can be used with NLM_F_DUMP (without device
      identification) to request the information for all devices in current
      network namespace providing the data.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      459e0b81
    • M
      ethtool: provide string sets with STRSET_GET request · 71921690
      Michal Kubecek 提交于
      Requests a contents of one or more string sets, i.e. indexed arrays of
      strings; this information is provided by ETHTOOL_GSSET_INFO and
      ETHTOOL_GSTRINGS commands of ioctl interface. Unlike ioctl interface, all
      information can be retrieved with one request and mulitple string sets can
      be requested at once.
      
      There are three types of requests:
      
        - no NLM_F_DUMP, no device: get "global" stringsets
        - no NLM_F_DUMP, with device: get string sets related to the device
        - NLM_F_DUMP, no device: get device related string sets for all devices
      
      Client can request either all string sets of given type (global or device
      related) or only specific sets. With ETHTOOL_A_STRSET_COUNTS flag set, only
      set sizes (numbers of strings) are returned.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71921690
    • M
      ethtool: support for netlink notifications · 6b08d6c1
      Michal Kubecek 提交于
      Add infrastructure for ethtool netlink notifications. There is only one
      multicast group "monitor" which is used to notify userspace about changes
      and actions performed. Notification messages (types using suffix _NTF)
      share the format with replies to GET requests.
      
      Notifications are supposed to be broadcasted on every configuration change,
      whether it is done using the netlink interface or ioctl one. Netlink SET
      requests only trigger a notification if some data is actually changed.
      
      To trigger an ethtool notification, both ethtool netlink and external code
      use ethtool_notify() helper. This helper requires RTNL to be held and may
      sleep. Handlers sending messages for specific notification message types
      are registered in ethnl_notify_handlers array. As notifications can be
      triggered from other code, ethnl_ok flag is used to prevent an attempt to
      send notification before genetlink family is registered.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b08d6c1
    • M
      ethtool: netlink bitset handling · 10b518d4
      Michal Kubecek 提交于
      The ethtool netlink code uses common framework for passing arbitrary
      length bit sets to allow future extensions. A bitset can be a list (only
      one bitmap) or can consist of value and mask pair (used e.g. when client
      want to modify only some bits). A bitset can use one of two formats:
      verbose (bit by bit) or compact.
      
      Verbose format consists of bitset size (number of bits), list flag and
      an array of bit nests, telling which bits are part of the list or which
      bits are in the mask and which of them are to be set. In requests, bits
      can be identified by index (position) or by name. In replies, kernel
      provides both index and name. Verbose format is suitable for "one shot"
      applications like standard ethtool command as it avoids the need to
      either keep bit names (e.g. link modes) in sync with kernel or having to
      add an extra roundtrip for string set request (e.g. for private flags).
      
      Compact format uses one (list) or two (value/mask) arrays of 32-bit
      words to store the bitmap(s). It is more suitable for long running
      applications (ethtool in monitor mode or network management daemons)
      which can retrieve the names once and then pass only compact bitmaps to
      save space.
      
      Userspace requests can use either format; ETHTOOL_FLAG_COMPACT_BITSETS
      flag in request header tells kernel which format to use in reply.
      Notifications always use compact format.
      
      As some code uses arrays of unsigned long for internal representation and
      some arrays of u32 (or even a single u32), two sets of parse/compose
      helpers are introduced. To avoid code duplication, helpers for unsigned
      long arrays are implemented as wrappers around helpers for u32 arrays.
      There are two reasons for this choice: (1) u32 arrays are more frequent in
      ethtool code and (2) unsigned long array can be always interpreted as an
      u32 array on little endian 64-bit and all 32-bit architectures while we
      would need special handling for odd number of u32 words in the opposite
      direction.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10b518d4
    • M
      ethtool: helper functions for netlink interface · 041b1c5d
      Michal Kubecek 提交于
      Add common request/reply header definition and helpers to parse request
      header and fill reply header. Provide ethnl_update_* helpers to update
      structure members from request attributes (to be used for *_SET requests).
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      041b1c5d
    • M
      ethtool: introduce ethtool netlink interface · 2b4a8990
      Michal Kubecek 提交于
      Basic genetlink and init infrastructure for the netlink interface, register
      genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to
      make the build optional. Add initial overall interface description into
      Documentation/networking/ethtool-netlink.rst, further patches will add more
      detailed information.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b4a8990
  9. 27 12月, 2019 1 次提交