1. 07 8月, 2020 1 次提交
    • Y
      bpf: Change uapi for bpf iterator map elements · 5e7b3020
      Yonghong Song 提交于
      Commit a5cbe05a ("bpf: Implement bpf iterator for
      map elements") added bpf iterator support for
      map elements. The map element bpf iterator requires
      info to identify a particular map. In the above
      commit, the attr->link_create.target_fd is used
      to carry map_fd and an enum bpf_iter_link_info
      is added to uapi to specify the target_fd actually
      representing a map_fd:
          enum bpf_iter_link_info {
      	BPF_ITER_LINK_UNSPEC = 0,
      	BPF_ITER_LINK_MAP_FD = 1,
      
      	MAX_BPF_ITER_LINK_INFO,
          };
      
      This is an extensible approach as we can grow
      enumerator for pid, cgroup_id, etc. and we can
      unionize target_fd for pid, cgroup_id, etc.
      But in the future, there are chances that
      more complex customization may happen, e.g.,
      for tasks, it could be filtered based on
      both cgroup_id and user_id.
      
      This patch changed the uapi to have fields
      	__aligned_u64	iter_info;
      	__u32		iter_info_len;
      for additional iter_info for link_create.
      The iter_info is defined as
      	union bpf_iter_link_info {
      		struct {
      			__u32   map_fd;
      		} map;
      	};
      
      So future extension for additional customization
      will be easier. The bpf_iter_link_info will be
      passed to target callback to validate and generic
      bpf_iter framework does not need to deal it any
      more.
      
      Note that map_fd = 0 will be considered invalid
      and -EBADF will be returned to user space.
      
      Fixes: a5cbe05a ("bpf: Implement bpf iterator for map elements")
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200805055056.1457463-1-yhs@fb.com
      5e7b3020
  2. 04 8月, 2020 3 次提交
  3. 02 8月, 2020 1 次提交
  4. 01 8月, 2020 2 次提交
    • R
      rtnetlink: add support for protodown reason · 829eb208
      Roopa Prabhu 提交于
      netdev protodown is a mechanism that allows protocols to
      hold an interface down. It was initially introduced in
      the kernel to hold links down by a multihoming protocol.
      There was also an attempt to introduce protodown
      reason at the time but was rejected. protodown and protodown reason
      is supported by almost every switching and routing platform.
      It was ok for a while to live without a protodown reason.
      But, its become more critical now given more than
      one protocol may need to keep a link down on a system
      at the same time. eg: vrrp peer node, port security,
      multihoming protocol. Its common for Network operators and
      protocol developers to look for such a reason on a networking
      box (Its also known as errDisable by most networking operators)
      
      This patch adds support for link protodown reason
      attribute. There are two ways to maintain protodown
      reasons.
      (a) enumerate every possible reason code in kernel
          - A protocol developer has to make a request and
            have that appear in a certain kernel version
      (b) provide the bits in the kernel, and allow user-space
      (sysadmin or NOS distributions) to manage the bit-to-reasonname
      map.
      	- This makes extending reason codes easier (kind of like
            the iproute2 table to vrf-name map /etc/iproute2/rt_tables.d/)
      
      This patch takes approach (b).
      
      a few things about the patch:
      - It treats the protodown reason bits as counter to indicate
      active protodown users
      - Since protodown attribute is already an exposed UAPI,
      the reason is not enforced on a protodown set. Its a no-op
      if not used.
      the patch follows the below algorithm:
        - presence of reason bits set indicates protodown
          is in use
        - user can set protodown and protodown reason in a
          single or multiple setlink operations
        - setlink operation to clear protodown, will return -EBUSY
          if there are active protodown reason bits
        - reason is not included in link dumps if not used
      
      example with patched iproute2:
      $cat /etc/iproute2/protodown_reasons.d/r.conf
      0 mlag
      1 evpn
      2 vrrp
      3 psecurity
      
      $ip link set dev vxlan0 protodown on protodown_reason vrrp on
      $ip link set dev vxlan0 protodown_reason mlag on
      $ip link show
      14: vxlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
      DEFAULT group default qlen 1000
          link/ether f6:06:be:17:91:e7 brd ff:ff:ff:ff:ff:ff protodown on <mlag,vrrp>
      
      $ip link set dev vxlan0 protodown_reason mlag off
      $ip link set dev vxlan0 protodown off protodown_reason vrrp off
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829eb208
    • Y
      tcp: add earliest departure time to SCM_TIMESTAMPING_OPT_STATS · 48040793
      Yousuk Seung 提交于
      This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports
      the earliest departure time(EDT) of the timestamped skb. By tracking EDT
      values of the skb from different timestamps, we can observe when and how
      much the value changed. This allows to measure the precise delay
      injected on the sender host e.g. by a bpf-base throttler.
      Signed-off-by: NYousuk Seung <ysseung@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48040793
  5. 31 7月, 2020 8 次提交
  6. 30 7月, 2020 1 次提交
  7. 28 7月, 2020 2 次提交
  8. 27 7月, 2020 5 次提交
  9. 26 7月, 2020 3 次提交
    • A
      bpf: Implement BPF XDP link-specific introspection APIs · c1931c97
      Andrii Nakryiko 提交于
      Implement XDP link-specific show_fdinfo and link_info to emit ifindex.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200722064603.3350758-7-andriin@fb.com
      c1931c97
    • A
      bpf, xdp: Add bpf_link-based XDP attachment API · aa8d3a71
      Andrii Nakryiko 提交于
      Add bpf_link-based API (bpf_xdp_link) to attach BPF XDP program through
      BPF_LINK_CREATE command.
      
      bpf_xdp_link is mutually exclusive with direct BPF program attachment,
      previous BPF program should be detached prior to attempting to create a new
      bpf_xdp_link attachment (for a given XDP mode). Once BPF link is attached, it
      can't be replaced by other BPF program attachment or link attachment. It will
      be detached only when the last BPF link FD is closed.
      
      bpf_xdp_link will be auto-detached when net_device is shutdown, similarly to
      how other BPF links behave (cgroup, flow_dissector). At that point bpf_link
      will become defunct, but won't be destroyed until last FD is closed.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200722064603.3350758-5-andriin@fb.com
      aa8d3a71
    • Y
      bpf: Implement bpf iterator for map elements · a5cbe05a
      Yonghong Song 提交于
      The bpf iterator for map elements are implemented.
      The bpf program will receive four parameters:
        bpf_iter_meta *meta: the meta data
        bpf_map *map:        the bpf_map whose elements are traversed
        void *key:           the key of one element
        void *value:         the value of the same element
      
      Here, meta and map pointers are always valid, and
      key has register type PTR_TO_RDONLY_BUF_OR_NULL and
      value has register type PTR_TO_RDWR_BUF_OR_NULL.
      The kernel will track the access range of key and value
      during verification time. Later, these values will be compared
      against the values in the actual map to ensure all accesses
      are within range.
      
      A new field iter_seq_info is added to bpf_map_ops which
      is used to add map type specific information, i.e., seq_ops,
      init/fini seq_file func and seq_file private data size.
      Subsequent patches will have actual implementation
      for bpf_map_ops->iter_seq_info.
      
      In user space, BPF_ITER_LINK_MAP_FD needs to be
      specified in prog attr->link_create.flags, which indicates
      that attr->link_create.target_fd is a map_fd.
      The reason for such an explicit flag is for possible
      future cases where one bpf iterator may allow more than
      one possible customization, e.g., pid and cgroup id for
      task_file.
      
      Current kernel internal implementation only allows
      the target to register at most one required bpf_iter_link_info.
      To support the above case, optional bpf_iter_link_info's
      are needed, the target can be extended to register such link
      infos, and user provided link_info needs to match one of
      target supported ones.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200723184112.590360-1-yhs@fb.com
      a5cbe05a
  10. 25 7月, 2020 5 次提交
    • C
      bcache: add bucket_size_hi into struct cache_sb_disk for large bucket · ffa47032
      Coly Li 提交于
      The large bucket feature is to extend bucket_size from 16bit to 32bit.
      
      When create cache device on zoned device (e.g. zoned NVMe SSD), making
      a single bucket cover one or more zones of the zoned device is the
      simplest way to support zoned device as cache by bcache.
      
      But current maximum bucket size is 16MB and a typical zone size of zoned
      device is 256MB, this is the major motiviation to extend bucket size to
      a larger bit width.
      
      This patch is the basic and first change to support large bucket size,
      the major changes it makes are,
      - Add BCH_FEATURE_INCOMPAT_LARGE_BUCKET for the large bucket feature,
        INCOMPAT means it introduces incompatible on-disk format change.
      - Add BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET) routines.
      - Adds __le16 bucket_size_hi into struct cache_sb_disk at offset 0x8d0
        for the on-disk super block format.
      - For the in-memory super block struct cache_sb, member bucket_size is
        extended from __u16 to __32.
      - Add get_bucket_size() to combine the bucket_size and bucket_size_hi
        from struct cache_sb_disk into an unsigned int value.
      
      Since we already have large bucket size helpers meta_bucket_pages(),
      meta_bucket_bytes() and alloc_meta_bucket_pages(), they make sure when
      bucket size > 8MB, the memory allocation for bcache meta data bucket
      won't fail no matter how large the bucket size extended. So these meta
      data buckets are handled properly when the bucket size width increase
      from 16bit to 32bit, we don't need to worry about them.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ffa47032
    • C
      bcache: struct cache_sb is only for in-memory super block now · 4c1ccd08
      Coly Li 提交于
      We have struct cache_sb_disk for on-disk super block already, it is
      unnecessary to keep the in-memory super block format exactly mapping
      to the on-disk struct layout.
      
      This patch adds code comments to notice that struct cache_sb is not
      exactly mapping to cache_sb_disk, and removes the useless member csum
      and pad[5].
      
      Although struct cache_sb does not belong to uapi, but there are still
      some on-disk format related macros reference it and it is unncessary to
      get rid of such dependency now. So struct cache_sb will continue to stay
      in include/uapi/linux/bache.h for now.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4c1ccd08
    • C
      bcache: increase super block version for cache device and backing device · d721a43f
      Coly Li 提交于
      The new added super block version BCACHE_SB_VERSION_BDEV_WITH_FEATURES
      (5) BCACHE_SB_VERSION_CDEV_WITH_FEATURES value (6), is for the feature
      set bits.
      
      Devices have super block version equal to the new version will have
      three new members for feature set bits in the on-disk super block,
              __le64                  feature_compat;
              __le64                  feature_incompat;
              __le64                  feature_ro_compat;
      
      They are used for further new features which may introduce on-disk
      format change, and avoid unncessary super block version increase.
      
      The very basic features handling code skeleton is also initialized in
      this patch.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d721a43f
    • W
      icmp6: support rfc 4884 · 01370434
      Willem de Bruijn 提交于
      Extend the rfc 4884 read interface introduced for ipv4 in
      commit eba75c58 ("icmp: support rfc 4884") to ipv6.
      
      Add socket option SOL_IPV6/IPV6_RECVERR_RFC4884.
      
      Changes v1->v2:
        - make ipv6_icmp_error_rfc4884 static (file scope)
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01370434
    • A
      net/sched: cls_flower: Add hash info to flow classification · 5923b8f7
      Ariel Levkovich 提交于
      Adding new cls flower keys for hash value and hash
      mask and dissect the hash info from the skb into
      the flow key towards flow classication.
      Signed-off-by: NAriel Levkovich <lariel@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5923b8f7
  11. 24 7月, 2020 1 次提交
  12. 23 7月, 2020 1 次提交
  13. 22 7月, 2020 2 次提交
  14. 21 7月, 2020 1 次提交
    • M
      audit: report audit wait metric in audit status reply · b43870c7
      Max Englander 提交于
      In environments where the preservation of audit events and predictable
      usage of system memory are prioritized, admins may use a combination of
      --backlog_wait_time and -b options at the risk of degraded performance
      resulting from backlog waiting. In some cases, this risk may be
      preferred to lost events or unbounded memory usage. Ideally, this risk
      can be mitigated by making adjustments when backlog waiting is detected.
      
      However, detection can be difficult using the currently available
      metrics. For example, an admin attempting to debug degraded performance
      may falsely believe a full backlog indicates backlog waiting. It may
      turn out the backlog frequently fills up but drains quickly.
      
      To make it easier to reliably track degraded performance to backlog
      waiting, this patch makes the following changes:
      
      Add a new field backlog_wait_time_total to the audit status reply.
      Initialize this field to zero. Add to this field the total time spent
      by the current task on scheduled timeouts while the backlog limit is
      exceeded. Reset field to zero upon request via AUDIT_SET.
      
      Tested on Ubuntu 18.04 using complementary changes to the
      audit-userspace and audit-testsuite:
      - https://github.com/linux-audit/audit-userspace/pull/134
      - https://github.com/linux-audit/audit-testsuite/pull/97Signed-off-by: NMax Englander <max.englander@gmail.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b43870c7
  15. 20 7月, 2020 4 次提交
    • P
      perf: Add perf_event_mmap_page::cap_user_time_short ABI · 6c0246a4
      Peter Zijlstra 提交于
      In order to support short clock counters, provide an ABI extension.
      
      As a whole:
      
          u64 time, delta, cyc = read_cycle_counter();
      
      +   if (cap_user_time_short)
      +	cyc = time_cycle + ((cyc - time_cycle) & time_mask);
      
          delta = mul_u64_u32_shr(cyc, time_mult, time_shift);
      
          if (cap_user_time_zero)
      	time = time_zero + delta;
      
          delta += time_offset;
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Link: https://lore.kernel.org/r/20200716051130.4359-6-leo.yan@linaro.orgSigned-off-by: NWill Deacon <will@kernel.org>
      6c0246a4
    • V
      ptp: introduce a phase offset in the periodic output request · b6bd4136
      Vladimir Oltean 提交于
      Some PHCs like the ocelot/felix switch cannot emit generic periodic
      output, but just PPS (pulse per second) signals, which:
      - don't start from arbitrary absolute times, but are rather
        phase-aligned to the beginning of [the closest next] second.
      - have an optional phase offset relative to that beginning of the
        second.
      
      For those, it was initially established that they should reject any
      other absolute time for the PTP_PEROUT_REQUEST than 0.000000000 [1].
      
      But when it actually came to writing an application [2] that makes use
      of this functionality, we realized that we can't really deal generically
      with PHCs that support absolute start time, and with PHCs that don't,
      without an explicit interface. Namely, in an ideal world, PHC drivers
      would ensure that the "perout.start" value written to hardware will
      result in a functional output. This means that if the PTP time has
      become in the past of this PHC's current time, it should be
      automatically fast-forwarded by the driver into a close enough future
      time that is known to work (note: this is necessary only if the hardware
      doesn't do this fast-forward by itself). But we don't really know what
      is the status for PHC drivers in use today, so in the general sense,
      user space would be risking to have a non-functional periodic output if
      it simply asked for a start time of 0.000000000.
      
      So let's introduce a flag for this type of reduced-functionality
      hardware, named PTP_PEROUT_PHASE. The start time is just "soon", the
      only thing we know for sure about this signal is that its rising edge
      events, Rn, occur at:
      
      Rn = perout.phase + n * perout.period
      
      The "phase" in the periodic output structure is simply an alias to the
      "start" time, since both cannot logically be specified at the same time.
      Therefore, the binary layout of the structure is not affected.
      
      [1]: https://patchwork.ozlabs.org/project/netdev/patch/20200320103726.32559-7-yangbo.lu@nxp.com/
      [2]: https://www.mail-archive.com/linuxptp-devel@lists.sourceforge.net/msg04142.htmlSigned-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6bd4136
    • V
      ptp: add ability to configure duty cycle for periodic output · f65b71aa
      Vladimir Oltean 提交于
      There are external event timestampers (PHCs with support for
      PTP_EXTTS_REQUEST) that timestamp both event edges.
      
      When those edges are very close (such as in the case of a short pulse),
      there is a chance that the collected timestamp might be of the rising,
      or of the falling edge, we never know.
      
      There are also PHCs capable of generating periodic output with a
      configurable duty cycle. This is good news, because we can space the
      rising and falling edge out enough in time, that the risks to overrun
      the 1-entry timestamp FIFO of the extts PHC are lower (example: the
      perout PHC can be configured for a period of 1 second, and an "on" time
      of 0.5 seconds, resulting in a duty cycle of 50%).
      
      A flag is introduced for signaling that an on time is present in the
      perout request structure, for preserving compatibility. Logically
      speaking, the duty cycle cannot exceed 100% and the PTP core checks for
      this.
      
      PHC drivers that don't support this flag emit a periodic output of an
      unspecified duty cycle, same as before.
      
      The duty cycle is encoded as an "on" time, similar to the "start" and
      "period" times, and reuses the reserved space while preserving overall
      binary layout.
      
      Pahole reported before:
      
      struct ptp_perout_request {
              struct ptp_clock_time start;                     /*     0    16 */
              struct ptp_clock_time period;                    /*    16    16 */
              unsigned int               index;                /*    32     4 */
              unsigned int               flags;                /*    36     4 */
              unsigned int               rsv[4];               /*    40    16 */
      
              /* size: 56, cachelines: 1, members: 5 */
              /* last cacheline: 56 bytes */
      };
      
      And now:
      
      struct ptp_perout_request {
              struct ptp_clock_time start;                     /*     0    16 */
              struct ptp_clock_time period;                    /*    16    16 */
              unsigned int               index;                /*    32     4 */
              unsigned int               flags;                /*    36     4 */
              union {
                      struct ptp_clock_time on;                /*    40    16 */
                      unsigned int       rsv[4];               /*    40    16 */
              };                                               /*    40    16 */
      
              /* size: 56, cachelines: 1, members: 5 */
              /* last cacheline: 56 bytes */
      };
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f65b71aa
    • W
      icmp: support rfc 4884 · eba75c58
      Willem de Bruijn 提交于
      Add setsockopt SOL_IP/IP_RECVERR_4884 to return the offset to an
      extension struct if present.
      
      ICMP messages may include an extension structure after the original
      datagram. RFC 4884 standardized this behavior. It stores the offset
      in words to the extension header in u8 icmphdr.un.reserved[1].
      
      The field is valid only for ICMP types destination unreachable, time
      exceeded and parameter problem, if length is at least 128 bytes and
      entire packet does not exceed 576 bytes.
      
      Return the offset to the start of the extension struct when reading an
      ICMP error from the error queue, if it matches the above constraints.
      
      Do not return the raw u8 field. Return the offset from the start of
      the user buffer, in bytes. The kernel does not return the network and
      transport headers, so subtract those.
      
      Also validate the headers. Return the offset regardless of validation,
      as an invalid extension must still not be misinterpreted as part of
      the original datagram. Note that !invalid does not imply valid. If
      the extension version does not match, no validation can take place,
      for instance.
      
      For backward compatibility, make this optional, set by setsockopt
      SOL_IP/IP_RECVERR_RFC4884. For API example and feature test, see
      github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c
      
      For forward compatibility, reserve only setsockopt value 1, leaving
      other bits for additional icmp extensions.
      
      Changes
        v1->v2:
        - convert word offset to byte offset from start of user buffer
          - return in ee_data as u8 may be insufficient
        - define extension struct and object header structs
        - return len only if constraints met
        - if returning len, also validate
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eba75c58