1. 20 7月, 2018 1 次提交
  2. 19 7月, 2018 1 次提交
    • S
      net: Move skb decrypted field, avoid explicity copy · a48d189e
      Stefano Brivio 提交于
      Commit 784abe24 ("net: Add decrypted field to skb")
      introduced a 'decrypted' field that is explicitly copied on skb
      copy and clone.
      
      Move it between headers_start[0] and headers_end[0], so that we
      don't need to copy it explicitly as it's copied by the memcpy()
      in __copy_skb_header().
      
      While at it, drop the assignment in __skb_clone(), it was
      already redundant.
      
      This doesn't change the size of sk_buff or cacheline boundaries.
      
      The 15-bits hole before tc_index becomes a 14-bits hole, and
      will be again a 15-bits hole when this change is merged with
      commit 8b700862 ("net: Don't copy pfmemalloc flag in
      __copy_skb_header()").
      
      v2: as reported by kbuild test robot (oops, I forgot to build
          with CONFIG_TLS_DEVICE it seems), we can't use
          CHECK_SKB_FIELD() on a bit-field member. Just drop the
          check for the moment being, perhaps we could think of some
          magic to also check bit-field members one day.
      
      Fixes: 784abe24 ("net: Add decrypted field to skb")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a48d189e
  3. 18 7月, 2018 4 次提交
    • A
      net: phy: sfp: Add HWMON support for module sensors · 1323061a
      Andrew Lunn 提交于
      SFP modules can contain a number of sensors. The EEPROM also contains
      recommended alarm and critical values for each sensor, and indications
      of if these have been exceeded. Export this information via
      HWMON. Currently temperature, VCC, bias current, transmit power, and
      possibly receiver power is supported.
      
      The sensors in the modules can either return calibrate or uncalibrated
      values. Uncalibrated values need to be manipulated, using coefficients
      provided in the SFP EEPROM. Uncalibrated receive power values require
      floating point maths in order to calibrate them. Performing this in
      the kernel is hard. So if the SFP module indicates it uses
      uncalibrated values, RX power is not made available.
      
      With this hwmon device, it is possible to view the sensor values using
      lm-sensors programs:
      
      in0:          +3.29 V  (crit min =  +2.90 V, min =  +3.00 V)
                             (max =  +3.60 V, crit max =  +3.70 V)
      temp1:        +33.0°C  (low  =  -5.0°C, high = +80.0°C)
                             (crit low = -10.0°C, crit = +85.0°C)
      power1:      1000.00 nW (max = 794.00 uW, min =  50.00 uW)  ALARM (LCRIT)
                             (lcrit =  40.00 uW, crit = 1000.00 uW)
      curr1:        +0.00 A  (crit min =  +0.00 A, min =  +0.00 A)  ALARM (LCRIT, MIN)
                             (max =  +0.01 A, crit max =  +0.01 A)
      
      The scaling sensors performs on the bias current is not particularly
      good. The raw values are more useful:
      
      curr1:
        curr1_input: 0.000
        curr1_min: 0.002
        curr1_max: 0.010
        curr1_lcrit: 0.000
        curr1_crit: 0.011
        curr1_min_alarm: 1.000
        curr1_max_alarm: 0.000
        curr1_lcrit_alarm: 1.000
        curr1_crit_alarm: 0.000
      
      In order to keep the I2C overhead to a minimum, the constant values,
      such as limits and calibration coefficients are read once at module
      insertion time. Thus only reading *_input and *_alarm properties
      requires i2c read operations.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Acked-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1323061a
    • A
      hwmon: Add helper to tell if a char is invalid in a name · dcb5d0fc
      Andrew Lunn 提交于
      HWMON device names are not allowed to contain "-* \t\n". Add a helper
      which will return true if passed an invalid character. It can be used
      to massage a string into a hwmon compatible name by replacing invalid
      characters with '_'.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Acked-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcb5d0fc
    • A
      hwmon: Add support for power min, lcrit, min_alarm and lcrit_alarm · aa7f29b0
      Andrew Lunn 提交于
      Some sensors support reporting minimal and lower critical power, as
      well as alarms when these thresholds are reached. Add support for
      these attributes to the hwmon core.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Acked-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa7f29b0
    • A
      hwmon: Add missing HWMON_T_LCRIT_ALARM define · 2fe31e43
      Andrew Lunn 提交于
      The enum hwmon_temp_lcrit_alarm exists, but the BIT definition is
      missing.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Acked-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fe31e43
  4. 17 7月, 2018 2 次提交
  5. 16 7月, 2018 4 次提交
  6. 14 7月, 2018 3 次提交
    • N
      net: ipmr: add support for passing full packet on wrong vif · c921c207
      Nikolay Aleksandrov 提交于
      This patch adds support for IGMPMSG_WRVIFWHOLE which is used to pass
      full packet and real vif id when the incoming interface is wrong.
      While the RP and FHR are setting up state we need to be sending the
      registers encapsulated with all the data inside otherwise we lose it.
      The RP then decapsulates it and forwards it to the interested parties.
      Currently with WRONGVIF we can only be sending empty register packets
      and will lose that data.
      This behaviour can be enabled by using MRT_PIM with
      val == IGMPMSG_WRVIFWHOLE. This doesn't prevent IGMPMSG_WRONGVIF from
      happening, it happens in addition to it, also it is controlled by the same
      throttling parameters as WRONGVIF (i.e. 1 packet per 3 seconds currently).
      Both messages are generated to keep backwards compatibily and avoid
      breaking someone who was enabling MRT_PIM with val == 4, since any
      positive val is accepted and treated the same.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c921c207
    • J
      xdp: support simultaneous driver and hw XDP attachment · a25717d2
      Jakub Kicinski 提交于
      Split the query of HW-attached program from the software one.
      Introduce new .ndo_bpf command to query HW-attached program.
      This will allow drivers to install different programs in HW
      and SW at the same time.  Netlink can now also carry multiple
      programs on dump (in which case mode will be set to
      XDP_ATTACHED_MULTI and user has to check per-attachment point
      attributes, IFLA_XDP_PROG_ID will not be present).  We reuse
      IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size()
      doesn't need to be updated.
      
      Note that the installation side is still not there, since all
      drivers currently reject installing more than one program at
      the time.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a25717d2
    • J
      xdp: don't make drivers report attachment mode · 6b867589
      Jakub Kicinski 提交于
      prog_attached of struct netdev_bpf should have been superseded
      by simply setting prog_id long time ago, but we kept it around
      to allow offloading drivers to communicate attachment mode (drv
      vs hw).  Subsequently drivers were also allowed to report back
      attachment flags (prog_flags), and since nowadays only programs
      attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell
      the attachment mode from the flags driver reports.  Remove
      prog_attached member.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6b867589
  7. 13 7月, 2018 4 次提交
  8. 12 7月, 2018 2 次提交
    • P
      net: Add lag.h, net_lag_port_dev_txable() · eeed992b
      Petr Machata 提交于
      LAG devices (team or bond) recognize for each one of their slave devices
      whether LAG traffic is going to be sent through that device. Bond calls
      such devices "active", team calls them "txable". When this state
      changes, a NETDEV_CHANGELOWERSTATE notification is distributed, together
      with a netdev_notifier_changelowerstate_info structure that for LAG
      devices includes a tx_enabled flag that refers to the new state. The
      notification thus makes it possible to react to the changes in txability
      in drivers.
      
      However there's no way to query txability from the outside on demand.
      That is problematic namely for mlxsw, which when resolving ERSPAN packet
      path, may encounter a LAG device, and needs to determine which of the
      slaves it should choose.
      
      To that end, introduce a new function, net_lag_port_dev_txable(), which
      determines whether a given slave device is "active" or
      "txable" (depending on the flavor of the LAG device). That function then
      dispatches to per-LAG-flavor helpers, bond_is_active_slave_dev() resp.
      team_port_dev_txable().
      
      Because there currently is no good place where net_lag_port_dev_txable()
      should be added, introduce a new header file, lag.h, which should from
      now on hold any logic common to both team and bond. (But keep
      netif_is_lag_master() together with the rest of netif_is_*_master()
      functions).
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeed992b
    • P
      team: Publish team_port_get_rcu() · 3443b00e
      Petr Machata 提交于
      A follow-up patch adds a new entry point, team_port_dev_txable(). Making
      it an ordinary exported function would mean that any module that may
      need the service in one of the supported configurations also
      unconditionally needs to pull in the team module, whether or not the
      user actually intends to create team interfaces.
      
      To prevent that, team_port_dev_txable() is defined in if_team.h, and
      therefore all dependencies of that function also need to be
      publicly-visible.
      
      Therefore move team_port_get_rcu() from team.c to if_team.h.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3443b00e
  9. 11 7月, 2018 1 次提交
    • T
      netfilter: Add nf_ct_get_tuple_skb global lookup function · b60a6040
      Toke Høiland-Jørgensen 提交于
      This adds a global netfilter function to extract a conntrack tuple from an
      skb. The function uses a new function added to nf_ct_hook, which will try
      to get the tuple from skb->_nfct, and do a full lookup if that fails. This
      makes it possible to use the lookup function before the skb has passed
      through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is
      copied to the caller to avoid issues with reference counting.
      
      The function returns false if conntrack is not loaded, allowing it to be
      used without incurring a module dependency on conntrack. This is used by
      the NAT mode in sch_cake.
      
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b60a6040
  10. 10 7月, 2018 6 次提交
  11. 08 7月, 2018 1 次提交
  12. 07 7月, 2018 1 次提交
    • J
      lib: reciprocal_div: implement the improved algorithm on the paper mentioned · 06ae4826
      Jiong Wang 提交于
      The new added "reciprocal_value_adv" implements the advanced version of the
      algorithm described in Figure 4.2 of the paper except when
      "divisor > (1U << 31)" whose ceil(log2(d)) result will be 32 which then
      requires u128 divide on host. The exception case could be easily handled
      before calling "reciprocal_value_adv".
      
      The advanced version requires more complex calculation to get the
      reciprocal multiplier and other control variables, but then could reduce
      the required emulation operations.
      
      It makes no sense to use this advanced version for host divide emulation,
      those extra complexities for calculating multiplier etc could completely
      waive our saving on emulation operations.
      
      However, it makes sense to use it for JIT divide code generation (for
      example eBPF JIT backends) for which we are willing to trade performance of
      JITed code with that of host. As shown by the following pseudo code, the
      required emulation operations could go down from 6 (the basic version) to 3
      or 4.
      
      To use the result of "reciprocal_value_adv", suppose we want to calculate
      n/d, the C-style pseudo code will be the following, it could be easily
      changed to real code generation for other JIT targets.
      
        struct reciprocal_value_adv rvalue;
        u8 pre_shift, exp;
      
        // handle exception case.
        if (d >= (1U << 31)) {
          result = n >= d;
          return;
        }
        rvalue = reciprocal_value_adv(d, 32)
        exp = rvalue.exp;
        if (rvalue.is_wide_m && !(d & 1)) {
          // floor(log2(d & (2^32 -d)))
          pre_shift = fls(d & -d) - 1;
          rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift);
        } else {
          pre_shift = 0;
        }
      
        // code generation starts.
        if (imm == 1U << exp) {
          result = n >> exp;
        } else if (rvalue.is_wide_m) {
          // pre_shift must be zero when reached here.
          t = (n * rvalue.m) >> 32;
          result = n - t;
          result >>= 1;
          result += t;
          result >>= rvalue.sh - 1;
        } else {
          if (pre_shift)
            result = n >> pre_shift;
          result = ((u64)result * rvalue.m) >> 32;
          result >>= rvalue.sh;
        }
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      06ae4826
  13. 05 7月, 2018 1 次提交
  14. 04 7月, 2018 5 次提交
    • V
      net/sched: Introduce the ETF Qdisc · 25db26a9
      Vinicius Costa Gomes 提交于
      The ETF (Earliest TxTime First) qdisc uses the information added
      earlier in this series (the socket option SO_TXTIME and the new
      role of sk_buff->tstamp) to schedule packets transmission based
      on absolute time.
      
      For some workloads, just bandwidth enforcement is not enough, and
      precise control of the transmission of packets is necessary.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
                 clockid CLOCK_TAI
      
      In this example, the Qdisc will provide SW best-effort for the control
      of the transmission time to the network adapter, the time stamp in the
      socket will be in reference to the clockid CLOCK_TAI and packets
      will leave the qdisc "delta" (100000) nanoseconds before its transmission
      time.
      
      The ETF qdisc will buffer packets sorted by their txtime. It will drop
      packets on enqueue() if their skbuff clockid does not match the clock
      reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
      if it expires while being enqueued.
      
      The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
      will dequeue a packet as soon as possible and change the skb timestamp
      to 'now' during etf_dequeue().
      
      Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
      to be configured, but it's been decided that usage of CLOCK_TAI should
      be enforced until we decide to allow for other clockids to be used.
      The rationale here is that PTP times are usually in the TAI scale, thus
      no other clocks should be necessary. For now, the qdisc will return
      EINVAL if any clocks other than CLOCK_TAI are used.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25db26a9
    • E
      net: ipv4: listified version of ip_rcv · 17266ee9
      Edward Cree 提交于
      Also involved adding a way to run a netfilter hook over a list of packets.
       Rather than attempting to make netfilter know about lists (which would be
       a major project in itself) we just let it call the regular okfn (in this
       case ip_rcv_finish()) for any packets it steals, and have it give us back
       a list of packets it's synchronously accepted (which normally NF_HOOK
       would automatically call okfn() on, but we want to be able to potentially
       pass the list to a listified version of okfn().)
      The netfilter hooks themselves are indirect calls that still happen per-
       packet (see nf_hook_entry_hookfn()), but again, changing that can be left
       for future work.
      
      There is potential for out-of-order receives if the netfilter hook ends up
       synchronously stealing packets, as they will be processed before any
       accepts earlier in the list.  However, it was already possible for an
       asynchronous accept to cause out-of-order receives, so presumably this is
       considered OK.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17266ee9
    • E
      net: core: another layer of lists, around PF_MEMALLOC skb handling · 4ce0017a
      Edward Cree 提交于
      First example of a layer splitting the list (rather than merely taking
       individual packets off it).
      Involves new list.h function, list_cut_before(), like list_cut_position()
       but cuts on the other side of the given entry.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ce0017a
    • E
      net: core: trivial netif_receive_skb_list() entry point · f6ad8c1b
      Edward Cree 提交于
      Just calls netif_receive_skb() in a loop.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6ad8c1b
    • X
      sctp: add support for dscp and flowlabel per transport · 8a9c58d2
      Xin Long 提交于
      Like some other per transport params, flowlabel and dscp are added
      in transport, asoc and sctp_sock. By default, transport sets its
      value from asoc's, and asoc does it from sctp_sock. flowlabel
      only works for ipv6 transport.
      
      Other than that they need to be passed down in sctp_xmit, flow4/6
      also needs to set them before looking up route in get_dst.
      
      Note that it uses '& 0x100000' to check if flowlabel is set and
      '& 0x1' (tos 1st bit is unused) to check if dscp is set by users,
      so that they could be set to 0 by sockopt in next patch.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a9c58d2
  15. 02 7月, 2018 2 次提交
  16. 30 6月, 2018 1 次提交
    • D
      bpf: undo prog rejection on read-only lock failure · 85782e03
      Daniel Borkmann 提交于
      Partially undo commit 9facc336 ("bpf: reject any prog that failed
      read-only lock") since it caused a regression, that is, syzkaller was
      able to manage to cause a panic via fault injection deep in set_memory_ro()
      path by letting an allocation fail: In x86's __change_page_attr_set_clr()
      it was able to change the attributes of the primary mapping but not in
      the alias mapping via cpa_process_alias(), so the second, inner call
      to the __change_page_attr() via __change_page_attr_set_clr() had to split
      a larger page and failed in the alloc_pages() with the artifically triggered
      allocation error which is then propagated down to the call site.
      
      Thus, for set_memory_ro() this means that it returned with an error, but
      from debugging a probe_kernel_write() revealed EFAULT on that memory since
      the primary mapping succeeded to get changed. Therefore the subsequent
      hdr->locked = 0 reset triggered the panic as it was performed on read-only
      memory, so call-site assumptions were infact wrong to assume that it would
      either succeed /or/ not succeed at all since there's no such rollback in
      set_memory_*() calls from partial change of mappings, in other words, we're
      left in a state that is "half done". A later undo via set_memory_rw() is
      succeeding though due to matching permissions on that part (aka due to the
      try_preserve_large_page() succeeding). While reproducing locally with
      explicitly triggering this error, the initial splitting only happens on
      rare occasions and in real world it would additionally need oom conditions,
      but that said, it could partially fail. Therefore, it is definitely wrong
      to bail out on set_memory_ro() error and reject the program with the
      set_memory_*() semantics we have today. Shouldn't have gone the extra mile
      since no other user in tree today infact checks for any set_memory_*()
      errors, e.g. neither module_enable_ro() / module_disable_ro() for module
      RO/NX handling which is mostly default these days nor kprobes core with
      alloc_insn_page() / free_insn_page() as examples that could be invoked long
      after bootup and original 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") did neither when it got first introduced to BPF
      so "improving" with bailing out was clearly not right when set_memory_*()
      cannot handle it today.
      
      Kees suggested that if set_memory_*() can fail, we should annotate it with
      __must_check, and all callers need to deal with it gracefully given those
      set_memory_*() markings aren't "advisory", but they're expected to actually
      do what they say. This might be an option worth to move forward in future
      but would at the same time require that set_memory_*() calls from supporting
      archs are guaranteed to be "atomic" in that they provide rollback if part
      of the range fails, once that happened, the transition from RW -> RO could
      be made more robust that way, while subsequent RO -> RW transition /must/
      continue guaranteeing to always succeed the undo part.
      
      Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
      Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
      Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      85782e03
  17. 29 6月, 2018 1 次提交