1. 19 1月, 2019 5 次提交
    • T
      mac80211: Add TXQ scheduling API · 18667600
      Toke Høiland-Jørgensen 提交于
      This adds an API to mac80211 to handle scheduling of TXQs. The interface
      between driver and mac80211 for TXQ handling is changed by adding two new
      functions: ieee80211_next_txq(), which will return the next TXQ to schedule
      in the current round-robin rotation, and ieee80211_return_txq(), which the
      driver uses to indicate that it has finished scheduling a TXQ (which will
      then be put back in the scheduling rotation if it isn't empty).
      
      The driver must call ieee80211_txq_schedule_start() at the start of each
      scheduling session, and ieee80211_txq_schedule_end() at the end. The API
      then guarantees that the same TXQ is not returned twice in the same
      session (so a driver can loop on ieee80211_next_txq() without worrying
      about breaking the loop.
      
      Usage of the new API is optional, so drivers can be ported one at a time.
      In this patch, the actual scheduling performed by mac80211 is simple
      round-robin, but a subsequent commit adds airtime fairness awareness to the
      scheduler.
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      [minor kernel-doc fix, propagate sparse locking checks out]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      18667600
    • E
      devlink: Add health report functionality · c7af343b
      Eran Ben Elisha 提交于
      Upon error discover, every driver can report it to the devlink health
      mechanism via devlink_health_report function, using the appropriate
      reporter registered to it. Driver can pass error specific context which
      will be delivered to it as part of the dump / recovery callbacks.
      
      Once an error is reported, devlink health will do the following actions:
      * A log is being send to the kernel trace events buffer
      * Health status and statistics are being updated for the reporter instance
      * Object dump is being taken and stored at the reporter instance (as long
        as there is no other dump which is already stored)
      * Auto recovery attempt is being done. depends on:
        - Auto Recovery configuration
        - Grace period vs. time since last recover
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7af343b
    • E
      devlink: Add health reporter create/destroy functionality · 880ee82f
      Eran Ben Elisha 提交于
      Devlink health reporter is an instance for reporting, diagnosing and
      recovering from run time errors discovered by the reporters.
      Define it's data structure and supported operations.
      In addition, expose devlink API to create and destroy a reporter.
      Each devlink instance will hold it's own reporters list.
      
      As part of the allocation, driver shall provide a set of callbacks which
      will be used the devlink in order to handle health reports and user
      commands related to this reporter. In addition, driver is entitled to
      provide some priv pointer, which can be fetched from the reporter by
      devlink_health_reporter_priv function.
      
      For each reporter, devlink will hold a metadata of statistics,
      buffers and status.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      880ee82f
    • E
      devlink: Add health buffer support · cb5ccfbe
      Eran Ben Elisha 提交于
      Devlink health buffer is a mechanism to pass descriptors between drivers
      and devlink. The API allows the driver to add objects, object pair,
      value array (nested attributes), value and name.
      
      Driver can use this API to fill the buffers in a format which can be
      translated by the devlink to the netlink message.
      
      In order to fulfill it, an internal buffer descriptor is defined. This
      will hold the data and metadata per each attribute and by used to pass
      actual commands to the netlink.
      
      This mechanism will be later used in devlink health for dump and diagnose
      data store by the drivers.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb5ccfbe
    • Y
      tcp: declare tcp_mmap() only when CONFIG_MMU is set · 340a6f3d
      Yafang Shao 提交于
      Since tcp_mmap() is defined when CONFIG_MMU is set.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      340a6f3d
  2. 18 1月, 2019 4 次提交
    • P
      switchdev: Add extack argument to call_switchdev_notifiers() · 6685987c
      Petr Machata 提交于
      A follow-up patch will enable vetoing of FDB entries. Make it possible
      to communicate details of why an FDB entry is not acceptable back to the
      user.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6685987c
    • P
      vxlan: Add extack to switchdev operations · 4c59b7d1
      Petr Machata 提交于
      There are four sources of VXLAN switchdev notifier calls:
      
      - the changelink() link operation, which already supports extack,
      - ndo_fdb_add() which got extack support in a previous patch,
      - FDB updates due to packet forwarding,
      - and vxlan_fdb_replay().
      
      Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
      switchdev message that it sends, and propagate the argument upwards to
      the callers. For the first two cases, pass in the extack gotten through
      the operation. For case #3, pass in NULL.
      
      To cover the last case, extend vxlan_fdb_replay() to take extack
      argument, which might come from whatever operation necessitated the FDB
      replay.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c59b7d1
    • V
      tls: Fix recvmsg() to be able to peek across multiple records · 692d7b5d
      Vakul Garg 提交于
      This fixes recvmsg() to be able to peek across multiple tls records.
      Without this patch, the tls's selftests test case
      'recv_peek_large_buf_mult_recs' fails. Each tls receive context now
      maintains a 'rx_list' to retain incoming skb carrying tls records. If a
      tls record needs to be retained e.g. for peek case or for the case when
      the buffer passed to recvmsg() has a length smaller than decrypted
      record length, then it is added to 'rx_list'. Additionally, records are
      added in 'rx_list' if the crypto operation runs in async mode. The
      records are dequeued from 'rx_list' after the decrypted data is consumed
      by copying into the buffer passed to recvmsg(). In case, the MSG_PEEK
      flag is used in recvmsg(), then records are not consumed or removed
      from the 'rx_list'.
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      692d7b5d
    • F
      net: dsa: Split platform data to header file · ecfc9372
      Florian Fainelli 提交于
      Instead of having net/dsa.h contain both the internal switch tree/driver
      structures, split the relevant platform_data parts into
      include/linux/platform_data/dsa.h and make that header be included by
      net/dsa.h in order not to break any setup. A subsequent set of patches
      will update code including net/dsa.h to include only the platform_data
      header.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ecfc9372
  3. 17 1月, 2019 1 次提交
  4. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  5. 02 1月, 2019 2 次提交
    • W
      ip: validate header length on virtual device xmit · cb9f1b78
      Willem de Bruijn 提交于
      KMSAN detected read beyond end of buffer in vti and sit devices when
      passing truncated packets with PF_PACKET. The issue affects additional
      ip tunnel devices.
      
      Extend commit 76c0ddd8 ("ip6_tunnel: be careful when accessing the
      inner header") and commit ccfec9e5 ("ip_tunnel: be careful when
      accessing the inner header").
      
      Move the check to a separate helper and call at the start of each
      ndo_start_xmit function in net/ipv4 and net/ipv6.
      
      Minor changes:
      - convert dev_kfree_skb to kfree_skb on error path,
        as dev_kfree_skb calls consume_skb which is not for error paths.
      - use pskb_network_may_pull even though that is pedantic here,
        as the same as pskb_may_pull for devices without llheaders.
      - do not cache ipv6 hdrs if used only once
        (unsafe across pskb_may_pull, was more relevant to earlier patch)
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb9f1b78
    • D
      sock: Make sock->sk_stamp thread-safe · 3a0ed3e9
      Deepa Dinamani 提交于
      Al Viro mentioned (Message-ID
      <20170626041334.GZ10672@ZenIV.linux.org.uk>)
      that there is probably a race condition
      lurking in accesses of sk_stamp on 32-bit machines.
      
      sock->sk_stamp is of type ktime_t which is always an s64.
      On a 32 bit architecture, we might run into situations of
      unsafe access as the access to the field becomes non atomic.
      
      Use seqlocks for synchronization.
      This allows us to avoid using spinlocks for readers as
      readers do not need mutual exclusion.
      
      Another approach to solve this is to require sk_lock for all
      modifications of the timestamps. The current approach allows
      for timestamps to have their own lock: sk_stamp_lock.
      This allows for the patch to not compete with already
      existing critical sections, and side effects are limited
      to the paths in the patch.
      
      The addition of the new field maintains the data locality
      optimizations from
      commit 9115e8cd ("net: reorganize struct sock for better data
      locality")
      
      Note that all the instances of the sk_stamp accesses
      are either through the ioctl or the syscall recvmsg.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0ed3e9
  6. 29 12月, 2018 2 次提交
  7. 25 12月, 2018 1 次提交
    • P
      net: dccp: fix kernel crash on module load · c92c81df
      Peter Oskolkov 提交于
      Patch eedbbb0d "net: dccp: initialize (addr,port) ..."
      added calling to inet_hashinfo2_init() from dccp_init().
      
      However, inet_hashinfo2_init() is marked as __init(), and
      thus the kernel panics when dccp is loaded as module. Removing
      __init() tag from inet_hashinfo2_init() is not feasible because
      it calls into __init functions in mm.
      
      This patch adds inet_hashinfo2_init_mod() function that can
      be called after the init phase is done; changes dccp_init() to
      call the new function; un-marks inet_hashinfo2_init() as
      exported.
      
      Fixes: eedbbb0d ("net: dccp: initialize (addr,port) ...")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c92c81df
  8. 21 12月, 2018 6 次提交
    • P
      net: seg6.h: remove an unused #include · a6ae520d
      Peter Oskolkov 提交于
      A minor code cleanup.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6ae520d
    • F
      netfilter: netns: shrink netns_ct struct · 8527f9df
      Florian Westphal 提交于
      remove the obsolete sysctl anchors and move auto_assign_helper_warned
      to avoid/cover a hole.  Reduces size by 40 bytes on 64 bit.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8527f9df
    • F
      netfilter: conntrack: remove empty pernet fini stubs · fc3893fd
      Florian Westphal 提交于
      after moving sysctl handling into single place, the init functions
      can't fail anymore and some of the fini functions are empty.
      
      Remove them and change return type to void.
      This also simplifies error unwinding in conntrack module init path.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      fc3893fd
    • F
      netfilter: conntrack: un-export seq_print_acct · 4b216e21
      Florian Westphal 提交于
      Only one caller, just place it where its needed.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4b216e21
    • F
      netfilter: conntrack: udp: only extend timeout to stream mode after 2s · d535c8a6
      Florian Westphal 提交于
      Currently DNS resolvers that send both A and AAAA queries from same source port
      can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
      for three minutes, even though DNS requests are done in a few milliseconds.
      
      Add a two second grace period where we continue to use the ordinary
      30-second default timeout.  Its enough for DNS request/response traffic,
      even if two request/reply packets are involved.
      
      ASSURED is still set, else conntrack (and thus a possible
      NAT mapping ...) gets zapped too in case conntrack table runs full.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d535c8a6
    • J
      bpf: sk_msg, sock{map|hash} redirect through ULP · 0608c69c
      John Fastabend 提交于
      A sockmap program that redirects through a kTLS ULP enabled socket
      will not work correctly because the ULP layer is skipped. This
      fixes the behavior to call through the ULP layer on redirect to
      ensure any operations required on the data stream at the ULP layer
      continue to be applied.
      
      To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
      calling the BPF layer on a redirected message. This is
      required to avoid calling the BPF layer multiple times (possibly
      recursively) which is not the current/expected behavior without
      ULPs. In the future we may add a redirect flag if users _do_
      want the policy applied again but this would need to work for both
      ULP and non-ULP sockets and be opt-in to avoid breaking existing
      programs.
      
      Also to avoid polluting the flag space with an internal flag we
      reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
      MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
      SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
      to verify is user space API is masked correctly to ensure the flag
      can not be set by user. (Note this needs to be true regardless
      because we have internal flags already in-use that user space
      should not be able to set). But for completeness we have two UAPI
      paths into sendpage, sendfile and splice.
      
      In the sendfile case the function do_sendfile() zero's flags,
      
      ./fs/read_write.c:
       static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
      		   	    size_t count, loff_t max)
       {
         ...
         fl = 0;
      #if 0
         /*
          * We need to debate whether we can enable this or not. The
          * man page documents EAGAIN return for the output at least,
          * and the application is arguably buggy if it doesn't expect
          * EAGAIN on a non-blocking file descriptor.
          */
          if (in.file->f_flags & O_NONBLOCK)
      	fl = SPLICE_F_NONBLOCK;
      #endif
          file_start_write(out.file);
          retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
       }
      
      In the splice case the pipe_to_sendpage "actor" is used which
      masks flags with SPLICE_F_MORE.
      
      ./fs/splice.c:
       static int pipe_to_sendpage(struct pipe_inode_info *pipe,
      			    struct pipe_buffer *buf, struct splice_desc *sd)
       {
         ...
         more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
         ...
       }
      
      Confirming what we expect that internal flags  are in fact internal
      to socket side.
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0608c69c
  9. 20 12月, 2018 9 次提交
  10. 18 12月, 2018 9 次提交