1. 04 7月, 2019 4 次提交
  2. 03 7月, 2019 3 次提交
    • X
      tipc: remove ub->ubsock checks · d2c3a4ba
      Xin Long 提交于
      Both tipc_udp_enable and tipc_udp_disable are called under rtnl_lock,
      ub->ubsock could never be NULL in tipc_udp_disable and cleanup_bearer,
      so remove the check.
      
      Also remove the one in tipc_udp_enable by adding "free" label.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2c3a4ba
    • S
      ipv4: Fix off-by-one in route dump counter without netlink strict checking · 885b8b4d
      Stefano Brivio 提交于
      In commit ee28906f ("ipv4: Dump route exceptions if requested") I
      added a counter of per-node dumped routes (including actual routes and
      exceptions), analogous to the existing counter for dumped nodes. Dumping
      exceptions means we need to also keep track of how many routes are dumped
      for each node: this would be just one route per node, without exceptions.
      
      When netlink strict checking is not enabled, we dump both routes and
      exceptions at the same time: the RTM_F_CLONED flag is not used as a
      filter. In this case, the per-node counter 'i_fa' is incremented by one
      to track the single dumped route, then also incremented by one for each
      exception dumped, and then stored as netlink callback argument as skip
      counter, 's_fa', to be used when a partial dump operation restarts.
      
      The per-node counter needs to be increased by one also when we skip a
      route (exception) due to a previous non-zero skip counter, because it
      needs to match the existing skip counter, if we are dumping both routes
      and exceptions. I missed this, and only incremented the counter, for
      regular routes, if the previous skip counter was zero. This means that,
      in case of a mixed dump, partial dump operations after the first one
      will start with a mismatching skip counter value, one less than expected.
      
      This means in turn that the first exception for a given node is skipped
      every time a partial dump operation restarts, if netlink strict checking
      is not enabled (iproute < 5.0).
      
      It turns out I didn't repeat the test in its final version, commit
      de755a85 ("selftests: pmtu: Introduce list_flush_ipv4_exception test
      case"), which also counts the number of route exceptions returned, with
      iproute2 versions < 5.0 -- I was instead using the equivalent of the IPv6
      test as it was before commit b964641e ("selftests: pmtu: Make
      list_flush_ipv6_exception test more demanding").
      
      Always increment the per-node counter by one if we previously dumped
      a regular route, so that it matches the current skip counter.
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      885b8b4d
    • D
      rxrpc: Fix uninitialized error code in rxrpc_send_data_packet() · 3427beb6
      David Howells 提交于
      With gcc 4.1:
      
          net/rxrpc/output.c: In function ‘rxrpc_send_data_packet’:
          net/rxrpc/output.c:338: warning: ‘ret’ may be used uninitialized in this function
      
      Indeed, if the first jump to the send_fragmentable label is made, and
      the address family is not handled in the switch() statement, ret will be
      used uninitialized.
      
      Fix this by BUG()'ing as is done in other places in rxrpc where internal
      support for future address families will need adding.  It should not be
      possible to reach this normally as the address families are checked
      up-front.
      
      Fixes: 5a924b89 ("rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs")
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3427beb6
  3. 02 7月, 2019 6 次提交
  4. 30 6月, 2019 4 次提交
  5. 29 6月, 2019 9 次提交
    • V
      taprio: Adjust timestamps for TCP packets · 54002066
      Vedang Patel 提交于
      When the taprio qdisc is running in "txtime offload" mode, it will
      set the launchtime value (in skb->tstamp) for all the packets which do
      not have the SO_TXTIME socket option. But, the TCP packets already have
      this value set and it indicates the earliest departure time represented
      in CLOCK_MONOTONIC clock.
      
      We need to respect the timestamp set by the TCP subsystem. So, convert
      this time to the clock which taprio is using and ensure that the packet
      is not transmitted before the deadline set by TCP.
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54002066
    • V
      taprio: make clock reference conversions easier · 7ede7b03
      Vedang Patel 提交于
      Later in this series we will need to transform from
      CLOCK_MONOTONIC (used in TCP) to the clock reference used in TAPRIO.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ede7b03
    • V
      taprio: Add support for txtime-assist mode · 4cfd5779
      Vedang Patel 提交于
      Currently, we are seeing non-critical packets being transmitted outside of
      their timeslice. We can confirm that the packets are being dequeued at the
      right time. So, the delay is induced in the hardware side.  The most likely
      reason is the hardware queues are starving the lower priority queues.
      
      In order to improve the performance of taprio, we will be making use of the
      txtime feature provided by the ETF qdisc. For all the packets which do not
      have the SO_TXTIME option set, taprio will set the transmit timestamp (set
      in skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit
      time for the packet is set to when the gate is open. If SO_TXTIME is set,
      the TAPrio qdisc will validate whether the timestamp (in skb->tstamp)
      occurs when the gate corresponding to skb's traffic class is open.
      
      Following two parameters added to support this mode:
      - flags: used to enable txtime-assist mode. Will also be used to enable
        other modes (like hardware offloading) later.
      - txtime-delay: This indicates the minimum time it will take for the packet
        to hit the wire. This is useful in determining whether we can transmit
      the packet in the remaining time if the gate corresponding to the packet is
      currently open.
      
      An example configuration for enabling txtime-assist:
      
      tc qdisc replace dev eth0 parent root handle 100 taprio \\
            num_tc 3 \\
            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
            queues 1@0 1@0 1@0 \\
            base-time 1558653424279842568 \\
            sched-entry S 01 300000 \\
            sched-entry S 02 300000 \\
            sched-entry S 04 400000 \\
            flags 0x1 \\
            txtime-delay 40000 \\
            clockid CLOCK_TAI
      
      tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
            offload delta 200000 clockid CLOCK_TAI
      
      Note that all the traffic classes are mapped to the same queue.  This is
      only possible in taprio when txtime-assist is enabled. Also, note that the
      ETF Qdisc is enabled with offload mode set.
      
      In this mode, if the packet's traffic class is open and the complete packet
      can be transmitted, taprio will try to transmit the packet immediately.
      This will be done by setting skb->tstamp to current_time + the time delta
      indicated in the txtime-delay parameter. This parameter indicates the time
      taken (in software) for packet to reach the network adapter.
      
      If the packet cannot be transmitted in the current interval or if the
      packet's traffic is not currently transmitting, the skb->tstamp is set to
      the next available timestamp value. This is tracked in the next_launchtime
      parameter in the struct sched_entry.
      
      The behaviour w.r.t admin and oper schedules is not changed from what is
      present in software mode.
      
      The transmit time is already known in advance. So, we do not need the HR
      timers to advance the schedule and wakeup the dequeue side of taprio.  So,
      HR timer won't be run when this mode is enabled.
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cfd5779
    • V
      taprio: Remove inline directive · 566af331
      Vedang Patel 提交于
      Remove inline directive from length_to_duration(). We will let the compiler
      make the decisions.
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      566af331
    • V
      taprio: calculate cycle_time when schedule is installed · 037be037
      Vedang Patel 提交于
      cycle time for a particular schedule is calculated only when it is first
      installed. So, it makes sense to just calculate it once right after the
      'cycle_time' parameter has been parsed and store it in cycle_time.
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      037be037
    • V
      etf: Add skip_sock_check · d14d2b20
      Vedang Patel 提交于
      Currently, etf expects a socket with SO_TXTIME option set for each packet
      it encounters. So, it will drop all other packets. But, in the future
      commits we are planning to add functionality where tstamp value will be set
      by another qdisc. Also, some packets which are generated from within the
      kernel (e.g. ICMP packets) do not have any socket associated with them.
      
      So, this commit adds support for skip_sock_check. When this option is set,
      etf will skip checking for a socket and other associated options for all
      skbs.
      Signed-off-by: NVedang Patel <vedang.patel@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d14d2b20
    • J
      net: sched: protect against stack overflow in TC act_mirred · e2ca070f
      John Hurley 提交于
      TC hooks allow the application of filters and actions to packets at both
      ingress and egress of the network stack. It is possible, with poor
      configuration, that this can produce loops whereby an ingress hook calls
      a mirred egress action that has an egress hook that redirects back to
      the first ingress etc. The TC core classifier protects against loops when
      doing reclassifies but there is no protection against a packet looping
      between multiple hooks and recursively calling act_mirred. This can lead
      to stack overflow panics.
      
      Add a per CPU counter to act_mirred that is incremented for each recursive
      call of the action function when processing a packet. If a limit is passed
      then the packet is dropped and CPU counter reset.
      
      Note that this patch does not protect against loops in TC datapaths. Its
      aim is to prevent stack overflow kernel panics that can be a consequence
      of such loops.
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ca070f
    • J
      net: sched: refactor reinsert action · 720f22fe
      John Hurley 提交于
      The TC_ACT_REINSERT return type was added as an in-kernel only option to
      allow a packet ingress or egress redirect. This is used to avoid
      unnecessary skb clones in situations where they are not required. If a TC
      hook returns this code then the packet is 'reinserted' and no skb consume
      is carried out as no clone took place.
      
      This return type is only used in act_mirred. Rather than have the reinsert
      called from the main datapath, call it directly in act_mirred. Instead of
      returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED
      which tells the caller that the packet has been stolen by another process
      and that no consume call is required.
      
      Moving all redirect calls to the act_mirred code is in preparation for
      tracking recursion created by act_mirred.
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      720f22fe
    • C
      ipv4: enable route flushing in network namespaces · 5cdda5f1
      Christian Brauner 提交于
      Tools such as vpnc try to flush routes when run inside network
      namespaces by writing 1 into /proc/sys/net/ipv4/route/flush. This
      currently does not work because flush is not enabled in non-initial
      network namespaces.
      Since routes are per network namespace it is safe to enable
      /proc/sys/net/ipv4/route/flush in there.
      
      Link: https://github.com/lxc/lxd/issues/4257Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cdda5f1
  6. 28 6月, 2019 10 次提交
  7. 27 6月, 2019 4 次提交
    • N
      af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET · 89ed5b51
      Neil Horman 提交于
      When an application is run that:
      a) Sets its scheduler to be SCHED_FIFO
      and
      b) Opens a memory mapped AF_PACKET socket, and sends frames with the
      MSG_DONTWAIT flag cleared, its possible for the application to hang
      forever in the kernel.  This occurs because when waiting, the code in
      tpacket_snd calls schedule, which under normal circumstances allows
      other tasks to run, including ksoftirqd, which in some cases is
      responsible for freeing the transmitted skb (which in AF_PACKET calls a
      destructor that flips the status bit of the transmitted frame back to
      available, allowing the transmitting task to complete).
      
      However, when the calling application is SCHED_FIFO, its priority is
      such that the schedule call immediately places the task back on the cpu,
      preventing ksoftirqd from freeing the skb, which in turn prevents the
      transmitting task from detecting that the transmission is complete.
      
      We can fix this by converting the schedule call to a completion
      mechanism.  By using a completion queue, we force the calling task, when
      it detects there are no more frames to send, to schedule itself off the
      cpu until such time as the last transmitted skb is freed, allowing
      forward progress to be made.
      
      Tested by myself and the reporter, with good results
      
      Change Notes:
      
      V1->V2:
      	Enhance the sleep logic to support being interruptible and
      allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)
      
      V2->V3:
      	Rearrage the point at which we wait for the completion queue, to
      avoid needing to check for ph/skb being null at the end of the loop.
      Also move the complete call to the skb destructor to avoid needing to
      modify __packet_set_status.  Also gate calling complete on
      packet_read_pending returning zero to avoid multiple calls to complete.
      (Willem de Bruijn)
      
      	Move timeo computation within loop, to re-fetch the socket
      timeout since we also use the timeo variable to record the return code
      from the wait_for_complete call (Neil Horman)
      
      V3->V4:
      	Willem has requested that the control flow be restored to the
      previous state.  Doing so lets us eliminate the need for the
      po->wait_on_complete flag variable, and lets us get rid of the
      packet_next_frame function, but introduces another complexity.
      Specifically, but using the packet pending count, we can, if an
      applications calls sendmsg multiple times with MSG_DONTWAIT set, each
      set of transmitted frames, when complete, will cause
      tpacket_destruct_skb to issue a complete call, for which there will
      never be a wait_on_completion call.  This imbalance will lead to any
      future call to wait_for_completion here to return early, when the frames
      they sent may not have completed.  To correct this, we need to re-init
      the completion queue on every call to tpacket_snd before we enter the
      loop so as to ensure we wait properly for the frames we send in this
      iteration.
      
      	Change the timeout and interrupted gotos to out_put rather than
      out_status so that we don't try to free a non-existant skb
      	Clean up some extra newlines (Willem de Bruijn)
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89ed5b51
    • X
      sctp: change to hold sk after auth shkey is created successfully · 25bff6d5
      Xin Long 提交于
      Now in sctp_endpoint_init(), it holds the sk then creates auth
      shkey. But when the creation fails, it doesn't release the sk,
      which causes a sk defcnf leak,
      
      Here to fix it by only holding the sk when auth shkey is created
      successfully.
      
      Fixes: a29a5bd4 ("[SCTP]: Implement SCTP-AUTH initializations.")
      Reported-by: syzbot+afabda3890cc2f765041@syzkaller.appspotmail.com
      Reported-by: syzbot+276ca1c77a19977c0130@syzkaller.appspotmail.com
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25bff6d5
    • N
      ipv6: fix neighbour resolution with raw socket · 2c6b55f4
      Nicolas Dichtel 提交于
      The scenario is the following: the user uses a raw socket to send an ipv6
      packet, destinated to a not-connected network, and specify a connected nh.
      Here is the corresponding python script to reproduce this scenario:
      
       import socket
       IPPROTO_RAW = 255
       send_s = socket.socket(socket.AF_INET6, socket.SOCK_RAW, IPPROTO_RAW)
       # scapy
       # p = IPv6(src='fd00:100::1', dst='fd00:200::fa')/ICMPv6EchoRequest()
       # str(p)
       req = b'`\x00\x00\x00\x00\x08:@\xfd\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xfd\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfa\x80\x00\x81\xc0\x00\x00\x00\x00'
       send_s.sendto(req, ('fd00:175::2', 0, 0, 0))
      
      fd00:175::/64 is a connected route and fd00:200::fa is not a connected
      host.
      
      With this scenario, the kernel starts by sending a NS to resolve
      fd00:175::2. When it receives the NA, it flushes its queue and try to send
      the initial packet. But instead of sending it, it sends another NS to
      resolve fd00:200::fa, which obvioulsy fails, thus the packet is dropped. If
      the user sends again the packet, it now uses the right nh (fd00:175::2).
      
      The problem is that ip6_dst_lookup_neigh() uses the rt6i_gateway, which is
      :: because the associated route is a connected route, thus it uses the dst
      addr of the packet. Let's use rt6_nexthop() to choose the right nh.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c6b55f4
    • N
      ipv6: constify rt6_nexthop() · 9b1c1ef1
      Nicolas Dichtel 提交于
      There is no functional change in this patch, it only prepares the next one.
      
      rt6_nexthop() will be used by ip6_dst_lookup_neigh(), which uses const
      variables.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b1c1ef1