1. 10 1月, 2017 5 次提交
  2. 06 1月, 2017 1 次提交
    • F
      xfrm: state: do not acquire lock in get_mtu helpers · b3b73b8e
      Florian Westphal 提交于
      Once flow cache gets removed the mtu initialisation happens for every skb
      that gets an xfrm attached, so this lock starts to show up in perf.
      
      It is not obvious why this lock is required -- the caller holds
      reference on the state struct, type->destructor is only called from the
      state gc worker (all state structs on gc list must have refcount 0).
      
      xfrm_init_state already has been called (else private data accessed
      by type->get_mtu() would not be set up).
      
      So just remove the lock -- the race on the state (DEAD?) doesn't
      matter (could change right after dropping the lock too).
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      b3b73b8e
  3. 04 1月, 2017 1 次提交
  4. 31 12月, 2016 1 次提交
    • D
      net: Allow IP_MULTICAST_IF to set index to L3 slave · 7bb387c5
      David Ahern 提交于
      IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
      does not match it. e.g.,
      
          ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument
      
      Relax the check in setsockopt to allow setting mc_index to an L3 slave if
      sk_bound_dev_if points to an L3 master.
      
      Make a similar change for IPv6. In this case change the device lookup to
      take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
      the lookup of a potential L3 master device.
      
      This really only silences a setsockopt failure since uses of mc_index are
      secondary to sk_bound_dev_if if it is set. In both cases, if either index
      is an L3 slave or master, lookups are directed to the same FIB table so
      relaxing the check at setsockopt time causes no harm.
      
      Patch is based on a suggested change by Darwin for a problem noted in
      their code base.
      Suggested-by: NDarwin Dingel <darwin.dingel@alliedtelesis.co.nz>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bb387c5
  5. 30 12月, 2016 6 次提交
  6. 29 12月, 2016 6 次提交
  7. 28 12月, 2016 3 次提交
  8. 27 12月, 2016 1 次提交
    • D
      net, sched: fix soft lockup in tc_classify · 628185cf
      Daniel Borkmann 提交于
      Shahar reported a soft lockup in tc_classify(), where we run into an
      endless loop when walking the classifier chain due to tp->next == tp
      which is a state we should never run into. The issue only seems to
      trigger under load in the tc control path.
      
      What happens is that in tc_ctl_tfilter(), thread A allocates a new
      tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
      with it. In that classifier callback we had to unlock/lock the rtnl
      mutex and returned with -EAGAIN. One reason why we need to drop there
      is, for example, that we need to request an action module to be loaded.
      
      This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
      after we loaded and found the requested action, we need to redo the
      whole request so we don't race against others. While we had to unlock
      rtnl in that time, thread B's request was processed next on that CPU.
      Thread B added a new tp instance successfully to the classifier chain.
      When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
      and destroying its tp instance which never got linked, we goto replay
      and redo A's request.
      
      This time when walking the classifier chain in tc_ctl_tfilter() for
      checking for existing tp instances we had a priority match and found
      the tp instance that was created and linked by thread B. Now calling
      again into tp->ops->change() with that tp was successful and returned
      without error.
      
      tp_created was never cleared in the second round, thus kernel thinks
      that we need to link it into the classifier chain (once again). tp and
      *back point to the same object due to the match we had earlier on. Thus
      for thread B's already public tp, we reset tp->next to tp itself and
      link it into the chain, which eventually causes the mentioned endless
      loop in tc_classify() once a packet hits the data path.
      
      Fix is to clear tp_created at the beginning of each request, also when
      we replay it. On the paths that can cause -EAGAIN we already destroy
      the original tp instance we had and on replay we really need to start
      from scratch. It seems that this issue was first introduced in commit
      12186be7 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
      and avoid kernel panic when we use cls_cgroup").
      
      Fixes: 12186be7 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
      Reported-by: NShahar Klein <shahark@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NShahar Klein <shahark@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      628185cf
  9. 26 12月, 2016 3 次提交
    • T
      ktime: Get rid of ktime_equal() · 1f3a8e49
      Thomas Gleixner 提交于
      No point in going through loops and hoops instead of just comparing the
      values.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      1f3a8e49
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  10. 25 12月, 2016 1 次提交
  11. 24 12月, 2016 8 次提交
    • J
      tipc: don't send FIN message from connectionless socket · 693c5649
      Jon Paul Maloy 提交于
      In commit 6f00089c ("tipc: remove SS_DISCONNECTING state") the
      check for socket type is in the wrong place, causing a closing socket
      to always send out a FIN message even when the socket was never
      connected. This is normally harmless, since the destination node for
      such messages most often is zero, and the message will be dropped, but
      it is still a wrong and confusing behavior.
      
      We fix this in this commit.
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      693c5649
    • M
      sctp: fix recovering from 0 win with small data chunks · 1636098c
      Marcelo Ricardo Leitner 提交于
      Currently if SCTP closes the receive window with window pressure, mostly
      caused by excessive skb overhead on payload/overheads ratio, SCTP will
      close the window abruptly while saving the delta on rwnd_press. It will
      start recovering rwnd as the chunks are consumed by the application and
      the rwnd_press will be only recovered after rwnd reach the same value as
      of rwnd_press, mostly to prevent silly window syndrome.
      
      Thing is, this is very inefficient with small data chunks, as with those
      it will never reach back that value, and thus it will never recover from
      such pressure. This means that we will not issue window updates when
      recovering from 0 window and will rely on a sender retransmit to notice
      it.
      
      The fix here is to remove such threshold, as no value is good enough: it
      depends on the (avg) chunk sizes being used.
      
      Test with netperf -t SCTP_STREAM -- -m 1, and trigger 0 window by
      sending SIGSTOP to netserver, sleep 1.2, and SIGCONT.
      Rate limited to 845kbps, for visibility. Capture done at netserver side.
      
      Previously:
      01.500751 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632372996] [a_rwnd 99153] [
      01.500752 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632372997] [SID: 0] [SS
      01.517471 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
      01.517483 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
      01.517485 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
      01.517488 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
      01.534168 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373096] [SID: 0] [SS
      01.534180 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
      01.534181 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373169] [SID: 0] [SS
      01.534185 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
      02.525978 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
      02.526021 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
        (window update missed)
      04.573807 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
      04.779370 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373082] [a_rwnd 859] [#g
      04.789162 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
      04.789323 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373156] [SID: 0] [SS
      04.789372 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373228] [a_rwnd 786] [#g
      
      After:
      02.568957 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098728] [a_rwnd 99153]
      02.568961 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098729] [SID: 0] [S
      02.585631 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
      02.585666 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
      02.585671 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
      02.585683 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
      02.602330 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098828] [SID: 0] [S
      02.602359 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
      02.602363 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098901] [SID: 0] [S
      02.602372 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
      03.600788 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
      03.600830 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
      03.619455 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 13508]
      03.619479 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 27017]
      03.619497 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 40526]
      03.619516 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 54035]
      03.619533 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 67544]
      03.619552 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 81053]
      03.619570 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 94562]
        (following data transmission triggered by window updates above)
      03.633504 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
      03.836445 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098814] [a_rwnd 100000]
      03.843125 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
      03.843285 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098888] [SID: 0] [S
      03.843345 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098960] [a_rwnd 99894]
      03.856546 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098961] [SID: 0] [S
      03.866450 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490099011] [SID: 0] [S
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1636098c
    • M
      sctp: do not loose window information if in rwnd_over · 58b94d88
      Marcelo Ricardo Leitner 提交于
      It's possible that we receive a packet that is larger than current
      window. If it's the first packet in this way, it will cause it to
      increase rwnd_over. Then, if we receive another data chunk (specially as
      SCTP allows you to have one data chunk in flight even during 0 window),
      rwnd_over will be overwritten instead of added to.
      
      In the long run, this could cause the window to grow bigger than its
      initial size, as rwnd_over would be charged only for the last received
      data chunk while the code will try open the window for all packets that
      were received and had its value in rwnd_over overwritten. This, then,
      can lead to the worsening of payload/buffer ratio and cause rwnd_press
      to kick in more often.
      
      The fix is to sum it too, same as is done for rwnd_press, so that if we
      receive 3 chunks after closing the window, we still have to release that
      same amount before re-opening it.
      
      Log snippet from sctp_test exhibiting the issue:
      [  146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
      rwnd decreased by 1 to (0, 1, 114221)
      [  146.209232] sctp: sctp_assoc_rwnd_decrease:
      association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
      [  146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
      rwnd decreased by 1 to (0, 1, 114221)
      [  146.209232] sctp: sctp_assoc_rwnd_decrease:
      association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
      [  146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
      rwnd decreased by 1 to (0, 1, 114221)
      [  146.209232] sctp: sctp_assoc_rwnd_decrease:
      association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
      [  146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
      rwnd decreased by 1 to (0, 1, 114221)
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58b94d88
    • I
      neigh: Send netevent after marking neigh as dead · 53f800e3
      Ido Schimmel 提交于
      neigh_cleanup_and_release() is always called after marking a neighbour
      as dead, but it only notifies user space and not in-kernel listeners of
      the netevent notification chain.
      
      This can cause multiple problems. In my specific use case, it causes the
      listener (a switch driver capable of L3 offloads) to believe a neighbour
      entry is still valid, and is thus erroneously kept in the device's
      table.
      
      Fix that by sending a netevent after marking the neighbour as dead.
      
      Fixes: a6bf9e93 ("mlxsw: spectrum_router: Offload neighbours based on NUD state change")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53f800e3
    • D
      ipv6: handle -EFAULT from skb_copy_bits · a98f9175
      Dave Jones 提交于
      By setting certain socket options on ipv6 raw sockets, we can confuse the
      length calculation in rawv6_push_pending_frames triggering a BUG_ON.
      
      RIP: 0010:[<ffffffff817c6390>] [<ffffffff817c6390>] rawv6_sendmsg+0xc30/0xc40
      RSP: 0018:ffff881f6c4a7c18  EFLAGS: 00010282
      RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
      RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
      RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
      R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
      R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80
      
      Call Trace:
       [<ffffffff8118ba23>] ? unmap_page_range+0x693/0x830
       [<ffffffff81772697>] inet_sendmsg+0x67/0xa0
       [<ffffffff816d93f8>] sock_sendmsg+0x38/0x50
       [<ffffffff816d982f>] SYSC_sendto+0xef/0x170
       [<ffffffff816da27e>] SyS_sendto+0xe/0x10
       [<ffffffff81002910>] do_syscall_64+0x50/0xa0
       [<ffffffff817f7cbc>] entry_SYSCALL64_slow_path+0x25/0x25
      
      Handle by jumping to the failure path if skb_copy_bits gets an EFAULT.
      
      Reproducer:
      
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <unistd.h>
      #include <sys/types.h>
      #include <sys/socket.h>
      #include <netinet/in.h>
      
      #define LEN 504
      
      int main(int argc, char* argv[])
      {
      	int fd;
      	int zero = 0;
      	char buf[LEN];
      
      	memset(buf, 0, LEN);
      
      	fd = socket(AF_INET6, SOCK_RAW, 7);
      
      	setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, &zero, 4);
      	setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, &buf, LEN);
      
      	sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110);
      }
      Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a98f9175
    • W
      inet: fix IP(V6)_RECVORIGDSTADDR for udp sockets · 39b2dd76
      Willem de Bruijn 提交于
      Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
      the packet. For sockets that have transport headers pulled, transport
      offset can be negative. Use signed comparison to avoid overflow.
      
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Reported-by: NNisar Jagabar <njagabar@cloudmark.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39b2dd76
    • O
      net/sched: cls_flower: Mandate mask when matching on flags · d9724772
      Or Gerlitz 提交于
      When matching on flags, we should require the user to provide the
      mask and avoid using an all-ones mask. Not doing so causes matching
      on flags provided w.o mask to hit on the value being unset for all
      flags, which may not what the user wanted to happen.
      
      Fixes: faa3ffce ('net/sched: cls_flower: Add support for matching on flags')
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reported-by: NPaul Blakey <paulb@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9724772
    • O
      net/sched: act_tunnel_key: Fix setting UDP dst port in metadata under IPv6 · dc594ecd
      Or Gerlitz 提交于
      The UDP dst port was provided to the helper function which sets the
      IPv6 IP tunnel meta-data under a wrong param order, fix that.
      
      Fixes: 75bfbca0 ('net/sched: act_tunnel_key: Add UDP dst port option')
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc594ecd
  12. 23 12月, 2016 1 次提交
  13. 22 12月, 2016 1 次提交
  14. 21 12月, 2016 2 次提交