1. 18 2月, 2014 12 次提交
  2. 17 2月, 2014 6 次提交
    • D
      packet: check for ndo_select_queue during queue selection · 0fd5d57b
      Daniel Borkmann 提交于
      Mathias reported that on an AMD Geode LX embedded board (ALiX)
      with ath9k driver PACKET_QDISC_BYPASS, introduced in commit
      d346a3fa ("packet: introduce PACKET_QDISC_BYPASS socket
      option"), triggers a WARN_ON() coming from the driver itself
      via 066dae93 ("ath9k: rework tx queue selection and fix
      queue stopping/waking").
      
      The reason why this happened is that ndo_select_queue() call
      is not invoked from direct xmit path i.e. for ieee80211 subsystem
      that sets queue and TID (similar to 802.1d tag) which is being
      put into the frame through 802.11e (WMM, QoS). If that is not
      set, pending frame counter for e.g. ath9k can get messed up.
      
      So the WARN_ON() in ath9k is absolutely legitimate. Generally,
      the hw queue selection in ieee80211 depends on the type of
      traffic, and priorities are set according to ieee80211_ac_numbers
      mapping; working in a similar way as DiffServ only on a lower
      layer, so that the AP can favour frames that have "real-time"
      requirements like voice or video data frames.
      
      Therefore, check for presence of ndo_select_queue() in netdev
      ops and, if available, invoke it with a fallback handler to
      __packet_pick_tx_queue(), so that driver such as bnx2x, ixgbe,
      or mlx4 can still select a hw queue for transmission in
      relation to the current CPU while e.g. ieee80211 subsystem
      can make their own choices.
      Reported-by: NMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fd5d57b
    • D
      netdevice: move netdev_cap_txqueue for shared usage to header · b9507bda
      Daniel Borkmann 提交于
      In order to allow users to invoke netdev_cap_txqueue, it needs to
      be moved into netdevice.h header file. While at it, also add kernel
      doc header to document the API.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9507bda
    • D
      netdevice: add queue selection fallback handler for ndo_select_queue · 99932d4f
      Daniel Borkmann 提交于
      Add a new argument for ndo_select_queue() callback that passes a
      fallback handler. This gets invoked through netdev_pick_tx();
      fallback handler is currently __netdev_pick_tx() as most drivers
      invoke this function within their customized implementation in
      case for skbs that don't need any special handling. This fallback
      handler can then be replaced on other call-sites with different
      queue selection methods (e.g. in packet sockets, pktgen etc).
      
      This also has the nice side-effect that __netdev_pick_tx() is
      then only invoked from netdev_pick_tx() and export of that
      function to modules can be undone.
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99932d4f
    • M
      net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer · ef2820a7
      Matija Glavinic Pecotic 提交于
      Implementation of (a)rwnd calculation might lead to severe performance issues
      and associations completely stalling. These problems are described and solution
      is proposed which improves lksctp's robustness in congestion state.
      
      1) Sudden drop of a_rwnd and incomplete window recovery afterwards
      
      Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
      but size of sk_buff, which is blamed against receiver buffer, is not accounted
      in rwnd. Theoretically, this should not be the problem as actual size of buffer
      is double the amount requested on the socket (SO_RECVBUF). Problem here is
      that this will have bad scaling for data which is less then sizeof sk_buff.
      E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
      of traffic of this size (less then 100B).
      
      An example of sudden drop and incomplete window recovery is given below. Node B
      exhibits problematic behavior. Node A initiates association and B is configured
      to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
      message in 4G (LTE) network). On B data is left in buffer by not reading socket
      in userspace.
      
      Lets examine when we will hit pressure state and declare rwnd to be 0 for
      scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
      chunk is sent in separate sctp packet)
      
      Logic is implemented in sctp_assoc_rwnd_decrease:
      
      socket_buffer (see below) is maximum size which can be held in socket buffer
      (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)
      
      A simple expression is given for which it will be examined after how many
      packets for above stated parameters we enter pressure state:
      
      We start by condition which has to be met in order to enter pressure state:
      
      	socket_buffer < currently_alloced;
      
      currently_alloced is represented as size of sctp packets received so far and not
      yet delivered to userspace. x is the number of chunks/packets (since there is no
      bundling, and each chunk is delivered in separate packet, we can observe each
      chunk also as sctp packet, and what is important here, having its own sk_buff):
      
      	socket_buffer < x*each_sctp_packet;
      
      each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
      twice the amount of initially requested size of socket buffer, which is in case
      of sctp, twice the a_rwnd requested:
      
      	2*rwnd < x*(payload+sizeof(struc sk_buff));
      
      sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
      and each payload size is 43
      
      	20000 < x(43+190);
      
      	x > 20000/233;
      
      	x ~> 84;
      
      After ~84 messages, pressure state is entered and 0 rwnd is advertised while
      received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
      drop from 6474 to 0, as it will be now shown in example:
      
      IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
      IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
      IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
      <...>
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]
      
      --> Sudden drop
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      At this point, rwnd_press stores current rwnd value so it can be later restored
      in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
      slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
      condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
      adding amount which is read from userspace. Let us observe values in above
      example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
      amount of actual sctp data currently waiting to be delivered to userspace
      is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
      only for sctp data, which is ~3500. Condition is never met, and when userspace
      reads all data, rwnd stays on 3569.
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      --> At this point userspace read everything, rwnd recovered only to 3569
      
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      Reproduction is straight forward, it is enough for sender to send packets of
      size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.
      
      2) Minute size window for associations sharing the same socket buffer
      
      In case multiple associations share the same socket, and same socket buffer
      (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
      of the associations can permanently drop rwnd of other association(s).
      
      Situation will be typically observed as one association suddenly having rwnd
      dropped to size of last packet received and never recovering beyond that point.
      Different scenarios will lead to it, but all have in common that one of the
      associations (let it be association from 1)) nearly depleted socket buffer, and
      the other association blames socket buffer just for the amount enough to start
      the pressure. This association will enter pressure state, set rwnd_press and
      announce 0 rwnd.
      When data is read by userspace, similar situation as in 1) will occur, rwnd will
      increase just for the size read by userspace but rwnd_press will be high enough
      so that association doesn't have enough credit to reach rwnd_press and restore
      to previous state. This case is special case of 1), being worse as there is, in
      the worst case, only one packet in buffer for which size rwnd will be increased.
      Consequence is association which has very low maximum rwnd ('minute size', in
      our case down to 43B - size of packet which caused pressure) and as such
      unusable.
      
      Scenario happened in the field and labs frequently after congestion state (link
      breaks, different probabilities of packet drop, packet reordering) and with
      scenario 1) preceding. Here is given a deterministic scenario for reproduction:
      
      >From node A establish two associations on the same socket, with rcvbuf_policy
      being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
      repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
      scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
      bring it down close to 0, as in just one more packet would close it. This has as
      a consequence that association number 2 is able to receive (at least) one more
      packet which will bring it in pressure state. E.g. if association 2 had rwnd of
      10000, packet received was 43, and we enter at this point into pressure,
      rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
      increase for 43, but conditions to restore rwnd to original state, just as in
      1), will never be satisfied.
      
      --> Association 1, between A.y and B.12345
      
      IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
      IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
      IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Association 2, between A.z and B.12346
      
      IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
      IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
      IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
      IP B.12346 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Deplete socket buffer by sending messages of size 43B over association 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      
      <...>
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]
      
      --> Sudden drop on 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
          association 1 so there is place in buffer for only one more packet
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      --> Socket buffer is almost depleted, but there is space for one more packet,
          send them over association 2, size 43B
      
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Immediate drop
      
      IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Read everything from the socket, both association recover up to maximum rwnd
          they are capable of reaching, note that association 1 recovered up to 3698,
          and association 2 recovered only to 43
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      A careful reader might wonder why it is necessary to reproduce 1) prior
      reproduction of 2). It is simply easier to observe when to send packet over
      association 2 which will push association into the pressure state.
      
      Proposed solution:
      
      Both problems share the same root cause, and that is improper scaling of socket
      buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
      calculating rwnd is not possible due to fact that there is no linear
      relationship between amount of data blamed in increase/decrease with IP packet
      in which payload arrived. Even in case such solution would be followed,
      complexity of the code would increase. Due to nature of current rwnd handling,
      slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
      entered is rationale, but it gives false representation to the sender of current
      buffer space. Furthermore, it implements additional congestion control mechanism
      which is defined on implementation, and not on standard basis.
      
      Proposed solution simplifies whole algorithm having on mind definition from rfc:
      
      o  Receiver Window (rwnd): This gives the sender an indication of the space
         available in the receiver's inbound buffer.
      
      Core of the proposed solution is given with these lines:
      
      sctp_assoc_rwnd_update:
      	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
      		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
      	else
      		asoc->rwnd = 0;
      
      We advertise to sender (half of) actual space we have. Half is in the braces
      depending whether you would like to observe size of socket buffer as SO_RECVBUF
      or twice the amount, i.e. size is the one visible from userspace, that is,
      from kernelspace.
      In this way sender is given with good approximation of our buffer space,
      regardless of the buffer policy - we always advertise what we have. Proposed
      solution fixes described problems and removes necessity for rwnd restoration
      algorithm. Finally, as proposed solution is simplification, some lines of code,
      along with some bytes in struct sctp_association are saved.
      
      Version 2 of the patch addressed comments from Vlad. Name of the function is set
      to be more descriptive, and two parts of code are changed, in one removing the
      superfluous call to sctp_assoc_rwnd_update since call would not result in update
      of rwnd, and the other being reordering of the code in a way that call to
      sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
      in a way that existing function is not reordered/copied in line, but it is
      correctly called. Thanks Vlad for suggesting.
      Signed-off-by: NMatija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Reviewed-by: NAlexander Sverdlin <alexander.sverdlin@nsn.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef2820a7
    • D
      ipv4: distinguish EHOSTUNREACH from the ENETUNREACH · cd0f0b95
      Duan Jiong 提交于
      since commit 251da413("ipv4: Cache ip_error() routes even when not forwarding."),
      the counter IPSTATS_MIB_INADDRERRORS can't work correctly, because the value of
      err was always set to ENETUNREACH.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd0f0b95
    • G
      dccp: re-enable debug macro · 09db3080
      Gerrit Renker 提交于
      dccp tfrc: revert
      
      This reverts 6aee49c5 ("dccp: make local variable static") since
      the variable tfrc_debug is referenced by the tfrc_pr_debug(fmt, ...)
      macro when TFRC debugging is enabled. If it is enabled, use of the
      macro produces a compilation error.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09db3080
  3. 14 2月, 2014 8 次提交
  4. 11 2月, 2014 12 次提交
    • E
      6lowpan: fix lockdep splats · 20e7c4e8
      Eric Dumazet 提交于
      When a device ndo_start_xmit() calls again dev_queue_xmit(),
      lockdep can complain because dev_queue_xmit() is re-entered and the
      spinlocks protecting tx queues share a common lockdep class.
      
      Same issue was fixed for bonding/l2tp/ppp in commits
      
      0daa2303 ("[PATCH] bonding: lockdep annotation")
      49ee4920 ("bonding: set qdisc_tx_busylock to avoid LOCKDEP splat")
      23d3b8bf ("net: qdisc busylock needs lockdep annotations ")
      303c07db ("ppp: set qdisc_tx_busylock to avoid LOCKDEP splat ")
      Reported-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20e7c4e8
    • R
      9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers · b6f52ae2
      Richard Yao 提交于
      The 9p-virtio transport does zero copy on things larger than 1024 bytes
      in size. It accomplishes this by returning the physical addresses of
      pages to the virtio-pci device. At present, the translation is usually a
      bit shift.
      
      That approach produces an invalid page address when we read/write to
      vmalloc buffers, such as those used for Linux kernel modules. Any
      attempt to load a Linux kernel module from 9p-virtio produces the
      following stack.
      
      [<ffffffff814878ce>] p9_virtio_zc_request+0x45e/0x510
      [<ffffffff814814ed>] p9_client_zc_rpc.constprop.16+0xfd/0x4f0
      [<ffffffff814839dd>] p9_client_read+0x15d/0x240
      [<ffffffff811c8440>] v9fs_fid_readn+0x50/0xa0
      [<ffffffff811c84a0>] v9fs_file_readn+0x10/0x20
      [<ffffffff811c84e7>] v9fs_file_read+0x37/0x70
      [<ffffffff8114e3fb>] vfs_read+0x9b/0x160
      [<ffffffff81153571>] kernel_read+0x41/0x60
      [<ffffffff810c83ab>] copy_module_from_fd.isra.34+0xfb/0x180
      
      Subsequently, QEMU will die printing:
      
      qemu-system-x86_64: virtio: trying to map MMIO memory
      
      This patch enables 9p-virtio to correctly handle this case. This not
      only enables us to load Linux kernel modules off virtfs, but also
      enables ZFS file-based vdevs on virtfs to be used without killing QEMU.
      
      Special thanks to both Avi Kivity and Alexander Graf for their
      interpretation of QEMU backtraces. Without their guidence, tracking down
      this bug would have taken much longer. Also, special thanks to Linus
      Torvalds for his insightful explanation of why this should use
      is_vmalloc_addr() instead of is_vmalloc_or_module_addr():
      
      https://lkml.org/lkml/2014/2/8/272Signed-off-by: NRichard Yao <ryao@gentoo.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6f52ae2
    • J
      tcp: tsq: fix nonagle handling · bf06200e
      John Ogness 提交于
      Commit 46d3ceab ("tcp: TCP Small Queues") introduced a possible
      regression for applications using TCP_NODELAY.
      
      If TCP session is throttled because of tsq, we should consult
      tp->nonagle when TX completion is done and allow us to send additional
      segment, especially if this segment is not a full MSS.
      Otherwise this segment is sent after an RTO.
      
      [edumazet] : Cooked the changelog, added another fix about testing
      sk_wmem_alloc twice because TX completion can happen right before
      setting TSQ_THROTTLED bit.
      
      This problem is particularly visible with recent auto corking,
      but might also be triggered with low tcp_limit_output_bytes
      values or NIC drivers delaying TX completion by hundred of usec,
      and very low rtt.
      
      Thomas Glanzmann for example reported an iscsi regression, caused
      by tcp auto corking making this bug quite visible.
      
      Fixes: 46d3ceab ("tcp: TCP Small Queues")
      Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NThomas Glanzmann <thomas@glanzmann.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf06200e
    • T
      bridge: Prevent possible race condition in br_fdb_change_mac_address · ac4c8868
      Toshiaki Makita 提交于
      br_fdb_change_mac_address() calls fdb_insert()/fdb_delete() without
      br->hash_lock.
      
      These hash list updates are racy with br_fdb_update()/br_fdb_cleanup().
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac4c8868
    • T
      bridge: Properly check if local fdb entry can be deleted when deleting vlan · 424bb9c9
      Toshiaki Makita 提交于
      Vlan codes unconditionally delete local fdb entries.
      We should consider the possibility that other ports have the same
      address and vlan.
      
      Example of problematic case:
        ip link set eth0 address 12:34:56:78:90:ab
        ip link set eth1 address aa:bb:cc:dd:ee:ff
        brctl addif br0 eth0
        brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
        bridge vlan add dev eth0 vid 10
        bridge vlan add dev eth1 vid 10
        bridge vlan add dev br0 vid 10 self
      We will have fdb entry such that f->dst == eth0, f->vlan_id == 10 and
      f->addr == 12:34:56:78:90:ab at this time.
      Next, delete eth0 vlan 10.
        bridge vlan del dev eth0 vid 10
      In this case, we still need the entry for br0, but it will be deleted.
      
      Note that br0 needs the entry even though its mac address is not set
      manually. To delete the entry with proper condition checking,
      fdb_delete_local() is suitable to use.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      424bb9c9
    • T
      bridge: Properly check if local fdb entry can be deleted in br_fdb_delete_by_port · a778e6d1
      Toshiaki Makita 提交于
      br_fdb_delete_by_port() doesn't care about vlan and mac address of the
      bridge device.
      
      As the check is almost the same as mac address changing, slightly modify
      fdb_delete_local() and use it.
      
      Note that we can always set added_by_user to 0 in fdb_delete_local() because
      - br_fdb_delete_by_port() calls fdb_delete_local() for local entries
        regardless of its added_by_user. In this case, we have to check if another
        port has the same address and vlan, and if found, we have to create the
        entry (by changing dst). This is kernel-added entry, not user-added.
      - br_fdb_changeaddr() doesn't call fdb_delete_local() for user-added entry.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a778e6d1
    • T
      bridge: Properly check if local fdb entry can be deleted in br_fdb_change_mac_address · 960b589f
      Toshiaki Makita 提交于
      br_fdb_change_mac_address() doesn't check if the local entry has the
      same address as any of bridge ports.
      Although I'm not sure when it is beneficial, current implementation allow
      the bridge device to receive any mac address of its ports.
      To preserve this behavior, we have to check if the mac address of the
      entry being deleted is identical to that of any port.
      
      As this check is almost the same as that in br_fdb_changeaddr(), create
      a common function fdb_delete_local() and call it from
      br_fdb_changeadddr() and br_fdb_change_mac_address().
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960b589f
    • T
      bridge: Fix the way to check if a local fdb entry can be deleted · 2b292fb4
      Toshiaki Makita 提交于
      We should take into account the followings when deleting a local fdb
      entry.
      
      - nbp_vlan_find() can be used only when vid != 0 to check if an entry is
        deletable, because a fdb entry with vid 0 can exist at any time while
        nbp_vlan_find() always return false with vid 0.
      
        Example of problematic case:
          ip link set eth0 address 12:34:56:78:90:ab
          ip link set eth1 address 12:34:56:78:90:ab
          brctl addif br0 eth0
          brctl addif br0 eth1
          ip link set eth0 address aa:bb:cc:dd:ee:ff
        Then, the fdb entry 12:34:56:78:90:ab will be deleted even though the
        bridge port eth1 still has that address.
      
      - The port to which the bridge device is attached might needs a local entry
        if its mac address is set manually.
      
        Example of problematic case:
          ip link set eth0 address 12:34:56:78:90:ab
          brctl addif br0 eth0
          ip link set br0 address 12:34:56:78:90:ab
          ip link set eth0 address aa:bb:cc:dd:ee:ff
        Then, the fdb still must have the entry 12:34:56:78:90:ab, but it will be
        deleted.
      
      We can use br->dev->addr_assign_type to check if the address is manually
      set or not, but I propose another approach.
      
      Since we delete and insert local entries whenever changing mac address
      of the bridge device, we can change dst of the entry to NULL regardless of
      addr_assign_type when deleting an entry associated with a certain port,
      and if it is found to be unnecessary later, then delete it.
      That is, if changing mac address of a port, the entry might be changed
      to its dst being NULL first, but is eventually deleted when recalculating
      and changing bridge id.
      
      This approach is especially useful when we want to share the code with
      deleting vlan in which the bridge device might want such an entry regardless
      of addr_assign_type, and makes things easy because we don't have to consider
      if mac address of the bridge device will be changed or not at the time we
      delete a local entry of a port, which means fdb code will not be bothered
      even if the bridge id calculating logic is changed in the future.
      
      Also, this change reduces inconsistent state, where frames whose dst is the
      mac address of the bridge, can't reach the bridge because of premature fdb
      entry deletion. This change reduces the possibility that the bridge device
      replies unreachable mac address to arp requests, which could occur during
      the short window between calling del_nbp() and br_stp_recalculate_bridge_id()
      in br_del_if(). This will effective after br_fdb_delete_by_port() starts to
      use the same code by following patch.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b292fb4
    • T
      bridge: Change local fdb entries whenever mac address of bridge device changes · a4b816d8
      Toshiaki Makita 提交于
      Vlan code may need fdb change when changing mac address of bridge device
      even if it is caused by the mac address changing of a bridge port.
      
      Example configuration:
        ip link set eth0 address 12:34:56:78:90:ab
        ip link set eth1 address aa:bb:cc:dd:ee:ff
        brctl addif br0 eth0
        brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
        bridge vlan add dev br0 vid 10 self
        bridge vlan add dev eth0 vid 10
      We will have fdb entry such that f->dst == NULL, f->vlan_id == 10 and
      f->addr == 12:34:56:78:90:ab at this time.
      Next, change the mac address of eth0 to greater value.
        ip link set eth0 address ee:ff:12:34:56:78
      Then, mac address of br0 will be recalculated and set to aa:bb:cc:dd:ee:ff.
      However, an entry aa:bb:cc:dd:ee:ff will not be created and we will be not
      able to communicate using br0 on vlan 10.
      
      Address this issue by deleting and adding local entries whenever
      changing the mac address of the bridge device.
      
      If there already exists an entry that has the same address, for example,
      in case that br_fdb_changeaddr() has already inserted it,
      br_fdb_change_mac_address() will simply fail to insert it and no
      duplicated entry will be made, as it was.
      
      This approach also needs br_add_if() to call br_fdb_insert() before
      br_stp_recalculate_bridge_id() so that we don't create an entry whose
      dst == NULL in this function to preserve previous behavior.
      
      Note that this is a slight change in behavior where the bridge device can
      receive the traffic to the new address before calling
      br_stp_recalculate_bridge_id() in br_add_if().
      However, it is not a problem because we have already the address on the
      new port and such a way to insert new one before recalculating bridge id
      is taken in br_device_event() as well.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4b816d8
    • T
      bridge: Fix the way to find old local fdb entries in br_fdb_change_mac_address · a3ebb7ef
      Toshiaki Makita 提交于
      We have been always failed to delete the old entry at
      br_fdb_change_mac_address() because br_set_mac_address() updates
      dev->dev_addr before calling br_fdb_change_mac_address() and
      br_fdb_change_mac_address() uses dev->dev_addr to find the old entry.
      
      That update of dev_addr is completely unnecessary because the same work
      is done in br_stp_change_bridge_id() which is called right away after
      calling br_fdb_change_mac_address().
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3ebb7ef
    • T
      bridge: Fix the way to insert new local fdb entries in br_fdb_changeaddr · 2836882f
      Toshiaki Makita 提交于
      Since commit bc9a25d2 ("bridge: Add vlan support for local fdb entries"),
      br_fdb_changeaddr() has inserted a new local fdb entry only if it can
      find old one. But if we have two ports where they have the same address
      or user has deleted a local entry, there will be no entry for one of the
      ports.
      
      Example of problematic case:
        ip link set eth0 address aa:bb:cc:dd:ee:ff
        ip link set eth1 address aa:bb:cc:dd:ee:ff
        brctl addif br0 eth0
        brctl addif br0 eth1 # eth1 will not have a local entry due to dup.
        ip link set eth1 address 12:34:56:78:90:ab
      Then, the new entry for the address 12:34:56:78:90:ab will not be
      created, and the bridge device will not be able to communicate.
      
      Insert new entries regardless of whether we can find old entries or not.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2836882f
    • T
      bridge: Fix the way to find old local fdb entries in br_fdb_changeaddr · a5642ab4
      Toshiaki Makita 提交于
      br_fdb_changeaddr() assumes that there is at most one local entry per port
      per vlan. It used to be true, but since commit 36fd2b63 ("bridge: allow
      creating/deleting fdb entries via netlink"), it has not been so.
      Therefore, the function might fail to search a correct previous address
      to be deleted and delete an arbitrary local entry if user has added local
      entries manually.
      
      Example of problematic case:
        ip link set eth0 address ee:ff:12:34:56:78
        brctl addif br0 eth0
        bridge fdb add 12:34:56:78:90:ab dev eth0 master
        ip link set eth0 address aa:bb:cc:dd:ee:ff
      Then, the address 12:34:56:78:90:ab might be deleted instead of
      ee:ff:12:34:56:78, the original mac address of eth0.
      
      Address this issue by introducing a new flag, added_by_user, to struct
      net_bridge_fdb_entry.
      
      Note that br_fdb_delete_by_port() has to set added_by_user to 0 in cases
      like:
        ip link set eth0 address 12:34:56:78:90:ab
        ip link set eth1 address aa:bb:cc:dd:ee:ff
        brctl addif br0 eth0
        bridge fdb add aa:bb:cc:dd:ee:ff dev eth0 master
        brctl addif br0 eth1
        brctl delif br0 eth0
      In this case, kernel should delete the user-added entry aa:bb:cc:dd:ee:ff,
      but it also should have been added by "brctl addif br0 eth1" originally,
      so we don't delete it and treat it a new kernel-created entry.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5642ab4
  5. 10 2月, 2014 2 次提交