1. 10 8月, 2017 1 次提交
  2. 01 7月, 2017 1 次提交
  3. 21 6月, 2017 1 次提交
    • W
      igmp: add a missing spin_lock_init() · b4846fc3
      WANG Cong 提交于
      Andrey reported a lockdep warning on non-initialized
      spinlock:
      
       INFO: trying to register non-static key.
       the code is fine but needs lockdep annotation.
       turning off the locking correctness validator.
       CPU: 1 PID: 4099 Comm: a.out Not tainted 4.12.0-rc6+ #9
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:16
        dump_stack+0x292/0x395 lib/dump_stack.c:52
        register_lock_class+0x717/0x1aa0 kernel/locking/lockdep.c:755
        ? 0xffffffffa0000000
        __lock_acquire+0x269/0x3690 kernel/locking/lockdep.c:3255
        lock_acquire+0x22d/0x560 kernel/locking/lockdep.c:3855
        __raw_spin_lock_bh ./include/linux/spinlock_api_smp.h:135
        _raw_spin_lock_bh+0x36/0x50 kernel/locking/spinlock.c:175
        spin_lock_bh ./include/linux/spinlock.h:304
        ip_mc_clear_src+0x27/0x1e0 net/ipv4/igmp.c:2076
        igmpv3_clear_delrec+0xee/0x4f0 net/ipv4/igmp.c:1194
        ip_mc_destroy_dev+0x4e/0x190 net/ipv4/igmp.c:1736
      
      We miss a spin_lock_init() in igmpv3_add_delrec(), probably
      because previously we never use it on this code path. Since
      we already unlink it from the global mc_tomb list, it is
      probably safe not to acquire this spinlock here. It does not
      harm to have it although, to avoid conditional locking.
      
      Fixes: c38b7d32 ("igmp: acquire pmc lock for ip_mc_clear_src()")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4846fc3
  4. 16 6月, 2017 1 次提交
    • J
      networking: make skb_put & friends return void pointers · 4df864c1
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions (skb_put, __skb_put and pskb_put) return void *
      and remove all the casts across the tree, adding a (u8 *) cast only
      where the unsigned char pointer was used directly, all done with the
      following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_put, __skb_put };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_put, __skb_put };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
      which actually doesn't cover pskb_put since there are only three
      users overall.
      
      A handful of stragglers were converted manually, notably a macro in
      drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
      instances in net/bluetooth/hci_sock.c. In the former file, I also
      had to fix one whitespace problem spatch introduced.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4df864c1
  5. 14 6月, 2017 1 次提交
    • W
      igmp: acquire pmc lock for ip_mc_clear_src() · c38b7d32
      WANG Cong 提交于
      Andrey reported a use-after-free in add_grec():
      
              for (psf = *psf_list; psf; psf = psf_next) {
      		...
                      psf_next = psf->sf_next;
      
      where the struct ip_sf_list's were already freed by:
      
       kfree+0xe8/0x2b0 mm/slub.c:3882
       ip_mc_clear_src+0x69/0x1c0 net/ipv4/igmp.c:2078
       ip_mc_dec_group+0x19a/0x470 net/ipv4/igmp.c:1618
       ip_mc_drop_socket+0x145/0x230 net/ipv4/igmp.c:2609
       inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:411
       sock_release+0x8d/0x1e0 net/socket.c:597
       sock_close+0x16/0x20 net/socket.c:1072
      
      This happens because we don't hold pmc->lock in ip_mc_clear_src()
      and a parallel mr_ifc_timer timer could jump in and access them.
      
      The RCU lock is there but it is merely for pmc itself, this
      spinlock could actually ensure we don't access them in parallel.
      
      Thanks to Eric and Long for discussion on this bug.
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c38b7d32
  6. 10 2月, 2017 1 次提交
  7. 03 1月, 2017 1 次提交
    • M
      igmp: Make igmp group member RFC 3376 compliant · 7ababb78
      Michal Tesar 提交于
      5.2. Action on Reception of a Query
      
       When a system receives a Query, it does not respond immediately.
       Instead, it delays its response by a random amount of time, bounded
       by the Max Resp Time value derived from the Max Resp Code in the
       received Query message.  A system may receive a variety of Queries on
       different interfaces and of different kinds (e.g., General Queries,
       Group-Specific Queries, and Group-and-Source-Specific Queries), each
       of which may require its own delayed response.
      
       Before scheduling a response to a Query, the system must first
       consider previously scheduled pending responses and in many cases
       schedule a combined response.  Therefore, the system must be able to
       maintain the following state:
      
       o A timer per interface for scheduling responses to General Queries.
      
       o A per-group and interface timer for scheduling responses to Group-
         Specific and Group-and-Source-Specific Queries.
      
       o A per-group and interface list of sources to be reported in the
         response to a Group-and-Source-Specific Query.
      
       When a new Query with the Router-Alert option arrives on an
       interface, provided the system has state to report, a delay for a
       response is randomly selected in the range (0, [Max Resp Time]) where
       Max Resp Time is derived from Max Resp Code in the received Query
       message.  The following rules are then used to determine if a Report
       needs to be scheduled and the type of Report to schedule.  The rules
       are considered in order and only the first matching rule is applied.
      
       1. If there is a pending response to a previous General Query
          scheduled sooner than the selected delay, no additional response
          needs to be scheduled.
      
       2. If the received Query is a General Query, the interface timer is
          used to schedule a response to the General Query after the
          selected delay.  Any previously pending response to a General
          Query is canceled.
      --8<--
      
      Currently the timer is rearmed with new random expiration time for
      every incoming query regardless of possibly already pending report.
      Which is not aligned with the above RFE.
      It also might happen that higher rate of incoming queries can
      postpone the report after the expiration time of the first query
      causing group membership loss.
      
      Now the per interface general query timer is rearmed only
      when there is no pending report already scheduled on that interface or
      the newly selected expiration time is before the already pending
      scheduled report.
      Signed-off-by: NMichal Tesar <mtesar@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ababb78
  8. 25 12月, 2016 1 次提交
  9. 16 11月, 2016 1 次提交
    • H
      igmp: do not remove igmp souce list info when set link down · 24803f38
      Hangbin Liu 提交于
      In commit 24cf3af3 ("igmp: call ip_mc_clear_src..."), we forgot to remove
      igmpv3_clear_delrec() in ip_mc_down(), which also called ip_mc_clear_src().
      This make us clear all IGMPv3 source filter info after NETDEV_DOWN.
      Move igmpv3_clear_delrec() to ip_mc_destroy_dev() and then no need
      ip_mc_clear_src() in ip_mc_destroy_dev().
      
      On the other hand, we should restore back instead of free all source filter
      info in igmpv3_del_delrec(). Or we will not able to restore IGMPv3 source
      filter info after NETDEV_UP and NETDEV_POST_TYPE_CHANGE.
      
      Fixes: 24cf3af3 ("igmp: call ip_mc_clear_src() only when ...")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24803f38
  10. 09 8月, 2016 1 次提交
  11. 04 3月, 2016 1 次提交
    • B
      mld, igmp: Fix reserved tailroom calculation · 1837b2e2
      Benjamin Poirier 提交于
      The current reserved_tailroom calculation fails to take hlen and tlen into
      account.
      
      skb:
      [__hlen__|__data____________|__tlen___|__extra__]
      ^                                               ^
      head                                            skb_end_offset
      
      In this representation, hlen + data + tlen is the size passed to alloc_skb.
      "extra" is the extra space made available in __alloc_skb because of
      rounding up by kmalloc. We can reorder the representation like so:
      
      [__hlen__|__data____________|__extra__|__tlen___]
      ^                                               ^
      head                                            skb_end_offset
      
      The maximum space available for ip headers and payload without
      fragmentation is min(mtu, data + extra). Therefore,
      reserved_tailroom
      = data + extra + tlen - min(mtu, data + extra)
      = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
      = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
      
      Compare the second line to the current expression:
      reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
      and we can see that hlen and tlen are not taken into account.
      
      The min() in the third line can be expanded into:
      if mtu < skb_tailroom - tlen:
      	reserved_tailroom = skb_tailroom - mtu
      else:
      	reserved_tailroom = tlen
      
      Depending on hlen, tlen, mtu and the number of multicast address records,
      the current code may output skbs that have less tailroom than
      dev->needed_tailroom or it may output more skbs than needed because not all
      space available is used.
      
      Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
      Signed-off-by: NBenjamin Poirier <bpoirier@suse.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1837b2e2
  12. 17 2月, 2016 1 次提交
  13. 11 2月, 2016 4 次提交
  14. 04 12月, 2015 1 次提交
    • A
      ipv4: igmp: Allow removing groups from a removed interface · 4eba7bb1
      Andrew Lunn 提交于
      When a multicast group is joined on a socket, a struct ip_mc_socklist
      is appended to the sockets mc_list containing information about the
      joined group.
      
      If the interface is hot unplugged, this entry becomes stale. Prior to
      commit 52ad353a ("igmp: fix the problem when mc leave group") it
      was possible to remove the stale entry by performing a
      IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
      the interface. However, this fix enforces that the interface must
      still exist. Thus with time, the number of stale entries grows, until
      sysctl_igmp_max_memberships is reached and then it is not possible to
      join and more groups.
      
      The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
      performed without specifying the interface, either by ifindex or ip
      address. However here we do supply one of these. So loosen the
      restriction on device existence to only apply when the interface has
      not been specified. This then restores the ability to clean up the
      stale entries.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Fixes: 52ad353a "(igmp: fix the problem when mc leave group")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eba7bb1
  15. 05 11月, 2015 1 次提交
  16. 08 10月, 2015 2 次提交
  17. 30 9月, 2015 1 次提交
  18. 29 8月, 2015 1 次提交
    • P
      IGMP: Inhibit reports for local multicast groups · df2cf4a7
      Philip Downey 提交于
      The range of addresses between 224.0.0.0 and 224.0.0.255 inclusive, is
      reserved for the use of routing protocols and other low-level topology
      discovery or maintenance protocols, such as gateway discovery and
      group membership reporting.  Multicast routers should not forward any
      multicast datagram with destination addresses in this range,
      regardless of its TTL.
      
      Currently, IGMP reports are generated for this reserved range of
      addresses even though a router will ignore this information since it
      has no purpose.  However, the presence of reserved group addresses in
      an IGMP membership report uses up network bandwidth and can also
      obscure addresses of interest when inspecting membership reports using
      packet inspection or debug messages.
      
      Although the RFCs for the various version of IGMP (e.g.RFC 3376 for
      v3) do not specify that the reserved addresses be excluded from
      membership reports, it should do no harm in doing so.  In particular
      there should be no adverse effect in any IGMP snooping functionality
      since 224.0.0.x is specifically excluded as per RFC 4541 (IGMP and MLD
      Snooping Switches Considerations) section 2.1.2. Data Forwarding
      Rules:
      
          2) Packets with a destination IP (DIP) address in the 224.0.0.X
             range which are not IGMP must be forwarded on all ports.
      
      IGMP reports for local multicast groups can now be optionally
      inhibited by means of a system control variable (by setting the value
      to zero) e.g.:
          echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports
      
      To retain backwards compatibility the previous behaviour is retained
      by default on system boot or reverted by setting the value back to
      non-zero e.g.:
          echo 1 >  /proc/sys/net/ipv4/igmp_link_local_mcast_reports
      Signed-off-by: NPhilip Downey <pdowney@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df2cf4a7
  19. 14 8月, 2015 1 次提交
  20. 05 5月, 2015 1 次提交
    • L
      net: Export IGMP/MLD message validation code · 9afd85c9
      Linus Lüssing 提交于
      With this patch, the IGMP and MLD message validation functions are moved
      from the bridge code to IPv4/IPv6 multicast files. Some small
      refactoring was done to enhance readibility and to iron out some
      differences in behaviour between the IGMP and MLD parsing code (e.g. the
      skb-cloning of MLD messages is now only done if necessary, just like the
      IGMP part always did).
      
      Finally, these IGMP and MLD message validation functions are exported so
      that not only the bridge can use it but batman-adv later, too.
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9afd85c9
  21. 04 4月, 2015 2 次提交
  22. 26 3月, 2015 1 次提交
  23. 19 3月, 2015 1 次提交
  24. 28 2月, 2015 1 次提交
    • M
      multicast: Extend ip address command to enable multicast group join/leave on · 93a714d6
      Madhu Challa 提交于
      Joining multicast group on ethernet level via "ip maddr" command would
      not work if we have an Ethernet switch that does igmp snooping since
      the switch would not replicate multicast packets on ports that did not
      have IGMP reports for the multicast addresses.
      
      Linux vxlan interfaces created via "ip link add vxlan" have the group option
      that enables then to do the required join.
      
      By extending ip address command with option "autojoin" we can get similar
      functionality for openvswitch vxlan interfaces as well as other tunneling
      mechanisms that need to receive multicast traffic. The kernel code is
      structured similar to how the vxlan driver does a group join / leave.
      
      example:
      ip address add 224.1.1.10/24 dev eth5 autojoin
      ip address del 224.1.1.10/24 dev eth5
      Signed-off-by: NMadhu Challa <challa@noironetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a714d6
  25. 21 2月, 2015 1 次提交
  26. 17 11月, 2014 1 次提交
    • D
      ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs · feb91a02
      Daniel Borkmann 提交于
      It has been reported that generating an MLD listener report on
      devices with large MTUs (e.g. 9000) and a high number of IPv6
      addresses can trigger a skb_over_panic():
      
      skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
      head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
      dev:port1
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:100!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: ixgbe(O)
      CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff80578226>] ? skb_put+0x3a/0x3b
       [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
       [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
       [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
       [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
       [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
       [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
       [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
       [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
       [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
       [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
      
      mld_newpack() skb allocations are usually requested with dev->mtu
      in size, since commit 72e09ad1 ("ipv6: avoid high order allocations")
      we have changed the limit in order to be less likely to fail.
      
      However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
      macros, which determine if we may end up doing an skb_put() for
      adding another record. To avoid possible fragmentation, we check
      the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
      assumption as the actual max allocation size can be much smaller.
      
      The IGMP case doesn't have this issue as commit 57e1ab6e
      ("igmp: refine skb allocations") stores the allocation size in
      the cb[].
      
      Set a reserved_tailroom to make it fit into the MTU and use
      skb_availroom() helper instead. This also allows to get rid of
      igmp_skb_size().
      Reported-by: NWei Liu <lw1a2.jing@gmail.com>
      Fixes: 72e09ad1 ("ipv6: avoid high order allocations")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: David L Stevens <david.stevens@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      feb91a02
  27. 06 11月, 2014 2 次提交
    • D
      ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs · 4c672e4b
      Daniel Borkmann 提交于
      It has been reported that generating an MLD listener report on
      devices with large MTUs (e.g. 9000) and a high number of IPv6
      addresses can trigger a skb_over_panic():
      
      skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
      head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
      dev:port1
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:100!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: ixgbe(O)
      CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff80578226>] ? skb_put+0x3a/0x3b
       [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
       [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
       [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
       [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
       [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
       [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
       [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
       [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
       [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
       [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
      
      mld_newpack() skb allocations are usually requested with dev->mtu
      in size, since commit 72e09ad1 ("ipv6: avoid high order allocations")
      we have changed the limit in order to be less likely to fail.
      
      However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
      macros, which determine if we may end up doing an skb_put() for
      adding another record. To avoid possible fragmentation, we check
      the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
      assumption as the actual max allocation size can be much smaller.
      
      The IGMP case doesn't have this issue as commit 57e1ab6e
      ("igmp: refine skb allocations") stores the allocation size in
      the cb[].
      
      Set a reserved_tailroom to make it fit into the MTU and use
      skb_availroom() helper instead. This also allows to get rid of
      igmp_skb_size().
      Reported-by: NWei Liu <lw1a2.jing@gmail.com>
      Fixes: 72e09ad1 ("ipv6: avoid high order allocations")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: David L Stevens <david.stevens@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c672e4b
    • J
      net: Convert SEQ_START_TOKEN/seq_printf to seq_puts · 1744bea1
      Joe Perches 提交于
      Using a single fixed string is smaller code size than using
      a format and many string arguments.
      
      Reduces overall code size a little.
      
      $ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
         text	   data	    bss	    dec	    hex	filename
        34269	   7012	  14824	  56105	   db29	net/ipv4/igmp.o.new
        34315	   7012	  14824	  56151	   db57	net/ipv4/igmp.o.old
        30078	   7869	  13200	  51147	   c7cb	net/ipv6/mcast.o.new
        30105	   7869	  13200	  51174	   c7e6	net/ipv6/mcast.o.old
        11434	   3748	   8580	  23762	   5cd2	net/ipv6/ip6_flowlabel.o.new
        11491	   3748	   8580	  23819	   5d0b	net/ipv6/ip6_flowlabel.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1744bea1
  28. 05 11月, 2014 1 次提交
  29. 07 10月, 2014 1 次提交
  30. 05 9月, 2014 1 次提交
  31. 23 8月, 2014 1 次提交
  32. 25 7月, 2014 1 次提交
  33. 08 7月, 2014 1 次提交
    • D
      igmp: fix the problem when mc leave group · 52ad353a
      dingtianhong 提交于
      The problem was triggered by these steps:
      
      1) create socket, bind and then setsockopt for add mc group.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("192.168.1.2");
         setsockopt(sockfd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));
      
      2) drop the mc group for this socket.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("0.0.0.0");
         setsockopt(sockfd, IPPROTO_IP, IP_DROP_MEMBERSHIP, &mreq, sizeof(mreq));
      
      3) and then drop the socket, I found the mc group was still used by the dev:
      
         netstat -g
      
         Interface       RefCnt Group
         --------------- ------ ---------------------
         eth2		   1	  255.0.0.37
      
      Normally even though the IP_DROP_MEMBERSHIP return error, the mc group still need
      to be released for the netdev when drop the socket, but this process was broken when
      route default is NULL, the reason is that:
      
      The ip_mc_leave_group() will choose the in_dev by the imr_interface.s_addr, if input addr
      is NULL, the default route dev will be chosen, then the ifindex is got from the dev,
      then polling the inet->mc_list and return -ENODEV, but if the default route dev is NULL,
      the in_dev and ifIndex is both NULL, when polling the inet->mc_list, the mc group will be
      released from the mc_list, but the dev didn't dec the refcnt for this mc group, so
      when dropping the socket, the mc_list is NULL and the dev still keep this group.
      
      v1->v2: According Hideaki's suggestion, we should align with IPv6 (RFC3493) and BSDs,
      	so I add the checking for the in_dev before polling the mc_list, make sure when
      	we remove the mc group, dec the refcnt to the real dev which was using the mc address.
      	The problem would never happened again.
      Signed-off-by: NDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52ad353a
  34. 03 6月, 2014 1 次提交
    • E
      inetpeer: get rid of ip_id_count · 73f156a6
      Eric Dumazet 提交于
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f156a6