1. 27 5月, 2015 1 次提交
    • J
      tipc: fix bug in link protocol message create function · f3903bcc
      Jon Paul Maloy 提交于
      In commit dd3f9e70
      ("tipc: add packet sequence number at instant of transmission") we
      made a change with the consequence that packets in the link backlog
      queue don't contain valid sequence numbers.
      
      However, when we create a link protocol message, we still use the
      sequence number of the first packet in the backlog, if there is any,
      as "next_sent" indicator in the message. This may entail unnecessary
      retransissions or stale packet transmission when there is very low
      traffic on the link.
      
      This commit fixes this issue by only using the current value of
      tipc_link::snd_nxt as indicator.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3903bcc
  2. 26 5月, 2015 19 次提交
  3. 25 5月, 2015 5 次提交
  4. 23 5月, 2015 9 次提交
    • J
      pktgen: make /proc/net/pktgen/pgctrl report fail on invalid input · 40207264
      Jesper Dangaard Brouer 提交于
      Giving /proc/net/pktgen/pgctrl an invalid command just returns shell
      success and prints a warning in dmesg.  This is not very useful for
      shell scripting, as it can only detect the error by parsing dmesg.
      
      Instead return -EINVAL when the command is unknown, as this provides
      userspace shell scripting a way of detecting this.
      
      Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl
      output this version number which would allow to detect this small
      semantic change, and (2) because the pktgen version tag have not been
      updated since 2010.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40207264
    • J
      pktgen: adjust spacing in proc file interface output · d079abd1
      Jesper Dangaard Brouer 提交于
      Too many spaces were introduced in commit 63adc6fb ("pktgen: cleanup
      checkpatch warnings"), thus misaligning "src_min:" to other columns.
      
      Fixes: 63adc6fb ("pktgen: cleanup checkpatch warnings")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d079abd1
    • E
      bridge: fix lockdep splat · 93a33a58
      Eric Dumazet 提交于
      Following lockdep splat was reported :
      
      [   29.382286] ===============================
      [   29.382315] [ INFO: suspicious RCU usage. ]
      [   29.382344] 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 Not tainted
      [   29.382380] -------------------------------
      [   29.382409] net/bridge/br_private.h:626 suspicious
      rcu_dereference_check() usage!
      [   29.382455]
                     other info that might help us debug this:
      
      [   29.382507]
                     rcu_scheduler_active = 1, debug_locks = 0
      [   29.382549] 2 locks held by swapper/0/0:
      [   29.382576]  #0:  (((&p->forward_delay_timer))){+.-...}, at:
      [<ffffffff81139f75>] call_timer_fn+0x5/0x4f0
      [   29.382660]  #1:  (&(&br->lock)->rlock){+.-...}, at:
      [<ffffffffa0450dc1>] br_forward_delay_timer_expired+0x31/0x140
      [bridge]
      [   29.382754]
                     stack backtrace:
      [   29.382787] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
      4.1.0-0.rc0.git11.1.fc23.x86_64 #1
      [   29.382838] Hardware name: LENOVO 422916G/LENOVO, BIOS A1KT53AUS 04/07/2015
      [   29.382882]  0000000000000000 3ebfc20364115825 ffff880666603c48
      ffffffff81892d4b
      [   29.382943]  0000000000000000 ffffffff81e124e0 ffff880666603c78
      ffffffff8110bcd7
      [   29.383004]  ffff8800785c9d00 ffff88065485ac58 ffff880c62002800
      ffff880c5fc88ac0
      [   29.383065] Call Trace:
      [   29.383084]  <IRQ>  [<ffffffff81892d4b>] dump_stack+0x4c/0x65
      [   29.383130]  [<ffffffff8110bcd7>] lockdep_rcu_suspicious+0xe7/0x120
      [   29.383178]  [<ffffffffa04520f9>] br_fill_ifinfo+0x4a9/0x6a0 [bridge]
      [   29.383225]  [<ffffffffa045266b>] br_ifinfo_notify+0x11b/0x4b0 [bridge]
      [   29.383271]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383320]  [<ffffffffa0450de8>]
      br_forward_delay_timer_expired+0x58/0x140 [bridge]
      [   29.383371]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383416]  [<ffffffff8113a033>] call_timer_fn+0xc3/0x4f0
      [   29.383454]  [<ffffffff81139f75>] ? call_timer_fn+0x5/0x4f0
      [   29.383493]  [<ffffffff8110a90f>] ? lock_release_holdtime.part.29+0xf/0x200
      [   29.383541]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383587]  [<ffffffff8113a6a4>] run_timer_softirq+0x244/0x490
      [   29.383629]  [<ffffffff810b68cc>] __do_softirq+0xec/0x670
      [   29.383666]  [<ffffffff810b70d5>] irq_exit+0x145/0x150
      [   29.383703]  [<ffffffff8189f506>] smp_apic_timer_interrupt+0x46/0x60
      [   29.383744]  [<ffffffff8189d523>] apic_timer_interrupt+0x73/0x80
      [   29.383782]  <EOI>  [<ffffffff816f131f>] ? cpuidle_enter_state+0x5f/0x2f0
      [   29.383832]  [<ffffffff816f131b>] ? cpuidle_enter_state+0x5b/0x2f0
      
      Problem here is that br_forward_delay_timer_expired() is a timer
      handler, calling br_ifinfo_notify() which assumes either rcu_read_lock()
      or RTNL are held.
      
      Simplest fix seems to add rcu read lock section.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Reported-by: NDominick Grift <dac.override@gmail.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a33a58
    • A
      net: core: 'ethtool' issue with querying phy settings · f96dee13
      Arun Parameswaran 提交于
      When trying to configure the settings for PHY1, using commands
      like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
      modify other settings apart from the speed of the PHY1, in the
      above case.
      
      The ethtool seems to query the settings for PHY0, and use this
      as the base to apply the new settings to the PHY1. This is
      causing the other settings of the PHY 1 to be wrongly
      configured.
      
      The issue is caused by the '_ethtool_get_settings()' API, which
      gets called because of the 'ETHTOOL_GSET' command, is clearing
      the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
      memset. This clears all the parameters (if any) passed for the
      'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
      with 'cmd->phy_address' as '0'.
      
      The '_ethtool_get_settings()' is called from other files in the
      'net/core'. So the fix is applied to the 'ethtool_get_settings()'
      which is only called in the context of the 'ethtool'.
      Signed-off-by: NArun Parameswaran <aparames@broadcom.com>
      Reviewed-by: NRay Jui <rjui@broadcom.com>
      Reviewed-by: NScott Branden <sbranden@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f96dee13
    • T
      bridge: fix parsing of MLDv2 reports · 47cc84ce
      Thadeu Lima de Souza Cascardo 提交于
      When more than a multicast address is present in a MLDv2 report, all but
      the first address is ignored, because the code breaks out of the loop if
      there has not been an error adding that address.
      
      This has caused failures when two guests connected through the bridge
      tried to communicate using IPv6. Neighbor discoveries would not be
      transmitted to the other guest when both used a link-local address and a
      static address.
      
      This only happens when there is a MLDv2 querier in the network.
      
      The fix will only break out of the loop when there is a failure adding a
      multicast address.
      
      The mdb before the patch:
      
      dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
      dev ovirtmgmt port bond0.86 grp ff02::2 temp
      
      After the patch:
      
      dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
      dev ovirtmgmt port bond0.86 grp ff02::fb temp
      dev ovirtmgmt port bond0.86 grp ff02::2 temp
      dev ovirtmgmt port bond0.86 grp ff02::d temp
      dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
      dev ovirtmgmt port bond0.86 grp ff02::16 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
      dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
      dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp
      
      Fixes: 08b202b6 ("bridge br_multicast: IPv6 MLD support.")
      Reported-by: NRik Theys <Rik.Theys@esat.kuleuven.be>
      Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@redhat.com>
      Tested-by: NRik Theys <Rik.Theys@esat.kuleuven.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47cc84ce
    • M
      ipv4: fill in table id when replacing a route · d4e64c29
      Michal Kubeček 提交于
      When replacing an IPv4 route, tb_id member of the new fib_alias
      structure is not set in the replace code path so that the new route is
      ignored.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e64c29
    • E
      ipv4: Avoid crashing in ip_error · 381c759d
      Eric W. Biederman 提交于
      ip_error does not check if in_dev is NULL before dereferencing it.
      
      IThe following sequence of calls is possible:
      CPU A                          CPU B
      ip_rcv_finish
          ip_route_input_noref()
              ip_route_input_slow()
                                     inetdev_destroy()
          dst_input()
      
      With the result that a network device can be destroyed while processing
      an input packet.
      
      A crash was triggered with only unicast packets in flight, and
      forwarding enabled on the only network device.   The error condition
      was created by the removal of the network device.
      
      As such it is likely the that error code was -EHOSTUNREACH, and the
      action taken by ip_error (if in_dev had been accessible) would have
      been to not increment any counters and to have tried and likely failed
      to send an icmp error as the network device is going away.
      
      Therefore handle this weird case by just dropping the packet if
      !in_dev.  It will result in dropping the packet sooner, and will not
      result in an actual change of behavior.
      
      Fixes: 251da413 ("ipv4: Cache ip_error() routes even when not forwarding.")
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Tested-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Signed-off-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      381c759d
    • J
      flow_dissector: do not break if ports are not needed in flowlabel · 12c227ec
      Jiri Pirko 提交于
      This restored previous behaviour. If caller does not want ports to be
      filled, we should not break.
      
      Fixes: 06635a35 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12c227ec
    • E
      tcp: fix a potential deadlock in tcp_get_info() · d654976c
      Eric Dumazet 提交于
      Taking socket spinlock in tcp_get_info() can deadlock, as
      inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
      while packet processing can use the reverse locking order.
      
      We could avoid this locking for TCP_LISTEN states, but lockdep would
      certainly get confused as all TCP sockets share same lockdep classes.
      
      [  523.722504] ======================================================
      [  523.728706] [ INFO: possible circular locking dependency detected ]
      [  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
      [  523.739202] -------------------------------------------------------
      [  523.745474] ss/18032 is trying to acquire lock:
      [  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
      [  523.758129]
      [  523.758129] but task is already holding lock:
      [  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
      [  523.774661]
      [  523.774661] which lock already depends on the new lock.
      [  523.774661]
      [  523.782850]
      [  523.782850] the existing dependency chain (in reverse order) is:
      [  523.790326]
      -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
      [  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
      [  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
      [  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
      [  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
      [  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
      [  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
      [  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
      [  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
      [  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
      [  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
      [  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420
      
      Lets use u64_sync infrastructure instead. As a bonus, 64bit
      arches get optimized, as these are nop for them.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d654976c
  5. 22 5月, 2015 6 次提交
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
    • F
      ipv6: reject locally assigned nexthop addresses · 48ed7b26
      Florian Westphal 提交于
      ip -6 addr add dead::1/128 dev eth0
      sleep 5
      ip -6 route add default via dead::1/128
      -> fails
      ip -6 addr add dead::1/128 dev eth0
      ip -6 route add default via dead::1/128
      -> succeeds
      
      reason is that if (nonsensensical) route above is added,
      dead::1 is still subject to DAD, so the route lookup will
      pick eth0 as outdev due to the prefix route that is added before
      DAD work is started.
      
      Add explicit test that checks if nexthop gateway is a local address.
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1167969Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48ed7b26
    • E
      tcp: improve REUSEADDR/NOREUSEADDR cohabitation · 946f9eb2
      Eric Dumazet 提交于
      inet_csk_get_port() randomization effort tends to spread
      sockets on all the available range (ip_local_port_range)
      
      This is unfortunate because SO_REUSEADDR sockets have
      less requirements than non SO_REUSEADDR ones.
      
      If an application uses SO_REUSEADDR hint, it is to try to
      allow source ports being shared.
      
      So instead of picking a random port number in ip_local_port_range,
      lets try first in first half of the range.
      
      This gives more chances to use upper half of the range for the
      sockets with strong requirements (not using SO_REUSEADDR)
      
      Note this patch does not add a new sysctl, and only changes
      the way we try to pick port number.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      946f9eb2
    • E
      inet_hashinfo: remove bsocket counter · f5af1f57
      Eric Dumazet 提交于
      We no longer need bsocket atomic counter, as inet_csk_get_port()
      calls bind_conflict() regardless of its value, after commit
      2b05ad33 ("tcp: bind() fix autoselection to share ports")
      
      This patch removes overhead of maintaining this counter and
      double inet_csk_get_port() calls under pressure.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5af1f57
    • J
      tcp: ensure epoll edge trigger wakeup when write queue is empty · ce5ec440
      Jason Baron 提交于
      We currently rely on the setting of SOCK_NOSPACE in the write()
      path to ensure that we wake up any epoll edge trigger waiters when
      acks return to free space in the write queue. However, if we fail
      to allocate even a single skb in the write queue, we could end up
      waiting indefinitely.
      
      Fix this by explicitly issuing a wakeup when we detect the condition
      of an empty write queue and a return value of -EAGAIN. This allows
      userspace to re-try as we expect this to be a temporary failure.
      
      I've tested this approach by artificially making
      sk_stream_alloc_skb() return NULL periodically. In that case,
      epoll edge trigger waiters will hang indefinitely in epoll_wait()
      without this patch.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce5ec440
    • D
      net: sched: fix call_rcu() race on classifier module unloads · c78e1746
      Daniel Borkmann 提交于
      Vijay reported that a loop as simple as ...
      
        while true; do
          tc qdisc add dev foo root handle 1: prio
          tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
          tc qdisc del dev foo root
          rmmod cls_u32
        done
      
      ... will panic the kernel. Moreover, he bisected the change
      apparently introducing it to 78fd1d0a ("netlink: Re-add
      locking to netlink_lookup() and seq walker").
      
      The removal of synchronize_net() from the netlink socket
      triggering the qdisc to be removed, seems to have uncovered
      an RCU resp. module reference count race from the tc API.
      Given that RCU conversion was done after e341694e ("netlink:
      Convert netlink_lookup() to use RCU protected hash table")
      which added the synchronize_net() originally, occasion of
      hitting the bug was less likely (not impossible though):
      
      When qdiscs that i) support attaching classifiers and,
      ii) have at least one of them attached, get deleted, they
      invoke tcf_destroy_chain(), and thus call into ->destroy()
      handler from a classifier module.
      
      After RCU conversion, all classifier that have an internal
      prio list, unlink them and initiate freeing via call_rcu()
      deferral.
      
      Meanhile, tcf_destroy() releases already reference to the
      tp->ops->owner module before the queued RCU callback handler
      has been invoked.
      
      Subsequent rmmod on the classifier module is then not prevented
      since all module references are already dropped.
      
      By the time, the kernel invokes the RCU callback handler from
      the module, that function address is then invalid.
      
      One way to fix it would be to add an rcu_barrier() to
      unregister_tcf_proto_ops() to wait for all pending call_rcu()s
      to complete.
      
      synchronize_rcu() is not appropriate as under heavy RCU
      callback load, registered call_rcu()s could be deferred
      longer than a grace period. In case we don't have any pending
      call_rcu()s, the barrier is allowed to return immediately.
      
      Since we came here via unregister_tcf_proto_ops(), there
      are no users of a given classifier anymore. Further nested
      call_rcu()s pointing into the module space are not being
      done anywhere.
      
      Only cls_bpf_delete_prog() may schedule a work item, to
      unlock pages eventually, but that is not in the range/context
      of cls_bpf anymore.
      
      Fixes: 25d8c0d5 ("net: rcu-ify tcf_proto")
      Fixes: 9888faef ("net: sched: cls_basic use RCU")
      Reported-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78e1746