1. 26 5月, 2015 6 次提交
  2. 25 5月, 2015 5 次提交
  3. 23 5月, 2015 9 次提交
    • J
      pktgen: make /proc/net/pktgen/pgctrl report fail on invalid input · 40207264
      Jesper Dangaard Brouer 提交于
      Giving /proc/net/pktgen/pgctrl an invalid command just returns shell
      success and prints a warning in dmesg.  This is not very useful for
      shell scripting, as it can only detect the error by parsing dmesg.
      
      Instead return -EINVAL when the command is unknown, as this provides
      userspace shell scripting a way of detecting this.
      
      Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl
      output this version number which would allow to detect this small
      semantic change, and (2) because the pktgen version tag have not been
      updated since 2010.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40207264
    • J
      pktgen: adjust spacing in proc file interface output · d079abd1
      Jesper Dangaard Brouer 提交于
      Too many spaces were introduced in commit 63adc6fb ("pktgen: cleanup
      checkpatch warnings"), thus misaligning "src_min:" to other columns.
      
      Fixes: 63adc6fb ("pktgen: cleanup checkpatch warnings")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d079abd1
    • E
      bridge: fix lockdep splat · 93a33a58
      Eric Dumazet 提交于
      Following lockdep splat was reported :
      
      [   29.382286] ===============================
      [   29.382315] [ INFO: suspicious RCU usage. ]
      [   29.382344] 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 Not tainted
      [   29.382380] -------------------------------
      [   29.382409] net/bridge/br_private.h:626 suspicious
      rcu_dereference_check() usage!
      [   29.382455]
                     other info that might help us debug this:
      
      [   29.382507]
                     rcu_scheduler_active = 1, debug_locks = 0
      [   29.382549] 2 locks held by swapper/0/0:
      [   29.382576]  #0:  (((&p->forward_delay_timer))){+.-...}, at:
      [<ffffffff81139f75>] call_timer_fn+0x5/0x4f0
      [   29.382660]  #1:  (&(&br->lock)->rlock){+.-...}, at:
      [<ffffffffa0450dc1>] br_forward_delay_timer_expired+0x31/0x140
      [bridge]
      [   29.382754]
                     stack backtrace:
      [   29.382787] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
      4.1.0-0.rc0.git11.1.fc23.x86_64 #1
      [   29.382838] Hardware name: LENOVO 422916G/LENOVO, BIOS A1KT53AUS 04/07/2015
      [   29.382882]  0000000000000000 3ebfc20364115825 ffff880666603c48
      ffffffff81892d4b
      [   29.382943]  0000000000000000 ffffffff81e124e0 ffff880666603c78
      ffffffff8110bcd7
      [   29.383004]  ffff8800785c9d00 ffff88065485ac58 ffff880c62002800
      ffff880c5fc88ac0
      [   29.383065] Call Trace:
      [   29.383084]  <IRQ>  [<ffffffff81892d4b>] dump_stack+0x4c/0x65
      [   29.383130]  [<ffffffff8110bcd7>] lockdep_rcu_suspicious+0xe7/0x120
      [   29.383178]  [<ffffffffa04520f9>] br_fill_ifinfo+0x4a9/0x6a0 [bridge]
      [   29.383225]  [<ffffffffa045266b>] br_ifinfo_notify+0x11b/0x4b0 [bridge]
      [   29.383271]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383320]  [<ffffffffa0450de8>]
      br_forward_delay_timer_expired+0x58/0x140 [bridge]
      [   29.383371]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383416]  [<ffffffff8113a033>] call_timer_fn+0xc3/0x4f0
      [   29.383454]  [<ffffffff81139f75>] ? call_timer_fn+0x5/0x4f0
      [   29.383493]  [<ffffffff8110a90f>] ? lock_release_holdtime.part.29+0xf/0x200
      [   29.383541]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383587]  [<ffffffff8113a6a4>] run_timer_softirq+0x244/0x490
      [   29.383629]  [<ffffffff810b68cc>] __do_softirq+0xec/0x670
      [   29.383666]  [<ffffffff810b70d5>] irq_exit+0x145/0x150
      [   29.383703]  [<ffffffff8189f506>] smp_apic_timer_interrupt+0x46/0x60
      [   29.383744]  [<ffffffff8189d523>] apic_timer_interrupt+0x73/0x80
      [   29.383782]  <EOI>  [<ffffffff816f131f>] ? cpuidle_enter_state+0x5f/0x2f0
      [   29.383832]  [<ffffffff816f131b>] ? cpuidle_enter_state+0x5b/0x2f0
      
      Problem here is that br_forward_delay_timer_expired() is a timer
      handler, calling br_ifinfo_notify() which assumes either rcu_read_lock()
      or RTNL are held.
      
      Simplest fix seems to add rcu read lock section.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Reported-by: NDominick Grift <dac.override@gmail.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a33a58
    • A
      net: core: 'ethtool' issue with querying phy settings · f96dee13
      Arun Parameswaran 提交于
      When trying to configure the settings for PHY1, using commands
      like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
      modify other settings apart from the speed of the PHY1, in the
      above case.
      
      The ethtool seems to query the settings for PHY0, and use this
      as the base to apply the new settings to the PHY1. This is
      causing the other settings of the PHY 1 to be wrongly
      configured.
      
      The issue is caused by the '_ethtool_get_settings()' API, which
      gets called because of the 'ETHTOOL_GSET' command, is clearing
      the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
      memset. This clears all the parameters (if any) passed for the
      'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
      with 'cmd->phy_address' as '0'.
      
      The '_ethtool_get_settings()' is called from other files in the
      'net/core'. So the fix is applied to the 'ethtool_get_settings()'
      which is only called in the context of the 'ethtool'.
      Signed-off-by: NArun Parameswaran <aparames@broadcom.com>
      Reviewed-by: NRay Jui <rjui@broadcom.com>
      Reviewed-by: NScott Branden <sbranden@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f96dee13
    • T
      bridge: fix parsing of MLDv2 reports · 47cc84ce
      Thadeu Lima de Souza Cascardo 提交于
      When more than a multicast address is present in a MLDv2 report, all but
      the first address is ignored, because the code breaks out of the loop if
      there has not been an error adding that address.
      
      This has caused failures when two guests connected through the bridge
      tried to communicate using IPv6. Neighbor discoveries would not be
      transmitted to the other guest when both used a link-local address and a
      static address.
      
      This only happens when there is a MLDv2 querier in the network.
      
      The fix will only break out of the loop when there is a failure adding a
      multicast address.
      
      The mdb before the patch:
      
      dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
      dev ovirtmgmt port bond0.86 grp ff02::2 temp
      
      After the patch:
      
      dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
      dev ovirtmgmt port bond0.86 grp ff02::fb temp
      dev ovirtmgmt port bond0.86 grp ff02::2 temp
      dev ovirtmgmt port bond0.86 grp ff02::d temp
      dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
      dev ovirtmgmt port bond0.86 grp ff02::16 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
      dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
      dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp
      
      Fixes: 08b202b6 ("bridge br_multicast: IPv6 MLD support.")
      Reported-by: NRik Theys <Rik.Theys@esat.kuleuven.be>
      Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@redhat.com>
      Tested-by: NRik Theys <Rik.Theys@esat.kuleuven.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47cc84ce
    • M
      ipv4: fill in table id when replacing a route · d4e64c29
      Michal Kubeček 提交于
      When replacing an IPv4 route, tb_id member of the new fib_alias
      structure is not set in the replace code path so that the new route is
      ignored.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e64c29
    • E
      ipv4: Avoid crashing in ip_error · 381c759d
      Eric W. Biederman 提交于
      ip_error does not check if in_dev is NULL before dereferencing it.
      
      IThe following sequence of calls is possible:
      CPU A                          CPU B
      ip_rcv_finish
          ip_route_input_noref()
              ip_route_input_slow()
                                     inetdev_destroy()
          dst_input()
      
      With the result that a network device can be destroyed while processing
      an input packet.
      
      A crash was triggered with only unicast packets in flight, and
      forwarding enabled on the only network device.   The error condition
      was created by the removal of the network device.
      
      As such it is likely the that error code was -EHOSTUNREACH, and the
      action taken by ip_error (if in_dev had been accessible) would have
      been to not increment any counters and to have tried and likely failed
      to send an icmp error as the network device is going away.
      
      Therefore handle this weird case by just dropping the packet if
      !in_dev.  It will result in dropping the packet sooner, and will not
      result in an actual change of behavior.
      
      Fixes: 251da413 ("ipv4: Cache ip_error() routes even when not forwarding.")
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Tested-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Signed-off-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      381c759d
    • J
      flow_dissector: do not break if ports are not needed in flowlabel · 12c227ec
      Jiri Pirko 提交于
      This restored previous behaviour. If caller does not want ports to be
      filled, we should not break.
      
      Fixes: 06635a35 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12c227ec
    • E
      tcp: fix a potential deadlock in tcp_get_info() · d654976c
      Eric Dumazet 提交于
      Taking socket spinlock in tcp_get_info() can deadlock, as
      inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
      while packet processing can use the reverse locking order.
      
      We could avoid this locking for TCP_LISTEN states, but lockdep would
      certainly get confused as all TCP sockets share same lockdep classes.
      
      [  523.722504] ======================================================
      [  523.728706] [ INFO: possible circular locking dependency detected ]
      [  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
      [  523.739202] -------------------------------------------------------
      [  523.745474] ss/18032 is trying to acquire lock:
      [  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
      [  523.758129]
      [  523.758129] but task is already holding lock:
      [  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
      [  523.774661]
      [  523.774661] which lock already depends on the new lock.
      [  523.774661]
      [  523.782850]
      [  523.782850] the existing dependency chain (in reverse order) is:
      [  523.790326]
      -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
      [  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
      [  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
      [  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
      [  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
      [  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
      [  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
      [  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
      [  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
      [  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
      [  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
      [  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420
      
      Lets use u64_sync infrastructure instead. As a bonus, 64bit
      arches get optimized, as these are nop for them.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d654976c
  4. 22 5月, 2015 10 次提交
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
    • F
      ipv6: reject locally assigned nexthop addresses · 48ed7b26
      Florian Westphal 提交于
      ip -6 addr add dead::1/128 dev eth0
      sleep 5
      ip -6 route add default via dead::1/128
      -> fails
      ip -6 addr add dead::1/128 dev eth0
      ip -6 route add default via dead::1/128
      -> succeeds
      
      reason is that if (nonsensensical) route above is added,
      dead::1 is still subject to DAD, so the route lookup will
      pick eth0 as outdev due to the prefix route that is added before
      DAD work is started.
      
      Add explicit test that checks if nexthop gateway is a local address.
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1167969Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48ed7b26
    • E
      tcp: improve REUSEADDR/NOREUSEADDR cohabitation · 946f9eb2
      Eric Dumazet 提交于
      inet_csk_get_port() randomization effort tends to spread
      sockets on all the available range (ip_local_port_range)
      
      This is unfortunate because SO_REUSEADDR sockets have
      less requirements than non SO_REUSEADDR ones.
      
      If an application uses SO_REUSEADDR hint, it is to try to
      allow source ports being shared.
      
      So instead of picking a random port number in ip_local_port_range,
      lets try first in first half of the range.
      
      This gives more chances to use upper half of the range for the
      sockets with strong requirements (not using SO_REUSEADDR)
      
      Note this patch does not add a new sysctl, and only changes
      the way we try to pick port number.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      946f9eb2
    • E
      inet_hashinfo: remove bsocket counter · f5af1f57
      Eric Dumazet 提交于
      We no longer need bsocket atomic counter, as inet_csk_get_port()
      calls bind_conflict() regardless of its value, after commit
      2b05ad33 ("tcp: bind() fix autoselection to share ports")
      
      This patch removes overhead of maintaining this counter and
      double inet_csk_get_port() calls under pressure.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5af1f57
    • J
      tcp: ensure epoll edge trigger wakeup when write queue is empty · ce5ec440
      Jason Baron 提交于
      We currently rely on the setting of SOCK_NOSPACE in the write()
      path to ensure that we wake up any epoll edge trigger waiters when
      acks return to free space in the write queue. However, if we fail
      to allocate even a single skb in the write queue, we could end up
      waiting indefinitely.
      
      Fix this by explicitly issuing a wakeup when we detect the condition
      of an empty write queue and a return value of -EAGAIN. This allows
      userspace to re-try as we expect this to be a temporary failure.
      
      I've tested this approach by artificially making
      sk_stream_alloc_skb() return NULL periodically. In that case,
      epoll edge trigger waiters will hang indefinitely in epoll_wait()
      without this patch.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce5ec440
    • D
      net: sched: fix call_rcu() race on classifier module unloads · c78e1746
      Daniel Borkmann 提交于
      Vijay reported that a loop as simple as ...
      
        while true; do
          tc qdisc add dev foo root handle 1: prio
          tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
          tc qdisc del dev foo root
          rmmod cls_u32
        done
      
      ... will panic the kernel. Moreover, he bisected the change
      apparently introducing it to 78fd1d0a ("netlink: Re-add
      locking to netlink_lookup() and seq walker").
      
      The removal of synchronize_net() from the netlink socket
      triggering the qdisc to be removed, seems to have uncovered
      an RCU resp. module reference count race from the tc API.
      Given that RCU conversion was done after e341694e ("netlink:
      Convert netlink_lookup() to use RCU protected hash table")
      which added the synchronize_net() originally, occasion of
      hitting the bug was less likely (not impossible though):
      
      When qdiscs that i) support attaching classifiers and,
      ii) have at least one of them attached, get deleted, they
      invoke tcf_destroy_chain(), and thus call into ->destroy()
      handler from a classifier module.
      
      After RCU conversion, all classifier that have an internal
      prio list, unlink them and initiate freeing via call_rcu()
      deferral.
      
      Meanhile, tcf_destroy() releases already reference to the
      tp->ops->owner module before the queued RCU callback handler
      has been invoked.
      
      Subsequent rmmod on the classifier module is then not prevented
      since all module references are already dropped.
      
      By the time, the kernel invokes the RCU callback handler from
      the module, that function address is then invalid.
      
      One way to fix it would be to add an rcu_barrier() to
      unregister_tcf_proto_ops() to wait for all pending call_rcu()s
      to complete.
      
      synchronize_rcu() is not appropriate as under heavy RCU
      callback load, registered call_rcu()s could be deferred
      longer than a grace period. In case we don't have any pending
      call_rcu()s, the barrier is allowed to return immediately.
      
      Since we came here via unregister_tcf_proto_ops(), there
      are no users of a given classifier anymore. Further nested
      call_rcu()s pointing into the module space are not being
      done anywhere.
      
      Only cls_bpf_delete_prog() may schedule a work item, to
      unlock pages eventually, but that is not in the range/context
      of cls_bpf anymore.
      
      Fixes: 25d8c0d5 ("net: rcu-ify tcf_proto")
      Fixes: 9888faef ("net: sched: cls_basic use RCU")
      Reported-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78e1746
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
    • D
      net: dev: reduce both ingress hook ifdefs · e7582bab
      Daniel Borkmann 提交于
      Reduce ifdef pollution slightly, no functional change. We can simply
      remove the extra alternative definition of handle_ing() and nf_ingress().
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7582bab
    • E
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet 提交于
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb934478
    • E
      neigh: Better handling of transition to NUD_PROBE state · 765c9c63
      Erik Kline 提交于
      [1] When entering NUD_PROBE state via neigh_update(), perhaps received
          from userspace, correctly (re)initialize the probes count to zero.
      
          This is useful for forcing revalidation of a neighbor (for example
          if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).
      
      [2] Notify listeners when a neighbor goes into NUD_PROBE state.
      
          By sending notifications on entry to NUD_PROBE state listeners get
          more timely warnings of imminent connectivity issues.
      
          The current notifications on entry to NUD_STALE have somewhat
          limited usefulness: NUD_STALE is a perfectly normal state, as is
          NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
          a neighbor reachability problem has been confirmed (typically after
          three probes).
      Signed-off-by: NErik Kline <ek@google.com>
      Acked-By: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      765c9c63
  5. 21 5月, 2015 2 次提交
    • M
      ipv6: fix ECMP route replacement · 27596472
      Michal Kubeček 提交于
      When replacing an IPv6 multipath route with "ip route replace", i.e.
      NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
      matching route without fixing its siblings, resulting in corrupted
      siblings linked list; removing one of the siblings can then end in an
      infinite loop.
      
      IPv6 ECMP implementation is a bit different from IPv4 so that route
      replacement cannot work in exactly the same way. This should be a
      reasonable approximation:
      
      1. If the new route is ECMP-able and there is a matching ECMP-able one
      already, replace it and all its siblings (if any).
      
      2. If the new route is ECMP-able and no matching ECMP-able route exists,
      replace first matching non-ECMP-able (if any) or just add the new one.
      
      3. If the new route is not ECMP-able, replace first matching
      non-ECMP-able route (if any) or add the new route.
      
      We also need to remove the NLM_F_REPLACE flag after replacing old
      route(s) by first nexthop of an ECMP route so that each subsequent
      nexthop does not replace previous one.
      
      Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27596472
    • M
      ipv6: do not delete previously existing ECMP routes if add fails · 35f1b4e9
      Michal Kubeček 提交于
      If adding a nexthop of an IPv6 multipath route fails, comment in
      ip6_route_multipath() says we are going to delete all nexthops already
      added. However, current implementation deletes even the routes it
      hasn't even tried to add yet. For example, running
      
        ip route add 1234:5678::/64 \
            nexthop via fe80::aa dev dummy1 \
            nexthop via fe80::bb dev dummy1 \
            nexthop via fe80::cc dev dummy1
      
      twice results in removing all routes first command added.
      
      Limit the second (delete) run to nexthops that succeeded in the first
      (add) run.
      
      Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f1b4e9
  6. 20 5月, 2015 8 次提交
    • F
      Revert "netfilter: bridge: query conntrack about skb dnat" · faecbb45
      Florian Westphal 提交于
      This reverts commit c055d5b0.
      
      There are two issues:
      'dnat_took_place' made me think that this is related to
      -j DNAT/MASQUERADE.
      
      But thats only one part of the story.  This is also relevant for SNAT
      when we undo snat translation in reverse/reply direction.
      
      Furthermore, I originally wanted to do this mainly to avoid
      storing ipv6 addresses once we make DNAT/REDIRECT work
      for ipv6 on bridges.
      
      However, I forgot about SNPT/DNPT which is stateless.
      
      So we can't escape storing address for ipv6 anyway. Might as
      well do it for ipv4 too.
      Reported-and-tested-by: NBernhard Thaler <bernhard.thaler@wvnet.at>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      faecbb45
    • D
      netfilter: ensure number of counters is >0 in do_replace() · 1086bbe9
      Dave Jones 提交于
      After improving setsockopt() coverage in trinity, I started triggering
      vmalloc failures pretty reliably from this code path:
      
      warn_alloc_failed+0xe9/0x140
      __vmalloc_node_range+0x1be/0x270
      vzalloc+0x4b/0x50
      __do_replace+0x52/0x260 [ip_tables]
      do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
      nf_setsockopt+0x65/0x90
      ip_setsockopt+0x61/0xa0
      raw_setsockopt+0x16/0x60
      sock_common_setsockopt+0x14/0x20
      SyS_setsockopt+0x71/0xd0
      
      It turns out we don't validate that the num_counters field in the
      struct we pass in from userspace is initialized.
      
      The same problem also exists in ebtables, arptables, ipv6, and the
      compat variants.
      Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1086bbe9
    • F
      netfilter: nfnetlink_{log,queue}: Register pernet in first place · 3bfe0498
      Francesco Ruggeri 提交于
      nfnetlink_{log,queue}_init() register the netlink callback nf*_rcv_nl_event
      before registering the pernet_subsys, but the callback relies on data
      structures allocated by pernet init functions.
      
      When nfnetlink_{log,queue} is loaded, if a netlink message is received after
      the netlink callback is registered but before the pernet_subsys is registered,
      the kernel will panic in the sequence
      
      nfulnl_rcv_nl_event
        nfnl_log_pernet
          net_generic
            BUG_ON(id == 0)  where id is nfnl_log_net_id.
      
      The panic can be easily reproduced in 4.0.3 by:
      
      while true ;do modprobe nfnetlink_log ; rmmod nfnetlink_log ; done &
      while true ;do ip netns add dummy ; ip netns del dummy ; done &
      
      This patch moves register_pernet_subsys to earlier in nfnetlink_log_init.
      
      Notice that the BUG_ON hit in 4.0.3 was recently removed in 2591ffd3
      ["netns: remove BUG_ONs from net_generic()"].
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3bfe0498
    • D
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann 提交于
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49213555
    • Y
      tcp: don't over-send F-RTO probes · b7b0ed91
      Yuchung Cheng 提交于
      After sending the new data packets to probe (step 2), F-RTO may
      incorrectly send more probes if the next ACK advances SND_UNA and
      does not sack new packet. However F-RTO RFC 5682 probes at most
      once. This bug may cause sender to always send new data instead of
      repairing holes, inducing longer HoL blocking on the receiver for
      the application.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7b0ed91
    • Y
      tcp: only undo on partial ACKs in CA_Loss · da34ac76
      Yuchung Cheng 提交于
      Undo based on TCP timestamps should only happen on ACKs that advance
      SND_UNA, according to the Eifel algorithm in RFC 3522:
      
      Section 3.2:
      
        (4) If the value of the Timestamp Echo Reply field of the
            acceptable ACK's Timestamps option is smaller than the
            value of RetransmitTS, then proceed to step (5),
      
      Section Terminology:
         We use the term 'acceptable ACK' as defined in [RFC793].  That is an
         ACK that acknowledges previously unacknowledged data.
      
      This is because upon receiving an out-of-order packet, the receiver
      returns the last timestamp that advances RCV_NXT, not the current
      timestamp of the packet in the DUPACK. Without checking the flag,
      the DUPACK will cause tcp_packet_delayed() to return true and
      tcp_try_undo_loss() will revert cwnd reduction.
      
      Note that we check the condition in CA_Recovery already by only
      calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
      tcp_try_undo_recovery() if snd_una crosses high_seq.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da34ac76
    • H
      net/ipv6/udp: Fix ipv6 multicast socket filter regression · 33b4b015
      Henning Rogge 提交于
      Commit <5cf3d461> ("udp: Simplify__udp*_lib_mcast_deliver")
      simplified the filter for incoming IPv6 multicast but removed
      the check of the local socket address and the UDP destination
      address.
      
      This patch restores the filter to prevent sockets bound to a IPv6
      multicast IP to receive other UDP traffic link unicast.
      Signed-off-by: NHenning Rogge <hrogge@gmail.com>
      Fixes: 5cf3d461 ("udp: Simplify__udp*_lib_mcast_deliver")
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33b4b015
    • E
      tcp: Return error instead of partial read for saved syn headers · aea0929e
      Eric B Munson 提交于
      Currently the getsockopt() requesting the cached contents of the syn
      packet headers will fail silently if the caller uses a buffer that is
      too small to contain the requested data.  Rather than fail silently and
      discard the headers, getsockopt() should return an error and report the
      required size to hold the data.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aea0929e