1. 18 9月, 2015 3 次提交
    • E
      netfilter: Pass net into okfn · 0c4b51f0
      Eric W. Biederman 提交于
      This is immediately motivated by the bridge code that chains functions that
      call into netfilter.  Without passing net into the okfns the bridge code would
      need to guess about the best expression for the network namespace to process
      packets in.
      
      As net is frequently one of the first things computed in continuation functions
      after netfilter has done it's job passing in the desired network namespace is in
      many cases a code simplification.
      
      To support this change the function dst_output_okfn is introduced to
      simplify passing dst_output as an okfn.  For the moment dst_output_okfn
      just silently drops the struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4b51f0
    • E
      netfilter: Pass struct net into the netfilter hooks · 29a26a56
      Eric W. Biederman 提交于
      Pass a network namespace parameter into the netfilter hooks.  At the
      call site of the netfilter hooks the path a packet is taking through
      the network stack is well known which allows the network namespace to
      be easily and reliabily.
      
      This allows the replacement of magic code like
      "dev_net(state->in?:state->out)" that appears at the start of most
      netfilter hooks with "state->net".
      
      In almost all cases the network namespace passed in is derived
      from the first network device passed in, guaranteeing those
      paths will not see any changes in practice.
      
      The exceptions are:
      xfrm/xfrm_output.c:xfrm_output_resume()         xs_net(skb_dst(skb)->xfrm)
      ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont()      ip_vs_conn_net(cp)
      ipvs/ip_vs_xmit.c:ip_vs_send_or_cont()          ip_vs_conn_net(cp)
      ipv4/raw.c:raw_send_hdrinc()                    sock_net(sk)
      ipv6/ip6_output.c:ip6_xmit()			sock_net(sk)
      ipv6/ndisc.c:ndisc_send_skb()                   dev_net(skb->dev) not dev_net(dst->dev)
      ipv6/raw.c:raw6_send_hdrinc()                   sock_net(sk)
      br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev
      
      In all cases these exceptions seem to be a better expression for the
      network namespace the packet is being processed in then the historic
      "dev_net(in?in:out)".  I am documenting them in case something odd
      pops up and someone starts trying to track down what happened.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29a26a56
    • E
      arp: Introduce arp_xmit_finish · f9e4306f
      Eric W. Biederman 提交于
      The function dev_queue_xmit_skb_sk is unncessary and very confusing.
      Introduce arp_xmit_finish to remove the need for dev_queue_xmit_skb_sk,
      and have arp_xmit_finish call dev_queue_xmit.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9e4306f
  2. 14 8月, 2015 1 次提交
    • D
      net: Fix up inet_addr_type checks · 30bbaa19
      David Ahern 提交于
      Currently inet_addr_type and inet_dev_addr_type expect local addresses
      to be in the local table. With the VRF device local routes for devices
      associated with a VRF will be in the table associated with the VRF.
      Provide an alternate inet_addr lookup to use a specific table rather
      than defaulting to the local table.
      
      inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
      if the passed in device is enslaved to a VRF then the table for that VRF
      is used for the lookup.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30bbaa19
  3. 29 7月, 2015 1 次提交
    • E
      arp: filter NOARP neighbours for SIOCGARP · 11c91ef9
      Eric Dumazet 提交于
      When arp is off on a device, and ioctl(SIOCGARP) is queried,
      a buggy answer is given with MAC address of the device, instead
      of the mac address of the destination/gateway.
      
      We filter out NUD_NOARP neighbours for /proc/net/arp,
      we must do the same for SIOCGARP ioctl.
      
      Tested:
      
      lpaa23:~# ./arp 10.246.7.190
      MAC=00:01:e8:22:cb:1d      // correct answer
      
      lpaa23:~# ip link set dev eth0 arp off
      lpaa23:~# cat /proc/net/arp   # check arp table is now 'empty'
      IP address       HW type     Flags       HW address    Mask     Device
      lpaa23:~# ./arp 10.246.7.190
      MAC=00:1a:11:c3:0d:7f   // buggy answer before patch (this is eth0 mac)
      
      After patch :
      
      lpaa23:~# ip link set dev eth0 arp off
      lpaa23:~# ./arp 10.246.7.190
      ioctl(SIOCGARP) failed: No such device or address
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NVytautas Valancius <valas@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11c91ef9
  4. 22 7月, 2015 1 次提交
  5. 08 4月, 2015 1 次提交
    • D
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller 提交于
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7026b1dd
  6. 04 4月, 2015 2 次提交
  7. 04 3月, 2015 1 次提交
    • E
      neigh: Factor out ___neigh_lookup_noref · 60395a20
      Eric W. Biederman 提交于
      While looking at the mpls code I found myself writing yet another
      version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
      and __ipv6_lookup_noref.
      
      So to make my work a little easier and to make it a smidge easier to
      verify/maintain the mpls code in the future I stopped and wrote
      ___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
      __ipv6_lookup_noref in terms of this new function.  I tested my new
      version by verifying that the same code is generated in
      ip_finish_output2 and ip6_finish_output2 where these functions are
      inlined.
      
      To get to ___neigh_lookup_noref I added a new neighbour cache table
      function key_eq.  So that the static size of the key would be
      available.
      
      I also added __neigh_lookup_noref for people who want to to lookup
      a neighbour table entry quickly but don't know which neibhgour table
      they are going to look up.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60395a20
  8. 03 3月, 2015 3 次提交
  9. 12 11月, 2014 1 次提交
  10. 29 9月, 2014 1 次提交
  11. 02 1月, 2014 1 次提交
  12. 29 12月, 2013 1 次提交
  13. 12 12月, 2013 1 次提交
  14. 10 12月, 2013 2 次提交
  15. 04 9月, 2013 1 次提交
    • T
      net: neighbour: Remove CONFIG_ARPD · 3e25c65e
      Tim Gardner 提交于
      This config option is superfluous in that it only guards a call
      to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
      change in behavior. There will now be call to __neigh_notify()
      for each ARP resolution, which has no impact unless there is a
      user space daemon waiting to receive the notification, i.e.,
      the case for which CONFIG_ARPD was designed anyways.
      Suggested-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Veaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e25c65e
  16. 29 5月, 2013 2 次提交
  17. 27 3月, 2013 1 次提交
    • Y
      firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection. · 6752c8db
      YOSHIFUJI Hideaki / 吉藤英明 提交于
      Inspection of upper layer protocol is considered harmful, especially
      if it is about ARP or other stateful upper layer protocol; driver
      cannot (and should not) have full state of them.
      
      IPv4 over Firewire module used to inspect ARP (both in sending path
      and in receiving path), and record peer's GUID, max packet size, max
      speed and fifo address.  This patch removes such inspection by extending
      our "hardware address" definition to include other information as well:
      max packet size, max speed and fifo.  By doing this, The neighbour
      module in networking subsystem can cache them.
      
      Note: As we have started ignoring sspd and max_rec in ARP/NDP, those
            information will not be used in the driver when sending.
      
      When a packet is being sent, the IP layer fills our pseudo header with
      the extended "hardware address", including GUID and fifo.  The driver
      can look-up node-id (the real but rather volatile low-level address)
      by GUID, and then the module can send the packet to the wire using
      parameters provided in the extendedn hardware address.
      
      This approach is realistic because IP over IEEE1394 (RFC2734) and IPv6
      over IEEE1394 (RFC3146) share same "hardware address" format
      in their address resolution protocols.
      
      Here, extended "hardware address" is defined as follows:
      
      union fwnet_hwaddr {
      	u8 u[16];
      	struct {
      		__be64 uniq_id;		/* EUI-64			*/
      		u8 max_rec;		/* max packet size		*/
      		u8 sspd;		/* max speed			*/
      		__be16 fifo_hi;		/* hi 16bits of FIFO addr	*/
      		__be32 fifo_lo;		/* lo 32bits of FIFO addr	*/
      	} __packed uc;
      };
      
      Note that Hardware address is declared as union, so that we can map full
      IP address into this, when implementing MCAP (Multicast Cannel Allocation
      Protocol) for IPv6, but IP and ARP subsystem do not need to know this
      format in detail.
      
      One difference between original ARP (RFC826) and 1394 ARP (RFC2734)
      is that 1394 ARP Request/Reply do not contain the target hardware address
      field (aka ar$tha).  This difference is handled in the ARP subsystem.
      
      CC: Stephan Gatzka <stephan.gatzka@gmail.com>
      Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6752c8db
  18. 19 2月, 2013 2 次提交
  19. 11 2月, 2013 1 次提交
  20. 25 12月, 2012 1 次提交
  21. 22 12月, 2012 1 次提交
    • E
      ipv4: arp: fix a lockdep splat in arp_solicit() · 9650388b
      Eric Dumazet 提交于
      Yan Burman reported following lockdep warning :
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.7.0+ #24 Not tainted
      ---------------------------------------------
      swapper/1/0 is trying to acquire lock:
        (&n->lock){++--..}, at: [<ffffffff8139f56e>] __neigh_event_send
      +0x2e/0x2f0
      
      but task is already holding lock:
        (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit+0x1d4/0x280
      
      other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&n->lock);
         lock(&n->lock);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
      4 locks held by swapper/1/0:
        #0:  (((&n->timer))){+.-...}, at: [<ffffffff8104b350>]
      call_timer_fn+0x0/0x1c0
        #1:  (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit
      +0x1d4/0x280
        #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81395400>]
      dev_queue_xmit+0x0/0x5d0
        #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff813cb41e>]
      ip_finish_output+0x13e/0x640
      
      stack backtrace:
      Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
      Call Trace:
        <IRQ>  [<ffffffff8108c7ac>] validate_chain+0xdcc/0x11f0
        [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
        [<ffffffff81120565>] ? kmem_cache_free+0xe5/0x1c0
        [<ffffffff8108d570>] __lock_acquire+0x440/0xc30
        [<ffffffff813c3570>] ? inet_getpeer+0x40/0x600
        [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
        [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
        [<ffffffff8108ddf5>] lock_acquire+0x95/0x140
        [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
        [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
        [<ffffffff81448d4b>] _raw_write_lock_bh+0x3b/0x50
        [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
        [<ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0
        [<ffffffff8139f99b>] neigh_resolve_output+0x16b/0x270
        [<ffffffff813cb62d>] ip_finish_output+0x34d/0x640
        [<ffffffff813cb41e>] ? ip_finish_output+0x13e/0x640
        [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
        [<ffffffff813cb9a0>] ip_output+0x80/0xf0
        [<ffffffff813ca368>] ip_local_out+0x28/0x80
        [<ffffffffa046f25a>] vxlan_xmit+0x66a/0xbec [vxlan]
        [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
        [<ffffffff81394a50>] ? skb_gso_segment+0x2b0/0x2b0
        [<ffffffff81449355>] ? _raw_spin_unlock_irqrestore+0x65/0x80
        [<ffffffff81394c57>] ? dev_queue_xmit_nit+0x207/0x270
        [<ffffffff813950c8>] dev_hard_start_xmit+0x298/0x5d0
        [<ffffffff813956f3>] dev_queue_xmit+0x2f3/0x5d0
        [<ffffffff81395400>] ? dev_hard_start_xmit+0x5d0/0x5d0
        [<ffffffff813f5788>] arp_xmit+0x58/0x60
        [<ffffffff813f59db>] arp_send+0x3b/0x40
        [<ffffffff813f6424>] arp_solicit+0x204/0x280
        [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
        [<ffffffff8139f515>] neigh_probe+0x45/0x70
        [<ffffffff813a1c10>] neigh_timer_handler+0x1a0/0x2a0
        [<ffffffff8104b3cf>] call_timer_fn+0x7f/0x1c0
        [<ffffffff8104b350>] ? detach_if_pending+0x120/0x120
        [<ffffffff8104b748>] run_timer_softirq+0x238/0x2b0
        [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
        [<ffffffff81043e51>] __do_softirq+0x101/0x280
        [<ffffffff814518cc>] call_softirq+0x1c/0x30
        [<ffffffff81003b65>] do_softirq+0x85/0xc0
        [<ffffffff81043a7e>] irq_exit+0x9e/0xc0
        [<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0
        [<ffffffff8145122f>] apic_timer_interrupt+0x6f/0x80
        <EOI>  [<ffffffff8100a054>] ? mwait_idle+0xa4/0x1c0
        [<ffffffff8100a04b>] ? mwait_idle+0x9b/0x1c0
        [<ffffffff8100a6a9>] cpu_idle+0x89/0xe0
        [<ffffffff81441127>] start_secondary+0x1b2/0x1b6
      
      Bug is from arp_solicit(), releasing the neigh lock after arp_send()
      In case of vxlan, we eventually need to write lock a neigh lock later.
      
      Its a false positive, but we can get rid of it without lockdep
      annotations.
      
      We can instead use neigh_ha_snapshot() helper.
      Reported-by: NYan Burman <yanb@mellanox.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9650388b
  22. 19 11月, 2012 1 次提交
    • E
      net: Allow userns root to control ipv4 · 52e804c6
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Settings that merely control a single network device are allowed.
      Either the network device is a logical network device where
      restrictions make no difference or the network device is hardware NIC
      that has been explicity moved from the initial network namespace.
      
      In general policy and network stack state changes are allowed
      while resource control is left unchanged.
      
      Allow creating raw sockets.
      Allow the SIOCSARP ioctl to control the arp cache.
      Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
      Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
      Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
      Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
      Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
      Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting gre tunnels.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting ipip tunnels.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting ipsec virtual tunnel interfaces.
      
      Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
      MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
      sockets.
      
      Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
      arbitrary ip options.
      
      Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
      Allow setting the IP_TRANSPARENT ipv4 socket option.
      Allow setting the TCP_REPAIR socket option.
      Allow setting the TCP_CONGESTION socket option.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52e804c6
  23. 19 9月, 2012 1 次提交
  24. 08 9月, 2012 1 次提交
  25. 27 7月, 2012 1 次提交
  26. 21 7月, 2012 2 次提交
  27. 28 6月, 2012 2 次提交
    • D
      Revert "ipv4: tcp: dont cache unconfirmed intput dst" · c10237e0
      David S. Miller 提交于
      This reverts commit c074da28.
      
      This change has several unwanted side effects:
      
      1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
         thus never create a real cached route.
      
      2) All TCP traffic will use DST_NOCACHE and never use the routing
         cache at all.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c10237e0
    • E
      ipv4: tcp: dont cache unconfirmed intput dst · c074da28
      Eric Dumazet 提交于
      DDOS synflood attacks hit badly IP route cache.
      
      On typical machines, this cache is allowed to hold up to 8 Millions dst
      entries, 256 bytes for each, for a total of 2GB of memory.
      
      rt_garbage_collect() triggers and tries to cleanup things.
      
      Eventually route cache is disabled but machine is under fire and might
      OOM and crash.
      
      This patch exploits the new TCP early demux, to set a nocache
      boolean in case incoming TCP frame is for a not yet ESTABLISHED or
      TIMEWAIT socket.
      
      This 'nocache' boolean is then used in case dst entry is not found in
      route cache, to create an unhashed dst entry (DST_NOCACHE)
      
      SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
      output dst for syncookies), so after this patch, a machine is able to
      absorb a DDOS synflood attack without polluting its IP route cache.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c074da28
  28. 13 6月, 2012 1 次提交
    • T
      ipv4: Add interface option to enable routing of 127.0.0.0/8 · d0daebc3
      Thomas Graf 提交于
      Routing of 127/8 is tradtionally forbidden, we consider
      packets from that address block martian when routing and do
      not process corresponding ARP requests.
      
      This is a sane default but renders a huge address space
      practically unuseable.
      
      The RFC states that no address within the 127/8 block should
      ever appear on any network anywhere but it does not forbid
      the use of such addresses outside of the loopback device in
      particular. For example to address a pool of virtual guests
      behind a load balancer.
      
      This patch adds a new interface option 'route_localnet'
      enabling routing of the 127/8 address block and processing
      of ARP requests on a specific interface.
      
      Note that for the feature to work, the default local route
      covering 127/8 dev lo needs to be removed.
      
      Example:
        $ sysctl -w net.ipv4.conf.eth0.route_localnet=1
        $ ip route del 127.0.0.0/8 dev lo table local
        $ ip addr add 127.1.0.1/16 dev eth0
        $ ip route flush cache
      
      V2: Fix invalid check to auto flush cache (thanks davem)
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0daebc3
  29. 16 5月, 2012 2 次提交