1. 29 1月, 2008 2 次提交
  2. 21 1月, 2008 1 次提交
    • P
      [NET]: rtnl_link: fix use-after-free · 68365458
      Patrick McHardy 提交于
      When unregistering the rtnl_link_ops, all existing devices using
      the ops are destroyed. With nested devices this may lead to a
      use-after-free despite the use of for_each_netdev_safe() in case
      the upper device is next in the device list and is destroyed
      by the NETDEV_UNREGISTER notifier.
      
      The easy fix is to restart scanning the device list after removing
      a device. Alternatively we could add new devices to the front of
      the list to avoid having dependant devices follow the device they
      depend on. A third option would be to only restart scanning if
      dev->iflink of the next device matches dev->ifindex of the current
      one. For now this seems like the safest solution.
      
      With this patch, the veth rtnl_link_ops unregistration can use
      rtnl_link_unregister() directly since it now also handles destruction
      of multiple devices at once.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68365458
  3. 27 10月, 2007 1 次提交
  4. 20 10月, 2007 1 次提交
    • P
      Make access to task's nsproxy lighter · cf7b708c
      Pavel Emelyanov 提交于
      When someone wants to deal with some other taks's namespaces it has to lock
      the task and then to get the desired namespace if the one exists.  This is
      slow on read-only paths and may be impossible in some cases.
      
      E.g.  Oleg recently noticed a race between unshare() and the (sent for
      review in cgroups) pid namespaces - when the task notifies the parent it
      has to know the parent's namespace, but taking the task_lock() is
      impossible there - the code is under write locked tasklist lock.
      
      On the other hand switching the namespace on task (daemonize) and releasing
      the namespace (after the last task exit) is rather rare operation and we
      can sacrifice its speed to solve the issues above.
      
      The access to other task namespaces is proposed to be performed
      like this:
      
           rcu_read_lock();
           nsproxy = task_nsproxy(tsk);
           if (nsproxy != NULL) {
                   / *
                     * work with the namespaces here
                     * e.g. get the reference on one of them
                     * /
           } / *
               * NULL task_nsproxy() means that this task is
               * almost dead (zombie)
               * /
           rcu_read_unlock();
      
      This patch has passed the review by Eric and Oleg :) and,
      of course, tested.
      
      [clg@fr.ibm.com: fix unshare()]
      [ebiederm@xmission.com: Update get_net_ns_by_pid]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf7b708c
  5. 11 10月, 2007 9 次提交
    • D
      [NET]: make netlink user -> kernel interface synchronious · cd40b7d3
      Denis V. Lunev 提交于
      This patch make processing netlink user -> kernel messages synchronious.
      This change was inspired by the talk with Alexey Kuznetsov about current
      netlink messages processing. He says that he was badly wrong when introduced 
      asynchronious user -> kernel communication.
      
      The call netlink_unicast is the only path to send message to the kernel
      netlink socket. But, unfortunately, it is also used to send data to the
      user.
      
      Before this change the user message has been attached to the socket queue
      and sk->sk_data_ready was called. The process has been blocked until all
      pending messages were processed. The bad thing is that this processing
      may occur in the arbitrary process context.
      
      This patch changes nlk->data_ready callback to get 1 skb and force packet
      processing right in the netlink_unicast.
      
      Kernel -> user path in netlink_unicast remains untouched.
      
      EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
      drop, but the process remains in the cycle until the message will be fully
      processed. So, there is no need to use this kludges now.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Acked-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd40b7d3
    • D
      [NET]: rtnl_unlock cleanups · 1536cc0d
      Denis V. Lunev 提交于
      There is no need to process outstanding netlink user->kernel packets
      during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv
      anymore.
      
      Normal code path is the following:
      netlink_sendmsg
         netlink_unicast
             netlink_sendskb
                 skb_queue_tail
                 netlink_data_ready
                     rtnetlink_rcv
                         mutex_lock(&rtnl_mutex);
                         netlink_run_queue(sk, qlen, &rtnetlink_rcv_msg);
                         mutex_unlock(&rtnl_mutex);
      
      So, it is possible, that packets can be present in the rtnl->sk_receive_queue
      during rtnl_unlock, but there is no need to process them at that moment as
      rtnetlink_rcv for that packet is pending.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Acked-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1536cc0d
    • H
      [NETLINK]: Avoid pointer in netlink_run_queue · 0cfad075
      Herbert Xu 提交于
      I was looking at Patrick's fix to inet_diag and it occured
      to me that we're using a pointer argument to return values
      unnecessarily in netlink_run_queue.  Changing it to return
      the value will allow the compiler to generate better code
      since the value won't have to be memory-backed.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cfad075
    • E
      [NET]: netlink support for moving devices between network namespaces. · d8a5ec67
      Eric W. Biederman 提交于
      The simplest thing to implement is moving network devices between
      namespaces.  However with the same attribute IFLA_NET_NS_PID we can
      easily implement creating devices in the destination network
      namespace as well.  However that is a little bit trickier so this
      patch sticks to what is simple and easy.
      
      A pid is used to identify a process that happens to be a member
      of the network namespace we want to move the network device to.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d8a5ec67
    • E
      [NET]: Make the device list and device lookups per namespace. · 881d966b
      Eric W. Biederman 提交于
      This patch makes most of the generic device layer network
      namespace safe.  This patch makes dev_base_head a
      network namespace variable, and then it picks up
      a few associated variables.  The functions:
      dev_getbyhwaddr
      dev_getfirsthwbytype
      dev_get_by_flags
      dev_get_by_name
      __dev_get_by_name
      dev_get_by_index
      __dev_get_by_index
      dev_ioctl
      dev_ethtool
      dev_load
      wireless_process_ioctl
      
      were modified to take a network namespace argument, and
      deal with it.
      
      vlan_ioctl_set and brioctl_set were modified so their
      hooks will receive a network namespace argument.
      
      So basically anthing in the core of the network stack that was
      affected to by the change of dev_base was modified to handle
      multiple network namespaces.  The rest of the network stack was
      simply modified to explicitly use &init_net the initial network
      namespace.  This can be fixed when those components of the network
      stack are modified to handle multiple network namespaces.
      
      For now the ifindex generator is left global.
      
      Fundametally ifindex numbers are per namespace, or else
      we will have corner case problems with migration when
      we get that far.
      
      At the same time there are assumptions in the network stack
      that the ifindex of a network device won't change.  Making
      the ifindex number global seems a good compromise until
      the network stack can cope with ifindex changes when
      you change namespaces, and the like.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      881d966b
    • E
      [NET]: Support multiple network namespaces with netlink · b4b51029
      Eric W. Biederman 提交于
      Each netlink socket will live in exactly one network namespace,
      this includes the controlling kernel sockets.
      
      This patch updates all of the existing netlink protocols
      to only support the initial network namespace.  Request
      by clients in other namespaces will get -ECONREFUSED.
      As they would if the kernel did not have the support for
      that netlink protocol compiled in.
      
      As each netlink protocol is updated to be multiple network
      namespace safe it can register multiple kernel sockets
      to acquire a presence in the rest of the network namespaces.
      
      The implementation in af_netlink is a simple filter implementation
      at hash table insertion and hash table look up time.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4b51029
    • E
      [NET]: Make device event notification network namespace safe · e9dc8653
      Eric W. Biederman 提交于
      Every user of the network device notifiers is either a protocol
      stack or a pseudo device.  If a protocol stack that does not have
      support for multiple network namespaces receives an event for a
      device that is not in the initial network namespace it quite possibly
      can get confused and do the wrong thing.
      
      To avoid problems until all of the protocol stacks are converted
      this patch modifies all netdev event handlers to ignore events on
      devices that are not in the initial network namespace.
      
      As the rest of the code is made network namespace aware these
      checks can be removed.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9dc8653
    • P
      [RTNETLINK]: Introduce generic rtnl_create_link(). · e7199288
      Pavel Emelianov 提交于
      This routine gets the parsed rtnl attributes and creates a new
      link with generic info (IFLA_LINKINFO policy). Its intention
      is to help the drivers, that need to create several links at
      once (like VETH).
      
      This is nothing but a copy-paste-ed part of rtnl_newlink() function
      that is responsible for creation of new device.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7199288
    • S
      [NET]: Make NAPI polling independent of struct net_device objects. · bea3348e
      Stephen Hemminger 提交于
      Several devices have multiple independant RX queues per net
      device, and some have a single interrupt doorbell for several
      queues.
      
      In either case, it's easier to support layouts like that if the
      structure representing the poll is independant from the net
      device itself.
      
      The signature of the ->poll() call back goes from:
      
      	int foo_poll(struct net_device *dev, int *budget)
      
      to
      
      	int foo_poll(struct napi_struct *napi, int budget)
      
      The caller is returned the number of RX packets processed (or
      the number of "NAPI credits" consumed if you want to get
      abstract).  The callee no longer messes around bumping
      dev->quota, *budget, etc. because that is all handled in the
      caller upon return.
      
      The napi_struct is to be embedded in the device driver private data
      structures.
      
      Furthermore, it is the driver's responsibility to disable all NAPI
      instances in it's ->stop() device close handler.  Since the
      napi_struct is privatized into the driver's private data structures,
      only the driver knows how to get at all of the napi_struct instances
      it may have per-device.
      
      With lots of help and suggestions from Rusty Russell, Roland Dreier,
      Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
      
      Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
      Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
      
      [ Ported to current tree and all drivers converted.  Integrated
        Stephen's follow-on kerneldoc additions, and restored poll_list
        handling to the old style to fix mutual exclusion issues.  -DaveM ]
      Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bea3348e
  6. 01 8月, 2007 1 次提交
  7. 19 7月, 2007 1 次提交
  8. 12 7月, 2007 2 次提交
  9. 11 7月, 2007 4 次提交
  10. 08 6月, 2007 2 次提交
  11. 23 5月, 2007 2 次提交
  12. 04 5月, 2007 1 次提交
  13. 26 4月, 2007 13 次提交