1. 13 2月, 2018 40 次提交
    • D
      Merge branch 'Replacing-net_mutex-with-rw_semaphore' · 885842d8
      David S. Miller 提交于
      Kirill Tkhai says:
      
      ====================
      Replacing net_mutex with rw_semaphore
      
      this is the third version of the patchset introducing net_sem
      instead of net_mutex. The patchset adds net_sem in addition
      to net_mutex and allows pernet_operations to be "async". This
      flag means, the pernet_operations methods are safe to be
      executed with any other pernet_operations (un)initializing
      another net.
      
      If there are only async pernet_operations in the system,
      net_mutex is not used either for setup_net() or for cleanup_net().
      
      The pernet_operations converted in this patchset allow
      to create minimal .config to have network working, and
      the changes improve the performance like you may see
      below:
      
          %for i in {1..10000}; do unshare -n bash -c exit; done
      
          *before*
          real 1m40,377s
          user 0m9,672s
          sys 0m19,928s
      
          *after*
          real 0m17,007s
          user 0m5,311s
          sys 0m11,779
      
          (5.8 times faster)
      
      In the future, when all pernet_operations become async,
      we'll just remove this "async" field tree-wide.
      
      All the new logic is concentrated in patches [1-5/32].
      The rest of patches converts specific operations:
      review, rationale of they can be converted, and setting
      of async flag.
      
      Kirill
      
      v3: Improved patches descriptions. Added comment into [5/32].
      Added [32/32] converting netlink_tap_net_ops (new pernet operations
      introduced in 2018).
      
      v2: Single patch -> patchset with rationale of every conversion
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      885842d8
    • K
      net: Convert netlink_tap_net_ops · b86b47a3
      Kirill Tkhai 提交于
      These pernet_operations init just allocated net memory,
      and they obviously can be executed in parallel in any
      others.
      
      v3: New
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b86b47a3
    • K
      net: Convert diag_net_ops · 59a51358
      Kirill Tkhai 提交于
      These pernet operations just create and destroy netlink
      socket. The socket is pernet and else operations don't
      touch it.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59a51358
    • K
      net: Convert default_device_ops · 2608e6b7
      Kirill Tkhai 提交于
      These pernet operations consist of exit() and exit_batch() methods.
      
      default_device_exit() moves not-local and virtual devices to init_net.
      There is nothing exciting, because this may happen in any time
      on a working system, and rtnl_lock() and synchronize_net() protect
      us from all cases of external dereference.
      
      The same for default_device_exit_batch(). Similar unregisteration
      may happen in any time on a system. Here several lists (like todo_list),
      which are accessed under rtnl_lock(). After rtnl_unlock() and
      netdev_run_todo() all the devices are flushed.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2608e6b7
    • K
      net: Convert loopback_net_ops · 9a4d105d
      Kirill Tkhai 提交于
      These pernet_operations have only init() method. It allocates
      memory for net_device, calls register_netdev() and assigns
      net::loopback_dev.
      
      register_netdev() is allowed be used without additional locks,
      as it's synchronized on rtnl_lock(). There are many examples
      of using this functon directly from ioctl().
      
      The only difference, compared to ioctl(), is that net is not
      completely alive at this moment. But it looks like, there is
      no way for parallel pernet_operations to dereference
      the net_device, as the most of struct net_device lists,
      where it's linked, are related to net, and the net is not liked.
      
      The exceptions are net_device::unreg_list, close_list, todo_list,
      used for unregistration, and ::link_watch_list, where net_device
      may be linked to global lists.
      
      Unregistration of loopback_dev obviously can't happen, when
      loopback_net_init() is executing, as the net as alive. It occurs
      in default_device_ops, which currently requires net_mutex,
      and it behaves as a barrier at the moment. It will be considered
      in next patch.
      
      Speaking about link_watch_list, it seems, there is no way
      for loopback_dev at time of registration to be linked in lweventlist
      and be available for another pernet_operations.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a4d105d
    • K
      net: Convert addrconf_ops · 0bc9be67
      Kirill Tkhai 提交于
      These pernet_operations (un)register sysctl, which
      are not touched by anybody else.
      
      So, it's safe to make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bc9be67
    • K
      net: Convert ipv4_sysctl_ops · 22769a2a
      Kirill Tkhai 提交于
      These pernet_operations create and destroy sysctl,
      which are not touched by anybody else.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22769a2a
    • K
      net: Convert packet_net_ops · cb5e3400
      Kirill Tkhai 提交于
      These pernet_operations just create and destroy /proc entry,
      and another operations do not touch it.
      
      Also, nobody else are interested in foreign net::packet::sklist.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb5e3400
    • K
      net: Convert unix_net_ops · 167f7ac7
      Kirill Tkhai 提交于
      These pernet_operations are just create and destroy
      /proc and sysctl entries, and are not touched by
      foreign pernet_operations.
      
      So, we are able to make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      167f7ac7
    • K
      net: Convert pernet_subsys, registered from inet_init() · f84c6821
      Kirill Tkhai 提交于
      arp_net_ops just addr/removes /proc entry.
      
      devinet_ops allocates and frees duplicate of init_net tables
      and (un)registers sysctl entries.
      
      fib_net_ops allocates and frees pernet tables, creates/destroys
      netlink socket and (un)initializes /proc entries. Foreign
      pernet_operations do not touch them.
      
      ip_rt_proc_ops only modifies pernet /proc entries.
      
      xfrm_net_ops creates/destroys /proc entries, allocates/frees
      pernet statistics, hashes and tables, and (un)initializes
      sysctl files. These are not touched by foreigh pernet_operations
      
      xfrm4_net_ops allocates/frees private pernet memory, and
      configures sysctls.
      
      sysctl_route_ops creates/destroys sysctls.
      
      rt_genid_ops only initializes fields of just allocated net.
      
      ipv4_inetpeer_ops allocated/frees net private memory.
      
      igmp_net_ops just creates/destroys /proc files and socket,
      noone else interested in.
      
      tcp_sk_ops seems to be safe, because tcp_sk_init() does not
      depend on any other pernet_operations modifications. Iteration
      over hash table in inet_twsk_purge() is made under RCU lock,
      and it's safe to iterate the table this way. Removing from
      the table happen from inet_twsk_deschedule_put(), but this
      function is safe without any extern locks, as it's synchronized
      inside itself. There are many examples, it's used in different
      context. So, it's safe to leave tcp_sk_exit_batch() unlocked.
      
      tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.
      
      udplite4_net_ops only creates/destroys pernet /proc file.
      
      icmp_sk_ops creates percpu sockets, not touched by foreign
      pernet_operations.
      
      ipmr_net_ops creates/destroys pernet fib tables, (un)registers
      fib rules and /proc files. This seem to be safe to execute
      in parallel with foreign pernet_operations.
      
      af_inet_ops just sets up default parameters of newly created net.
      
      ipv4_mib_ops creates and destroys pernet percpu statistics.
      
      raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
      and ip_proc_ops only create/destroy pernet /proc files.
      
      ip4_frags_ops creates and destroys sysctl file.
      
      So, it's safe to make the pernet_operations async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84c6821
    • K
      net: Convert sysctl_core_ops · 232cf06c
      Kirill Tkhai 提交于
      These pernet_operations register and destroy sysctl
      directory, and it's not interesting for foreign
      pernet_operations.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      232cf06c
    • K
      net: Convert wext_pernet_ops · 6c0075d0
      Kirill Tkhai 提交于
      These pernet_operations initialize and purge net::wext_nlevents
      queue, and are not touched by foreign pernet_operations.
      
      Mark them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c0075d0
    • K
      net: Convert genl_pernet_ops · 83caf62c
      Kirill Tkhai 提交于
      This pernet_operations create and destroy net::genl_sock.
      Foreign pernet_operations don't touch it.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83caf62c
    • K
      net: Convert subsys_initcall() registered pernet_operations from net/sched · 13da199c
      Kirill Tkhai 提交于
      psched_net_ops only creates and destroyes /proc entry,
      and safe to be executed in parallel with any foreigh
      pernet_operations.
      
      tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht,
      which is not touched by foreign pernet_operations.
      
      So, make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13da199c
    • K
      net: Convert fib_* pernet_operations, registered via subsys_initcall · 86b63418
      Kirill Tkhai 提交于
      Both of them create and initialize lists, which are not touched
      by another foreing pernet_operations.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86b63418
    • K
      net: Convert pernet_subsys ops, registered via net_dev_init() · 88b8ffeb
      Kirill Tkhai 提交于
      There are:
      1)dev_proc_ops and dev_mc_net_ops, which create and destroy
      pernet proc file and not interesting for another net namespaces;
      2)netdev_net_ops, which creates pernet hashes, which are not
      touched by another pernet_operations.
      
      So, make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88b8ffeb
    • K
      net: Convert proto_net_ops · 36b0068e
      Kirill Tkhai 提交于
      This patch starts to convert pernet_subsys, registered
      from subsys initcalls.
      
      It seems safe to be executed in parallel with others,
      as it's only creates/destoyes proc entry,
      which nobody else is not interested in.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36b0068e
    • K
      net: Convert uevent_net_ops · 15898a01
      Kirill Tkhai 提交于
      uevent_net_init() and uevent_net_exit() create and
      destroy netlink socket, and these actions serialized
      in netlink code.
      
      Parallel execution with other pernet_operations
      makes the socket disappear earlier from uevent_sock_list
      on ->exit. As userspace can't be interested in broadcast
      messages of dying net, and, as I see, no one in kernel
      listen them, we may safely make uevent_net_ops async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15898a01
    • K
      net: Convert audit_net_ops · 906f63ec
      Kirill Tkhai 提交于
      This patch starts to convert pernet_subsys, registered
      from postcore initcalls.
      
      audit_net_init() creates netlink socket, while audit_net_exit()
      destroys it. The rest of the pernet_list are not interested
      in the socket, so we make audit_net_ops async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      906f63ec
    • K
      net: Convert rtnetlink_net_ops · 46456675
      Kirill Tkhai 提交于
      rtnetlink_net_init() and rtnetlink_net_exit()
      create and destroy netlink socket net::rtnl.
      
      The socket is used to send rtnl notification via
      rtnl_net_notifyid(). There is no a problem
      to create and destroy it in parallel with other
      pernet operations, as we link net in setup_net()
      after the socket is created, and destroy
      in cleanup_net() after net is unhashed from all
      the lists and there is no RCU references on it.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46456675
    • K
      net: Convert netlink_net_ops · 194b95d2
      Kirill Tkhai 提交于
      The methods of netlink_net_ops create and destroy "netlink"
      file, which are not interesting for foreigh pernet_operations.
      So, netlink_net_ops may safely be made async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      194b95d2
    • K
      net: Convert net_defaults_ops · ff291d00
      Kirill Tkhai 提交于
      net_defaults_ops introduce only net_defaults_init_net method,
      and it acts on net::core::sysctl_somaxconn, which
      is not interesting for the rest of pernet_subsys and
      pernet_device lists. Then, make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff291d00
    • K
      net: Convert net_inuse_ops · 604da74e
      Kirill Tkhai 提交于
      net_inuse_ops methods expose statistics in /proc.
      No one from the rest of pernet_subsys or pernet_device
      lists touch net::core::inuse.
      
      So, it's safe to make net_inuse_ops async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      604da74e
    • K
      net: Convert nf_log_net_ops · c9d8fb91
      Kirill Tkhai 提交于
      The pernet_operations would have had a problem in parallel
      execution with others, if init_net had been able to released.
      But it's not, and the rest is safe for that.
      There is memory allocation, which nobody else interested in,
      and sysctl registration. So, we make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9d8fb91
    • K
      net: Convert netfilter_net_ops · 95499299
      Kirill Tkhai 提交于
      Methods netfilter_net_init() and netfilter_net_exit()
      initialize net::nf::hooks and change net-related proc
      directory of net. Another pernet_operations are not
      interested in forein net::nf::hooks or proc entries,
      so it's safe to make them executed in parallel with
      methods of other pernet operations.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95499299
    • K
      net: Convert sysctl_pernet_ops · 93d230fe
      Kirill Tkhai 提交于
      This patch starts to convert pernet_subsys, registered
      from core initcalls.
      
      Methods sysctl_net_init() and sysctl_net_exit() initialize
      net::sysctls table of a namespace.
      
      pernet_operations::init()/exit() methods from the rest
      of the list do not touch net::sysctls of strangers,
      so it's safe to execute sysctl_pernet_ops's methods
      in parallel with any other pernet_operations.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93d230fe
    • K
      net: Convert net_ns_ops methods · 3fc3b827
      Kirill Tkhai 提交于
      This patch starts to convert pernet_subsys, registered
      from pure initcalls.
      
      net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only
      ida_simple_* functions, which are not need a synchronization.
      They are synchronized by idr subsystem.
      
      So, net_ns_ops methods are able to be executed
      in parallel with methods of other pernet operations.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fc3b827
    • K
      net: Convert proc_net_ns_ops · f039e184
      Kirill Tkhai 提交于
      This patch starts to convert pernet_subsys, registered
      before initcalls.
      
      proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit()
      {un,}register pernet net->proc_net and ->proc_net_stat.
      
      Constructors and destructors of another pernet_operations
      are not interested in foreign net's proc_net and proc_net_stat.
      Proc filesystem privitives are synchronized on proc_subdir_lock.
      
      So, proc_net_ns_ops methods are able to be executed
      in parallel with methods of any other pernet operations.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f039e184
    • K
      net: Allow pernet_operations to be executed in parallel · 447cd7a0
      Kirill Tkhai 提交于
      This adds new pernet_operations::async flag to indicate operations,
      which ->init(), ->exit() and ->exit_batch() methods are allowed
      to be executed in parallel with the methods of any other pernet_operations.
      
      When there are only asynchronous pernet_operations in the system,
      net_mutex won't be taken for a net construction and destruction.
      
      Also, remove BUG_ON(mutex_is_locked()) from net_assign_generic()
      without replacing with the equivalent net_sem check, as there is
      one more lockdep assert below.
      
      v3: Add comment near net_mutex.
      Suggested-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      447cd7a0
    • K
      net: Move mutex_unlock() in cleanup_net() up · bcab1ddd
      Kirill Tkhai 提交于
      net_sem protects from pernet_list changing, while
      ops_free_list() makes simple kfree(), and it can't
      race with other pernet_operations callbacks.
      
      So we may release net_mutex earlier then it was.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcab1ddd
    • K
      net: Introduce net_sem for protection of pernet_list · 1a57feb8
      Kirill Tkhai 提交于
      Currently, the mutex is mostly used to protect pernet operations
      list. It orders setup_net() and cleanup_net() with parallel
      {un,}register_pernet_operations() calls, so ->exit{,batch} methods
      of the same pernet operations are executed for a dying net, as
      were used to call ->init methods, even after the net namespace
      is unlinked from net_namespace_list in cleanup_net().
      
      But there are several problems with scalability. The first one
      is that more than one net can't be created or destroyed
      at the same moment on the node. For big machines with many cpus
      running many containers it's very sensitive.
      
      The second one is that it's need to synchronize_rcu() after net
      is removed from net_namespace_list():
      
      Destroy net_ns:
      cleanup_net()
        mutex_lock(&net_mutex)
        list_del_rcu(&net->list)
        synchronize_rcu()                                  <--- Sleep there for ages
        list_for_each_entry_reverse(ops, &pernet_list, list)
          ops_exit_list(ops, &net_exit_list)
        list_for_each_entry_reverse(ops, &pernet_list, list)
          ops_free_list(ops, &net_exit_list)
        mutex_unlock(&net_mutex)
      
      This primitive is not fast, especially on the systems with many processors
      and/or when preemptible RCU is enabled in config. So, all the time, while
      cleanup_net() is waiting for RCU grace period, creation of new net namespaces
      is not possible, the tasks, who makes it, are sleeping on the same mutex:
      
      Create net_ns:
      copy_net_ns()
        mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
      
      I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
      with preemptible RCU enabled after CRIU tests round is finished.
      
      The solution is to convert net_mutex to the rw_semaphore and add fine grain
      locks to really small number of pernet_operations, what really need them.
      
      Then, pernet_operations::init/::exit methods, modifying the net-related data,
      will require down_read() locking only, while down_write() will be used
      for changing pernet_list (i.e., when modules are being loaded and unloaded).
      
      This gives signify performance increase, after all patch set is applied,
      like you may see here:
      
      %for i in {1..10000}; do unshare -n bash -c exit; done
      
      *before*
      real 1m40,377s
      user 0m9,672s
      sys 0m19,928s
      
      *after*
      real 0m17,007s
      user 0m5,311s
      sys 0m11,779
      
      (5.8 times faster)
      
      This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
      describes the variables it protects, and makes to use, where appropriate.
      net_mutex is still present, and next patches will kick it out step-by-step.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a57feb8
    • K
      net: Cleanup in copy_net_ns() · 5ba049a5
      Kirill Tkhai 提交于
      Line up destructors actions in the revers order
      to constructors. Next patches will add more actions,
      and this will be comfortable, if there is the such
      order.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ba049a5
    • K
      net: Assign net to net_namespace_list in setup_net() · 98f6c533
      Kirill Tkhai 提交于
      This patch merges two repeating pieces of code in one,
      and they will live in setup_net() now.
      
      The only change is that assignment:
      
      	init_net_initialized = true;
      
      becomes reordered with:
      
      	list_add_tail_rcu(&net->list, &net_namespace_list);
      
      The order does not have visible effect, and it is a simple
      cleanup because of:
      
      init_net_initialized is used in !CONFIG_NET_NS case
      to order proc_net_ns_ops registration occuring at boot time:
      
      	start_kernel()->proc_root_init()->proc_net_init(),
      with
      	net_ns_init()->setup_net(&init_net, &init_user_ns)
      
      also occuring in boot time from the same init_task.
      
      When there are no another tasks to race with them,
      for the single task it does not matter, which order
      two sequential independent loads should be made.
      So we make them reordered.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98f6c533
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · cf19e5e2
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2018-02-12
      
      This series contains updates to i40e and i40evf.
      
      Alan fixes a spelling mistake in code comments.  Fixes an issue on older
      firmware versions or NPAR enabled PFs which do not support the
      I40E_FLAG_DISABLE_FW_LLDP flag and would get into a situation where any
      attempt to change any priv flag would be forbidden.
      
      Alex got busy with the ITR code and made several cleanups and fixes so
      that we can more easily understand what is going on.  The fixes included
      a computational fix when determining the register offset, as well as a
      fix for unnecessarily toggling the CLEARPBA bit which could lead to
      potential lost events if auto-masking is not enabled.
      
      Filip adds a necessary delay to recover after a EMP reset when using
      firmware version 4.33.
      
      Paweł adds a warning message for MFP devices when the link-down-on-close
      flag is set because it may affect other partitions.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf19e5e2
    • A
      i40e/i40evf: Add support for new mechanism of updating adaptive ITR · a0073a4b
      Alexander Duyck 提交于
      This patch replaces the existing mechanism for determining the correct
      value to program for adaptive ITR with yet another new and more
      complicated approach.
      
      The basic idea from a 30K foot view is that this new approach will push the
      Rx interrupt moderation up so that by default it starts in low latency and
      is gradually pushed up into a higher latency setup as long as doing so
      increases the number of packets processed, if the number of packets drops
      to 4 to 1 per packet we will reset and just base our ITR on the size of the
      packets being received. For Tx we leave it floating at a high interrupt
      delay and do not pull it down unless we start processing more than 112
      packets per interrupt. If we start exceeding that we will cut our interrupt
      rates in half until we are back below 112.
      
      The side effect of these patches are that we will be processing more
      packets per interrupt. This is both a good and a bad thing as it means we
      will not be blocking processing in the case of things like pktgen and XDP,
      but we will also be consuming a bit more CPU in the cases of things such as
      network throughput tests using netperf.
      
      One delta from this versus the ixgbe version of the changes is that I have
      made the interrupt moderation a bit more aggressive when we are in bulk
      mode by moving our "goldilocks zone" up from 48 to 96 to 56 to 112. The
      main motivation behind moving this is to address the fact that we need to
      update less frequently, and have more fine grained control due to the
      separate Tx and Rx ITR times.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a0073a4b
    • A
      i40e/i40evf: Split container ITR into current_itr and target_itr · 556fdfd6
      Alexander Duyck 提交于
      This patch is mostly prep-work for replacing the current approach to
      programming the dynamic aka adaptive ITR. Specifically here what we are
      doing is splitting the Tx and Rx ITR each into two separate values.
      
      The first value current_itr represents the current value of the register.
      
      The second value target_itr represents the desired value of the register.
      
      The general plan by doing this is to allow for deferring the update of the
      ITR value under certain circumstances. For now we will work with what we
      have, but in the future I hope to change the behavior so that we always
      only update one ITR at a time using some simple logic to determine which
      ITR requires an update.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      556fdfd6
    • A
      i40evf: Correctly populate rxitr_idx and txitr_idx · d4942d58
      Alexander Duyck 提交于
      While testing code for the recent ITR changes I found that updating the Tx
      ITR appeared to have no effect with everything defaulting to the Rx ITR. A
      bit of digging narrowed it down the fact that we were asking the PF to
      associate all causes with ITR 0 as we weren't populating the itr_idx values
      for either Rx or Tx.
      
      To correct it I have added the configuration for these values to this
      patch. In addition I did some minor clean-up to just add a local pointer
      for the vector map instead of dereferencing it based off of the index
      repeatedly. In my opinion this makes the resultant code a bit more readable
      and saves us a few characters.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      d4942d58
    • A
      i40e/i40evf: Use usec value instead of reg value for ITR defines · 92418fb1
      Alexander Duyck 提交于
      Instead of using the register value for the defines when setting up the
      ring ITR we can just use the actual values and avoid the use of shifts and
      macros to translate between the values we have and the values we want.
      
      This helps to make the code more readable as we can quickly translate from
      one value to the other.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      92418fb1
    • D
      net: make getname() functions return length rather than use int* parameter · 9b2c45d4
      Denys Vlasenko 提交于
      Changes since v1:
      Added changes in these files:
          drivers/infiniband/hw/usnic/usnic_transport.c
          drivers/staging/lustre/lnet/lnet/lib-socket.c
          drivers/target/iscsi/iscsi_target_login.c
          drivers/vhost/net.c
          fs/dlm/lowcomms.c
          fs/ocfs2/cluster/tcp.c
          security/tomoyo/network.c
      
      Before:
      All these functions either return a negative error indicator,
      or store length of sockaddr into "int *socklen" parameter
      and return zero on success.
      
      "int *socklen" parameter is awkward. For example, if caller does not
      care, it still needs to provide on-stack storage for the value
      it does not need.
      
      None of the many FOO_getname() functions of various protocols
      ever used old value of *socklen. They always just overwrite it.
      
      This change drops this parameter, and makes all these functions, on success,
      return length of sockaddr. It's always >= 0 and can be differentiated
      from an error.
      
      Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
      
      rpc_sockname() lost "int buflen" parameter, since its only use was
      to be passed to kernel_getsockname() as &buflen and subsequently
      not used in any way.
      
      Userspace API is not changed.
      
          text    data     bss      dec     hex filename
      30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
      30108109 2633612  873672 33615393 200ee21 vmlinux.o
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: linux-bluetooth@vger.kernel.org
      CC: linux-decnet-user@lists.sourceforge.net
      CC: linux-wireless@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: linux-sctp@vger.kernel.org
      CC: linux-nfs@vger.kernel.org
      CC: linux-x25@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b2c45d4
    • A
      i40e/i40evf: Don't bother setting the CLEARPBA bit · 4ff17929
      Alexander Duyck 提交于
      The CLEARPBA bit in the dynamic interrupt control register actually has
      no effect either way on the hardware. As per errata 28 in the XL710
      specification update the interrupt is actually cleared any time the
      register is written with the INTENA_MSK bit set to 0. As such the act of
      toggling the enable bit actually will trigger the interrupt being
      cleared and could lead to potential lost events if auto-masking is
      not enabled.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4ff17929