1. 17 11月, 2012 1 次提交
  2. 02 10月, 2012 3 次提交
  3. 28 9月, 2012 2 次提交
    • E
      net: use bigger pages in __netdev_alloc_frag · 69b08f62
      Eric Dumazet 提交于
      We currently use percpu order-0 pages in __netdev_alloc_frag
      to deliver fragments used by __netdev_alloc_skb()
      
      Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
      be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096
      
      Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :
      
      - Better filling of space (the ending hole overhead is less an issue)
      
      - Less calls to page allocator or accesses to page->_count
      
      - Could allow struct skb_shared_info futures changes without major
        performance impact.
      
      This patch implements a transparent fallback to smaller
      pages in case of memory pressure.
      
      It also uses a standard "struct page_frag" instead of a custom one.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69b08f62
    • E
      net: remove sk_init() helper · e2bcabec
      Eric Dumazet 提交于
      It seems sk_init() has no value today and even does strange things :
      
      # grep . /proc/sys/net/core/?mem_*
      /proc/sys/net/core/rmem_default:212992
      /proc/sys/net/core/rmem_max:131071
      /proc/sys/net/core/wmem_default:212992
      /proc/sys/net/core/wmem_max:131071
      
      We can remove it completely.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NShan Wei <davidshan@tencent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2bcabec
  4. 25 9月, 2012 3 次提交
    • E
      net: guard tcp_set_keepalive() to tcp sockets · 3e10986d
      Eric Dumazet 提交于
      Its possible to use RAW sockets to get a crash in
      tcp_set_keepalive() / sk_reset_timer()
      
      Fix is to make sure socket is a SOCK_STREAM one.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e10986d
    • D
      filter: add XOR instruction for use with X/K · 9e49e889
      Daniel Borkmann 提交于
      SKF_AD_ALU_XOR_X has been added a while ago, but as an 'ancillary'
      operation that is invoked through a negative offset in K within BPF
      load operations. Since BPF_MOD has recently been added, BPF_XOR should
      also be part of the common ALU operations. Removing SKF_AD_ALU_XOR_X
      might not be an option since this is exposed to user space.
      Signed-off-by: NDaniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e49e889
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768
  5. 21 9月, 2012 1 次提交
    • E
      net: do not disable sg for packets requiring no checksum · c0d680e5
      Ed Cashin 提交于
      A change in a series of VLAN-related changes appears to have
      inadvertently disabled the use of the scatter gather feature of
      network cards for transmission of non-IP ethernet protocols like ATA
      over Ethernet (AoE).  Below is a reference to the commit that
      introduces a "harmonize_features" function that turns off scatter
      gather when the NIC does not support hardware checksumming for the
      ethernet protocol of an sk buff.
      
        commit f01a5236
        Author: Jesse Gross <jesse@nicira.com>
        Date:   Sun Jan 9 06:23:31 2011 +0000
      
            net offloading: Generalize netif_get_vlan_features().
      
      The can_checksum_protocol function is not equipped to consider a
      protocol that does not require checksumming.  Calling it for a
      protocol that requires no checksum is inappropriate.
      
      The patch below has harmonize_features call can_checksum_protocol when
      the protocol needs a checksum, so that the network layer is not forced
      to perform unnecessary skb linearization on the transmission of AoE
      packets.  Unnecessary linearization results in decreased performance
      and increased memory pressure, as reported here:
      
        http://www.spinics.net/lists/linux-mm/msg15184.html
      
      The problem has probably not been widely experienced yet, because
      only recently has the kernel.org-distributed aoe driver acquired the
      ability to use payloads of over a page in size, with the patchset
      recently included in the mm tree:
      
        https://lkml.org/lkml/2012/8/28/140
      
      The coraid.com-distributed aoe driver already could use payloads of
      greater than a page in size, but its users generally do not use the
      newest kernels.
      Signed-off-by: NEd Cashin <ecashin@coraid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0d680e5
  6. 20 9月, 2012 6 次提交
  7. 19 9月, 2012 1 次提交
  8. 18 9月, 2012 1 次提交
    • E
      userns: Convert the audit loginuid to be a kuid · e1760bd5
      Eric W. Biederman 提交于
      Always store audit loginuids in type kuid_t.
      
      Print loginuids by converting them into uids in the appropriate user
      namespace, and then printing the resulting uid.
      
      Modify audit_get_loginuid to return a kuid_t.
      
      Modify audit_set_loginuid to take a kuid_t.
      
      Modify /proc/<pid>/loginuid on read to convert the loginuid into the
      user namespace of the opener of the file.
      
      Modify /proc/<pid>/loginud on write to convert the loginuid
      rom the user namespace of the opener of the file.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com> ?
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      e1760bd5
  9. 17 9月, 2012 3 次提交
  10. 15 9月, 2012 4 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
    • D
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner 提交于
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
    • D
      cgroup: net_prio: Do not define task_netpioidx() when not selected · 51e4e7fa
      Daniel Wagner 提交于
      task_netprioidx() should not be defined in case the configuration is
      CONFIG_NETPRIO_CGROUP=n. The reason is that in a following patch the
      net_prio_subsys_id will only be defined if CONFIG_NETPRIO_CGROUP!=n.
      When net_prio is not built at all any callee should only get an empty
      task_netprioidx() without any references to net_prio_subsys_id.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      51e4e7fa
    • D
      cgroup: net_cls: Do not define task_cls_classid() when not selected · 8fb974c9
      Daniel Wagner 提交于
      task_cls_classid() should not be defined in case the configuration is
      CONFIG_NET_CLS_CGROUP=n. The reason is that in a following patch the
      net_cls_subsys_id will only be defined if CONFIG_NET_CLS_CGROUP!=n.
      When net_cls is not built at all a callee should only get an empty
      task_cls_classid() without any references to net_cls_subsys_id.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8fb974c9
  11. 14 9月, 2012 3 次提交
  12. 11 9月, 2012 3 次提交
  13. 09 9月, 2012 3 次提交
  14. 08 9月, 2012 1 次提交
    • E
      scm: Don't use struct ucred in NETLINK_CB and struct scm_cookie. · dbe9a417
      Eric W. Biederman 提交于
      Passing uids and gids on NETLINK_CB from a process in one user
      namespace to a process in another user namespace can result in the
      wrong uid or gid being presented to userspace.  Avoid that problem by
      passing kuids and kgids instead.
      
      - define struct scm_creds for use in scm_cookie and netlink_skb_parms
        that holds uid and gid information in kuid_t and kgid_t.
      
      - Modify scm_set_cred to fill out scm_creds by heand instead of using
        cred_to_ucred to fill out struct ucred.  This conversion ensures
        userspace does not get incorrect uid or gid values to look at.
      
      - Modify scm_recv to convert from struct scm_creds to struct ucred
        before copying credential values to userspace.
      
      - Modify __scm_send to populate struct scm_creds on in the scm_cookie,
        instead of just copying struct ucred from userspace.
      
      - Modify netlink_sendmsg to copy scm_creds instead of struct ucred
        into the NETLINK_CB.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbe9a417
  15. 06 9月, 2012 1 次提交
  16. 04 9月, 2012 1 次提交
  17. 01 9月, 2012 2 次提交
    • J
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu 提交于
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8336886f
    • R
      net: fix documentation of skb_needs_linearize(). · d1a53dfd
      Rami Rosen 提交于
      skb_needs_linearize() does not check highmem DMA as it does not call
      illegal_highdma() anymore, so there is no need to mention highmem DMA here.
      
      (Indeed, ~NETIF_F_SG flag, which is checked in skb_needs_linearize(), can
      be set when illegal_highdma() returns true, and we are assured that
      illegal_highdma() is invoked prior to skb_needs_linearize() as
      skb_needs_linearize() is a static method called only once.
      But ~NETIF_F_SG can be set not only there in this same invocation path.
      It can also be set when can_checksum_protocol() returns false).
      
      see commit 02932ce9,
      Convert skb_need_linearize() to use precomputed features.
      Signed-off-by: NRami Rosen <rosenr@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1a53dfd
  18. 31 8月, 2012 1 次提交