1. 29 1月, 2013 1 次提交
  2. 19 12月, 2012 2 次提交
    • G
      memcg: infrastructure to match an allocation to the right cache · d7f25f8a
      Glauber Costa 提交于
      The page allocator is able to bind a page to a memcg when it is
      allocated.  But for the caches, we'd like to have as many objects as
      possible in a page belonging to the same cache.
      
      This is done in this patch by calling memcg_kmem_get_cache in the
      beginning of every allocation function.  This function is patched out by
      static branches when kernel memory controller is not being used.
      
      It assumes that the task allocating, which determines the memcg in the
      page allocator, belongs to the same cgroup throughout the whole process.
      Misaccounting can happen if the task calls memcg_kmem_get_cache() while
      belonging to a cgroup, and later on changes.  This is considered
      acceptable, and should only happen upon task migration.
      
      Before the cache is created by the memcg core, there is also a possible
      imbalance: the task belongs to a memcg, but the cache being allocated from
      is the global cache, since the child cache is not yet guaranteed to be
      ready.  This case is also fine, since in this case the GFP_KMEMCG will not
      be passed and the page allocator will not attempt any cgroup accounting.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7f25f8a
    • G
      memcg: kmem accounting basic infrastructure · 510fc4e1
      Glauber Costa 提交于
      Add the basic infrastructure for the accounting of kernel memory.  To
      control that, the following files are created:
      
       * memory.kmem.usage_in_bytes
       * memory.kmem.limit_in_bytes
       * memory.kmem.failcnt
       * memory.kmem.max_usage_in_bytes
      
      They have the same meaning of their user memory counterparts.  They
      reflect the state of the "kmem" res_counter.
      
      Per cgroup kmem memory accounting is not enabled until a limit is set for
      the group.  Once the limit is set the accounting cannot be disabled for
      that group.  This means that after the patch is applied, no behavioral
      changes exists for whoever is still using memcg to control their memory
      usage, until memory.kmem.limit_in_bytes is set for the first time.
      
      We always account to both user and kernel resource_counters.  This
      effectively means that an independent kernel limit is in place when the
      limit is set to a lower value than the user memory.  A equal or higher
      value means that the user limit will always hit first, meaning that kmem
      is effectively unlimited.
      
      People who want to track kernel memory but not limit it, can set this
      limit to a very high number (like RESOURCE_MAX - 1page - that no one will
      ever hit, or equal to the user memory)
      
      [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      510fc4e1
  3. 11 12月, 2012 2 次提交
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • A
      mm: numa: pte_numa() and pmd_numa() · be3a7284
      Andrea Arcangeli 提交于
      Implement pte_numa and pmd_numa.
      
      We must atomically set the numa bit and clear the present bit to
      define a pte_numa or pmd_numa.
      
      Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
      a thread touches a virtual address in the corresponding virtual range,
      a NUMA hinting page fault will trigger. The NUMA hinting page fault
      will clear the NUMA bit and set the present bit again to resolve the
      page fault.
      
      The expectation is that a NUMA hinting page fault is used as part
      of a placement policy that decides if a page should remain on the
      current node or migrated to a different node.
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      be3a7284
  4. 01 12月, 2012 1 次提交
    • F
      context_tracking: New context tracking susbsystem · 91d1aa43
      Frederic Weisbecker 提交于
      Create a new subsystem that probes on kernel boundaries
      to keep track of the transitions between level contexts
      with two basic initial contexts: user or kernel.
      
      This is an abstraction of some RCU code that use such tracking
      to implement its userspace extended quiescent state.
      
      We need to pull this up from RCU into this new level of indirection
      because this tracking is also going to be used to implement an "on
      demand" generic virtual cputime accounting. A necessary step to
      shutdown the tick while still accounting the cputime.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: fix whitespace error and email address. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      91d1aa43
  5. 17 11月, 2012 1 次提交
    • P
      rcu: Add callback-free CPUs · 3fbfbf7a
      Paul E. McKenney 提交于
      RCU callback execution can add significant OS jitter and also can
      degrade both scheduling latency and, in asymmetric multiprocessors,
      energy efficiency.  This commit therefore adds the ability for selected
      CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
      to kthreads.  If the "rcu_nocb_poll" boot parameter is also specified,
      these kthreads will do polling, removing the need for the offloaded
      CPUs to do wakeups.  At least one CPU must be doing normal callback
      processing: currently CPU 0 cannot be selected as a no-CBs CPU.
      In addition, attempts to offline the last normal-CBs CPU will fail.
      
      This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
      this commit includes fixes to problems located by Fengguang Wu's
      kbuild test robot.
      
      [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3fbfbf7a
  6. 15 11月, 2012 2 次提交
    • E
      userns: Support fuse interacting with multiple user namespaces · 499dcf20
      Eric W. Biederman 提交于
      Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.
      
      The connection between between a fuse filesystem and a fuse daemon is
      established when a fuse filesystem is mounted and provided with a file
      descriptor the fuse daemon created by opening /dev/fuse.
      
      For now restrict the communication of uids and gids between the fuse
      filesystem and the fuse daemon to the initial user namespace.  Enforce
      this by verifying the file descriptor passed to the mount of fuse was
      opened in the initial user namespace.  Ensuring the mount happens in
      the initial user namespace is not necessary as mounts from non-initial
      user namespaces are not yet allowed.
      
      In fuse_req_init_context convert the currrent fsuid and fsgid into the
      initial user namespace for the request that will be sent to the fuse
      daemon.
      
      In fuse_fill_attr convert the uid and gid passed from the fuse daemon
      from the initial user namespace into kuids and kgids.
      
      In iattr_to_fattr called from fuse_setattr convert kuids and kgids
      into the uids and gids in the initial user namespace before passing
      them to the fuse filesystem.
      
      In fuse_change_attributes_common called from fuse_dentry_revalidate,
      fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
      the uid and gid from the fuse daemon into a kuid and a kgid to store
      on the fuse inode.
      
      By default fuse mounts are restricted to task whose uid, suid, and
      euid matches the fuse user_id and whose gid, sgid, and egid matches
      the fuse group id.  Convert the user_id and group_id mount options
      into kuids and kgids at mount time, and use uid_eq and gid_eq to
      compare the in fuse_allow_task.
      
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      499dcf20
    • E
      userns: Support autofs4 interacing with multiple user namespaces · 45634cd8
      Eric W. Biederman 提交于
      Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.
      
      When creating directories and symlinks default the uid and gid of
      the mount requester to the global root uid and gid.  autofs4_wait
      will update these fields when a mount is requested.
      
      When generating autofsv5 packets report the uid and gid of the mount
      requestor in user namespace of the process that opened the pipe,
      reporting unmapped uids and gids as overflowuid and overflowgid.
      
      In autofs_dev_ioctl_requester return the uid and gid of the last mount
      requester converted into the calling processes user namespace.  When the
      uid or gid don't map return overflowuid and overflowgid as appropriate,
      allowing failure to find a mount requester to be distinguished from
      failure to map a mount requester.
      
      The uid and gid mount options specifying the user and group of the
      root autofs inode are converted into kuid and kgid as they are parsed
      defaulting to the current uid and current gid of the process that
      mounts autofs.
      
      Mounting of autofs for the present remains confined to processes in
      the initial user namespace.
      
      Cc: Ian Kent <raven@themaw.net>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      45634cd8
  7. 25 10月, 2012 1 次提交
  8. 24 10月, 2012 1 次提交
  9. 11 10月, 2012 1 次提交
  10. 10 10月, 2012 3 次提交
    • D
      MODSIGN: Implement module signature checking · 48ba2462
      David Howells 提交于
      Check the signature on the module against the keys compiled into the kernel or
      available in a hardware key store.
      
      Currently, only RSA keys are supported - though that's easy enough to change,
      and the signature is expected to contain raw components (so not a PGP or
      PKCS#7 formatted blob).
      
      The signature blob is expected to consist of the following pieces in order:
      
       (1) The binary identifier for the key.  This is expected to match the
           SubjectKeyIdentifier from an X.509 certificate.  Only X.509 type
           identifiers are currently supported.
      
       (2) The signature data, consisting of a series of MPIs in which each is in
           the format of a 2-byte BE word sizes followed by the content data.
      
       (3) A 12 byte information block of the form:
      
      	struct module_signature {
      		enum pkey_algo		algo : 8;
      		enum pkey_hash_algo	hash : 8;
      		enum pkey_id_type	id_type : 8;
      		u8			__pad;
      		__be32			id_length;
      		__be32			sig_length;
      	};
      
           The three enums are defined in crypto/public_key.h.
      
           'algo' contains the public-key algorithm identifier (0->DSA, 1->RSA).
      
           'hash' contains the digest algorithm identifier (0->MD4, 1->MD5, 2->SHA1,
            etc.).
      
           'id_type' contains the public-key identifier type (0->PGP, 1->X.509).
      
           '__pad' should be 0.
      
           'id_length' should contain in the binary identifier length in BE form.
      
           'sig_length' should contain in the signature data length in BE form.
      
           The lengths are in BE order rather than CPU order to make dealing with
           cross-compilation easier.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (minor Kconfig fix)
      48ba2462
    • D
      MODSIGN: Provide Kconfig options · ea0b6dcf
      David Howells 提交于
      Provide kernel configuration options for module signing.
      
      The following configuration options are added:
      
           CONFIG_MODULE_SIG_SHA1
           CONFIG_MODULE_SIG_SHA224
           CONFIG_MODULE_SIG_SHA256
           CONFIG_MODULE_SIG_SHA384
           CONFIG_MODULE_SIG_SHA512
      
      These select the cryptographic hash used to digest the data prior to signing.
      Additionally, the crypto module selected will be built into the kernel as it
      won't be possible to load it as a module without incurring a circular
      dependency when the kernel tries to check its signature.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      ea0b6dcf
    • R
      module: signature checking hook · 106a4ee2
      Rusty Russell 提交于
      We do a very simple search for a particular string appended to the module
      (which is cache-hot and about to be SHA'd anyway).  There's both a config
      option and a boot parameter which control whether we accept or fail with
      unsigned modules and modules that are signed with an unknown key.
      
      If module signing is enabled, the kernel will be tainted if a module is
      loaded that is unsigned or has a signature for which we don't have the
      key.
      
      (Useful feedback and tweaks by David Howells <dhowells@redhat.com>)
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      106a4ee2
  11. 09 10月, 2012 2 次提交
  12. 08 10月, 2012 1 次提交
    • D
      X.509: Add simple ASN.1 grammar compiler · 4520c6a4
      David Howells 提交于
      Add a simple ASN.1 grammar compiler.  This produces a bytecode output that can
      be fed to a decoder to inform the decoder how to interpret the ASN.1 stream it
      is trying to parse.
      
      Action functions can be specified in the grammar by interpolating:
      
      	({ foo })
      
      after a type, for example:
      
      	SubjectPublicKeyInfo ::= SEQUENCE {
      		algorithm		AlgorithmIdentifier,
      		subjectPublicKey	BIT STRING ({ do_key_data })
      		}
      
      The decoder is expected to call these after matching this type and parsing the
      contents if it is a constructed type.
      
      The grammar compiler does not currently support the SET type (though it does
      support SET OF) as I can't see a good way of tracking which members have been
      encountered yet without using up extra stack space.
      
      Currently, the grammar compiler will fail if more than 256 bytes of bytecode
      would be produced or more than 256 actions have been specified as it uses
      8-bit jump values and action indices to keep space usage down.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      4520c6a4
  13. 06 10月, 2012 2 次提交
  14. 26 9月, 2012 2 次提交
    • F
      rcu: Userspace RCU extended QS selftest · 1fd2b442
      Frederic Weisbecker 提交于
      Provide a config option that enables the userspace
      RCU extended quiescent state on every CPUs by default.
      
      This is for testing purpose.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1fd2b442
    • F
      rcu: Settle config for userspace extended quiescent state · 2b1d5024
      Frederic Weisbecker 提交于
      Create a new config option under the RCU menu that put
      CPUs under RCU extended quiescent state (as in dynticks
      idle mode) when they run in userspace. This require
      some contribution from architectures to hook into kernel
      and userspace boundaries.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      2b1d5024
  15. 25 9月, 2012 2 次提交
    • F
      cputime: Make finegrained irqtime accounting generally available · fdf9c356
      Frederic Weisbecker 提交于
      There is no known reason for this option to be unavailable on other
      archs than x86. They just need to call enable_sched_clock_irqtime()
      if they have a sufficiently finegrained clock to make it working.
      
      Move it to the general option and let the user choose between
      it and pure tick based or virtual cputime accounting.
      
      Note that virtual cputime accounting already performs a finegrained
      irqtime accounting. CONFIG_IRQ_TIME_ACCOUNTING is a kind of middle ground
      between tick and virtual based accounting. So CONFIG_IRQ_TIME_ACCOUNTING
      and CONFIG_VIRT_CPU_ACCOUNTING are mutually exclusive choices.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      fdf9c356
    • F
      cputime: Gather time/stats accounting config options into a single menu · 391dc69c
      Frederic Weisbecker 提交于
      This debloats a bit the general config menu and make these
      config options easier to find.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      391dc69c
  16. 21 9月, 2012 16 次提交