1. 18 11月, 2014 1 次提交
    • D
      integrity: provide a hook to load keys when rootfs is ready · c9cd2ce2
      Dmitry Kasatkin 提交于
      Keys can only be loaded once the rootfs is mounted. Initcalls
      are not suitable for that. This patch defines a special hook
      to load the x509 public keys onto the IMA keyring, before
      attempting to access any file. The keys are required for
      verifying the file's signature. The hook is called after the
      root filesystem is mounted and before the kernel calls 'init'.
      
      Changes in v3:
      * added more explanation to the patch description (Mimi)
      
      Changes in v2:
      * Hook renamed as 'integrity_load_keys()' to handle both IMA and EVM
        keys by integrity subsystem.
      * Hook patch moved after defining loading functions
      Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      c9cd2ce2
  2. 17 9月, 2014 5 次提交
  3. 09 9月, 2014 1 次提交
  4. 02 9月, 2014 1 次提交
  5. 31 7月, 2014 1 次提交
    • A
      kexec: export free_huge_page to VMCOREINFO · 8f1d26d0
      Atsushi Kumagai 提交于
      PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56b
      ("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
      another symbol to filter *hugetlbfs* pages.
      
      If a user hope to filter user pages, makedumpfile tries to exclude them by
      checking the condition whether the page is anonymous, but hugetlbfs pages
      aren't anonymous while they also be user pages.
      
      We know it's possible to detect them in the same way as PageHuge(),
      so we need the start address of free_huge_page():
      
          int PageHuge(struct page *page)
          {
                  if (!PageCompound(page))
                          return 0;
      
                  page = compound_head(page);
                  return get_compound_page_dtor(page) == free_huge_page;
          }
      
      For that reason, this patch changes free_huge_page() into public
      to export it to VMCOREINFO.
      Signed-off-by: NAtsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f1d26d0
  6. 30 7月, 2014 5 次提交
  7. 28 7月, 2014 1 次提交
  8. 26 7月, 2014 2 次提交
  9. 24 7月, 2014 2 次提交
    • E
      CAPABILITIES: remove undefined caps from all processes · 7d8b6c63
      Eric Paris 提交于
      This is effectively a revert of 7b9a7ec5
      plus fixing it a different way...
      
      We found, when trying to run an application from an application which
      had dropped privs that the kernel does security checks on undefined
      capability bits.  This was ESPECIALLY difficult to debug as those
      undefined bits are hidden from /proc/$PID/status.
      
      Consider a root application which drops all capabilities from ALL 4
      capability sets.  We assume, since the application is going to set
      eff/perm/inh from an array that it will clear not only the defined caps
      less than CAP_LAST_CAP, but also the higher 28ish bits which are
      undefined future capabilities.
      
      The BSET gets cleared differently.  Instead it is cleared one bit at a
      time.  The problem here is that in security/commoncap.c::cap_task_prctl()
      we actually check the validity of a capability being read.  So any task
      which attempts to 'read all things set in bset' followed by 'unset all
      things set in bset' will not even attempt to unset the undefined bits
      higher than CAP_LAST_CAP.
      
      So the 'parent' will look something like:
      CapInh:	0000000000000000
      CapPrm:	0000000000000000
      CapEff:	0000000000000000
      CapBnd:	ffffffc000000000
      
      All of this 'should' be fine.  Given that these are undefined bits that
      aren't supposed to have anything to do with permissions.  But they do...
      
      So lets now consider a task which cleared the eff/perm/inh completely
      and cleared all of the valid caps in the bset (but not the invalid caps
      it couldn't read out of the kernel).  We know that this is exactly what
      the libcap-ng library does and what the go capabilities library does.
      They both leave you in that above situation if you try to clear all of
      you capapabilities from all 4 sets.  If that root task calls execve()
      the child task will pick up all caps not blocked by the bset.  The bset
      however does not block bits higher than CAP_LAST_CAP.  So now the child
      task has bits in eff which are not in the parent.  These are
      'meaningless' undefined bits, but still bits which the parent doesn't
      have.
      
      The problem is now in cred_cap_issubset() (or any operation which does a
      subset test) as the child, while a subset for valid cap bits, is not a
      subset for invalid cap bits!  So now we set durring commit creds that
      the child is not dumpable.  Given it is 'more priv' than its parent.  It
      also means the parent cannot ptrace the child and other stupidity.
      
      The solution here:
      1) stop hiding capability bits in status
      	This makes debugging easier!
      
      2) stop giving any task undefined capability bits.  it's simple, it you
      don't put those invalid bits in CAP_FULL_SET you won't get them in init
      and you won't get them in any other task either.
      	This fixes the cap_issubset() tests and resulting fallout (which
      	made the init task in a docker container untraceable among other
      	things)
      
      3) mask out undefined bits when sys_capset() is called as it might use
      ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
      	This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.
      
      4) mask out undefined bit when we read a file capability off of disk as
      again likely all bits are set in the xattr for forward/backward
      compatibility.
      	This lets 'setcap all+pe /bin/bash; /bin/bash' run
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Dan Walsh <dwalsh@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      7d8b6c63
    • N
      mm/rmap.c: fix pgoff calculation to handle hugepage correctly · a0f7a756
      Naoya Horiguchi 提交于
      I triggered VM_BUG_ON() in vma_address() when I tried to migrate an
      anonymous hugepage with mbind() in the kernel v3.16-rc3.  This is
      because pgoff's calculation in rmap_walk_anon() fails to consider
      compound_order() only to have an incorrect value.
      
      This patch introduces page_to_pgoff(), which gets the page's offset in
      PAGE_CACHE_SIZE.
      
      Kirill pointed out that page cache tree should natively handle
      hugepages, and in order to make hugetlbfs fit it, page->index of
      hugetlbfs page should be in PAGE_CACHE_SIZE.  This is beyond this patch,
      but page_to_pgoff() contains the point to be fixed in a single function.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0f7a756
  10. 23 7月, 2014 3 次提交
    • T
      libata: introduce ata_host->n_tags to avoid oops on SAS controllers · 1a112d10
      Tejun Heo 提交于
      1871ee13 ("libata: support the ata host which implements a queue
      depth less than 32") directly used ata_port->scsi_host->can_queue from
      ata_qc_new() to determine the number of tags supported by the host;
      unfortunately, SAS controllers doing SATA don't initialize ->scsi_host
      leading to the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
       IP: [<ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
       PGD 0
       Oops: 0002 [#1] SMP
       Modules linked in: isci libsas scsi_transport_sas mgag200 drm_kms_helper ttm
       CPU: 1 PID: 518 Comm: udevd Not tainted 3.16.0-rc6+ #62
       Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
       task: ffff880c1a00b280 ti: ffff88061a000000 task.ti: ffff88061a000000
       RIP: 0010:[<ffffffff814e0618>]  [<ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
       RSP: 0018:ffff88061a003ae8  EFLAGS: 00010012
       RAX: 0000000000000001 RBX: ffff88000241ca80 RCX: 00000000000000fa
       RDX: 0000000000000020 RSI: 0000000000000020 RDI: ffff8806194aa298
       RBP: ffff88061a003ae8 R08: ffff8806194a8000 R09: 0000000000000000
       R10: 0000000000000000 R11: ffff88000241ca80 R12: ffff88061ad58200
       R13: ffff8806194aa298 R14: ffffffff814e67a0 R15: ffff8806194a8000
       FS:  00007f3ad7fe3840(0000) GS:ffff880627620000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000058 CR3: 000000061a118000 CR4: 00000000001407e0
       Stack:
        ffff88061a003b20 ffffffff814e96e1 ffff88000241ca80 ffff88061ad58200
        ffff8800b6bf6000 ffff880c1c988000 ffff880619903850 ffff88061a003b68
        ffffffffa0056ce1 ffff88061a003b48 0000000013d6e6f8 ffff88000241ca80
       Call Trace:
        [<ffffffff814e96e1>] ata_sas_queuecmd+0xa1/0x430
        [<ffffffffa0056ce1>] sas_queuecommand+0x191/0x220 [libsas]
        [<ffffffff8149afee>] scsi_dispatch_cmd+0x10e/0x300
        [<ffffffff814a3bc5>] scsi_request_fn+0x2f5/0x550
        [<ffffffff81317613>] __blk_run_queue+0x33/0x40
        [<ffffffff8131781a>] queue_unplugged+0x2a/0x90
        [<ffffffff8131ceb4>] blk_flush_plug_list+0x1b4/0x210
        [<ffffffff8131d274>] blk_finish_plug+0x14/0x50
        [<ffffffff8117eaa8>] __do_page_cache_readahead+0x198/0x1f0
        [<ffffffff8117ee21>] force_page_cache_readahead+0x31/0x50
        [<ffffffff8117ee7e>] page_cache_sync_readahead+0x3e/0x50
        [<ffffffff81172ac6>] generic_file_read_iter+0x496/0x5a0
        [<ffffffff81219897>] blkdev_read_iter+0x37/0x40
        [<ffffffff811e307e>] new_sync_read+0x7e/0xb0
        [<ffffffff811e3734>] vfs_read+0x94/0x170
        [<ffffffff811e43c6>] SyS_read+0x46/0xb0
        [<ffffffff811e33d1>] ? SyS_lseek+0x91/0xb0
        [<ffffffff8171ee29>] system_call_fastpath+0x16/0x1b
       Code: 00 00 00 88 50 29 83 7f 08 01 19 d2 83 e2 f0 83 ea 50 88 50 34 c6 81 1d 02 00 00 40 c6 81 17 02 00 00 00 5d c3 66 0f 1f 44 00 00 <89> 14 25 58 00 00 00
      
      Fix it by introducing ata_host->n_tags which is initialized to
      ATA_MAX_QUEUE - 1 in ata_host_init() for SAS controllers and set to
      scsi_host_template->can_queue in ata_host_register() for !SAS ones.
      As SAS hosts are never registered, this will give them the same
      ATA_MAX_QUEUE - 1 as before.  Note that we can't use
      scsi_host->can_queue directly for SAS hosts anyway as they can go
      higher than the libata maximum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMike Qiu <qiudayu@linux.vnet.ibm.com>
      Reported-by: NJesse Brandeburg <jesse.brandeburg@gmail.com>
      Reported-by: NPeter Hurley <peter@hurleysoftware.com>
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Fixes: 1871ee13 ("libata: support the ata host which implements a queue depth less than 32")
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: stable@vger.kernel.org
      1a112d10
    • D
      KEYS: Allow expiry time to be set when preparsing a key · 7dfa0ca6
      David Howells 提交于
      Allow a key type's preparsing routine to set the expiry time for a key.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Acked-by: NJeff Layton <jlayton@primarydata.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      7dfa0ca6
    • D
      KEYS: struct key_preparsed_payload should have two payload pointers · fc7c70e0
      David Howells 提交于
      struct key_preparsed_payload should have two payload pointers to correspond
      with those in struct key.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Acked-by: NJeff Layton <jlayton@primarydata.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      fc7c70e0
  11. 19 7月, 2014 5 次提交
    • K
      seccomp: implement SECCOMP_FILTER_FLAG_TSYNC · c2e1f2e3
      Kees Cook 提交于
      Applying restrictive seccomp filter programs to large or diverse
      codebases often requires handling threads which may be started early in
      the process lifetime (e.g., by code that is linked in). While it is
      possible to apply permissive programs prior to process start up, it is
      difficult to further restrict the kernel ABI to those threads after that
      point.
      
      This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
      synchronizing thread group seccomp filters at filter installation time.
      
      When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
      filter) an attempt will be made to synchronize all threads in current's
      threadgroup to its new seccomp filter program. This is possible iff all
      threads are using a filter that is an ancestor to the filter current is
      attempting to synchronize to. NULL filters (where the task is running as
      SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
      transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
      ...) has been set on the calling thread, no_new_privs will be set for
      all synchronized threads too. On success, 0 is returned. On failure,
      the pid of one of the failing threads will be returned and no filters
      will have been applied.
      
      The race conditions against another thread are:
      - requesting TSYNC (already handled by sighand lock)
      - performing a clone (already handled by sighand lock)
      - changing its filter (already handled by sighand lock)
      - calling exec (handled by cred_guard_mutex)
      The clone case is assisted by the fact that new threads will have their
      seccomp state duplicated from their parent before appearing on the tasklist.
      
      Holding cred_guard_mutex means that seccomp filters cannot be assigned
      while in the middle of another thread's exec (potentially bypassing
      no_new_privs or similar). The call to de_thread() may kill threads waiting
      for the mutex.
      
      Changes across threads to the filter pointer includes a barrier.
      
      Based on patches by Will Drewry.
      Suggested-by: NJulien Tinnes <jln@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      c2e1f2e3
    • K
      seccomp: introduce writer locking · dbd95212
      Kees Cook 提交于
      Normally, task_struct.seccomp.filter is only ever read or modified by
      the task that owns it (current). This property aids in fast access
      during system call filtering as read access is lockless.
      
      Updating the pointer from another task, however, opens up race
      conditions. To allow cross-thread filter pointer updates, writes to the
      seccomp fields are now protected by the sighand spinlock (which is shared
      by all threads in the thread group). Read access remains lockless because
      pointer updates themselves are atomic.  However, writes (or cloning)
      often entail additional checking (like maximum instruction counts)
      which require locking to perform safely.
      
      In the case of cloning threads, the child is invisible to the system
      until it enters the task list. To make sure a child can't be cloned from
      a thread and left in a prior state, seccomp duplication is additionally
      moved under the sighand lock. Then parent and child are certain have
      the same seccomp state when they exit the lock.
      
      Based on patches by Will Drewry and David Drysdale.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      dbd95212
    • K
      sched: move no_new_privs into new atomic flags · 1d4457f9
      Kees Cook 提交于
      Since seccomp transitions between threads requires updates to the
      no_new_privs flag to be atomic, the flag must be part of an atomic flag
      set. This moves the nnp flag into a separate task field, and introduces
      accessors.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      1d4457f9
    • K
      seccomp: add "seccomp" syscall · 48dc92b9
      Kees Cook 提交于
      This adds the new "seccomp" syscall with both an "operation" and "flags"
      parameter for future expansion. The third argument is a pointer value,
      used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
      be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
      
      In addition to the TSYNC flag later in this patch series, there is a
      non-zero chance that this syscall could be used for configuring a fixed
      argument area for seccomp-tracer-aware processes to pass syscall arguments
      in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
      for this syscall. Additionally, this syscall uses operation, flags,
      and user pointer for arguments because strictly passing arguments via
      a user pointer would mean seccomp itself would be unable to trivially
      filter the seccomp syscall itself.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      48dc92b9
    • D
      KEYS: Provide a generic instantiation function · 6a09d17b
      David Howells 提交于
      Provide a generic instantiation function for key types that use the preparse
      hook.  This makes it easier to prereserve key quota before keyrings get locked
      to retain the new key.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Acked-by: NJeff Layton <jlayton@primarydata.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      6a09d17b
  12. 18 7月, 2014 2 次提交
    • B
      cpufreq: make table sentinel macros unsigned to match use · 2b1987a9
      Brian W Hart 提交于
      Commit 5eeaf1f1 (cpufreq: Fix build error on some platforms that
      use cpufreq_for_each_*) moved function cpufreq_next_valid() to a public
      header.  Warnings are now generated when objects including that header
      are built with -Wsign-compare (as an out-of-tree module might be):
      
      .../include/linux/cpufreq.h: In function ‘cpufreq_next_valid’:
      .../include/linux/cpufreq.h:519:27: warning: comparison between signed
      and unsigned integer expressions [-Wsign-compare]
        while ((*pos)->frequency != CPUFREQ_TABLE_END)
                                 ^
      .../include/linux/cpufreq.h:520:25: warning: comparison between signed
      and unsigned integer expressions [-Wsign-compare]
         if ((*pos)->frequency != CPUFREQ_ENTRY_INVALID)
                               ^
      
      Constants CPUFREQ_ENTRY_INVALID and CPUFREQ_TABLE_END are signed, but
      are used with unsigned member 'frequency' of cpufreq_frequency_table.
      Update the macro definitions to be explicitly unsigned to match their
      use.
      
      This also corrects potentially wrong behavior of clk_rate_table_iter()
      if unsigned long is wider than usigned int.
      
      Fixes: 5eeaf1f1 (cpufreq: Fix build error on some platforms that use cpufreq_for_each_*)
      Signed-off-by: NBrian W Hart <hartb@linux.vnet.ibm.com>
      Reviewed-by: NSimon Horman <horms+renesas@verge.net.au>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2b1987a9
    • D
      KEYS: Allow special keys (eg. DNS results) to be invalidated by CAP_SYS_ADMIN · 0c7774ab
      David Howells 提交于
      Special kernel keys, such as those used to hold DNS results for AFS, CIFS and
      NFS and those used to hold idmapper results for NFS, used to be
      'invalidateable' with key_revoke().  However, since the default permissions for
      keys were reduced:
      
      	Commit: 96b5c8fe
      	KEYS: Reduce initial permissions on keys
      
      it has become impossible to do this.
      
      Add a key flag (KEY_FLAG_ROOT_CAN_INVAL) that will permit a key to be
      invalidated by root.  This should not be used for system keyrings as the
      garbage collector will try and remove any invalidate key.  For system keyrings,
      KEY_FLAG_ROOT_CAN_CLEAR can be used instead.
      
      After this, from userspace, keyctl_invalidate() and "keyctl invalidate" can be
      used by any possessor of CAP_SYS_ADMIN (typically root) to invalidate DNS and
      idmapper keys.  Invalidated keys are immediately garbage collected and will be
      immediately rerequested if needed again.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NSteve Dickson <steved@redhat.com>
      0c7774ab
  13. 17 7月, 2014 1 次提交
    • D
      KEYS: validate certificate trust only with builtin keys · 32c4741c
      Dmitry Kasatkin 提交于
      Instead of allowing public keys, with certificates signed by any
      key on the system trusted keyring, to be added to a trusted keyring,
      this patch further restricts the certificates to those signed only by
      builtin keys on the system keyring.
      
      This patch defines a new option 'builtin' for the kernel parameter
      'keys_ownerid' to allow trust validation using builtin keys.
      
      Simplified Mimi's "KEYS: define an owner trusted keyring" patch
      
      Changelog v7:
      - rename builtin_keys to use_builtin_keys
      Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      32c4741c
  14. 16 7月, 2014 6 次提交
    • D
      locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER · 5db6c6fe
      Davidlohr Bueso 提交于
      Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
      encapsulate the dependencies for rwsem optimistic spinning.
      No logical changes here as it continues to depend on both
      SMP and the XADD algorithm variant.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NJason Low <jason.low2@hp.com>
      [ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
      Cc: aswin@hp.com
      Cc: Chris Mason <clm@fb.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5db6c6fe
    • J
      locking/rwsem: Reduce the size of struct rw_semaphore · ce069fc9
      Jason Low 提交于
      Recent optimistic spinning additions to rwsem provide significant performance
      benefits on many workloads on large machines. The cost of it was increasing
      the size of the rwsem structure by up to 128 bits.
      
      However, now that the previous patches in this series bring the overhead of
      struct optimistic_spin_queue to 32 bits, this patch reorders some fields in
      struct rw_semaphore such that we can reduce the overhead of the rwsem structure
      by 64 bits (on 64 bit systems).
      
      The extra overhead required for rwsem optimistic spinning would now be up
      to 8 additional bytes instead of up to 16 bytes. Additionally, the size of
      rwsem would now be more in line with mutexes.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Link: http://lkml.kernel.org/r/1405358872-3732-6-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ce069fc9
    • P
      locking/rwsem: Rename 'activity' to 'count' · 13b9a962
      Peter Zijlstra 提交于
      There are two definitions of struct rw_semaphore, one in linux/rwsem.h
      and one in linux/rwsem-spinlock.h.
      
      For some reason they have different names for the initial field. This
      makes it impossible to use C99 named initialization for
      __RWSEM_INITIALIZER() -- or we have to duplicate that entire thing
      along with the structure definitions.
      
      The simpler patch is renaming the rwsem-spinlock variant to match the
      regular rwsem.
      
      This allows us to switch to C99 named initialization.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13b9a962
    • J
      locking/spinlocks/mcs: Introduce and use init macro and function for osq locks · 4d9d951e
      Jason Low 提交于
      Currently, we initialize the osq lock by directly setting the lock's values. It
      would be preferable if we use an init macro to do the initialization like we do
      with other locks.
      
      This patch introduces and uses a macro and function for initializing the osq lock.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4d9d951e
    • J
      locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead · 90631822
      Jason Low 提交于
      The cancellable MCS spinlock is currently used to queue threads that are
      doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
      the lock would access and queue the local node corresponding to the CPU that
      it's running on. Currently, the cancellable MCS lock is implemented by using
      pointers to these nodes.
      
      In this patch, instead of operating on pointers to the per-cpu nodes, we
      store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
      A similar concept is used with the qspinlock.
      
      By operating on the CPU # of the nodes using atomic_t instead of pointers
      to those nodes, this can reduce the overhead of the cancellable MCS spinlock
      by 32 bits (on 64 bit systems).
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      90631822
    • J
      locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node() · 046a619d
      Jason Low 提交于
      Currently, the per-cpu nodes structure for the cancellable MCS spinlock is
      named "optimistic_spin_queue". However, in a follow up patch in the series
      we will be introducing a new structure that serves as the new "handle" for
      the lock. It would make more sense if that structure is named
      "optimistic_spin_queue". Additionally, since the current use of the
      "optimistic_spin_queue" structure are  "nodes", it might be better if we
      rename them to "node" anyway.
      
      This preparatory patch renames all current "optimistic_spin_queue"
      to "optimistic_spin_node".
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Link: http://lkml.kernel.org/r/1405358872-3732-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      046a619d
  15. 10 7月, 2014 1 次提交
    • P
      selinux: fix the default socket labeling in sock_graft() · 4da6daf4
      Paul Moore 提交于
      The sock_graft() hook has special handling for AF_INET, AF_INET, and
      AF_UNIX sockets as those address families have special hooks which
      label the sock before it is attached its associated socket.
      Unfortunately, the sock_graft() hook was missing a default approach
      to labeling sockets which meant that any other address family which
      made use of connections or the accept() syscall would find the
      returned socket to be in an "unlabeled" state.  This was recently
      demonstrated by the kcrypto/AF_ALG subsystem and the newly released
      cryptsetup package (cryptsetup v1.6.5 and later).
      
      This patch preserves the special handling in selinux_sock_graft(),
      but adds a default behavior - setting the sock's label equal to the
      associated socket - which resolves the problem with AF_ALG and
      presumably any other address family which makes use of accept().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Tested-by: NMilan Broz <gmazyland@gmail.com>
      4da6daf4
  16. 09 7月, 2014 3 次提交