1. 15 7月, 2014 1 次提交
    • T
      cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes · 5577964e
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes is used for both the unified
      default hierarchy and legacy ones and subsystems can mark each file
      with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
      only on one of them.  This is quite hairy and error-prone.  Also, we
      may end up exposing interface files to the default hierarchy without
      thinking it through.
      
      cgroup_subsys will grow two separate cftype arrays and apply each only
      on the hierarchies of the matching type.  This will allow organizing
      cftypes in a lot clearer way and encourage subsystems to scrutinize
      the interface which is being exposed in the new default hierarchy.
      
      In preparation, this patch renames cgroup_subsys->base_cftypes to
      cgroup_subsys->legacy_cftypes.  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      5577964e
  2. 13 6月, 2014 7 次提交
    • D
      ima: introduce ima_kernel_read() · 0430e49b
      Dmitry Kasatkin 提交于
      Commit 8aac6270 "move exit_task_namespaces() outside of exit_notify"
      introduced the kernel opps since the kernel v3.10, which happens when
      Apparmor and IMA-appraisal are enabled at the same time.
      
      ----------------------------------------------------------------------
      [  106.750167] BUG: unable to handle kernel NULL pointer dereference at
      0000000000000018
      [  106.750221] IP: [<ffffffff811ec7da>] our_mnt+0x1a/0x30
      [  106.750241] PGD 0
      [  106.750254] Oops: 0000 [#1] SMP
      [  106.750272] Modules linked in: cuse parport_pc ppdev bnep rfcomm
      bluetooth rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc
      fscache dm_crypt intel_rapl x86_pkg_temp_thermal intel_powerclamp
      kvm_intel snd_hda_codec_hdmi kvm crct10dif_pclmul crc32_pclmul
      ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul
      ablk_helper cryptd snd_hda_codec_realtek dcdbas snd_hda_intel
      snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi
      snd_seq_midi_event snd_rawmidi psmouse snd_seq microcode serio_raw
      snd_timer snd_seq_device snd soundcore video lpc_ich coretemp mac_hid lp
      parport mei_me mei nbd hid_generic e1000e usbhid ahci ptp hid libahci
      pps_core
      [  106.750658] CPU: 6 PID: 1394 Comm: mysqld Not tainted 3.13.0-rc7-kds+ #15
      [  106.750673] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A08
      09/19/2012
      [  106.750689] task: ffff8800de804920 ti: ffff880400fca000 task.ti:
      ffff880400fca000
      [  106.750704] RIP: 0010:[<ffffffff811ec7da>]  [<ffffffff811ec7da>]
      our_mnt+0x1a/0x30
      [  106.750725] RSP: 0018:ffff880400fcba60  EFLAGS: 00010286
      [  106.750738] RAX: 0000000000000000 RBX: 0000000000000100 RCX:
      ffff8800d51523e7
      [  106.750764] RDX: ffffffffffffffea RSI: ffff880400fcba34 RDI:
      ffff880402d20020
      [  106.750791] RBP: ffff880400fcbae0 R08: 0000000000000000 R09:
      0000000000000001
      [  106.750817] R10: 0000000000000000 R11: 0000000000000001 R12:
      ffff8800d5152300
      [  106.750844] R13: ffff8803eb8df510 R14: ffff880400fcbb28 R15:
      ffff8800d51523e7
      [  106.750871] FS:  0000000000000000(0000) GS:ffff88040d200000(0000)
      knlGS:0000000000000000
      [  106.750910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  106.750935] CR2: 0000000000000018 CR3: 0000000001c0e000 CR4:
      00000000001407e0
      [  106.750962] Stack:
      [  106.750981]  ffffffff813434eb ffff880400fcbb20 ffff880400fcbb18
      0000000000000000
      [  106.751037]  ffff8800de804920 ffffffff8101b9b9 0001800000000000
      0000000000000100
      [  106.751093]  0000010000000000 0000000000000002 000000000000000e
      ffff8803eb8df500
      [  106.751149] Call Trace:
      [  106.751172]  [<ffffffff813434eb>] ? aa_path_name+0x2ab/0x430
      [  106.751199]  [<ffffffff8101b9b9>] ? sched_clock+0x9/0x10
      [  106.751225]  [<ffffffff8134a68d>] aa_path_perm+0x7d/0x170
      [  106.751250]  [<ffffffff8101b945>] ? native_sched_clock+0x15/0x80
      [  106.751276]  [<ffffffff8134aa73>] aa_file_perm+0x33/0x40
      [  106.751301]  [<ffffffff81348c5e>] common_file_perm+0x8e/0xb0
      [  106.751327]  [<ffffffff81348d78>] apparmor_file_permission+0x18/0x20
      [  106.751355]  [<ffffffff8130c853>] security_file_permission+0x23/0xa0
      [  106.751382]  [<ffffffff811c77a2>] rw_verify_area+0x52/0xe0
      [  106.751407]  [<ffffffff811c789d>] vfs_read+0x6d/0x170
      [  106.751432]  [<ffffffff811cda31>] kernel_read+0x41/0x60
      [  106.751457]  [<ffffffff8134fd45>] ima_calc_file_hash+0x225/0x280
      [  106.751483]  [<ffffffff8134fb52>] ? ima_calc_file_hash+0x32/0x280
      [  106.751509]  [<ffffffff8135022d>] ima_collect_measurement+0x9d/0x160
      [  106.751536]  [<ffffffff810b552d>] ? trace_hardirqs_on+0xd/0x10
      [  106.751562]  [<ffffffff8134f07c>] ? ima_file_free+0x6c/0xd0
      [  106.751587]  [<ffffffff81352824>] ima_update_xattr+0x34/0x60
      [  106.751612]  [<ffffffff8134f0d0>] ima_file_free+0xc0/0xd0
      [  106.751637]  [<ffffffff811c9635>] __fput+0xd5/0x300
      [  106.751662]  [<ffffffff811c98ae>] ____fput+0xe/0x10
      [  106.751687]  [<ffffffff81086774>] task_work_run+0xc4/0xe0
      [  106.751712]  [<ffffffff81066fad>] do_exit+0x2bd/0xa90
      [  106.751738]  [<ffffffff8173c958>] ? retint_swapgs+0x13/0x1b
      [  106.751763]  [<ffffffff8106780c>] do_group_exit+0x4c/0xc0
      [  106.751788]  [<ffffffff81067894>] SyS_exit_group+0x14/0x20
      [  106.751814]  [<ffffffff8174522d>] system_call_fastpath+0x1a/0x1f
      [  106.751839] Code: c3 0f 1f 44 00 00 55 48 89 e5 e8 22 fe ff ff 5d c3
      0f 1f 44 00 00 55 65 48 8b 04 25 c0 c9 00 00 48 8b 80 28 06 00 00 48 89
      e5 5d <48> 8b 40 18 48 39 87 c0 00 00 00 0f 94 c0 c3 0f 1f 80 00 00 00
      [  106.752185] RIP  [<ffffffff811ec7da>] our_mnt+0x1a/0x30
      [  106.752214]  RSP <ffff880400fcba60>
      [  106.752236] CR2: 0000000000000018
      [  106.752258] ---[ end trace 3c520748b4732721 ]---
      ----------------------------------------------------------------------
      
      The reason for the oops is that IMA-appraisal uses "kernel_read()" when
      file is closed. kernel_read() honors LSM security hook which calls
      Apparmor handler, which uses current->nsproxy->mnt_ns. The 'guilty'
      commit changed the order of cleanup code so that nsproxy->mnt_ns was
      not already available for Apparmor.
      
      Discussion about the issue with Al Viro and Eric W. Biederman suggested
      that kernel_read() is too high-level for IMA. Another issue, except
      security checking, that was identified is mandatory locking. kernel_read
      honors it as well and it might prevent IMA from calculating necessary hash.
      It was suggested to use simplified version of the function without security
      and locking checks.
      
      This patch introduces special version ima_kernel_read(), which skips security
      and mandatory locking checking. It prevents the kernel oops to happen.
      Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Suggested-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      0430e49b
    • M
      evm: prohibit userspace writing 'security.evm' HMAC value · 2fb1c9a4
      Mimi Zohar 提交于
      Calculating the 'security.evm' HMAC value requires access to the
      EVM encrypted key.  Only the kernel should have access to it.  This
      patch prevents userspace tools(eg. setfattr, cp --preserve=xattr)
      from setting/modifying the 'security.evm' HMAC value directly.
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      2fb1c9a4
    • D
      ima: check inode integrity cache in violation check · 14503eb9
      Dmitry Kasatkin 提交于
      When IMA did not support ima-appraisal, existance of the S_IMA flag
      clearly indicated that the file was measured. With IMA appraisal S_IMA
      flag indicates that file was measured and/or appraised. Because of
      this, when measurement is not enabled by the policy, violations are
      still reported.
      
      To differentiate between measurement and appraisal policies this
      patch checks the inode integrity cache flags.  The IMA_MEASURED
      flag indicates whether the file was actually measured, while the
      IMA_MEASURE flag indicates whether the file should be measured.
      Unfortunately, the IMA_MEASURED flag is reset to indicate the file
      needs to be re-measured.  Thus, this patch checks the IMA_MEASURE
      flag.
      
      This patch limits the false positive violation reports, but does
      not fix it entirely.  The IMA_MEASURE/IMA_MEASURED flags are
      indications that, at some point in time, the file opened for read
      was in policy, but might not be in policy now (eg. different uid).
      Other changes would be needed to further limit false positive
      violation reports.
      
      Changelog:
      - expanded patch description based on conversation with Roberto (Mimi)
      Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      14503eb9
    • D
      ima: prevent unnecessary policy checking · b882fae2
      Dmitry Kasatkin 提交于
      ima_rdwr_violation_check is called for every file openning.
      The function checks the policy even when violation condition
      is not met. It causes unnecessary policy checking.
      
      This patch does policy checking only if violation condition is met.
      
      Changelog:
      - check writecount is greater than zero (Mimi)
      Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      b882fae2
    • D
      evm: provide option to protect additional SMACK xattrs · 3e38df56
      Dmitry Kasatkin 提交于
      Newer versions of SMACK introduced following security xattrs:
      SMACK64EXEC, SMACK64TRANSMUTE and SMACK64MMAP.
      
      To protect these xattrs, this patch includes them in the HMAC
      calculation.  However, for backwards compatibility with existing
      labeled filesystems, including these xattrs needs to be
      configurable.
      
      Changelog:
      - Add SMACK dependency on new option (Mimi)
      Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      3e38df56
    • D
      evm: replace HMAC version with attribute mask · d3b33679
      Dmitry Kasatkin 提交于
      Using HMAC version limits the posibility to arbitrarily add new
      attributes such as SMACK64EXEC to the hmac calculation.
      
      This patch replaces hmac version with attribute mask.
      Desired attributes can be enabled with configuration parameter.
      It allows to build kernels which works with previously labeled
      filesystems.
      
      Currently supported attribute is 'fsuuid' which is equivalent of
      the former version 2.
      Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      d3b33679
    • M
      ima: prevent new digsig xattr from being replaced · 060bdebf
      Mimi Zohar 提交于
      Even though a new xattr will only be appraised on the next access,
      set the DIGSIG flag to prevent a signature from being replaced with
      a hash on file close.
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      060bdebf
  3. 04 6月, 2014 5 次提交
    • M
      ima: audit log files opened with O_DIRECT flag · f9b2a735
      Mimi Zohar 提交于
      Files are measured or appraised based on the IMA policy.  When a
      file, in policy, is opened with the O_DIRECT flag, a deadlock
      occurs.
      
      The first attempt at resolving this lockdep temporarily removed the
      O_DIRECT flag and restored it, after calculating the hash.  The
      second attempt introduced the O_DIRECT_HAVELOCK flag. Based on this
      flag, do_blockdev_direct_IO() would skip taking the i_mutex a second
      time.  The third attempt, by Dmitry Kasatkin, resolves the i_mutex
      locking issue, by re-introducing the IMA mutex, but uncovered
      another problem.  Reading a file with O_DIRECT flag set, writes
      directly to userspace pages.  A second patch allocates a user-space
      like memory.  This works for all IMA hooks, except ima_file_free(),
      which is called on __fput() to recalculate the file hash.
      
      Until this last issue is addressed, do not 'collect' the
      measurement for measuring, appraising, or auditing files opened
      with the O_DIRECT flag set.  Based on policy, permit or deny file
      access.  This patch defines a new IMA policy rule option named
      'permit_directio'.  Policy rules could be defined, based on LSM
      or other criteria, to permit specific applications to open files
      with the O_DIRECT flag set.
      
      Changelog v1:
      - permit or deny file access based IMA policy rules
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Acked-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Cc: <stable@vger.kernel.org>
      f9b2a735
    • D
      selinux: conditionally reschedule in hashtab_insert while loading selinux policy · ed1c9642
      Dave Jones 提交于
      After silencing the sleeping warning in mls_convert_context() I started
      seeing similar traces from hashtab_insert. Do a cond_resched there too.
      Signed-off-by: NDave Jones <davej@redhat.com>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      ed1c9642
    • D
      selinux: conditionally reschedule in mls_convert_context while loading selinux policy · 9a591f39
      Dave Jones 提交于
      On a slow machine (with debugging enabled), upgrading selinux policy may take
      a considerable amount of time. Long enough that the softlockup detector
      gets triggered.
      
      The backtrace looks like this..
      
       > BUG: soft lockup - CPU#2 stuck for 23s! [load_policy:19045]
       > Call Trace:
       >  [<ffffffff81221ddf>] symcmp+0xf/0x20
       >  [<ffffffff81221c27>] hashtab_search+0x47/0x80
       >  [<ffffffff8122e96c>] mls_convert_context+0xdc/0x1c0
       >  [<ffffffff812294e8>] convert_context+0x378/0x460
       >  [<ffffffff81229170>] ? security_context_to_sid_core+0x240/0x240
       >  [<ffffffff812221b5>] sidtab_map+0x45/0x80
       >  [<ffffffff8122bb9f>] security_load_policy+0x3ff/0x580
       >  [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100
       >  [<ffffffff810786dd>] ? sched_clock_local+0x1d/0x80
       >  [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100
       >  [<ffffffff8103096a>] ? __change_page_attr_set_clr+0x82a/0xa50
       >  [<ffffffff810786dd>] ? sched_clock_local+0x1d/0x80
       >  [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100
       >  [<ffffffff8103096a>] ? __change_page_attr_set_clr+0x82a/0xa50
       >  [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100
       >  [<ffffffff81534ddc>] ? retint_restore_args+0xe/0xe
       >  [<ffffffff8109c82d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
       >  [<ffffffff81279a2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
       >  [<ffffffff810d28a8>] ? rcu_irq_exit+0x68/0xb0
       >  [<ffffffff81534ddc>] ? retint_restore_args+0xe/0xe
       >  [<ffffffff8121e947>] sel_write_load+0xa7/0x770
       >  [<ffffffff81139633>] ? vfs_write+0x1c3/0x200
       >  [<ffffffff81210e8e>] ? security_file_permission+0x1e/0xa0
       >  [<ffffffff8113952b>] vfs_write+0xbb/0x200
       >  [<ffffffff811581c7>] ? fget_light+0x397/0x4b0
       >  [<ffffffff81139c27>] SyS_write+0x47/0xa0
       >  [<ffffffff8153bde4>] tracesys+0xdd/0xe2
      
      Stephen Smalley suggested:
      
       > Maybe put a cond_resched() within the ebitmap_for_each_positive_bit()
       > loop in mls_convert_context()?
      
      That seems to do the trick. Tested by downgrading and re-upgrading selinux-policy-targeted.
      Signed-off-by: NDave Jones <davej@redhat.com>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      9a591f39
    • P
      selinux: reject setexeccon() on MNT_NOSUID applications with -EACCES · 5b589d44
      Paul Moore 提交于
      We presently prevent processes from using setexecon() to set the
      security label of exec()'d processes when NO_NEW_PRIVS is enabled by
      returning an error; however, we silently ignore setexeccon() when
      exec()'ing from a nosuid mounted filesystem.  This patch makes things
      a bit more consistent by returning an error in the setexeccon()/nosuid
      case.
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      5b589d44
    • S
      selinux: Report permissive mode in avc: denied messages. · ca7786a2
      Stephen Smalley 提交于
      We cannot presently tell from an avc: denied message whether access was in
      fact denied or was allowed due to global or per-domain permissive mode.
      Add a permissive= field to the avc message to reflect this information.
      Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      ca7786a2
  4. 17 5月, 2014 3 次提交
  5. 14 5月, 2014 1 次提交
    • T
      cgroup: replace cftype->write_string() with cftype->write() · 451af504
      Tejun Heo 提交于
      Convert all cftype->write_string() users to the new cftype->write()
      which maps directly to kernfs write operation and has full access to
      kernfs and cgroup contexts.  The conversions are mostly mechanical.
      
      * @css and @cft are accessed using of_css() and of_cft() accessors
        respectively instead of being specified as arguments.
      
      * Should return @nbytes on success instead of 0.
      
      * @buf is not trimmed automatically.  Trim if necessary.  Note that
        blkcg and netprio don't need this as the parsers already handle
        whitespaces.
      
      cftype->write_string() has no user left after the conversions and
      removed.
      
      While at it, remove unnecessary local variable @p in
      cgroup_subtree_control_write() and stale comment about
      CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
      
      This patch doesn't introduce any visible behavior changes.
      
      v2: netprio was missing from conversion.  Converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      451af504
  6. 07 5月, 2014 2 次提交
  7. 05 5月, 2014 2 次提交
    • A
      device_cgroup: check if exception removal is allowed · d2c2b11c
      Aristeu Rozanski 提交于
      [PATCH v3 1/2] device_cgroup: check if exception removal is allowed
      
      When the device cgroup hierarchy was introduced in
      	bd2953eb - devcg: propagate local changes down the hierarchy
      
      a specific case was overlooked. Consider the hierarchy bellow:
      
      	A	default policy: ALLOW, exceptions will deny access
      	 \
      	  B	default policy: ALLOW, exceptions will deny access
      
      There's no need to verify when an new exception is added to B because
      in this case exceptions will deny access to further devices, which is
      always fine. Hierarchy in device cgroup only makes sure B won't have
      more access than A.
      
      But when an exception is removed (by writing devices.allow), it isn't
      checked if the user is in fact removing an inherited exception from A,
      thus giving more access to B.
      
      Example:
      
      	# echo 'a' >A/devices.allow
      	# echo 'c 1:3 rw' >A/devices.deny
      	# echo $$ >A/B/tasks
      	# echo >/dev/null
      	-bash: /dev/null: Operation not permitted
      	# echo 'c 1:3 w' >A/B/devices.allow
      	# echo >/dev/null
      	#
      
      This shouldn't be allowed and this patch fixes it by making sure to never allow
      exceptions in this case to be removed if the exception is partially or fully
      present on the parent.
      
      v3: missing '*' in function description
      v2: improved log message and formatting fixes
      
      Cc: cgroups@vger.kernel.org
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d2c2b11c
    • A
      device_cgroup: fix the comment format for recently added functions · f5f3cf6f
      Aristeu Rozanski 提交于
      Moving more extensive explanations to the end of the comment.
      
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f5f3cf6f
  8. 01 5月, 2014 1 次提交
  9. 23 4月, 2014 2 次提交
  10. 22 4月, 2014 2 次提交
    • J
      locks: rename file-private locks to "open file description locks" · 0d3f7a2d
      Jeff Layton 提交于
      File-private locks have been merged into Linux for v3.15, and *now*
      people are commenting that the name and macro definitions for the new
      file-private locks suck.
      
      ...and I can't even disagree. The names and command macros do suck.
      
      We're going to have to live with these for a long time, so it's
      important that we be happy with the names before we're stuck with them.
      The consensus on the lists so far is that they should be rechristened as
      "open file description locks".
      
      The name isn't a big deal for the kernel, but the command macros are not
      visually distinct enough from the traditional POSIX lock macros. The
      glibc and documentation folks are recommending that we change them to
      look like F_OFD_{GETLK|SETLK|SETLKW}. That lessens the chance that a
      programmer will typo one of the commands wrong, and also makes it easier
      to spot this difference when reading code.
      
      This patch makes the following changes that I think are necessary before
      v3.15 ships:
      
      1) rename the command macros to their new names. These end up in the uapi
         headers and so are part of the external-facing API. It turns out that
         glibc doesn't actually use the fcntl.h uapi header, but it's hard to
         be sure that something else won't. Changing it now is safest.
      
      2) make the the /proc/locks output display these as type "OFDLCK"
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Carlos O'Donell <carlos@redhat.com>
      Cc: Stefan Metzmacher <metze@samba.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Frank Filz <ffilzlnx@mindspring.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      0d3f7a2d
    • A
      device_cgroup: rework device access check and exception checking · 79d71974
      Aristeu Rozanski 提交于
      Whenever a device file is opened and checked against current device
      cgroup rules, it uses the same function (may_access()) as when a new
      exception rule is added by writing devices.{allow,deny}. And in both
      cases, the algorithm is the same, doesn't matter the behavior.
      
      First problem is having device access to be considered the same as rule
      checking. Consider the following structure:
      
      	A	(default behavior: allow, exceptions disallow access)
      	 \
      	  B	(default behavior: allow, exceptions disallow access)
      
      A new exception is added to B by writing devices.deny:
      
      	c 12:34 rw
      
      When checking if that exception is allowed in may_access():
      
      	if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
      		if (behavior == DEVCG_DEFAULT_ALLOW) {
      			/* the exception will deny access to certain devices */
      			return true;
      
      Which is ok, since B is not getting more privileges than A, it doesn't
      matter and the rule is accepted
      
      Now, consider it's a device file open check and the process belongs to
      cgroup B. The access will be generated as:
      
      	behavior: allow
      	exception: c 12:34 rw
      
      The very same chunk of code will allow it, even if there's an explicit
      exception telling to do otherwise.
      
      A simple test case:
      
      	# mkdir new_group
      	# cd new_group
      	# echo $$ >tasks
      	# echo "c 1:3 w" >devices.deny
      	# echo >/dev/null
      	# echo $?
      	0
      
      This is a serious bug and was introduced on
      
      	c39a2a30 devcg: prepare may_access() for hierarchy support
      
      To solve this problem, the device file open function was split from the
      new exception check.
      
      Second problem is how exceptions are processed by may_access(). The
      first part of the said function tries to match fully with an existing
      exception:
      
      	list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) {
      		if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
      			continue;
      		if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR))
      			continue;
      		if (ex->major != ~0 && ex->major != refex->major)
      			continue;
      		if (ex->minor != ~0 && ex->minor != refex->minor)
      			continue;
      		if (refex->access & (~ex->access))
      			continue;
      		match = true;
      		break;
      	}
      
      That means the new exception should be contained into an existing one to
      be considered a match:
      
      	New exception		Existing	match?	notes
      	b 12:34 rwm		b 12:34 rwm	yes
      	b 12:34 r		b *:34 rw	yes
      	b 12:34 rw		b 12:34 w	no	extra "r"
      	b *:34 rw		b 12:34 rw	no	too broad "*"
      	b *:34 rw		b *:34 rwm	yes
      
      Which is fine in some cases. Consider:
      
      	A	(default behavior: deny, exceptions allow access)
      	 \
      	  B	(default behavior: deny, exceptions allow access)
      
      In this case the full match makes sense, the new exception cannot add
      more access than the parent allows
      
      But this doesn't always work, consider:
      
      	A	(default behavior: allow, exceptions disallow access)
      	 \
      	  B	(default behavior: deny, exceptions allow access)
      
      In this case, a new exception in B shouldn't match any of the exceptions
      in A, after all you can't allow something that was forbidden by A. But
      consider this scenario:
      
      	New exception	Existing in A	match?	outcome
      	b 12:34 rw	b 12:34 r	no	exception is accepted
      
      Because the new exception has "w" as extra, it doesn't match, so it'll
      be added to B's exception list.
      
      The same problem can happen during a file access check. Consider a
      cgroup with allow as default behavior:
      
      	Access		Exception	match?
      	b 12:34 rw	b 12:34 r	no
      
      In this case, the access didn't match any of the exceptions in the
      cgroup, which is required since exceptions will disallow access.
      
      To solve this problem, two new functions were created to match an
      exception either fully or partially. In the example above, a partial
      check will be performed and it'll produce a match since at least
      "b 12:34 r" from "b 12:34 rw" access matches.
      
      Cc: cgroups@vger.kernel.org
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAristeu Rozanski <arozansk@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      79d71974
  11. 15 4月, 2014 1 次提交
  12. 12 4月, 2014 8 次提交
  13. 02 4月, 2014 1 次提交
  14. 01 4月, 2014 2 次提交
  15. 31 3月, 2014 1 次提交
    • J
      locks: add new fcntl cmd values for handling file private locks · 5d50ffd7
      Jeff Layton 提交于
      Due to some unfortunate history, POSIX locks have very strange and
      unhelpful semantics. The thing that usually catches people by surprise
      is that they are dropped whenever the process closes any file descriptor
      associated with the inode.
      
      This is extremely problematic for people developing file servers that
      need to implement byte-range locks. Developers often need a "lock
      management" facility to ensure that file descriptors are not closed
      until all of the locks associated with the inode are finished.
      
      Additionally, "classic" POSIX locks are owned by the process. Locks
      taken between threads within the same process won't conflict with one
      another, which renders them useless for synchronization between threads.
      
      This patchset adds a new type of lock that attempts to address these
      issues. These locks conflict with classic POSIX read/write locks, but
      have semantics that are more like BSD locks with respect to inheritance
      and behavior on close.
      
      This is implemented primarily by changing how fl_owner field is set for
      these locks. Instead of having them owned by the files_struct of the
      process, they are instead owned by the filp on which they were acquired.
      Thus, they are inherited across fork() and are only released when the
      last reference to a filp is put.
      
      These new semantics prevent them from being merged with classic POSIX
      locks, even if they are acquired by the same process. These locks will
      also conflict with classic POSIX locks even if they are acquired by
      the same process or on the same file descriptor.
      
      The new locks are managed using a new set of cmd values to the fcntl()
      syscall. The initial implementation of this converts these values to
      "classic" cmd values at a fairly high level, and the details are not
      exposed to the underlying filesystem. We may eventually want to push
      this handing out to the lower filesystem code but for now I don't
      see any need for it.
      
      Also, note that with this implementation the new cmd values are only
      available via fcntl64() on 32-bit arches. There's little need to
      add support for legacy apps on a new interface like this.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      5d50ffd7
  16. 20 3月, 2014 1 次提交