1. 04 4月, 2019 1 次提交
    • O
      kernfs: fix xattr name handling in LSM helpers · 1537ad15
      Ondrej Mosnacek 提交于
      The implementation of kernfs_security_xattr_*() helpers reuses the
      kernfs_node_xattr_*() functions, which take the suffix of the xattr name
      and extract full xattr name from it using xattr_full_name(). However,
      this function relies on the fact that the suffix passed to xattr
      handlers from VFS is always constructed from the full name by just
      incerementing the pointer. This doesn't necessarily hold for the callers
      of kernfs_security_xattr_*(), so their usage will easily lead to
      out-of-bounds access.
      
      Fix this by moving the xattr name reconstruction to the VFS xattr
      handlers and replacing the kernfs_security_xattr_*() helpers with more
      general kernfs_xattr_*() helpers that take full xattr name and allow
      accessing all kernfs node's xattrs.
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Fixes: b230d5ab ("LSM: add new hook for kernfs node initialization")
      Fixes: ec882da5 ("selinux: implement the kernfs_init_security hook")
      Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      1537ad15
  2. 21 3月, 2019 1 次提交
    • O
      LSM: add new hook for kernfs node initialization · b230d5ab
      Ondrej Mosnacek 提交于
      This patch introduces a new security hook that is intended for
      initializing the security data for newly created kernfs nodes, which
      provide a way of storing a non-default security context, but need to
      operate independently from mounts (and therefore may not have an
      associated inode at the moment of creation).
      
      The main motivation is to allow kernfs nodes to inherit the context of
      the parent under SELinux, similar to the behavior of
      security_inode_init_security(). Other LSMs may implement their own logic
      for handling the creation of new nodes.
      
      This patch also adds helper functions to <linux/kernfs.h> for
      getting/setting security xattrs of a kernfs node so that LSMs hooks are
      able to do their job. Other important attributes should be accessible
      direcly in the kernfs_node fields (in case there is need for more, then
      new helpers should be added to kernfs.h along with the patch that needs
      them).
      Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      [PM: more manual merge fixes]
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b230d5ab
  3. 06 3月, 2019 1 次提交
    • J
      fs: kernfs: add poll file operation · 147e1a97
      Johannes Weiner 提交于
      Patch series "psi: pressure stall monitors", v3.
      
      Android is adopting psi to detect and remedy memory pressure that
      results in stuttering and decreased responsiveness on mobile devices.
      
      Psi gives us the stall information, but because we're dealing with
      latencies in the millisecond range, periodically reading the pressure
      files to detect stalls in a timely fashion is not feasible.  Psi also
      doesn't aggregate its averages at a high enough frequency right now.
      
      This patch series extends the psi interface such that users can
      configure sensitive latency thresholds and use poll() and friends to be
      notified when these are breached.
      
      As high-frequency aggregation is costly, it implements an aggregation
      method that is optimized for fast, short-interval averaging, and makes
      the aggregation frequency adaptive, such that high-frequency updates
      only happen while monitored stall events are actively occurring.
      
      With these patches applied, Android can monitor for, and ward off,
      mounting memory shortages before they cause problems for the user.  For
      example, using memory stall monitors in userspace low memory killer
      daemon (lmkd) we can detect mounting pressure and kill less important
      processes before device becomes visibly sluggish.
      
      In our memory stress testing psi memory monitors produce roughly 10x
      less false positives compared to vmpressure signals.  Having ability to
      specify multiple triggers for the same psi metric allows other parts of
      Android framework to monitor memory state of the device and act
      accordingly.
      
      The new interface is straightforward.  The user opens one of the
      pressure files for writing and writes a trigger description into the
      file descriptor that defines the stall state - some or full, and the
      maximum stall time over a given window of time.  E.g.:
      
              /* Signal when stall time exceeds 100ms of a 1s window */
              char trigger[] = "full 100000 1000000";
              fd = open("/proc/pressure/memory");
              write(fd, trigger, sizeof(trigger));
              while (poll() >= 0) {
                      ...
              }
              close(fd);
      
      When the monitored stall state is entered, psi adapts its aggregation
      frequency according to what the configured time window requires in order
      to emit event signals in a timely fashion.  Once the stalling subsides,
      aggregation reverts back to normal.
      
      The trigger is associated with the open file descriptor.  To stop
      monitoring, the user only needs to close the file descriptor and the
      trigger is discarded.
      
      Patches 1-4 prepare the psi code for polling support.  Patch 5
      implements the adaptive polling logic, the pressure growth detection
      optimized for short intervals, and hooks up write() and poll() on the
      pressure files.
      
      The patches were developed in collaboration with Johannes Weiner.
      
      This patch (of 5):
      
      Kernfs has a standardized poll/notification mechanism for waking all
      pollers on all fds when a filesystem node changes.  To allow polling for
      custom events, add a .poll callback that can override the default.
      
      This is in preparation for pollable cgroup pressure files which have
      per-fd trigger configurations.
      
      Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.comSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      147e1a97
  4. 28 2月, 2019 1 次提交
    • D
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context · 23bf1b6b
      David Howells 提交于
      Make kernfs support superblock creation/mount/remount with fs_context.
      
      This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
      be made to support fs_context also.
      
      Notes:
      
       (1) A kernfs_fs_context struct is created to wrap fs_context and the
           kernfs mount parameters are moved in here (or are in fs_context).
      
       (2) kernfs_mount{,_ns}() are made into kernfs_get_tree().  The extra
           namespace tag parameter is passed in the context if desired
      
       (3) kernfs_free_fs_context() is provided as a destructor for the
           kernfs_fs_context struct, but for the moment it does nothing except
           get called in the right places.
      
       (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
           pass, but possibly this should be done anyway in case someone wants to
           add a parameter in future.
      
       (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
           the cgroup v1 and v2 mount parameters are all moved there.
      
       (6) cgroup1 parameter parsing error messages are now handled by invalf(),
           which allows userspace to collect them directly.
      
       (7) cgroup1 parameter cleanup is now done in the context destructor rather
           than in the mount/get_tree and remount functions.
      
      Weirdies:
      
       (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
           but then uses the resulting pointer after dropping the locks.  I'm
           told this is okay and needs commenting.
      
       (*) The cgroup refcount web.  This really needs documenting.
      
       (*) cgroup2 only has one root?
      
      Add a suggestion from Thomas Gleixner in which the RDT enablement code is
      placed into its own function.
      
      [folded a leak fix from Andrey Vagin]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc: Tejun Heo <tj@kernel.org>
      cc: Li Zefan <lizefan@huawei.com>
      cc: Johannes Weiner <hannes@cmpxchg.org>
      cc: cgroups@vger.kernel.org
      cc: fenghua.yu@intel.com
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      23bf1b6b
  5. 18 1月, 2019 1 次提交
  6. 17 9月, 2018 1 次提交
  7. 21 7月, 2018 1 次提交
  8. 29 7月, 2017 5 次提交
  9. 28 12月, 2016 2 次提交
  10. 10 8月, 2016 3 次提交
  11. 10 5月, 2016 1 次提交
    • S
      cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces · 4f41fc59
      Serge E. Hallyn 提交于
      Patch summary:
      
      When showing a cgroupfs entry in mountinfo, show the path of the mount
      root dentry relative to the reader's cgroup namespace root.
      
      Short explanation (courtesy of mkerrisk):
      
      If we create a new cgroup namespace, then we want both /proc/self/cgroup
      and /proc/self/mountinfo to show cgroup paths that are correctly
      virtualized with respect to the cgroup mount point.  Previous to this
      patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
      does not.
      
      Long version:
      
      When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
      namespace, and then mounts a new instance of the freezer cgroup, the new
      mount will be rooted at /a/b.  The root dentry field of the mountinfo
      entry will show '/a/b'.
      
       cat > /tmp/do1 << EOF
       mount -t cgroup -o freezer freezer /mnt
       grep freezer /proc/self/mountinfo
       EOF
      
       unshare -Gm  bash /tmp/do1
       > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
       > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
      
      The task's freezer cgroup entry in /proc/self/cgroup will simply show
      '/':
      
       grep freezer /proc/self/cgroup
       9:freezer:/
      
      If instead the same task simply bind mounts the /a/b cgroup directory,
      the resulting mountinfo entry will again show /a/b for the dentry root.
      However in this case the task will find its own cgroup at /mnt/a/b,
      not at /mnt:
      
       mount --bind /sys/fs/cgroup/freezer/a/b /mnt
       130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
      
      In other words, there is no way for the task to know, based on what is
      in mountinfo, which cgroup directory is its own.
      
      Example (by mkerrisk):
      
      First, a little script to save some typing and verbiage:
      
      echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
      cat /proc/self/mountinfo | grep freezer |
              awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
      
      Create cgroup, place this shell into the cgroup, and look at the state
      of the /proc files:
      
      2653
      2653                         # Our shell
      14254                        # cat(1)
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
      
      Create a shell in new cgroup and mount namespaces. The act of creating
      a new cgroup namespace causes the process's current cgroups directories
      to become its cgroup root directories. (Here, I'm using my own version
      of the "unshare" utility, which takes the same options as the util-linux
      version):
      
      Look at the state of the /proc files:
      
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /sys/fs/cgroup/freezer
      
      The third entry in /proc/self/cgroup (the pathname of the cgroup inside
      the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
      is rooted at /a/b in the outer namespace.
      
      However, the info in /proc/self/mountinfo is not for this cgroup
      namespace, since we are seeing a duplicate of the mount from the
      old mount namespace, and the info there does not correspond to the
      new cgroup namespace. However, trying to create a new mount still
      doesn't show us the right information in mountinfo:
      
                                            # propagating to other mountns
              /proc/self/cgroup:      7:freezer:/
              mountinfo:              /a/b    /mnt/freezer
      
      The act of creating a new cgroup namespace caused the process's
      current freezer directory, "/a/b", to become its cgroup freezer root
      directory. In other words, the pathname directory of the directory
      within the newly mounted cgroup filesystem should be "/",
      but mountinfo wrongly shows us "/a/b". The consequence of this is
      that the process in the cgroup namespace cannot correctly construct
      the pathname of its cgroup root directory from the information in
      /proc/PID/mountinfo.
      
      With this patch, the dentry root field in mountinfo is shown relative
      to the reader's cgroup namespace.  So the same steps as above:
      
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /../..  /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /mnt/freezer
      
      cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
      cgroup.procs           freezer.self_freezing    notify_on_release
      3164
      2653                   # First shell that placed in this cgroup
      3164                   # Shell started by 'unshare'
      14197                  # cat(1)
      Signed-off-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4f41fc59
  12. 01 5月, 2016 1 次提交
    • C
      kernfs: Move faulting copy_user operations outside of the mutex · e4234a1f
      Chris Wilson 提交于
      A fault in a user provided buffer may lead anywhere, and lockdep warns
      that we have a potential deadlock between the mm->mmap_sem and the
      kernfs file mutex:
      
      [   82.811702] ======================================================
      [   82.811705] [ INFO: possible circular locking dependency detected ]
      [   82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
      [   82.811711] -------------------------------------------------------
      [   82.811714] kms_setmode/5859 is trying to acquire lock:
      [   82.811717]  (&dev->struct_mutex){+.+.+.}, at: [<ffffffff8150d9c1>] drm_gem_mmap+0x1a1/0x270
      [   82.811731]
      but task is already holding lock:
      [   82.811734]  (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
      [   82.811745]
      which lock already depends on the new lock.
      
      [   82.811749]
      the existing dependency chain (in reverse order) is:
      [   82.811752]
      -> #3 (&mm->mmap_sem){++++++}:
      [   82.811761]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.811766]        [<ffffffff8118bc65>] __might_fault+0x75/0xa0
      [   82.811771]        [<ffffffff8124da4a>] kernfs_fop_write+0x8a/0x180
      [   82.811787]        [<ffffffff811d1023>] __vfs_write+0x23/0xe0
      [   82.811792]        [<ffffffff811d1d74>] vfs_write+0xa4/0x190
      [   82.811797]        [<ffffffff811d2c14>] SyS_write+0x44/0xb0
      [   82.811801]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      [   82.811807]
      -> #2 (s_active#6){++++.+}:
      [   82.811814]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.811819]        [<ffffffff8124c070>] __kernfs_remove+0x210/0x2f0
      [   82.811823]        [<ffffffff8124d040>] kernfs_remove_by_name_ns+0x40/0xa0
      [   82.811828]        [<ffffffff8124e9e0>] sysfs_remove_file_ns+0x10/0x20
      [   82.811832]        [<ffffffff815318d4>] device_del+0x124/0x250
      [   82.811837]        [<ffffffff81531a19>] device_unregister+0x19/0x60
      [   82.811841]        [<ffffffff8153c051>] cpu_cache_sysfs_exit+0x51/0xb0
      [   82.811846]        [<ffffffff8153c628>] cacheinfo_cpu_callback+0x38/0x70
      [   82.811851]        [<ffffffff8109ae89>] notifier_call_chain+0x39/0xa0
      [   82.811856]        [<ffffffff8109aef9>] __raw_notifier_call_chain+0x9/0x10
      [   82.811860]        [<ffffffff810786de>] cpu_notify+0x1e/0x40
      [   82.811865]        [<ffffffff81078779>] cpu_notify_nofail+0x9/0x20
      [   82.811869]        [<ffffffff81078ac3>] _cpu_down+0x233/0x340
      [   82.811874]        [<ffffffff81079019>] disable_nonboot_cpus+0xc9/0x350
      [   82.811878]        [<ffffffff810d2e11>] suspend_devices_and_enter+0x5a1/0xb50
      [   82.811883]        [<ffffffff810d3903>] pm_suspend+0x543/0x8d0
      [   82.811888]        [<ffffffff810d1b77>] state_store+0x77/0xe0
      [   82.811892]        [<ffffffff813fa68f>] kobj_attr_store+0xf/0x20
      [   82.811897]        [<ffffffff8124e740>] sysfs_kf_write+0x40/0x50
      [   82.811902]        [<ffffffff8124dafc>] kernfs_fop_write+0x13c/0x180
      [   82.811906]        [<ffffffff811d1023>] __vfs_write+0x23/0xe0
      [   82.811910]        [<ffffffff811d1d74>] vfs_write+0xa4/0x190
      [   82.811914]        [<ffffffff811d2c14>] SyS_write+0x44/0xb0
      [   82.811918]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      [   82.811923]
      -> #1 (cpu_hotplug.lock){+.+.+.}:
      [   82.811929]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.811933]        [<ffffffff817b6f72>] mutex_lock_nested+0x62/0x3b0
      [   82.811940]        [<ffffffff810784c1>] get_online_cpus+0x61/0x80
      [   82.811944]        [<ffffffff811170eb>] stop_machine+0x1b/0xe0
      [   82.811949]        [<ffffffffa0178edd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
      [   82.812009]        [<ffffffffa017d3a6>] ggtt_bind_vma+0x46/0x70 [i915]
      [   82.812045]        [<ffffffffa017eb70>] i915_vma_bind+0x140/0x290 [i915]
      [   82.812081]        [<ffffffffa01862b9>] i915_gem_object_do_pin+0x899/0xb00 [i915]
      [   82.812117]        [<ffffffffa0186555>] i915_gem_object_pin+0x35/0x40 [i915]
      [   82.812154]        [<ffffffffa019a23e>] intel_init_pipe_control+0xbe/0x210 [i915]
      [   82.812192]        [<ffffffffa0197312>] intel_logical_rings_init+0xe2/0xde0 [i915]
      [   82.812232]        [<ffffffffa0186fe3>] i915_gem_init+0xf3/0x130 [i915]
      [   82.812278]        [<ffffffffa02097ed>] i915_driver_load+0xf2d/0x1770 [i915]
      [   82.812318]        [<ffffffff81512474>] drm_dev_register+0xa4/0xb0
      [   82.812323]        [<ffffffff8151467e>] drm_get_pci_dev+0xce/0x1e0
      [   82.812328]        [<ffffffffa01472cf>] i915_pci_probe+0x2f/0x50 [i915]
      [   82.812360]        [<ffffffff8143f907>] pci_device_probe+0x87/0xf0
      [   82.812366]        [<ffffffff81535f89>] driver_probe_device+0x229/0x450
      [   82.812371]        [<ffffffff81536233>] __driver_attach+0x83/0x90
      [   82.812375]        [<ffffffff81533c61>] bus_for_each_dev+0x61/0xa0
      [   82.812380]        [<ffffffff81535879>] driver_attach+0x19/0x20
      [   82.812384]        [<ffffffff8153535f>] bus_add_driver+0x1ef/0x290
      [   82.812388]        [<ffffffff81536e9b>] driver_register+0x5b/0xe0
      [   82.812393]        [<ffffffff8143e83b>] __pci_register_driver+0x5b/0x60
      [   82.812398]        [<ffffffff81514866>] drm_pci_init+0xd6/0x100
      [   82.812402]        [<ffffffffa027c094>] 0xffffffffa027c094
      [   82.812406]        [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
      [   82.812412]        [<ffffffff811595a0>] do_init_module+0x5b/0x1cb
      [   82.812417]        [<ffffffff81106160>] load_module+0x1c20/0x2480
      [   82.812422]        [<ffffffff81106bae>] SyS_finit_module+0x7e/0xa0
      [   82.812428]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      [   82.812433]
      -> #0 (&dev->struct_mutex){+.+.+.}:
      [   82.812439]        [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
      [   82.812443]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.812456]        [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
      [   82.812460]        [<ffffffff81196a14>] mmap_region+0x334/0x580
      [   82.812466]        [<ffffffff81196fc4>] do_mmap+0x364/0x410
      [   82.812470]        [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
      [   82.812474]        [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
      [   82.812479]        [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
      [   82.812484]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      [   82.812489]
      other info that might help us debug this:
      
      [   82.812493] Chain exists of:
        &dev->struct_mutex --> s_active#6 --> &mm->mmap_sem
      
      [   82.812502]  Possible unsafe locking scenario:
      
      [   82.812506]        CPU0                    CPU1
      [   82.812508]        ----                    ----
      [   82.812510]   lock(&mm->mmap_sem);
      [   82.812514]                                lock(s_active#6);
      [   82.812519]                                lock(&mm->mmap_sem);
      [   82.812522]   lock(&dev->struct_mutex);
      [   82.812526]
       *** DEADLOCK ***
      
      [   82.812531] 1 lock held by kms_setmode/5859:
      [   82.812533]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
      [   82.812541]
      stack backtrace:
      [   82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
      [   82.812550] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
      [   82.812553]  0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
      [   82.812560]  ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
      [   82.812566]  ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
      [   82.812573] Call Trace:
      [   82.812578]  [<ffffffff813f8505>] dump_stack+0x67/0x92
      [   82.812582]  [<ffffffff810c84ac>] print_circular_bug+0x1fc/0x310
      [   82.812586]  [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
      [   82.812590]  [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.812594]  [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
      [   82.812599]  [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
      [   82.812603]  [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
      [   82.812608]  [<ffffffff81196a14>] mmap_region+0x334/0x580
      [   82.812612]  [<ffffffff81196fc4>] do_mmap+0x364/0x410
      [   82.812616]  [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
      [   82.812629]  [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
      [   82.812633]  [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
      [   82.812637]  [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      
      Highly unlikely though this scenario is, we can avoid the issue entirely
      by moving the copy operation from out under the kernfs_get_active()
      tracking by assigning the preallocated buffer its own mutex. The
      temporary buffer allocation doesn't require mutex locking as it is
      entirely local.
      
      The locked section was extended by the addition of the preallocated buf
      to speed up md user operations in
      
      commit 2b75869b
      Author: NeilBrown <neilb@suse.de>
      Date:   Mon Oct 13 16:41:28 2014 +1100
      
          sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e4234a1f
  13. 17 2月, 2016 2 次提交
  14. 21 11月, 2015 1 次提交
  15. 19 8月, 2015 1 次提交
  16. 01 7月, 2015 1 次提交
  17. 19 6月, 2015 1 次提交
  18. 14 2月, 2015 1 次提交
    • T
      kernfs: remove KERNFS_STATIC_NAME · dfeb0750
      Tejun Heo 提交于
      When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid
      making a separate copy of its name.  It's currently only used for sysfs
      attributes whose filenames are required to stay accessible and unchanged.
      There are rare exceptions where these names are allocated and formatted
      dynamically but for the vast majority of cases they're consts in the
      rodata section.
      
      Now that kernfs is converted to use kstrdup_const() and kfree_const(),
      there's little point in keeping KERNFS_STATIC_NAME around.  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andrzej Hajda <a.hajda@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfeb0750
  19. 08 11月, 2014 1 次提交
    • N
      sysfs/kernfs: allow attributes to request write buffer be pre-allocated. · 2b75869b
      NeilBrown 提交于
      md/raid allows metadata management to be performed in user-space.
      A various times, particularly on device failure, the metadata needs
      to be updated before further writes can be permitted.
      This means that the user-space program which updates metadata much
      not block on writeout, and so must not allocate memory.
      
      mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all
      memory allocation issues for user-memory, but that does not help
      kernel memory.
      Several kernel objects can be pre-allocated.  e.g. files opened before
      any writes to the array are permitted.
      However some kernel allocation happens in places that cannot be
      pre-allocated.
      In particular, writes to sysfs files (to tell md that it can now
      allow writes to the array) allocate a buffer using GFP_KERNEL.
      
      This patch allows attributes to be marked as "PREALLOC".  In that case
      the maximal buffer is allocated when the file is opened, and then used
      on each write instead of allocating a new buffer.
      
      As the same buffer is now shared for all writes on the same file
      description, the mutex is extended to cover full use of the buffer
      including the copy_from_user().
      
      The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is
      inspected by sysfs_add_file_mode_ns() to determine if the file should be
      marked as requiring prealloc.
      
      Despite the comment, we *do* use ->seq_show together with ->prealloc
      in this patch.  The next patch fixes that.
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b75869b
  20. 03 7月, 2014 1 次提交
    • T
      kernfs: kernfs_notify() must be useable from non-sleepable contexts · ecca47ce
      Tejun Heo 提交于
      d911d987 ("kernfs: make kernfs_notify() trigger inotify events
      too") added fsnotify triggering to kernfs_notify() which requires a
      sleepable context.  There are already existing users of
      kernfs_notify() which invoke it from an atomic context and in general
      it's silly to require a sleepable context for triggering a
      notification.
      
      The following is an invalid context bug triggerd by md invoking
      sysfs_notify() from IO completion path.
      
       BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
       in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
       2 locks held by swapper/1/0:
        #0:  (&(&vblk->vq_lock)->rlock){-.-...}, at: [<ffffffffa0039042>] virtblk_done+0x42/0xe0 [virtio_blk]
        #1:  (&(&bitmap->counts.lock)->rlock){-.....}, at: [<ffffffff81633718>] bitmap_endwrite+0x68/0x240
       irq event stamp: 33518
       hardirqs last  enabled at (33515): [<ffffffff8102544f>] default_idle+0x1f/0x230
       hardirqs last disabled at (33516): [<ffffffff818122ed>] common_interrupt+0x6d/0x72
       softirqs last  enabled at (33518): [<ffffffff810a1272>] _local_bh_enable+0x22/0x50
       softirqs last disabled at (33517): [<ffffffff810a29e0>] irq_enter+0x60/0x80
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
        0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
        0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
       Call Trace:
        <IRQ>  [<ffffffff81807b4c>] dump_stack+0x4d/0x66
        [<ffffffff810d4f14>] __might_sleep+0x184/0x240
        [<ffffffff8180caf2>] mutex_lock_nested+0x42/0x440
        [<ffffffff812d76a0>] kernfs_notify+0x90/0x150
        [<ffffffff8163377c>] bitmap_endwrite+0xcc/0x240
        [<ffffffffa00de863>] close_write+0x93/0xb0 [raid1]
        [<ffffffffa00df029>] r1_bio_write_done+0x29/0x50 [raid1]
        [<ffffffffa00e0474>] raid1_end_write_request+0xe4/0x260 [raid1]
        [<ffffffff813acb8b>] bio_endio+0x6b/0xa0
        [<ffffffff813b46c4>] blk_update_request+0x94/0x420
        [<ffffffff813bf0ea>] blk_mq_end_io+0x1a/0x70
        [<ffffffffa00392c2>] virtblk_request_done+0x32/0x80 [virtio_blk]
        [<ffffffff813c0648>] __blk_mq_complete_request+0x88/0x120
        [<ffffffff813c070a>] blk_mq_complete_request+0x2a/0x30
        [<ffffffffa0039066>] virtblk_done+0x66/0xe0 [virtio_blk]
        [<ffffffffa002535a>] vring_interrupt+0x3a/0xa0 [virtio_ring]
        [<ffffffff81116177>] handle_irq_event_percpu+0x77/0x340
        [<ffffffff8111647d>] handle_irq_event+0x3d/0x60
        [<ffffffff81119436>] handle_edge_irq+0x66/0x130
        [<ffffffff8101c3e4>] handle_irq+0x84/0x150
        [<ffffffff818146ad>] do_IRQ+0x4d/0xe0
        [<ffffffff818122f2>] common_interrupt+0x72/0x72
        <EOI>  [<ffffffff8105f706>] ? native_safe_halt+0x6/0x10
        [<ffffffff81025454>] default_idle+0x24/0x230
        [<ffffffff81025f9f>] arch_cpu_idle+0xf/0x20
        [<ffffffff810f5adc>] cpu_startup_entry+0x37c/0x7b0
        [<ffffffff8104df1b>] start_secondary+0x25b/0x300
      
      This patch fixes it by punting the notification delivery through a
      work item.  This ends up adding an extra pointer to kernfs_elem_attr
      enlarging kernfs_node by a pointer, which is not ideal but not a very
      big deal either.  If this turns out to be an actual issue, we can move
      kernfs_elem_attr->size to kernfs_node->iattr later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecca47ce
  21. 30 6月, 2014 1 次提交
  22. 03 6月, 2014 1 次提交
  23. 28 5月, 2014 1 次提交
  24. 13 5月, 2014 1 次提交
    • T
      kernfs, sysfs, cgroup: restrict extra perm check on open to sysfs · 555724a8
      Tejun Heo 提交于
      The kernfs open method - kernfs_fop_open() - inherited extra
      permission checks from sysfs.  While the vfs layer allows ignoring the
      read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
      sysfs explicitly denied open regardless of the cap if the file doesn't
      have any of the UGO perms of the requested access or doesn't implement
      the requested operation.  It can be debated whether this was a good
      idea or not but the behavior is too subtle and dangerous to change at
      this point.
      
      After cgroup got converted to kernfs, this extra perm check also got
      applied to cgroup breaking libcgroup which opens write-only files with
      O_RDWR as root.  This patch gates the extra open permission check with
      a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
      For sysfs, nothing changes.  For cgroup, root now can perform any
      operation regardless of the permissions as it was before kernfs
      conversion.  Note that kernfs still fails unimplemented operations
      with -EINVAL.
      
      While at it, add comments explaining KERNFS_ROOT flags.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAndrey Wagin <avagin@gmail.com>
      Tested-by: NAndrey Wagin <avagin@gmail.com>
      Cc: Li Zefan <lizefan@huawei.com>
      References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
      Fixes: 2bd59d48 ("cgroup: convert to kernfs")
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      555724a8
  25. 26 4月, 2014 1 次提交
  26. 09 3月, 2014 1 次提交
    • T
      kernfs: cache atomic_write_len in kernfs_open_file · b7ce40cf
      Tejun Heo 提交于
      While implementing atomic_write_len, 4d3773c4 ("kernfs: implement
      kernfs_ops->atomic_write_len") moved data copy from userland inside
      kernfs_get_active() and kernfs_open_file->mutex so that
      kernfs_ops->atomic_write_len can be accessed before copying buffer
      from userland; unfortunately, this could lead to locking order
      inversion involving mmap_sem if copy_from_user() takes a page fault.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G        W
        -------------------------------------------------------
        trinity-c236/10658 is trying to acquire lock:
         (&of->mutex#2){+.+.+.}, at: [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
      
        but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
       -> #1 (&mm->mmap_sem){++++++}:
      	 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
      	 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
      	 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
      	 [<mm/memory.c:4188>] might_fault+0x7e/0xb0
      	 [<arch/x86/include/asm/uaccess.h:713 fs/kernfs/file.c:291>] kernfs_fop_write+0xd8/0x190
      	 [<fs/read_write.c:473>] vfs_write+0xe3/0x1d0
      	 [<fs/read_write.c:523 fs/read_write.c:515>] SyS_write+0x5d/0xa0
      	 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2
      
       -> #0 (&of->mutex#2){+.+.+.}:
      	 [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
      	 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
      	 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
      	 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
      	 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
      	 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
      	 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
      	 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
      	 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
      	 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
      	 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
      	 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&mm->mmap_sem);
      				 lock(&of->mutex#2);
      				 lock(&mm->mmap_sem);
          lock(&of->mutex#2);
      
         *** DEADLOCK ***
      
        1 lock held by trinity-c236/10658:
         #0:  (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0
      
        stack backtrace:
        CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G        W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26
         0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000
         0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8
         ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8
        Call Trace:
         [<lib/dump_stack.c:52>] dump_stack+0x52/0x7f
         [<kernel/locking/lockdep.c:1213>] print_circular_bug+0x129/0x160
         [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
         [<include/linux/spinlock.h:343 mm/slub.c:1933>] ? deactivate_slab+0x511/0x550
         [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
         [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
         [<mm/mmap.c:1552>] ? mmap_region+0x24a/0x5c0
         [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
         [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
         [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
         [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
         [<kernel/sched/core.c:2477>] ? get_parent_ip+0x11/0x50
         [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
         [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
         [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
         [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
         [<mm/util.c:397>] ? vm_mmap_pgoff+0x6e/0xe0
         [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
         [<kernel/rcu/update.c:97>] ? __rcu_read_unlock+0x44/0xb0
         [<fs/file.c:641>] ? dup_fd+0x3c0/0x3c0
         [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
         [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
         [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2
      
      Fix it by caching atomic_write_len in kernfs_open_file during open so
      that it can be determined without accessing kernfs_ops in
      kernfs_fop_write().  This restores the structure of kernfs_fop_write()
      before 4d3773c4 with updated @len determination logic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      References: http://lkml.kernel.org/g/53113485.2090407@oracle.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b7ce40cf
  27. 25 2月, 2014 1 次提交
    • L
      sysfs: fix namespace refcnt leak · fed95bab
      Li Zefan 提交于
      As mount() and kill_sb() is not a one-to-one match, we shoudn't get
      ns refcnt unconditionally in sysfs_mount(), and instead we should
      get the refcnt only when kernfs_mount() allocated a new superblock.
      
      v2:
      - Changed the name of the new argument, suggested by Tejun.
      - Made the argument optional, suggested by Tejun.
      
      v3:
      - Make the new argument as second-to-last arg, suggested by Tejun.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NTejun Heo <tj@kernel.org>
       ---
       fs/kernfs/mount.c      | 8 +++++++-
       fs/sysfs/mount.c       | 5 +++--
       include/linux/kernfs.h | 9 +++++----
       3 files changed, 15 insertions(+), 7 deletions(-)
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fed95bab
  28. 08 2月, 2014 5 次提交
    • T
      kernfs: add CONFIG_KERNFS · ba341d55
      Tejun Heo 提交于
      As sysfs was kernfs's only user, kernfs has been piggybacking on
      CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
      soon.  Introduce a separate config option CONFIG_KERNFS which is to be
      selected by kernfs users.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba341d55
    • T
      kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends · 3eef34ad
      Tejun Heo 提交于
      kernfs_node->parent and ->name are currently marked as "published"
      indicating that kernfs users may access them directly; however, those
      fields may get updated by kernfs_rename[_ns]() and unrestricted access
      may lead to erroneous values or oops.
      
      Protect ->parent and ->name updates with a irq-safe spinlock
      kernfs_rename_lock and implement the following accessors for these
      fields.
      
      * kernfs_name()		- format the node's name into the specified buffer
      * kernfs_path()		- format the node's path into the specified buffer
      * pr_cont_kernfs_name()	- pr_cont a node's name (doesn't need buffer)
      * pr_cont_kernfs_path()	- pr_cont a node's path (doesn't need buffer)
      * kernfs_get_parent()	- pin and return a node's parent
      
      All can be called under any context.  The recursive sysfs_pathname()
      in fs/sysfs/dir.c is replaced with kernfs_path() and
      sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
      dereferencing parent directly.
      
      v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
          static inline making it cause a lot of build warnings.  Add it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3eef34ad
    • T
      kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename() · 0c23b225
      Tejun Heo 提交于
      Implement helpers to determine node from dentry and root from
      super_block.  Also add a kernfs_rename_ns() wrapper which assumes NULL
      namespace.  These generally make sense and will be used by cgroup.
      
      v2: Some dummy implementations for !CONFIG_SYSFS was missing.  Fixed.
          Reported by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c23b225
    • T
      kernfs: add kernfs_open_file->priv · 2536390d
      Tejun Heo 提交于
      Add a private data field to be used by kernfs file operations.  This
      generally makes sense and will be used by cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2536390d
    • T
      kernfs: implement kernfs_ops->atomic_write_len · 4d3773c4
      Tejun Heo 提交于
      A write to a kernfs_node is buffered through a kernel buffer.  Writes
      <= PAGE_SIZE are performed atomically, while larger ones are executed
      in PAGE_SIZE chunks.  While this is enough for sysfs, cgroup which is
      scheduled to be converted to use kernfs needs a bit more control over
      it.
      
      This patch adds kernfs_ops->atomic_write_len.  If not set (zero), the
      behavior stays the same.  If set, writes upto the size are executed
      atomically and larger writes are rejected with -E2BIG.
      
      A different implementation strategy would be allowing configuring
      chunking size while making the original write size available to the
      write method; however, such strategy, while being more complicated,
      doesn't really buy anything.  If the write implementation has to
      handle chunking, the specific chunk size shouldn't matter all that
      much.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d3773c4