1. 14 2月, 2018 1 次提交
    • K
      inotify: Extend ioctl to allow to request id of new watch descriptor · e1603b6e
      Kirill Tkhai 提交于
      Watch descriptor is id of the watch created by inotify_add_watch().
      It is allocated in inotify_add_to_idr(), and takes the numbers
      starting from 1. Every new inotify watch obtains next available
      number (usually, old + 1), as served by idr_alloc_cyclic().
      
      CRIU (Checkpoint/Restore In Userspace) project supports inotify
      files, and restores watched descriptors with the same numbers,
      they had before dump. Since there was no kernel support, we
      had to use cycle to add a watch with specific descriptor id:
      
      	while (1) {
      		int wd;
      
      		wd = inotify_add_watch(inotify_fd, path, mask);
      		if (wd < 0) {
      			break;
      		} else if (wd == desired_wd_id) {
      			ret = 0;
      			break;
      		}
      
      		inotify_rm_watch(inotify_fd, wd);
      	}
      
      (You may find the actual code at the below link:
       https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)
      
      The cycle is suboptiomal and very expensive, but since there is no better
      kernel support, it was the only way to restore that. Happily, we had met
      mostly descriptors with small id, and this approach had worked somehow.
      
      But recent time containers with inotify with big watch descriptors
      begun to come, and this way stopped to work at all. When descriptor id
      is something about 0x34d71d6, the restoring process spins in busy loop
      for a long time, and the restore hungs and delay of migration from node
      to node could easily be watched.
      
      This patch aims to solve this problem. It introduces new ioctl
      INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
      watch descriptor from userspace. It simply calls idr_set_cursor() primitive
      to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
      will return this id, if it is not occupied. This is the way which is
      used to restore some other resources from userspace. For example,
      /proc/sys/kernel/ns_last_pid works the same for task pids.
      
      The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
      may exclude it.
      
      v2: Use INT_MAX instead of custom definition of max id,
      as IDR subsystem guarantees id is between 0 and INT_MAX.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Matthew Wilcox <willy@infradead.org>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      e1603b6e
  2. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  3. 28 11月, 2017 1 次提交
  4. 01 11月, 2017 1 次提交
  5. 10 4月, 2017 8 次提交
  6. 03 4月, 2017 1 次提交
  7. 02 3月, 2017 1 次提交
  8. 24 1月, 2017 1 次提交
    • N
      inotify: Convert to using per-namespace limits · 1cce1eea
      Nikolay Borisov 提交于
      This patchset converts inotify to using the newly introduced
      per-userns sysctl infrastructure.
      
      Currently the inotify instances/watches are being accounted in the
      user_struct structure. This means that in setups where multiple
      users in unprivileged containers map to the same underlying
      real user (i.e. pointing to the same user_struct) the inotify limits
      are going to be shared as well, allowing one user(or application) to exhaust
      all others limits.
      
      Fix this by switching the inotify sysctls to using the
      per-namespace/per-user limits. This will allow the server admin to
      set sensible global limits, which can further be tuned inside every
      individual user namespace. Additionally, in order to preserve the
      sysctl ABI make the existing inotify instances/watches sysctls
      modify the values of the initial user namespace.
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1cce1eea
  9. 08 10月, 2016 1 次提交
  10. 06 11月, 2015 1 次提交
    • D
      inotify: actually check for invalid bits in sys_inotify_add_watch() · d30e2c05
      Dave Hansen 提交于
      The comment here says that it is checking for invalid bits.  But, the mask
      is *actually* checking to ensure that _any_ valid bit is set, which is
      quite different.
      
      Without this check, an unexpected bit could get set on an inotify object.
      Since these bits are also interpreted by the fsnotify/dnotify code, there
      is the potential for an object to be mishandled inside the kernel.  For
      instance, can we be sure that setting the dnotify flag FS_DN_RENAME on an
      inotify watch is harmless?
      
      Add the actual check which was intended.  Retain the existing inotify bits
      are being added to the watch.  Plus, this is existing behavior which would
      be nice to preserve.
      
      I did a quick sniff test that inotify functions and that my
      'inotify-tools' package passes 'make check'.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30e2c05
  11. 17 6月, 2015 1 次提交
    • P
      fs/notify: don't use module_init for non-modular inotify_user code · c013d5a4
      Paul Gortmaker 提交于
      The INOTIFY_USER option is bool, and hence this code is either
      present or absent.  It will never be modular, so using
      module_init as an alias for __initcall is rather misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of fs_initcall (which
      makes sense for fs code) will thus change this registration
      from level 6-device to level 5-fs (i.e. slightly earlier).
      However no observable impact of that small difference has
      been observed during testing, or is expected.
      
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      c013d5a4
  12. 14 12月, 2014 1 次提交
  13. 28 10月, 2014 1 次提交
    • P
      sched, inotify: Deal with nested sleeps · e23738a7
      Peter Zijlstra 提交于
      inotify_read is a wait loop with sleeps in. Wait loops rely on
      task_struct::state and sleeps do too, since that's the only means of
      actually sleeping. Therefore the nested sleeps destroy the wait loop
      state and the wait loop breaks the sleep functions that assume
      TASK_RUNNING (mutex_lock).
      
      Fix this by using the new woken_wake_function and wait_woken() stuff,
      which registers wakeups in wait and thereby allows shrinking the
      task_state::state changes to the actual sleep part.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tglx@linutronix.de
      Cc: ilya.dryomov@inktank.com
      Cc: umgwanakikbuti@gmail.com
      Cc: Robert Love <rlove@rlove.org>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Link: http://lkml.kernel.org/r/20140924082242.254858080@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e23738a7
  14. 07 8月, 2014 1 次提交
  15. 07 6月, 2014 1 次提交
  16. 25 2月, 2014 1 次提交
    • J
      fsnotify: Allocate overflow events with proper type · ff57cd58
      Jan Kara 提交于
      Commit 7053aee2 "fsnotify: do not share events between notification
      groups" used overflow event statically allocated in a group with the
      size of the generic notification event. This causes problems because
      some code looks at type specific parts of event structure and gets
      confused by a random data it sees there and causes crashes.
      
      Fix the problem by allocating overflow event with type corresponding to
      the group type so code cannot get confused.
      Signed-off-by: NJan Kara <jack@suse.cz>
      ff57cd58
  17. 18 2月, 2014 1 次提交
    • J
      inotify: Fix reporting of cookies for inotify events · 45a22f4c
      Jan Kara 提交于
      My rework of handling of notification events (namely commit 7053aee2
      "fsnotify: do not share events between notification groups") broke
      sending of cookies with inotify events. We didn't propagate the value
      passed to fsnotify() properly and passed 4 uninitialized bytes to
      userspace instead (so it is also an information leak). Sadly I didn't
      notice this during my testing because inotify cookies aren't used very
      much and LTP inotify tests ignore them.
      
      Fix the problem by passing the cookie value properly.
      
      Fixes: 7053aee2Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      45a22f4c
  18. 22 1月, 2014 2 次提交
    • J
      fsnotify: do not share events between notification groups · 7053aee2
      Jan Kara 提交于
      Currently fsnotify framework creates one event structure for each
      notification event and links this event into all interested notification
      groups.  This is done so that we save memory when several notification
      groups are interested in the event.  However the need for event
      structure shared between inotify & fanotify bloats the event structure
      so the result is often higher memory consumption.
      
      Another problem is that fsnotify framework keeps path references with
      outstanding events so that fanotify can return open file descriptors
      with its events.  This has the undesirable effect that filesystem cannot
      be unmounted while there are outstanding events - a regression for
      inotify compared to a situation before it was converted to fsnotify
      framework.  For fanotify this problem is hard to avoid and users of
      fanotify should kind of expect this behavior when they ask for file
      descriptors from notified files.
      
      This patch changes fsnotify and its users to create separate event
      structure for each group.  This allows for much simpler code (~400 lines
      removed by this patch) and also smaller event structures.  For example
      on 64-bit system original struct fsnotify_event consumes 120 bytes, plus
      additional space for file name, additional 24 bytes for second and each
      subsequent group linking the event, and additional 32 bytes for each
      inotify group for private data.  After the conversion inotify event
      consumes 48 bytes plus space for file name which is considerably less
      memory unless file names are long and there are several groups
      interested in the events (both of which are uncommon).  Fanotify event
      fits in 56 bytes after the conversion (fanotify doesn't care about file
      names so its events don't have to have it allocated).  A win unless
      there are four or more fanotify groups interested in the event.
      
      The conversion also solves the problem with unmount when only inotify is
      used as we don't have to grab path references for inotify events.
      
      [hughd@google.com: fanotify: fix corruption preventing startup]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7053aee2
    • J
      inotify: provide function for name length rounding · e9fe6904
      Jan Kara 提交于
      Rounding of name length when passing it to userspace was done in several
      places.  Provide a function to do it and use it in all places.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9fe6904
  19. 10 7月, 2013 1 次提交
    • L
      inotify: fix race when adding a new watch · e1e5a9f8
      Lino Sanfilippo 提交于
      In inotify_new_watch() the number of watches for a group is compared
      against the max number of allowed watches and increased afterwards.  The
      check and incrementation is not done atomically, so it is possible for
      multiple concurrent threads to pass the check and increment the number
      of marks above the allowed max.
      
      This patch uses an inotify groups mark_lock to ensure that both check
      and incrementation are done atomic.  Furthermore we dont have to worry
      about the race that allows a concurrent thread to add a watch just after
      inotify_update_existing_watch() returned with -ENOENT anymore, since
      this is also synchronized by the groups mark mutex now.
      Signed-off-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1e5a9f8
  20. 01 5月, 2013 1 次提交
  21. 30 4月, 2013 2 次提交
  22. 28 2月, 2013 1 次提交
    • T
      inotify: convert to idr_alloc() · 4542da63
      Tejun Heo 提交于
      Convert to the much saner new idr interface.
      
      Note that the adhoc cyclic id allocation is buggy.  If wraparound
      happens, the previous code with idr_get_new_above() may segfault and
      the converted code will trigger WARN and return -EINVAL.  Even if it's
      fixed to wrap to zero, the code will be prone to unnecessary -ENOSPC
      failures after the first wraparound.  We probably need to implement
      proper cyclic support in idr.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4542da63
  23. 22 2月, 2013 1 次提交
    • J
      inotify: remove broken mask checks causing unmount to be EINVAL · 676a0675
      Jim Somerville 提交于
      Running the command:
      
      	inotifywait -e unmount /mnt/disk
      
      immediately aborts with a -EINVAL return code.  This is however a valid
      parameter.  This abort occurs only if unmount is the sole event
      parameter.  If other event parameters are supplied, then the unmount
      event wait will work.
      
      The problem was introduced by commit 44b350fc ("inotify: Fix mask
      checks").  In that commit, it states:
      
      	The mask checks in inotify_update_existing_watch() and
      	inotify_new_watch() are useless because inotify_arg_to_mask()
      	sets FS_IN_IGNORED and FS_EVENT_ON_CHILD bits anyway.
      
      But instead of removing the useless checks, it did this:
      
      	        mask = inotify_arg_to_mask(arg);
      	-       if (unlikely(!mask))
      	+       if (unlikely(!(mask & IN_ALL_EVENTS)))
      	                return -EINVAL;
      
      The problem is that IN_ALL_EVENTS doesn't include IN_UNMOUNT, and other
      parts of the code keep IN_UNMOUNT separate from IN_ALL_EVENTS.  So the
      check should be:
      
      	if (unlikely(!(mask & (IN_ALL_EVENTS | IN_UNMOUNT))))
      
      But inotify_arg_to_mask(arg) always sets the IN_UNMOUNT bit in the mask
      anyway, so the check is always going to pass and thus should simply be
      removed.  Also note that inotify_arg_to_mask completely controls what
      mask bits get set from arg, there's no way for invalid bits to get
      enabled there.
      
      Lets fix it by simply removing the useless broken checks.
      Signed-off-by: NJim Somerville <Jim.Somerville@windriver.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: <stable@vger.kernel.org>		[2.6.37+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      676a0675
  24. 18 12月, 2012 1 次提交
    • C
      fs, notify: add procfs fdinfo helper · be77196b
      Cyrill Gorcunov 提交于
      This allow us to print out fsnotify details such as watchee inode, device,
      mask and optionally a file handle.
      
      For inotify objects if kernel compiled with exportfs support the output
      will be
      
       | pos:	0
       | flags:	02000000
       | inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
       | inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:11a1000020542153
       | inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:49b1060023552153
      
      If kernel compiled without exportfs support, the file handle
      won't be provided but inode and device only.
      
       | pos:	0
       | flags:	02000000
       | inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0
       | inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0
       | inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0
      
      For fanotify the output is like
      
       | pos:	0
       | flags:	04002
       | fanotify flags:10 event-flags:0
       | fanotify mnt_id:12 mask:3b ignored_mask:0
       | fanotify ino:50205 sdev:800013 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:05020500fb1d47e7
      
      To minimize impact on general fsnotify code the new functionality
      is gathered in fs/notify/fdinfo.c file.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Helsley <matt.helsley@gmail.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be77196b
  25. 12 12月, 2012 6 次提交
  26. 27 9月, 2012 1 次提交