1. 07 9月, 2017 1 次提交
  2. 03 8月, 2017 1 次提交
    • J
      ocfs2: don't clear SGID when inheriting ACLs · 19ec8e48
      Jan Kara 提交于
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0').  However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
      into ocfs2_iop_set_acl().  That way the function will not be called when
      inheriting ACLs which is what we want as it prevents SGID bit clearing
      and the mode has been properly set by posix_acl_create() anyway.  Also
      posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
      mode itself.
      
      Fixes: 07393101 ("posix_acl: Clear SGID bit when setting file permissions")
      Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19ec8e48
  3. 07 7月, 2017 4 次提交
  4. 24 6月, 2017 1 次提交
    • E
      ocfs2: fix deadlock caused by recursive locking in xattr · 8818efaa
      Eric Ren 提交于
      Another deadlock path caused by recursive locking is reported.  This
      kind of issue was introduced since commit 743b5f14 ("ocfs2: take
      inode lock in ocfs2_iop_set/get_acl()").  Two deadlock paths have been
      fixed by commit b891fa50 ("ocfs2: fix deadlock issue when taking
      inode lock at vfs entry points").  Yes, we intend to fix this kind of
      case in incremental way, because it's hard to find out all possible
      paths at once.
      
      This one can be reproduced like this.  On node1, cp a large file from
      home directory to ocfs2 mountpoint.  While on node2, run
      setfacl/getfacl.  Both nodes will hang up there.  The backtraces:
      
      On node1:
        __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
        ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
        ocfs2_write_begin+0x43/0x1a0 [ocfs2]
        generic_perform_write+0xa9/0x180
        __generic_file_write_iter+0x1aa/0x1d0
        ocfs2_file_write_iter+0x4f4/0xb40 [ocfs2]
        __vfs_write+0xc3/0x130
        vfs_write+0xb1/0x1a0
        SyS_write+0x46/0xa0
      
      On node2:
        __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
        ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
        ocfs2_xattr_set+0x12e/0xe80 [ocfs2]
        ocfs2_set_acl+0x22d/0x260 [ocfs2]
        ocfs2_iop_set_acl+0x65/0xb0 [ocfs2]
        set_posix_acl+0x75/0xb0
        posix_acl_xattr_set+0x49/0xa0
        __vfs_setxattr+0x69/0x80
        __vfs_setxattr_noperm+0x72/0x1a0
        vfs_setxattr+0xa7/0xb0
        setxattr+0x12d/0x190
        path_setxattr+0x9f/0xb0
        SyS_setxattr+0x14/0x20
      
      Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
      exported by commit 439a36b8 ("ocfs2/dlmglue: prepare tracking logic
      to avoid recursive cluster lock").
      
      Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com
      Fixes: 743b5f14 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
      Signed-off-by: NEric Ren <zren@suse.com>
      Reported-by: NThomas Voegtle <tv@lio96.de>
      Tested-by: NThomas Voegtle <tv@lio96.de>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8818efaa
  5. 09 6月, 2017 1 次提交
  6. 05 6月, 2017 1 次提交
  7. 29 5月, 2017 1 次提交
    • K
      ocfs2: Use ERR_CAST() to avoid cross-structure cast · 7585d12f
      Kees Cook 提交于
      When trying to propagate an error result, the error return path attempts
      to retain the error, but does this with an open cast across very different
      types, which the upcoming structure layout randomization plugin flags as
      being potentially dangerous in the face of randomization. This is a false
      positive, but what this code actually wants to do is use ERR_CAST() to
      retain the error value.
      
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      7585d12f
  8. 04 5月, 2017 3 次提交
  9. 06 4月, 2017 1 次提交
  10. 10 3月, 2017 1 次提交
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
  11. 03 3月, 2017 1 次提交
    • D
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells 提交于
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      function.
      
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      
      ========
      OVERVIEW
      ========
      
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      
      A number of requests were gathered for features to be included.  The
      following have been included:
      
       (1) Make the fields a consistent size on all arches and make them large.
      
       (2) Spare space, request flags and information flags are provided for
           future expansion.
      
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
           __s64).
      
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
      
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
      
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
      
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
           (AT_STATX_FORCE_SYNC).
      
      And the following have been left out for future extension:
      
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Kumar].
      
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
      
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
      
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
      
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
      
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
      
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
      
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
      
           (Deferred to fsinfo).
      
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
      
           (Deferred to fsinfo).
      
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
      
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
      
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      
      (16) Femtosecond-resolution timestamps [Dave Chinner].
      
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      
      (17) A set multiple attributes syscall to go with this.
      
      ===============
      NEW SYSTEM CALL
      ===============
      
      The new system call is:
      
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
      
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
           respect.
      
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
      
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
           approximate.
      
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      
      buffer points to the destination for the data.  This must be 256 bytes in
      size.
      
      ======================
      MAIN ATTRIBUTES RECORD
      ======================
      
      The following structures are defined in which to return the main attribute
      set:
      
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	};
      
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      	};
      
      The defined bits in request_mask and stx_mask are:
      
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      placed.
      
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      
      Within the kernel, the supported flags are listed by:
      
      	KSTAT_ATTR_FS_IOC_FLAGS
      
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      
      New flags include:
      
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      
      Fields in struct statx come in a number of classes:
      
       (0) stx_dev_*, stx_blksize.
      
           These are local system information and are always available.
      
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
      
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
      
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
      
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
      
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
      
       (2) stx_rdev_*.
      
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
      
       (3) stx_btime.
      
           Similar to (1), except this will be set to 0 if it doesn't exist.
      
      =======
      TESTING
      =======
      
      The following test program can be used to test the statx system call:
      
      	samples/statx/test-statx.c
      
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      
      Secondly, the result of automounting on that directory.
      
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a528d35e
  12. 02 3月, 2017 3 次提交
  13. 28 2月, 2017 2 次提交
  14. 25 2月, 2017 1 次提交
  15. 23 2月, 2017 2 次提交
    • E
      ocfs2: fix deadlock issue when taking inode lock at vfs entry points · b891fa50
      Eric Ren 提交于
      Commit 743b5f14 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
      results in a deadlock, as the author "Tariq Saeed" realized shortly
      after the patch was merged.  The discussion happened here
      
        https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html
      
      The reason why taking cluster inode lock at vfs entry points opens up a
      self deadlock window, is explained in the previous patch of this series.
      
      So far, we have seen two different code paths that have this issue.
      
      1. do_sys_open
           may_open
            inode_permission
             ocfs2_permission
              ocfs2_inode_lock() <=== take PR
               generic_permission
                get_acl
                 ocfs2_iop_get_acl
                  ocfs2_inode_lock() <=== take PR
      
      2. fchmod|fchmodat
          chmod_common
           notify_change
            ocfs2_setattr <=== take EX
             posix_acl_chmod
              get_acl
               ocfs2_iop_get_acl <=== take PR
              ocfs2_iop_set_acl <=== take EX
      
      Fixes them by adding the tracking logic (in the previous patch) for these
      funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
      ocfs2_setattr().
      
      Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b891fa50
    • E
      ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock · 439a36b8
      Eric Ren 提交于
      We are in the situation that we have to avoid recursive cluster locking,
      but there is no way to check if a cluster lock has been taken by a precess
      already.
      
      Mostly, we can avoid recursive locking by writing code carefully.
      However, we found that it's very hard to handle the routines that are
      invoked directly by vfs code.  For instance:
      
        const struct inode_operations ocfs2_file_iops = {
            .permission     = ocfs2_permission,
            .get_acl        = ocfs2_iop_get_acl,
            .set_acl        = ocfs2_iop_set_acl,
        };
      
      Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
      
        do_sys_open
         may_open
          inode_permission
           ocfs2_permission
            ocfs2_inode_lock() <=== first time
             generic_permission
              get_acl
               ocfs2_iop_get_acl
        	ocfs2_inode_lock() <=== recursive one
      
      A deadlock will occur if a remote EX request comes in between two of
      ocfs2_inode_lock().  Briefly describe how the deadlock is formed:
      
      On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
      BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
      the remote EX lock request.  Another hand, the recursive cluster lock
      (the second one) will be blocked in in __ocfs2_cluster_lock() because of
      OCFS2_LOCK_BLOCKED.  But, the downconvert never complete, why? because
      there is no chance for the first cluster lock on this node to be
      unlocked - we block ourselves in the code path.
      
      The idea to fix this issue is mostly taken from gfs2 code.
      
      1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
         of the processes' pid who has taken the cluster lock of this lock
         resource;
      
      2. introduce a new flag for ocfs2_inode_lock_full:
         OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
         us if we've got cluster lock.
      
      3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
         got the cluster lock in the upper code path.
      
      The tracking logic should be used by some of the ocfs2 vfs's callbacks,
      to solve the recursive locking issue cuased by the fact that vfs
      routines can call into each other.
      
      The performance penalty of processing the holder list should only be
      seen at a few cases where the tracking logic is used, such as get/set
      acl.
      
      You may ask what if the first time we got a PR lock, and the second time
      we want a EX lock? fortunately, this case never happens in the real
      world, as far as I can see, including permission check,
      (get|set)_(acl|attr), and the gfs2 code also do so.
      
      [sfr@canb.auug.org.au remove some inlines]
      Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      439a36b8
  16. 14 1月, 2017 1 次提交
    • P
      locking/atomic, kref: Add kref_read() · 2c935bc5
      Peter Zijlstra 提交于
      Since we need to change the implementation, stop exposing internals.
      
      Provide kref_read() to read the current reference count; typically
      used for debug messages.
      
      Kills two anti-patterns:
      
      	atomic_read(&kref->refcount)
      	kref->refcount.counter
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c935bc5
  17. 11 1月, 2017 1 次提交
    • E
      ocfs2: fix crash caused by stale lvb with fsdlm plugin · e7ee2c08
      Eric Ren 提交于
      The crash happens rather often when we reset some cluster nodes while
      nodes contend fiercely to do truncate and append.
      
      The crash backtrace is below:
      
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
         ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: Beginning quota recovery on device (253,18) for slot 2
         ocfs2: Finishing quota recovery on device (253,18) for slot 2
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
         ------------[ cut here ]------------
         kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
         Supported: No, Unsupported modules are loaded
         CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
         task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
         RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
         RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
         RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
         RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
         RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
         R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
         R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
         FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
         CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
         Call Trace:
           ocfs2_setattr+0x698/0xa90 [ocfs2]
           notify_change+0x1ae/0x380
           do_truncate+0x5e/0x90
           do_sys_ftruncate.constprop.11+0x108/0x160
           entry_SYSCALL_64_fastpath+0x12/0x6d
         Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
         RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
      
      It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
      not equal to the disk i_size.  We mistakenly trust the LVB because the
      underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
      DLM_SBF_VALNOTVALID properly for us.  But, why?
      
      The current code tries to downconvert lock without DLM_LKF_VALBLK flag
      to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
      if the lock resource type needs LVB.  This is not the right way for
      fsdlm.
      
      The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
      DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
      DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
      this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
      failure happens.
      
      The following diagram briefly illustrates how this crash happens:
      
      RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
      
      The 1st round:
      
                   Node1                                    Node2
      RSB1: PR
                                                        RSB1(master): NULL->EX
      ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
        ocfs2_dlm_lock(no DLM_LKF_VALBLK)
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      
      dlm_lock(no DLM_LKF_VALBLK)
        convert_lock(overwrite lkb->lkb_exflags
                     with no DLM_LKF_VALBLK)
      
      RSB1: NULL                                        RSB1: EX
                                                        reset Node2
      dlm_recover_rsbs()
        recover_lvb()
      
      /* The LVB is not trustable if the node with EX fails and
       * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
       */
      
       if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
                 return;                   * to invalid the LVB here.
                                           */
      
      The 2nd round:
      
               Node 1                                Node2
      RSB1(become master from recovery)
      
      ocfs2_setattr()
        ocfs2_inode_lock(NULL->EX)
          /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
          ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
        ocfs2_truncate_file()
            mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
      
      The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
      for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
      is uesed.
      
      Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7ee2c08
  18. 26 12月, 2016 1 次提交
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  19. 25 12月, 2016 1 次提交
  20. 23 12月, 2016 1 次提交
  21. 13 12月, 2016 7 次提交
  22. 11 12月, 2016 4 次提交