1. 03 3月, 2015 1 次提交
    • T
      eCryptfs: don't pass fs-specific ioctl commands through · 6d65261a
      Tyler Hicks 提交于
      eCryptfs can't be aware of what to expect when after passing an
      arbitrary ioctl command through to the lower filesystem. The ioctl
      command may trigger an action in the lower filesystem that is
      incompatible with eCryptfs.
      
      One specific example is when one attempts to use the Btrfs clone
      ioctl command when the source file is in the Btrfs filesystem that
      eCryptfs is mounted on top of and the destination fd is from a new file
      created in the eCryptfs mount. The ioctl syscall incorrectly returns
      success because the command is passed down to Btrfs which thinks that it
      was able to do the clone operation. However, the result is an empty
      eCryptfs file.
      
      This patch allows the trim, {g,s}etflags, and {g,s}etversion ioctl
      commands through and then copies up the inode metadata from the lower
      inode to the eCryptfs inode to catch any changes made to the lower
      inode's metadata. Those five ioctl commands are mostly common across all
      filesystems but the whitelist may need to be further pruned in the
      future.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=93691
      https://launchpad.net/bugs/1305335Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Rocko <rockorequin@hotmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: stable@vger.kernel.org # v2.6.36+: c43f7b8f eCryptfs: Handle ioctl calls with unlocked and compat functions
      6d65261a
  2. 01 3月, 2015 1 次提交
    • R
      nilfs2: fix potential memory overrun on inode · 957ed60b
      Ryusuke Konishi 提交于
      Each inode of nilfs2 stores a root node of a b-tree, and it turned out to
      have a memory overrun issue:
      
      Each b-tree node of nilfs2 stores a set of key-value pairs and the number
      of them (in "bn_nchildren" member of nilfs_btree_node struct), as well as
      a few other "bn_*" members.
      
      Since the value of "bn_nchildren" is used for operations on the key-values
      within the b-tree node, it can cause memory access overrun if a large
      number is incorrectly set to "bn_nchildren".
      
      For instance, nilfs_btree_node_lookup() function determines the range of
      binary search with it, and too large "bn_nchildren" leads
      nilfs_btree_node_get_key() in that function to overrun.
      
      As for intermediate b-tree nodes, this is prevented by a sanity check
      performed when each node is read from a drive, however, no sanity check
      has been done for root nodes stored in inodes.
      
      This patch fixes the issue by adding missing sanity check against b-tree
      root nodes so that it's called when on-memory inodes are read from ifile,
      inode metadata file.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      957ed60b
  3. 27 2月, 2015 1 次提交
  4. 25 2月, 2015 1 次提交
    • C
      eCryptfs: ensure copy to crypt_stat->cipher does not overrun · 2a559a8b
      Colin Ian King 提交于
      The patch 237fead6: "[PATCH] ecryptfs: fs/Makefile and
      fs/Kconfig" from Oct 4, 2006, leads to the following static checker
      warning:
      
        fs/ecryptfs/crypto.c:846 ecryptfs_new_file_context()
        error: off-by-one overflow 'crypt_stat->cipher' size 32.  rl = '0-32'
      
      There is a mismatch between the size of ecryptfs_crypt_stat.cipher
      and ecryptfs_mount_crypt_stat.global_default_cipher_name causing the
      copy of the cipher name to cause a off-by-one string copy error. This
      fix ensures the space reserved for this string is the same size including
      the trailing zero at the end throughout ecryptfs.
      
      This fix avoids increasing the size of ecryptfs_crypt_stat.cipher
      and also ecryptfs_parse_tag_70_packet_silly_stack.cipher_string and instead
      reduces the of ECRYPTFS_MAX_CIPHER_NAME_SIZE to 31 and includes the + 1 for
      the end of string terminator.
      
      NOTE: An overflow is not possible in practice since the value copied
      into global_default_cipher_name is validated by
      ecryptfs_code_for_cipher_string() at mount time. None of the allowed
      cipher strings are long enough to cause the potential buffer overflow
      fixed by this patch.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      [tyhicks: Added the NOTE about the overflow not being triggerable]
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      2a559a8b
  5. 24 2月, 2015 2 次提交
  6. 23 2月, 2015 11 次提交
    • D
      xfs: ensure truncate forces zeroed blocks to disk · 5885ebda
      Dave Chinner 提交于
      A new fsync vs power fail test in xfstests indicated that XFS can
      have unreliable data consistency when doing extending truncates that
      require block zeroing. The blocks beyond EOF get zeroed in memory,
      but we never force those changes to disk before we run the
      transaction that extends the file size and exposes those blocks to
      userspace. This can result in the blocks not being correctly zeroed
      after a crash.
      
      Because in-memory behaviour is correct, tools like fsx don't pick up
      any coherency problems - it's not until the filesystem is shutdown
      or the system crashes after writing the truncate transaction to the
      journal but before the zeroed data in the page cache is flushed that
      the issue is exposed.
      
      Fix this by also flushing the dirty data in memory region between
      the old size and new size when we've found blocks that need zeroing
      in the truncate process.
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5885ebda
    • J
      xfs: Fix quota type in quota structures when reusing quota file · dfcc70a8
      Jan Kara 提交于
      For filesystems without separate project quota inode field in the
      superblock we just reuse project quota file for group quotas (and vice
      versa) if project quota file is allocated and we need group quota file.
      When we reuse the file, quota structures on disk suddenly have wrong
      type stored in d_flags though. Nobody really cares about this (although
      structure type reported to userspace was wrong as well) except
      that after commit 14bf61ff (quota: Switch ->get_dqblk() and
      ->set_dqblk() to use bytes as space units) assertion in
      xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
      was testing without XFS_DEBUG so I didn't notice when submitting the
      above commit).
      
      Fix the problem by properly resetting ddq->d_flags when running quotacheck
      for a quota file.
      
      CC: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dfcc70a8
    • A
      autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation · 0a280962
      Al Viro 提交于
      X-Coverup: just ask spender
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a280962
    • A
      procfs: fix race between symlink removals and traversals · 7e0e953b
      Al Viro 提交于
      use_pde()/unuse_pde() in ->follow_link()/->put_link() resp.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7e0e953b
    • A
      debugfs: leave freeing a symlink body until inode eviction · 0db59e59
      Al Viro 提交于
      As it is, we have debugfs_remove() racing with symlink traversals.
      Supply ->evict_inode() and do freeing there - inode will remain
      pinned until we are done with the symlink body.
      
      And rip the idiocy with checking if dentry is positive right after
      we'd verified debugfs_positive(), which is a stronger check...
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0db59e59
    • K
      trylock_super(): replacement for grab_super_passive() · eb6ef3df
      Konstantin Khlebnikov 提交于
      I've noticed significant locking contention in memory reclaimer around
      sb_lock inside grab_super_passive(). Grab_super_passive() is called from
      two places: in icache/dcache shrinkers (function super_cache_scan) and
      from writeback (function __writeback_inodes_wb). Both are required for
      progress in memory allocator.
      
      Grab_super_passive() acquires sb_lock to increment sb->s_count and check
      sb->s_instances. It seems sb->s_umount locked for read is enough here:
      super-block deactivation always runs under sb->s_umount locked for write.
      Protecting super-block itself isn't a problem: in super_cache_scan() sb
      is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
      are still active. Inside writeback super-block comes from inode from bdi
      writeback list under wb->list_lock.
      
      This patch removes locking sb_lock and checks s_instances under s_umount:
      generic_shutdown_super() unlinks it under sb->s_umount locked for write.
      New variant is called trylock_super() and since it only locks semaphore,
      callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
      they're done.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eb6ef3df
    • D
      fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions · 54f2a2f4
      David Howells 提交于
      Fanotify probably doesn't want to watch autodirs so make it use d_can_lookup()
      rather than d_is_dir() when checking a dir watch and give an error on fake
      directories.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      54f2a2f4
    • D
      Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions · ce40fa78
      David Howells 提交于
      Fix up the following scripted S_ISDIR/S_ISREG/S_ISLNK conversions (or lack
      thereof) in cachefiles:
      
       (1) Cachefiles mostly wants to use d_can_lookup() rather than d_is_dir() as
           it doesn't want to deal with automounts in its cache.
      
       (2) Coccinelle didn't find S_IS* expressions in ASSERT() statements in
           cachefiles.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ce40fa78
    • D
      VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) · e36cb0b8
      David Howells 提交于
      Convert the following where appropriate:
      
       (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
      
       (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
      
       (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
           complicated than it appears as some calls should be converted to
           d_can_lookup() instead.  The difference is whether the directory in
           question is a real dir with a ->lookup op or whether it's a fake dir with
           a ->d_automount op.
      
      In some circumstances, we can subsume checks for dentry->d_inode not being
      NULL into this, provided we the code isn't in a filesystem that expects
      d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
      use d_inode() rather than d_backing_inode() to get the inode pointer).
      
      Note that the dentry type field may be set to something other than
      DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
      manages the fall-through from a negative dentry to a lower layer.  In such a
      case, the dentry type of the negative union dentry is set to the same as the
      type of the lower dentry.
      
      However, if you know d_inode is not NULL at the call site, then you can use
      the d_is_xxx() functions even in a filesystem.
      
      There is one further complication: a 0,0 chardev dentry may be labelled
      DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
      intended for special directory entry types that don't have attached inodes.
      
      The following perl+coccinelle script was used:
      
      use strict;
      
      my @callers;
      open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
          die "Can't grep for S_ISDIR and co. callers";
      @callers = <$fd>;
      close($fd);
      unless (@callers) {
          print "No matches\n";
          exit(0);
      }
      
      my @cocci = (
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISLNK(E->d_inode->i_mode)',
          '+ d_is_symlink(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISDIR(E->d_inode->i_mode)',
          '+ d_is_dir(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISREG(E->d_inode->i_mode)',
          '+ d_is_reg(E)' );
      
      my $coccifile = "tmp.sp.cocci";
      open($fd, ">$coccifile") || die $coccifile;
      print($fd "$_\n") || die $coccifile foreach (@cocci);
      close($fd);
      
      foreach my $file (@callers) {
          chomp $file;
          print "Processing ", $file, "\n";
          system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
      	die "spatch failed";
      }
      
      [AV: overlayfs parts skipped]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e36cb0b8
    • D
      VFS: Split DCACHE_FILE_TYPE into regular and special types · 44bdb5e5
      David Howells 提交于
      Split DCACHE_FILE_TYPE into DCACHE_REGULAR_TYPE (dentries representing regular
      files) and DCACHE_SPECIAL_TYPE (representing blockdev, chardev, FIFO and
      socket files).
      
      d_is_reg() and d_is_special() are added to detect these subtypes and
      d_is_file() is left as the union of the two.
      
      This allows a number of places that use S_ISREG(dentry->d_inode->i_mode) to
      use d_is_reg(dentry) instead.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      44bdb5e5
    • D
      VFS: Add a fallthrough flag for marking virtual dentries · df1a085a
      David Howells 提交于
      Add a DCACHE_FALLTHRU flag to indicate that, in a layered filesystem, this is
      a virtual dentry that covers another one in a lower layer that should be used
      instead.  This may be recorded on medium if directory integration is stored
      there.
      
      The flag can be set with d_set_fallthru() and tested with d_is_fallthru().
      
      Original-author: Valerie Aurora <vaurora@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      df1a085a
  7. 20 2月, 2015 7 次提交
  8. 19 2月, 2015 16 次提交
    • H
      x86, mm/ASLR: Fix stack randomization on 64-bit systems · 4e7c22d4
      Hector Marco-Gisbert 提交于
      The issue is that the stack for processes is not properly randomized on
      64 bit architectures due to an integer overflow.
      
      The affected function is randomize_stack_top() in file
      "fs/binfmt_elf.c":
      
        static unsigned long randomize_stack_top(unsigned long stack_top)
        {
                 unsigned int random_variable = 0;
      
                 if ((current->flags & PF_RANDOMIZE) &&
                         !(current->personality & ADDR_NO_RANDOMIZE)) {
                         random_variable = get_random_int() & STACK_RND_MASK;
                         random_variable <<= PAGE_SHIFT;
                 }
                 return PAGE_ALIGN(stack_top) + random_variable;
                 return PAGE_ALIGN(stack_top) - random_variable;
        }
      
      Note that, it declares the "random_variable" variable as "unsigned int".
      Since the result of the shifting operation between STACK_RND_MASK (which
      is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
      
      	  random_variable <<= PAGE_SHIFT;
      
      then the two leftmost bits are dropped when storing the result in the
      "random_variable". This variable shall be at least 34 bits long to hold
      the (22+12) result.
      
      These two dropped bits have an impact on the entropy of process stack.
      Concretely, the total stack entropy is reduced by four: from 2^28 to
      2^30 (One fourth of expected entropy).
      
      This patch restores back the entropy by correcting the types involved
      in the operations in the functions randomize_stack_top() and
      stack_maxrandom_size().
      
      The successful fix can be tested with:
      
        $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
        7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
        7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
        7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
        7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
        ...
      
      Once corrected, the leading bytes should be between 7ffc and 7fff,
      rather than always being 7fff.
      Signed-off-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Signed-off-by: NIsmael Ripoll <iripoll@upv.es>
      [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: CVE-2015-1593
      Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.netSigned-off-by: NBorislav Petkov <bp@suse.de>
      4e7c22d4
    • Y
      ceph: return error for traceless reply race · 4d41cef2
      Yan, Zheng 提交于
      When we receives traceless reply for request that created new inode,
      we re-send a lookup request to MDS get information of the newly created
      inode. (VFS expects FS' callback return an inode in create case)
      This breaks one request into two requests. Other client may modify or
      move to the new inode in the middle.
      
      When the race happens, ceph_handle_notrace_create() unconditionally
      links the dentry for 'create' operation to the inode returned by lookup.
      This may confuse VFS when the inode is a directory (VFS does not allow
      multiple linkages for directory inode).
      
      This patch makes ceph_handle_notrace_create() when it detect a race.
      This event should be rare and it happens only when we talk to old MDS.
      Recent MDS does not send traceless reply for request that creates new
      inode.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      4d41cef2
    • Y
      ceph: fix dentry leaks · 5cba372c
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      5cba372c
    • Y
      ceph: re-send requests when MDS enters reconnecting stage · 3de22be6
      Yan, Zheng 提交于
      So that MDS can check if any request is already completed and process
      completed requests in clientreplay stage. When completed requests are
      processed in clientreplay stage, MDS can avoid sending traceless
      replies.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      3de22be6
    • I
    • Y
      ceph: fix atomic_open snapdir · bf91c315
      Yan, Zheng 提交于
      ceph_handle_snapdir() checks ceph_mdsc_do_request()'s return value
      and creates snapdir inode if it's -ENOENT
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      bf91c315
    • Y
      ceph: properly mark empty directory as complete · 2f92b3d0
      Yan, Zheng 提交于
      ceph_add_cap() calls __check_cap_issue(), which clears directory
      inode' complete flag. so we should set the complete flag for empty
      directory should be set after calling ceph_add_cap().
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      2f92b3d0
    • Y
      client: include kernel version in client metadata · a6a5ce4f
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      a6a5ce4f
    • Y
      ceph: provide seperate {inode,file}_operations for snapdir · 38c48b5f
      Yan, Zheng 提交于
      remove all unsupported operations from {inode,file}_operations.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      38c48b5f
    • Y
      ceph: fix request time stamp encoding · 1f041a89
      Yan, Zheng 提交于
      struct timespec uses 'long' to present second and nanosecond. 'long'
      is 64 bits on 64bits machine. ceph MDS expects time stamp to be
      encoded as struct ceph_timespec, which uses 'u32' to present second
      and nanosecond.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      1f041a89
    • Y
      ceph: fix reading inline data when i_size > PAGE_SIZE · fcc02d2a
      Yan, Zheng 提交于
      when inode has inline data but its size > PAGE_SIZE (it was truncated
      to larger size), previous direct read code return -EIO. This patch adds
      code to return zeros for data whose offset > PAGE_SIZE.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      fcc02d2a
    • Y
      ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) · 86d8f67b
      Yan, Zheng 提交于
      use an atomic variable to track number of sessions, this can avoid block
      operation inside wait loops.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      86d8f67b
    • Y
      ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) · c4d4a582
      Yan, Zheng 提交于
      we should not do block operation in wait_event_interruptible()'s condition
      check function, but reading inline data can block. so move the read inline
      data code to ceph_get_caps()
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      c4d4a582
    • Y
      ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) · d3383a8e
      Yan, Zheng 提交于
      check_cap_flush() calls mutex_lock(), which may block. So we can't
      use it as condition check function for wait_event();
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      d3383a8e
    • Y
      ceph: improve reference tracking for snaprealm · 982d6011
      Yan, Zheng 提交于
      When snaprealm is created, its initial reference count is zero.
      But in some rare cases, the newly created snaprealm is not referenced
      by anyone. This causes snaprealm with zero reference count not freed.
      
      The fix is set reference count of newly snaprealm to 1. The reference
      is return the function who requests to create the snaprealm. When the
      function finishes its job, it releases the reference.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      982d6011
    • Y
      ceph: properly zero data pages for file holes. · 1487a688
      Yan, Zheng 提交于
      A bug is found in striped_read() of fs/ceph/file.c. striped_read() calls
      ceph_zero_pape_vector_range().  The first argument, page_align + read + ret,
      passed to ceph_zero_pape_vector_range() is wrong.
      
      When a file has holes, this wrong parameter may cause memory corruption
      either in kernal space or user space. Kernel space memory may be corrupted in
      the case of non direct IO; user space memory may be corrupted in the case of
      direct IO. In the latter case, the application doing direct IO may crash due
      to memory corruption, as we have experienced.
      
      The correct value should be initial_align + read + ret, where intial_align =
      o_direct ? buf_align : io_align.  Compared with page_align, the current page
      offest, initial_align is the initial page offest, which should be used to
      calculate the page and offset in ceph_zero_pape_vector_range().
      Reported-by: Ncaifeng zhu <zhucaifeng@unissoft-nj.com>
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      1487a688