1. 19 1月, 2017 1 次提交
  2. 18 1月, 2017 7 次提交
  3. 17 1月, 2017 6 次提交
    • R
      ubifs: Fix journal replay wrt. xattr nodes · 1cb51a15
      Richard Weinberger 提交于
      When replaying the journal it can happen that a journal entry points to
      a garbage collected node.
      This is the case when a power-cut occurred between a garbage collect run
      and a commit. In such a case nodes have to be read using the failable
      read functions to detect whether the found node matches what we expect.
      
      One corner case was forgotten, when the journal contains an entry to
      remove an inode all xattrs have to be removed too. UBIFS models xattr
      like directory entries, so the TNC code iterates over
      all xattrs of the inode and removes them too. This code re-uses the
      functions for walking directories and calls ubifs_tnc_next_ent().
      ubifs_tnc_next_ent() expects to be used only after the journal and
      aborts when a node does not match the expected result. This behavior can
      render an UBIFS volume unmountable after a power-cut when xattrs are
      used.
      
      Fix this issue by using failable read functions in ubifs_tnc_next_ent()
      too when replaying the journal.
      Cc: stable@vger.kernel.org
      Fixes: 1e51764a ("UBIFS: add new flash file system")
      Reported-by: NRock Lee <rockdotlee@gmail.com>
      Reviewed-by: NDavid Gstir <david@sigma-star.at>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      1cb51a15
    • E
      ubifs: remove redundant checks for encryption key · 3d4b2fcb
      Eric Biggers 提交于
      In several places, ubifs checked for an encryption key before creating a
      file in an encrypted directory.  This was redundant with
      fscrypt_setup_filename() or ubifs_new_inode(), and in the case of
      ubifs_link() it broke linking to special files.  So remove the extra
      checks.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      3d4b2fcb
    • E
      ubifs: allow encryption ioctls in compat mode · a75467d9
      Eric Biggers 提交于
      The ubifs encryption ioctls did not work when called by a 32-bit program
      on a 64-bit kernel.  Since 'struct fscrypt_policy' is not affected by
      the word size, ubifs just needs to allow these ioctls through, like what
      ext4 and f2fs do.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      a75467d9
    • A
      ubifs: add CONFIG_BLOCK dependency for encryption · 404e0b63
      Arnd Bergmann 提交于
      This came up during the v4.10 merge window:
      
      warning: (UBIFS_FS_ENCRYPTION) selects FS_ENCRYPTION which has unmet direct dependencies (BLOCK)
      fs/crypto/crypto.c: In function 'fscrypt_zeroout_range':
      fs/crypto/crypto.c:355:9: error: implicit declaration of function 'bio_alloc';did you mean 'd_alloc'? [-Werror=implicit-function-declaration]
         bio = bio_alloc(GFP_NOWAIT, 1);
      
      The easiest way out is to limit UBIFS_FS_ENCRYPTION to configurations
      that also enable BLOCK.
      
      Fixes: d475a507 ("ubifs: Add skeleton for fscrypto")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      404e0b63
    • P
      ubifs: fix unencrypted journal write · 507502ad
      Peter Rosin 提交于
      Without this, I get the following on reboot:
      
      UBIFS error (ubi1:0 pid 703): ubifs_load_znode: bad target node (type 1) length (8240)
      UBIFS error (ubi1:0 pid 703): ubifs_load_znode: have to be in range of 48-4144
      UBIFS error (ubi1:0 pid 703): ubifs_load_znode: bad indexing node at LEB 13:11080, error 5
       magic          0x6101831
       crc            0xb1cb246f
       node_type      9 (indexing node)
       group_type     0 (no node group)
       sqnum          546
       len            128
       child_cnt      5
       level          0
       Branches:
       0: LEB 14:72088 len 161 key (133, inode)
       1: LEB 14:81120 len 160 key (134, inode)
       2: LEB 20:26624 len 8240 key (134, data, 0)
       3: LEB 14:81280 len 160 key (135, inode)
       4: LEB 20:34864 len 8240 key (135, data, 0)
      UBIFS warning (ubi1:0 pid 703): ubifs_ro_mode.part.0: switched to read-only mode, error -22
      CPU: 0 PID: 703 Comm: mount Not tainted 4.9.0-next-20161213+ #1197
      Hardware name: Atmel SAMA5
      [<c010d2ac>] (unwind_backtrace) from [<c010b250>] (show_stack+0x10/0x14)
      [<c010b250>] (show_stack) from [<c024df94>] (ubifs_jnl_update+0x2e8/0x614)
      [<c024df94>] (ubifs_jnl_update) from [<c0254bf8>] (ubifs_mkdir+0x160/0x204)
      [<c0254bf8>] (ubifs_mkdir) from [<c01a6030>] (vfs_mkdir+0xb0/0x104)
      [<c01a6030>] (vfs_mkdir) from [<c0286070>] (ovl_create_real+0x118/0x248)
      [<c0286070>] (ovl_create_real) from [<c0283ed4>] (ovl_fill_super+0x994/0xaf4)
      [<c0283ed4>] (ovl_fill_super) from [<c019c394>] (mount_nodev+0x44/0x9c)
      [<c019c394>] (mount_nodev) from [<c019c4ac>] (mount_fs+0x14/0xa4)
      [<c019c4ac>] (mount_fs) from [<c01b5338>] (vfs_kern_mount+0x4c/0xd4)
      [<c01b5338>] (vfs_kern_mount) from [<c01b6b80>] (do_mount+0x154/0xac8)
      [<c01b6b80>] (do_mount) from [<c01b782c>] (SyS_mount+0x74/0x9c)
      [<c01b782c>] (SyS_mount) from [<c0107f80>] (ret_fast_syscall+0x0/0x3c)
      UBIFS error (ubi1:0 pid 703): ubifs_mkdir: cannot create directory, error -22
      overlayfs: failed to create directory /mnt/ovl/work/work (errno: 22); mounting read-only
      
      Fixes: 7799953b ("ubifs: Implement encrypt/decrypt for all IO")
      Signed-off-by: NPeter Rosin <peda@axentia.se>
      Tested-by: NKevin Hilman <khilman@baylibre.com>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      507502ad
    • C
      ubifs: ensure zero err is returned on successful return · e8f19746
      Colin Ian King 提交于
      err is no longer being set on a successful return path, causing
      a garbage value being returned. Fix this by setting err to zero
      for the successful return path.
      
      Found with static analysis by CoverityScan, CID 1389473
      
      Fixes: 7799953b ("ubifs: Implement encrypt/decrypt for all IO")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      e8f19746
  4. 15 1月, 2017 2 次提交
    • D
      coredump: Ensure proper size of sparse core files · 4d22c75d
      Dave Kleikamp 提交于
      If the last section of a core file ends with an unmapped or zero page,
      the size of the file does not correspond with the last dump_skip() call.
      gdb complains that the file is truncated and can be confusing to users.
      
      After all of the vma sections are written, make sure that the file size
      is no smaller than the current file position.
      
      This problem can be demonstrated with gdb's bigcore testcase on the
      sparc architecture.
      Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4d22c75d
    • S
      aio: fix lock dep warning · a12f1ae6
      Shaohua Li 提交于
      lockdep reports a warnning. file_start_write/file_end_write only
      acquire/release the lock for regular files. So checking the files in aio
      side too.
      
      [  453.532141] ------------[ cut here ]------------
      [  453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
      [  453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0)
      [  453.533011] Modules linked in:
      [  453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964
      [  453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
      [  453.533011]  ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000
      [  453.533011]  ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008
      [  453.533011]  ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef
      [  453.533011] Call Trace:
      [  453.533011]  [<ffffffff8196cffb>] dump_stack+0x67/0x9c
      [  453.533011]  [<ffffffff81091ee1>] __warn+0x111/0x130
      [  453.533011]  [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0
      [  453.533011]  [<ffffffff81091f00>] ? __warn+0x130/0x130
      [  453.533011]  [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60
      [  453.533011]  [<ffffffff811205d4>] lock_release+0x434/0x670
      [  453.533011]  [<ffffffff8198af94>] ? import_single_range+0xd4/0x110
      [  453.533011]  [<ffffffff81322195>] ? rw_verify_area+0x65/0x140
      [  453.533011]  [<ffffffff813aa696>] ? aio_write+0x1f6/0x280
      [  453.533011]  [<ffffffff813aa6c9>] aio_write+0x229/0x280
      [  453.533011]  [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640
      [  453.533011]  [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0
      [  453.533011]  [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
      [  453.533011]  [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40
      [  453.533011]  [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0
      [  453.533011]  [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10
      [  453.533011]  [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10
      [  453.533011]  [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270
      [  453.533011]  [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0
      [  453.533011]  [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
      [  453.533011]  [<ffffffff813acb90>] SyS_io_submit+0x10/0x20
      [  453.533011]  [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad
      [  453.533011]  [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110
      [  453.533011] ---[ end trace b2fbe664d1cc0082 ]---
      
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a12f1ae6
  5. 14 1月, 2017 1 次提交
  6. 13 1月, 2017 8 次提交
  7. 12 1月, 2017 2 次提交
    • D
      block: Rename blk_queue_zone_size and bdev_zone_size · f99e8648
      Damien Le Moal 提交于
      All block device data fields and functions returning a number of 512B
      sectors are by convention named xxx_sectors while names in the form
      xxx_size are generally used for a number of bytes. The blk_queue_zone_size
      and bdev_zone_size functions were not following this convention so rename
      them.
      
      No functional change is introduced by this patch.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      
      Collapsed the two patches, they were nonsensically split and broke
      bisection.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f99e8648
    • J
      xfs: Timely free truncated dirty pages · 0a417b8d
      Jan Kara 提交于
      Commit 99579cce "xfs: skip dirty pages in ->releasepage()" started
      to skip dirty pages in xfs_vm_releasepage() which also has the effect
      that if a dirty page is truncated, it does not get freed by
      block_invalidatepage() and is lingering in LRU list waiting for reclaim.
      So a simple loop like:
      
      while true; do
      	dd if=/dev/zero of=file bs=1M count=100
      	rm file
      done
      
      will keep using more and more memory until we hit low watermarks and
      start pagecache reclaim which will eventually reclaim also the truncate
      pages. Keeping these truncated (and thus never usable) pages in memory
      is just a waste of memory, is unnecessarily stressing page cache
      reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
      prematurely.
      
      So instead of just skipping dirty pages in xfs_vm_releasepage(), return
      to old behavior of skipping them only if they have delalloc or unwritten
      buffers and fix the spurious warnings by warning only if the page is
      clean.
      
      CC: stable@vger.kernel.org
      CC: Brian Foster <bfoster@redhat.com>
      CC: Vlastimil Babka <vbabka@suse.cz>
      Reported-by: NPetr Tůma <petr.tuma@d3s.mff.cuni.cz>
      Fixes: 99579cceSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0a417b8d
  8. 11 1月, 2017 3 次提交
    • E
      ocfs2: fix crash caused by stale lvb with fsdlm plugin · e7ee2c08
      Eric Ren 提交于
      The crash happens rather often when we reset some cluster nodes while
      nodes contend fiercely to do truncate and append.
      
      The crash backtrace is below:
      
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
         ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: Beginning quota recovery on device (253,18) for slot 2
         ocfs2: Finishing quota recovery on device (253,18) for slot 2
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
         ------------[ cut here ]------------
         kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
         Supported: No, Unsupported modules are loaded
         CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
         task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
         RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
         RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
         RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
         RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
         RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
         R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
         R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
         FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
         CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
         Call Trace:
           ocfs2_setattr+0x698/0xa90 [ocfs2]
           notify_change+0x1ae/0x380
           do_truncate+0x5e/0x90
           do_sys_ftruncate.constprop.11+0x108/0x160
           entry_SYSCALL_64_fastpath+0x12/0x6d
         Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
         RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
      
      It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
      not equal to the disk i_size.  We mistakenly trust the LVB because the
      underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
      DLM_SBF_VALNOTVALID properly for us.  But, why?
      
      The current code tries to downconvert lock without DLM_LKF_VALBLK flag
      to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
      if the lock resource type needs LVB.  This is not the right way for
      fsdlm.
      
      The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
      DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
      DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
      this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
      failure happens.
      
      The following diagram briefly illustrates how this crash happens:
      
      RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
      
      The 1st round:
      
                   Node1                                    Node2
      RSB1: PR
                                                        RSB1(master): NULL->EX
      ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
        ocfs2_dlm_lock(no DLM_LKF_VALBLK)
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      
      dlm_lock(no DLM_LKF_VALBLK)
        convert_lock(overwrite lkb->lkb_exflags
                     with no DLM_LKF_VALBLK)
      
      RSB1: NULL                                        RSB1: EX
                                                        reset Node2
      dlm_recover_rsbs()
        recover_lvb()
      
      /* The LVB is not trustable if the node with EX fails and
       * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
       */
      
       if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
                 return;                   * to invalid the LVB here.
                                           */
      
      The 2nd round:
      
               Node 1                                Node2
      RSB1(become master from recovery)
      
      ocfs2_setattr()
        ocfs2_inode_lock(NULL->EX)
          /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
          ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
        ocfs2_truncate_file()
            mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
      
      The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
      for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
      is uesed.
      
      Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7ee2c08
    • R
      dax: wrprotect pmd_t in dax_mapping_entry_mkclean · f729c8c9
      Ross Zwisler 提交于
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss in the following sequence:
      
      1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
         pmd_t dirty and writeable
      2) fsync, flushing out PMD data and cleaning the radix tree entry. We
         currently fail to mark the pmd_t as clean and write protected.
      3) more mmap writes to the PMD.  These don't cause any page faults since
         the pmd_t is dirty and writeable.  The radix tree entry remains clean.
      4) fsync, which fails to flush the dirty PMD data because the radix tree
         entry was clean.
      5) crash - dirty data that should have been fsync'd as part of 4) could
         still have been in the processor cache, and is lost.
      
      Fix this by marking the pmd_t clean and write protected in
      dax_mapping_entry_mkclean(), which is called as part of the fsync
      operation 2).  This will cause the writes in step 3) above to generate
      page faults where we'll re-dirty the PMD radix tree entry, resulting in
      flushes in the fsync that happens in step 4).
      
      Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush")
      Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f729c8c9
    • C
      do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned · dd545b52
      Chandan Rajendra 提交于
      The code currently uses sdio->blkbits to compute the number of blocks to
      be cleaned. However sdio->blkbits is derived from the logical block size
      of the underlying block device (Refer to the definition of
      do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
      fail when executed on an ext4 filesystem with 64k as the block size and
      when using a virtio based disk (having 512 byte as the logical block
      size) inside a kvm guest.
      
      This commit fixes the bug by using inode->i_blkbits to compute the
      number of blocks to be cleaned.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      
      Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once,
      to avoid issues with block size changes with IO in flight.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd545b52
  9. 10 1月, 2017 9 次提交
    • G
      tmpfs: clear S_ISGID when setting posix ACLs · 497de07d
      Gu Zheng 提交于
      This change was missed the tmpfs modification in In CVE-2016-7097
      commit 07393101 ("posix_acl: Clear SGID bit when setting
      file permissions")
      It can test by xfstest generic/375, which failed to clear
      setgid bit in the following test case on tmpfs:
      
        touch $testfile
        chown 100:100 $testfile
        chmod 2755 $testfile
        _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile
      Signed-off-by: NGu Zheng <guzheng1@huawei.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      497de07d
    • Z
      sysctl: Drop reference added by grab_header in proc_sys_readdir · 93362fa4
      Zhou Chengming 提交于
      Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
      added by grab_header when return from !dir_emit_dots path.
      It can cause any path called unregister_sysctl_table will
      wait forever.
      
      The calltrace of CVE-2016-9191:
      
      [ 5535.960522] Call Trace:
      [ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
      [ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
      [ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
      [ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
      [ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
      [ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
      [ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
      [ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
      [ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
      [ 5536.021657]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
      [ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
      [ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
      [ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
      [ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
      [ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
      [ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
      [ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
      [ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
      [ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
      [ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
      [ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
      [ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
      [ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
      [ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
      [ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
      [ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
      [ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
      [ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
      
      One cgroup maintainer mentioned that "cgroup is trying to offline
      a cpuset css, which takes place under cgroup_mutex.  The offlining
      ends up trying to drain active usages of a sysctl table which apprently
      is not happening."
      The real reason is that proc_sys_readdir doesn't drop reference added
      by grab_header when return from !dir_emit_dots path. So this cpuset
      offline path will wait here forever.
      
      See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13
      
      Fixes: f0c3b509 ("[readdir] convert procfs")
      Cc: stable@vger.kernel.org
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Tested-by: NYang Shukui <yangshukui@huawei.com>
      Signed-off-by: NZhou Chengming <zhouchengming1@huawei.com>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      93362fa4
    • E
      libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount · 75422726
      Eric W. Biederman 提交于
      Add MS_KERNMOUNT to the flags that are passed.
      Use sget_userns and force &init_user_ns instead of calling sget so that
      even if called from a weird context the internal filesystem will be
      considered to be in the intial user namespace.
      
      Luis Ressel reported that the the failure to pass MS_KERNMOUNT into
      mount_pseudo broke his in development graphics driver that uses the
      generic drm infrastructure.  I am not certain the deriver was bug
      free in it's usage of that infrastructure but since
      mount_pseudo_xattr can never be triggered by userspace it is clearer
      and less error prone, and less problematic for the code to be explicit.
      Reported-by: NLuis Ressel <aranea@aixah.de>
      Tested-by: NLuis Ressel <aranea@aixah.de>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      75422726
    • E
      mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8
      Eric W. Biederman 提交于
      Protecting the mountpoint hashtable with namespace_sem was sufficient
      until a call to umount_mnt was added to mntput_no_expire.  At which
      point it became possible for multiple calls of put_mountpoint on
      the same hash chain to happen on the same time.
      
      Kristen Johansen <kjlx@templeofstupid.com> reported:
      > This can cause a panic when simultaneous callers of put_mountpoint
      > attempt to free the same mountpoint.  This occurs because some callers
      > hold the mount_hash_lock, while others hold the namespace lock.  Some
      > even hold both.
      >
      > In this submitter's case, the panic manifested itself as a GP fault in
      > put_mountpoint() when it called hlist_del() and attempted to dereference
      > a m_hash.pprev that had been poisioned by another thread.
      
      Al Viro observed that the simple fix is to switch from using the namespace_sem
      to the mount_lock to protect the mountpoint hash table.
      
      I have taken Al's suggested patch moved put_mountpoint in pivot_root
      (instead of taking mount_lock an additional time), and have replaced
      new_mountpoint with get_mountpoint a function that does the hash table
      lookup and addition under the mount_lock.   The introduction of get_mounptoint
      ensures that only the mount_lock is needed to manipulate the mountpoint
      hashtable.
      
      d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
      already set.  This allows get_mountpoint to use the setting of
      DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
      happens exactly once.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 ("mnt: Honor MNT_LOCKED when detaching mounts")
      Reported-by: NKrister Johansen <kjlx@templeofstupid.com>
      Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      3895dbf8
    • C
      xfs: don't print warnings when xfs_log_force fails · 84a4620c
      Christoph Hellwig 提交于
      There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
      one is an I/O error, for which xlog_bdstrat already logs a warning, and
      the second is an already shutdown log due to a previous I/O errors.  In
      the latter case we'll already have a previous indication for the actual
      error, but the large stream of misleading warnings from xfs_log_force
      will probably scroll it out of the message buffer.
      
      Simply removing the warnings thus makes the XFS log reporting significantly
      better.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      84a4620c
    • C
      xfs: don't rely on ->total in xfs_alloc_space_available · 12ef8301
      Christoph Hellwig 提交于
      ->total is a bit of an odd parameter passed down to the low-level
      allocator all the way from the high-level callers.  It's supposed to
      contain the maximum number of blocks to be allocated for the whole
      transaction [1].
      
      But in xfs_iomap_write_allocate we only convert existing delayed
      allocations and thus only have a minimal block reservation for the
      current transaction, so xfs_alloc_space_available can't use it for
      the allocation decisions.  Use the maximum of args->total and the
      calculated block requirement to make a decision.  We probably should
      get rid of args->total eventually and instead apply ->minleft more
      broadly, but that will require some extensive changes all over.
      
      [1] which creates lots of confusion as most callers don't decrement it
      once doing a first allocation.  But that's for a separate series.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      12ef8301
    • C
      xfs: adjust allocation length in xfs_alloc_space_available · 54fee133
      Christoph Hellwig 提交于
      We must decide in xfs_alloc_fix_freelist if we can perform an
      allocation from a given AG is possible or not based on the available
      space, and should not fail the allocation past that point on a
      healthy file system.
      
      But currently we have two additional places that second-guess
      xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
      maxlen parameter to remove the reservation before doing the
      allocation (but ignores the various minium freespace requirements),
      and xfs_alloc_fix_minleft tries to fix up the allocated length
      after we've found an extent, but ignores the reservations and also
      doesn't take the AGFL into account (and thus fails allocations
      for not matching minlen in some cases).
      
      Remove all these later fixups and just correct the maxlen argument
      inside xfs_alloc_fix_freelist once we have the AGF buffer locked.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      54fee133
    • C
      xfs: fix bogus minleft manipulations · 255c5162
      Christoph Hellwig 提交于
      We can't just set minleft to 0 when we're low on space - that's exactly
      what we need minleft for: to protect space in the AG for btree block
      allocations when we are low on free space.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      255c5162
    • C
      xfs: bump up reserved blocks in xfs_alloc_set_aside · 5149fd32
      Christoph Hellwig 提交于
      Setting aside 4 blocks globally for bmbt splits isn't all that useful,
      as different threads can allocate space in parallel.  Bump it to 4
      blocks per AG to allow each thread that is currently doing an
      allocation to dip into it separately.  Without that we may no have
      enough reserved blocks if there are enough parallel transactions
      in an almost out space file system that all run into bmap btree
      splits.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5149fd32
  10. 09 1月, 2017 1 次提交