1. 06 5月, 2016 23 次提交
    • Z
      btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes · 2f3165ec
      Zygo Blaxell 提交于
      During a mount, we start the cleaner kthread first because the transaction
      kthread wants to wake up the cleaner kthread.  We start the transaction
      kthread next because everything in btrfs wants transactions.  We do reloc
      recovery in the thread that was doing the original mount call once the
      transaction kthread is running.  This means that the cleaner kthread
      could already be running when reloc recovery happens (e.g. if a snapshot
      delete was started before a crash).
      
      Relocation does not play well with the cleaner kthread, so a mutex was
      added in commit 5f316481 "Btrfs: fix
      race between balance recovery and root deletion" to prevent both from
      being active at the same time.
      
      If the cleaner kthread is already holding the mutex by the time we get
      to btrfs_recover_relocation, the mount will be blocked until at least
      one deleted subvolume is cleaned (possibly more if the mount process
      doesn't get the lock right away).  During this time (which could be an
      arbitrarily long time on a large/slow filesystem), the mount process is
      stuck and the filesystem is unnecessarily inaccessible.
      
      Fix this by locking cleaner_mutex before we start cleaner_kthread, and
      unlocking the mutex after mount no longer requires it.  This ensures
      that the mounting process will not be blocked by the cleaner kthread.
      The cleaner kthread is already prepared for mutex contention and will
      just go to sleep until the mutex is available.
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f3165ec
    • D
      btrfs: ioctl: reorder exclusive op check in RM_DEV · 58d7bbf8
      David Sterba 提交于
      Move the op exclusivity check before the other code (same as in
      ADD_DEV).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58d7bbf8
    • D
      btrfs: add write protection to SET_FEATURES ioctl · 7ab19625
      David Sterba 提交于
      Perform the want_write check if we get far enough to do any writes.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7ab19625
    • A
      btrfs: fix lock dep warning move scratch super outside of chunk_mutex · 48b3b9d4
      Anand Jain 提交于
      Move scratch super outside of the chunk lock to avoid below
      lockdep warning. The better place to scratch super is in
      the function btrfs_rm_dev_replace_free_srcdev() just before
      free_device, which is outside of the chunk lock as well.
      
      To reproduce:
        (fresh boot)
        mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde
        mount /dev/sdc /btrfs
        dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
        (get devmgt from https://github.com/asj/devmgt.git)
        devmgt detach /dev/sde
        dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
        sync
        btrfs replace start -Brf 3 /dev/sdf /btrfs <--
        devmgt attach host7
      
      ======================================================
      [ INFO: possible circular locking dependency detected ]
      4.6.0-rc2asj+ #1 Not tainted
      ---------------------------------------------------
      
      btrfs/2174 is trying to acquire lock:
      (sb_writers){.+.+.+}, at:
      [<ffffffff812449b4>] __sb_start_write+0xb4/0xf0
      
      but task is already holding lock:
      (&fs_info->chunk_mutex){+.+.+.}, at:
      [<ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs]
      
      which lock already depends on the new lock.
      
      Chain exists of:
      sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex
      Possible unsafe locking scenario:
      CPU0				CPU1
      ----				----
      lock(&fs_info->chunk_mutex);
      				lock(&fs_devs->device_list_mutex);
      				lock(&fs_info->chunk_mutex);
      lock(sb_writers);
      
      *** DEADLOCK ***
      
      -> #0 (sb_writers){.+.+.+}:
      [<ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0
      [<ffffffff810e707e>] lock_acquire+0xbe/0x210
      [<ffffffff810df49a>] percpu_down_read+0x4a/0xa0
      [<ffffffff812449b4>] __sb_start_write+0xb4/0xf0
      [<ffffffff81265534>] mnt_want_write+0x24/0x50
      [<ffffffff812508a2>] path_openat+0x952/0x1190
      [<ffffffff81252451>] do_filp_open+0x91/0x100
      [<ffffffff8123f5cc>] file_open_name+0xfc/0x140
      [<ffffffff8123f643>] filp_open+0x33/0x60
      [<ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs]
      [<ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs]
      [<ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs]
      [<ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs]
      [<ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs]
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48b3b9d4
    • A
      btrfs: Fix BUG_ON condition in scrub_setup_recheck_block() · 24731149
      Ashish Samant 提交于
      pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK.
      page_index should be checked with the same to trigger BUG_ON().
      Signed-off-by: NAshish Samant <ashish.samant@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24731149
    • J
      Btrfs: remove BUG_ON()'s in btrfs_map_block · e042d1ec
      Josef Bacik 提交于
      btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree
      to not be assholes and panic at any possible chance things are all fucked up.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      [ removed type casts ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e042d1ec
    • L
      Btrfs: fix divide error upon chunk's stripe_len · 3d8da678
      Liu Bo 提交于
      The struct 'map_lookup' uses type int for @stripe_len, while
      btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
      @stripe_len being undefined value and it can lead to 'divide error' in
       __btrfs_map_block().
      
      This changes 'map_lookup' to use type u64 for stripe_len, also right now
      we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
      BTRFS_STRIPE_LEN.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Reported-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ folded division fix to scrub_raid56_parity ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d8da678
    • D
      btrfs: sysfs: protect reading label by lock · ee17fc80
      David Sterba 提交于
      If the label setting ioctl races with sysfs label handler, we could get
      mixed result in the output, part old part new. We should either get the
      old or new label. The chances to hit this race are low.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee17fc80
    • D
      btrfs: add check to sysfs handler of label · 66ac9fe7
      David Sterba 提交于
      Add a sanity check for the fs_info as we will dereference it, similar to
      what the 'store features' handler does.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      66ac9fe7
    • D
      btrfs: add read-only check to sysfs handler of features · ee611138
      David Sterba 提交于
      We don't want to trigger the change on a read-only filesystem, similar
      to what the label handler does.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      ee611138
    • D
      btrfs: reuse existing variable in scrub_stripe, reduce stack usage · e6c11f9a
      David Sterba 提交于
      The key variable occupies 17 bytes, the key_start is used once, we can
      simply reuse existing 'key' for that purpose. As the key is not a simple
      type, compiler doest not do it on itself.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6c11f9a
    • D
      btrfs: use dynamic allocation for root item in create_subvol · 49a3c4d9
      David Sterba 提交于
      The size of root item is more than 400 bytes, which is quite a lot of
      stack space. As we do IO from inside the subvolume ioctls, we should
      keep the stack usage low in case the filesystem is on top of other
      layers (NFS, device mapper, iscsi, etc).
      Reviewed-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49a3c4d9
    • D
      15351955
    • D
      2f91306a
    • D
      c03d01f3
    • D
      btrfs: send: use temporary variable to store allocation size · e55d1153
      David Sterba 提交于
      We're going to use the argument multiple times later.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e55d1153
    • D
      eb5b75fe
    • D
      6ff48ce0
    • A
      btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex · 779bf3fe
      Anand Jain 提交于
      When the replace target fails, the target device will be taken
      out of fs device list, scratch + update_dev_time and freed. However
      we could do the scratch  + update_dev_time and free part after the
      device has been taken out of device list, so that we don't have to
      hold the device_list_mutex and uuid_mutex locks.
      
      Reported issue:
      
      [ 5375.718845] ======================================================
      [ 5375.718846] [ INFO: possible circular locking dependency detected ]
      [ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
      [ 5375.718849] -------------------------------------------------------
      [ 5375.718851] btrfs-health/4662 is trying to acquire lock:
      [ 5375.718861]  (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
      [ 5375.718862]
      [ 5375.718862] but task is already holding lock:
      [ 5375.718907]  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
      [ 5375.718907]
      [ 5375.718907] which lock already depends on the new lock.
      [ 5375.718907]
      [ 5375.718908]
      [ 5375.718908] the existing dependency chain (in reverse order) is:
      [ 5375.718911]
      [ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
      [ 5375.718917]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
      [ 5375.718921]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
      [ 5375.718940]        [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
      [ 5375.718945]        [<ffffffff81267079>] show_vfsmnt+0x49/0x150
      [ 5375.718948]        [<ffffffff81240b07>] m_show+0x17/0x20
      [ 5375.718951]        [<ffffffff81246868>] seq_read+0x2d8/0x3b0
      [ 5375.718955]        [<ffffffff8121df28>] __vfs_read+0x28/0xd0
      [ 5375.718959]        [<ffffffff8121e806>] vfs_read+0x86/0x130
      [ 5375.718962]        [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
      [ 5375.718966]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [ 5375.718968]
      [ 5375.718968] -> #2 (namespace_sem){+++++.}:
      [ 5375.718971]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
      [ 5375.718974]        [<ffffffff81635199>] down_write+0x49/0x80
      [ 5375.718977]        [<ffffffff81243593>] lock_mount+0x43/0x1c0
      [ 5375.718979]        [<ffffffff81243c13>] do_add_mount+0x23/0xd0
      [ 5375.718982]        [<ffffffff81244afb>] do_mount+0x27b/0xe30
      [ 5375.718985]        [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
      [ 5375.718988]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [ 5375.718991]
      [ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
      [ 5375.718994]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
      [ 5375.718996]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
      [ 5375.719001]        [<ffffffff8122d608>] path_openat+0x468/0x1360
      [ 5375.719004]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
      [ 5375.719007]        [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
      [ 5375.719010]        [<ffffffff8121db7e>] SyS_open+0x1e/0x20
      [ 5375.719013]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [ 5375.719015]
      [ 5375.719015] -> #0 (sb_writers){.+.+.+}:
      [ 5375.719018]        [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
      [ 5375.719021]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
      [ 5375.719026]        [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
      [ 5375.719028]        [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
      [ 5375.719031]        [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
      [ 5375.719035]        [<ffffffff8122ded2>] path_openat+0xd32/0x1360
      [ 5375.719037]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
      [ 5375.719040]        [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
      [ 5375.719043]        [<ffffffff8121d923>] filp_open+0x33/0x60
      [ 5375.719073]        [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
      [ 5375.719099]        [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
      [ 5375.719123]        [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
      [ 5375.719150]        [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
      [ 5375.719175]        [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
      [ 5375.719199]        [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
      [ 5375.719222]        [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
      [ 5375.719225]        [<ffffffff810a70df>] kthread+0xef/0x110
      [ 5375.719229]        [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
      [ 5375.719230]
      [ 5375.719230] other info that might help us debug this:
      [ 5375.719230]
      [ 5375.719233] Chain exists of:
      [ 5375.719233]   sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
      [ 5375.719233]
      [ 5375.719234]  Possible unsafe locking scenario:
      [ 5375.719234]
      [ 5375.719234]        CPU0                    CPU1
      [ 5375.719235]        ----                    ----
      [ 5375.719236]   lock(&fs_devs->device_list_mutex);
      [ 5375.719238]                                lock(namespace_sem);
      [ 5375.719239]                                lock(&fs_devs->device_list_mutex);
      [ 5375.719241]   lock(sb_writers);
      [ 5375.719241]
      [ 5375.719241]  *** DEADLOCK ***
      [ 5375.719241]
      [ 5375.719243] 4 locks held by btrfs-health/4662:
      [ 5375.719266]  #0:  (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
      [ 5375.719293]  #1:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
      [ 5375.719319]  #2:  (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
      [ 5375.719343]  #3:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
      [ 5375.719343]
      [ 5375.719343] stack backtrace:
      [ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
      [ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
      [ 5375.719352]  0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
      [ 5375.719354]  ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
      [ 5375.719357]  ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
      [ 5375.719357] Call Trace:
      [ 5375.719363]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
      [ 5375.719366]  [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
      [ 5375.719369]  [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
      [ 5375.719373]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
      [ 5375.719376]  [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
      [ 5375.719378]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
      [ 5375.719383]  [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
      [ 5375.719385]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
      [ 5375.719387]  [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
      [ 5375.719389]  [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
      [ 5375.719393]  [<ffffffff8122ded2>] path_openat+0xd32/0x1360
      [ 5375.719415]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
      [ 5375.719418]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
      [ 5375.719420]  [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
      [ 5375.719423]  [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
      [ 5375.719426]  [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
      [ 5375.719430]  [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
      [ 5375.719433]  [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
      [ 5375.719436]  [<ffffffff8121d923>] filp_open+0x33/0x60
      [ 5375.719462]  [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
      [ 5375.719485]  [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
      [ 5375.719506]  [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
      [ 5375.719530]  [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
      [ 5375.719554]  [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
      [ 5375.719576]  [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
      [ 5375.719598]  [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
      [ 5375.719621]  [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
      [ 5375.719641]  [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
      [ 5375.719661]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
      [ 5375.719663]  [<ffffffff810a70df>] kthread+0xef/0x110
      [ 5375.719666]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
      [ 5375.719669]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
      [ 5375.719672]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
      [ 5375.719697] ------------[ cut here ]------------
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reported-by: NYauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      779bf3fe
    • D
      btrfs: send: silence an integer overflow warning · f5ecec3c
      Dan Carpenter 提交于
      The "sizeof(*arg->clone_sources) * arg->clone_sources_count" expression
      can overflow.  It causes several static checker warnings.  It's all
      under CAP_SYS_ADMIN so it's not that serious but lets silence the
      warnings.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5ecec3c
    • L
      btrfs: avoid overflowing f_bfree · 41b34acc
      Luis de Bethencourt 提交于
      Since mixed block groups accounting isn't byte-accurate and f_bree is an
      unsigned integer, it could overflow. Avoid this.
      Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
      Suggested-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      41b34acc
    • L
      btrfs: fix mixed block count of available space · ae02d1bd
      Luis de Bethencourt 提交于
      Metadata for mixed block is already accounted in total data and should not
      be counted as part of the free metadata space.
      Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=114281Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae02d1bd
    • A
      btrfs: allow balancing to dup with multi-device · 88be159c
      Austin S. Hemmelgarn 提交于
      Currently, we don't allow the user to try and rebalance to a dup profile
      on a multi-device filesystem.  In most cases, this is a perfectly sensible
      restriction as raid1 uses the same amount of space and provides better
      protection.
      
      However, when reshaping a multi-device filesystem down to a single device
      filesystem, this requires the user to convert metadata and system chunks
      to single profile before deleting devices, and then convert again to dup,
      which leaves a period of time where metadata integrity is reduced.
      
      This patch removes the single-device-only restriction from converting to
      dup profile to remove this potential data integrity reduction.
      Signed-off-by: NAustin S. Hemmelgarn <ahferroin7@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88be159c
  2. 28 4月, 2016 5 次提交
  3. 19 4月, 2016 1 次提交
    • L
      devpts: clean up interface to pty drivers · 67245ff3
      Linus Torvalds 提交于
      This gets rid of the horrible notion of having that
      
          struct inode *ptmx_inode
      
      be the linchpin of the interface between the pty code and devpts.
      
      By de-emphasizing the ptmx inode, a lot of things actually get cleaner,
      and we will have a much saner way forward.  In particular, this will
      allow us to associate with any particular devpts instance at open-time,
      and not be artificially tied to one particular ptmx inode.
      
      The patch itself is actually fairly straightforward, and apart from some
      locking and return path cleanups it's pretty mechanical:
      
       - the interfaces that devpts exposes all take "struct pts_fs_info *"
         instead of "struct inode *ptmx_inode" now.
      
         NOTE! The "struct pts_fs_info" thing is a completely opaque structure
         as far as the pty driver is concerned: it's still declared entirely
         internally to devpts. So the pty code can't actually access it in any
         way, just pass it as a "cookie" to the devpts code.
      
       - the "look up the pts fs info" is now a single clear operation, that
         also does the reference count increment on the pts superblock.
      
         So "devpts_add/del_ref()" is gone, and replaced by a "lookup and get
         ref" operation (devpts_get_ref(inode)), along with a "put ref" op
         (devpts_put_ref()).
      
       - the pty master "tty->driver_data" field now contains the pts_fs_info,
         not the ptmx inode.
      
       - because we don't care about the ptmx inode any more as some kind of
         base index, the ref counting can now drop the inode games - it just
         gets the ref on the superblock.
      
       - the pts_fs_info now has a back-pointer to the super_block. That's so
         that we can easily look up the information we actually need. Although
         quite often, the pts fs info was actually all we wanted, and not having
         to look it up based on some magical inode makes things more
         straightforward.
      
      In particular, now that "devpts_get_ref(inode)" operation should really
      be the *only* place we need to look up what devpts instance we're
      associated with, and we do it exactly once, at ptmx_open() time.
      
      The other side of this is that one ptmx node could now be associated
      with multiple different devpts instances - you could have a single
      /dev/ptmx node, and then have multiple mount namespaces with their own
      instances of devpts mounted on /dev/pts/.  And that's all perfectly sane
      in a model where we just look up the pts instance at open time.
      
      This will eventually allow us to get rid of our odd single-vs-multiple
      pts instance model, but this patch in itself changes no semantics, only
      an internal binding model.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67245ff3
  4. 15 4月, 2016 1 次提交
    • L
      Make file credentials available to the seqfile interfaces · 34dbbcdb
      Linus Torvalds 提交于
      A lot of seqfile users seem to be using things like %pK that uses the
      credentials of the current process, but that is actually completely
      wrong for filesystem interfaces.
      
      The unix semantics for permission checking files is to check permissions
      at _open_ time, not at read or write time, and that is not just a small
      detail: passing off stdin/stdout/stderr to a suid application and making
      the actual IO happen in privileged context is a classic exploit
      technique.
      
      So if we want to be able to look at permissions at read time, we need to
      use the file open credentials, not the current ones.  Normal file
      accesses can just use "f_cred" (or any of the helper functions that do
      that, like file_ns_capable()), but the seqfile interfaces do not have
      any such options.
      
      It turns out that seq_file _does_ save away the user_ns information of
      the file, though.  Since user_ns is just part of the full credential
      information, replace that special case with saving off the cred pointer
      instead, and suddenly seq_file has all the permission information it
      needs.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34dbbcdb
  5. 13 4月, 2016 5 次提交
  6. 11 4月, 2016 1 次提交
    • L
      Revert "ext4: allow readdir()'s of large empty directories to be interrupted" · 9f2394c9
      Linus Torvalds 提交于
      This reverts commit 1028b55b.
      
      It's broken: it makes ext4 return an error at an invalid point, causing
      the readdir wrappers to write the the position of the last successful
      directory entry into the position field, which means that the next
      readdir will now return that last successful entry _again_.
      
      You can only return fatal errors (that terminate the readdir directory
      walk) from within the filesystem readdir functions, the "normal" errors
      (that happen when the readdir buffer fills up, for example) happen in
      the iterorator where we know the position of the actual failing entry.
      
      I do have a very different patch that does the "signal_pending()"
      handling inside the iterator function where it is allowable, but while
      that one passes all the sanity checks, I screwed up something like four
      times while emailing it out, so I'm not going to commit it today.
      
      So my track record is not good enough, and the stars will have to align
      better before that one gets committed.  And it would be good to get some
      review too, of course, since celestial alignments are always an iffy
      debugging model.
      
      IOW, let's just revert the commit that caused the problem for now.
      Reported-by: NGreg Thelen <gthelen@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f2394c9
  7. 09 4月, 2016 4 次提交