1. 21 3月, 2015 3 次提交
  2. 04 3月, 2015 2 次提交
  3. 25 2月, 2015 1 次提交
  4. 23 2月, 2015 34 次提交
    • G
      Add new disk to clustered array · 1aee41f6
      Goldwyn Rodrigues 提交于
      Algorithm:
      1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
         ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
      2. Node 1 sends NEWDISK with uuid and slot number
      3. Other nodes issue kobject_uevent_env with uuid and slot number
      (Steps 4,5 could be a udev rule)
      4. In userspace, the node searches for the disk, perhaps
         using blkid -t SUB_UUID=""
      5. Other nodes issue either of the following depending on whether the disk
         was found:
         ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
      	 disc.number set to slot number)
         ioctl(CLUSTERED_DISK_NACK)
      6. Other nodes drop lock on no-new-devs (CR) if device is found
      7. Node 1 attempts EX lock on no-new-devs
      8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
         as SpareLocal
      9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
      10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      1aee41f6
    • G
      Read from the first device when an area is resyncing · 7d49ffcf
      Goldwyn Rodrigues 提交于
      set choose_first true for cluster read in read balance when the area
      is resyncing.
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      7d49ffcf
    • G
      Suspend writes in RAID1 if within range · 589a1c49
      Goldwyn Rodrigues 提交于
      If there is a resync going on, all nodes must suspend writes to the
      range. This is recorded in the suspend_info/suspend_list.
      
      If there is an I/O within the ranges of any of the suspend_info,
      should_suspend will return 1.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      589a1c49
    • G
      Resync start/Finish actions · e59721cc
      Goldwyn Rodrigues 提交于
      When a RESYNC_START message arrives, the node removes the entry
      with the current slot number and adds the range to the
      suspend_list.
      
      Simlarly, when a RESYNC_FINISHED message is received, node clears
      entry with respect to the bitmap number.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      e59721cc
    • G
      Send RESYNCING while performing resync start/stop · 965400eb
      Goldwyn Rodrigues 提交于
      When a resync is initiated, RESYNCING message is sent to all active
      nodes with the range (lo,hi). When the resync is over, a RESYNCING
      message is sent with (0,0). A high sector value of zero indicates
      that the resync is over.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      965400eb
    • G
      Reload superblock if METADATA_UPDATED is received · 1d7e3e96
      Goldwyn Rodrigues 提交于
      Re-reads the devices by invalidating the cache.
      Since we don't write to faulty devices, this is detected using
      events recorded in the devices. If it is old as compared to the mddev
      mark it is faulty.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      1d7e3e96
    • G
      metadata_update sends message to other nodes · 293467aa
      Goldwyn Rodrigues 提交于
         - request to send a message
         - make changes to superblock
         - send messages telling everyone that the superblock has changed
         - other nodes all read the superblock
         - other nodes all ack the messages
         - updating node release the "I'm sending a message" resource.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      293467aa
    • G
      Communication Framework: Sending functions · 601b515c
      Goldwyn Rodrigues 提交于
      The sending part is split in two functions to make sure
      atomicity of the operations, such as the MD superblock update.
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      601b515c
    • G
      Communication Framework: Receiving · 4664680c
      Goldwyn Rodrigues 提交于
      1. receive status
      
         sender                         receiver                   receiver
         ACK:CR                          ACK:CR                     ACK:CR
      
      2. sender get EX of TOKEN
         sender get EX of MESSAGE
         sender                          receiver                   receiver
         TOKEN:EX                         ACK:CR                     ACK:CR
         MESSAGE:EX
         ACK:CR
      
      3. sender write LVB.
         sender down-convert MESSAGE from EX to CR
         sender try to get EX of ACK
         [ wait until all receiver has *processed* the MESSAGE ]
      
                                           [ triggered by bast of ACK ]
                                           receiver get CR of MESSAGE
                                           receiver read LVB
                                           receiver processes the message
      				     [ wait finish ]
                                           receiver release ACK
      
         sender                         receiver                   receiver
         TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
         MESSAGE:CR
         ACK:EX
      
      4. sender down-convert ACK from EX to CR
         sender release MESSAGE
         sender release TOKEN
      				  receiver upconvert to EX of MESSAGE
                                        receiver get CR of ACK
      				  receiver release MESSAGE
      
         sender                        receiver                   receiver
         ACK:CR                         ACK:CR                     ACK:CR
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      4664680c
    • G
      Perform resync for cluster node failure · 4b26a08a
      Goldwyn Rodrigues 提交于
      If bitmap_copy_slot returns hi>0, we need to perform resync.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      4b26a08a
    • G
      Initiate recovery on node failure · e94987db
      Goldwyn Rodrigues 提交于
      The DLM informs us in case of node failure with the DLM slot number.
      cluster_info->recovery_map sets the bit corresponding to the slot number
      and wakes up the recovery thread.
      
      The recovery thread:
      1. Derives the slot number from the recovery_map
      2. Locks the bitmap corresponding to the slot
      3. Copies the set bits to the node-local bitmap
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      e94987db
    • G
      Copy set bits from another slot · 11dd35da
      Goldwyn Rodrigues 提交于
      bitmap_copy_from_slot reads the bitmap from the slot mentioned.
      It then copies the set bits to the node local bitmap.
      
      This is helper function for the resync operation on node failure.
      
      bitmap_set_memory_bits() currently assumes it is only run at startup and that
      they bitmap is currently empty.  So if it finds that a region is already
      marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits()
      to always set the NEEDED_MASK bit if 'needed' is set.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      11dd35da
    • G
      bitmap_create returns bitmap pointer · f9209a32
      Goldwyn Rodrigues 提交于
      This is done to have multiple bitmaps open at the same time.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      f9209a32
    • G
      Gather on-going resync information of other nodes · 96ae923a
      Goldwyn Rodrigues 提交于
      When a node joins, it does not know of other nodes performing resync.
      So, each node keeps the resync information in it's LVB. When a new
      node joins, it reads the LVB of each "online" bitmap.
      
      [TODO] The new node attempts to get the PW lock on other bitmap, if
      it is successful, it reads the bitmap and performs the resync (if
      required) on it's behalf.
      
      If the node does not get the PW, it requests CR and reads the LVB
      for the resync information.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      96ae923a
    • G
      Lock bitmap while joining the cluster · 54519c5f
      Goldwyn Rodrigues 提交于
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      54519c5f
    • G
      Use separate bitmaps for each nodes in the cluster · b97e9257
      Goldwyn Rodrigues 提交于
      On-disk format:
      
      0                    4k                     8k                    12k
      -------------------------------------------------------------------
      | idle                | md super            | bm super [0] + bits |
      | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
      | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
      | bm bits [3, contd]  |                     |                     |
      
      Bitmap super has a field nodes, which defines the maximum number
      of nodes the device can use. While reading the bitmap super, if
      the cluster finds out that the number of nodes is > 0:
      1. Requests the md-cluster module.
      2. Calls md_cluster_ops->join(), which sets up clustering such as
         joining DLM lockspace.
      
      Since the first time, the first bitmap is read. After the call
      to the cluster_setup, the bitmap offset is adjusted and the
      superblock is re-read. This also ensures the bitmap is read
      the bitmap lock (when bitmap lock is introduced in later patches)
      
      Questions:
      1. cluster name is repeated in all bitmap supers. Is that okay?
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      b97e9257
    • G
      Add node recovery callbacks · cf921cc1
      Goldwyn Rodrigues 提交于
      DLM offers callbacks when a node fails and the lock remastery
      is performed:
      
      1. recover_prep: called when DLM discovers a node is down
      2. recover_slot: called when DLM identifies the node and recovery
      		can start
      3. recover_done: called when all nodes have completed recover_slot
      
      recover_slot() and recover_done() are also called when the node joins
      initially in order to inform the node with its slot number. These slot
      numbers start from one, so we deduct one to make it start with zero
      which the cluster-md code uses.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      cf921cc1
    • G
      ca8895d9
    • G
      Introduce md_cluster_info · c4ce867f
      Goldwyn Rodrigues 提交于
      md_cluster_info stores the cluster information in the MD device.
      
      The join() is called when mddev detects it is a clustered device.
      The main responsibilities are:
      	1. Setup a DLM lockspace
      	2. Setup all initial locks such as super block locks and bitmap lock (will come later)
      
      The leave() clears up the lockspace and all the locks held.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      c4ce867f
    • G
      Introduce md_cluster_operations to handle cluster functions · edb39c9d
      Goldwyn Rodrigues 提交于
      This allows dynamic registering of cluster hooks.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      edb39c9d
    • G
      DLM lock and unlock functions · 47741b7c
      Goldwyn Rodrigues 提交于
      A dlm_lock_resource is a structure which contains all information
      required for locking using DLM. The init function allocates the
      lock and acquires the lock in NL mode. The unlock function
      converts the lock resource to NL mode. This is done to preserve
      LVB and for faster processing of locks. The lock resource is
      DLM unlocked only in the lockres_free function, which is the end
      of life of the lock resource.
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      47741b7c
    • G
      Create a separate module for clustering support · 8e854e9c
      Goldwyn Rodrigues 提交于
      Tagged as EXPERIMENTAL for now.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      8e854e9c
    • G
      183bdf51
    • G
      md-cluster: Design Documentation · b8d83448
      Goldwyn Rodrigues 提交于
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      b8d83448
    • L
      Linux 4.0-rc1 · c517d838
      Linus Torvalds 提交于
      .. after extensive statistical analysis of my G+ polling, I've come to
      the inescapable conclusion that internet polls are bad.
      
      Big surprise.
      
      But "Hurr durr I'ma sheep" trounced "I like online polls" by a 62-to-38%
      margin, in a poll that people weren't even supposed to participate in.
      Who can argue with solid numbers like that? 5,796 votes from people who
      can't even follow the most basic directions?
      
      In contrast, "v4.0" beat out "v3.20" by a slimmer margin of 56-to-44%,
      but with a total of 29,110 votes right now.
      
      Now, arguably, that vote spread is only about 3,200 votes, which is less
      than the almost six thousand votes that the "please ignore" poll got, so
      it could be considered noise.
      
      But hey, I asked, so I'll honor the votes.
      c517d838
    • L
      Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · feaf2229
      Linus Torvalds 提交于
      Pull ext4 fixes from Ted Ts'o:
       "Ext4 bug fixes.
      
        We also reserved code points for encryption and read-only images (for
        which the implementation is mostly just the reserved code point for a
        read-only feature :-)"
      
      * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: fix indirect punch hole corruption
        ext4: ignore journal checksum on remount; don't fail
        ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
        ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize
        ext4: support read-only images
        ext4: change to use setup_timer() instead of init_timer()
        ext4: reserve codepoints used by the ext4 encryption feature
        jbd2: complain about descriptor block checksum errors
      feaf2229
    • L
      Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · be5e6616
      Linus Torvalds 提交于
      Pull more vfs updates from Al Viro:
       "Assorted stuff from this cycle.  The big ones here are multilayer
        overlayfs from Miklos and beginning of sorting ->d_inode accesses out
        from David"
      
      * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits)
        autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
        procfs: fix race between symlink removals and traversals
        debugfs: leave freeing a symlink body until inode eviction
        Documentation/filesystems/Locking: ->get_sb() is long gone
        trylock_super(): replacement for grab_super_passive()
        fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
        Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
        VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
        SELinux: Use d_is_positive() rather than testing dentry->d_inode
        Smack: Use d_is_positive() rather than testing dentry->d_inode
        TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR()
        Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode
        Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb
        VFS: Split DCACHE_FILE_TYPE into regular and special types
        VFS: Add a fallthrough flag for marking virtual dentries
        VFS: Add a whiteout dentry type
        VFS: Introduce inode-getting helpers for layered/unioned fs environments
        Infiniband: Fix potential NULL d_inode dereference
        posix_acl: fix reference leaks in posix_acl_create
        autofs4: Wrong format for printing dentry
        ...
      be5e6616
    • L
      Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm · 90c453ca
      Linus Torvalds 提交于
      Pull ARM fix from Russell King:
       "Just one fix this time around.  __iommu_alloc_buffer() can cause a
        BUG() if dma_alloc_coherent() is called with either __GFP_DMA32 or
        __GFP_HIGHMEM set.  The patch from Alexandre addresses this"
      
      * 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
        ARM: 8305/1: DMA: Fix kzalloc flags in __iommu_alloc_buffer()
      90c453ca
    • A
      autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation · 0a280962
      Al Viro 提交于
      X-Coverup: just ask spender
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a280962
    • A
      procfs: fix race between symlink removals and traversals · 7e0e953b
      Al Viro 提交于
      use_pde()/unuse_pde() in ->follow_link()/->put_link() resp.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7e0e953b
    • A
      debugfs: leave freeing a symlink body until inode eviction · 0db59e59
      Al Viro 提交于
      As it is, we have debugfs_remove() racing with symlink traversals.
      Supply ->evict_inode() and do freeing there - inode will remain
      pinned until we are done with the symlink body.
      
      And rip the idiocy with checking if dentry is positive right after
      we'd verified debugfs_positive(), which is a stronger check...
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0db59e59
    • A
      dca11178
    • K
      trylock_super(): replacement for grab_super_passive() · eb6ef3df
      Konstantin Khlebnikov 提交于
      I've noticed significant locking contention in memory reclaimer around
      sb_lock inside grab_super_passive(). Grab_super_passive() is called from
      two places: in icache/dcache shrinkers (function super_cache_scan) and
      from writeback (function __writeback_inodes_wb). Both are required for
      progress in memory allocator.
      
      Grab_super_passive() acquires sb_lock to increment sb->s_count and check
      sb->s_instances. It seems sb->s_umount locked for read is enough here:
      super-block deactivation always runs under sb->s_umount locked for write.
      Protecting super-block itself isn't a problem: in super_cache_scan() sb
      is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
      are still active. Inside writeback super-block comes from inode from bdi
      writeback list under wb->list_lock.
      
      This patch removes locking sb_lock and checks s_instances under s_umount:
      generic_shutdown_super() unlinks it under sb->s_umount locked for write.
      New variant is called trylock_super() and since it only locks semaphore,
      callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
      they're done.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eb6ef3df
    • D
      fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions · 54f2a2f4
      David Howells 提交于
      Fanotify probably doesn't want to watch autodirs so make it use d_can_lookup()
      rather than d_is_dir() when checking a dir watch and give an error on fake
      directories.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      54f2a2f4