1. 25 1月, 2012 11 次提交
    • E
      sysctl: Improve the sysctl sanity checks · 7c60c48f
      Eric W. Biederman 提交于
      - Stop validating subdirectories now that we only register leaf tables
      
      - Cleanup and improve the duplicate filename check.
        * Run the duplicate filename check under the sysctl_lock to guarantee
          we never add duplicate names.
        * Reduce the duplicate filename check to nearly O(M*N) where M is the
          number of entries in tthe table we are registering and N is the
          number of entries in the directory before we got there.
      
      - Move the duplicate filename check into it's own function and call
        it directtly from __register_sysctl_table
      
      - Kill the config option as the sanity checks are now cheap enough
        the config option is unnecessary. The original reason for the config
        option was because we had a huge table used to verify the proc filename
        to binary sysctl mapping.  That table has now evolved into the binary_sysctl
        translation layer and is no longer part of the sysctl_check code.
      
      - Tighten up the permission checks.  Guarnateeing that files only have read
        or write permissions.
      
      - Removed redudant check for parents having a procname as now everything has
        a procname.
      
      - Generalize the backtrace logic so that we print a backtrace from
        any failure of __register_sysctl_table that was not caused by
        a memmory allocation failure.  The backtrace allows us to track
        down who erroneously registered a sysctl table.
      
      Bechmark before (CONFIG_SYSCTL_CHECK=y):
          make-dummies 0 999 -> 12s
          rmmod dummy        -> 0.08s
      
      Bechmark before (CONFIG_SYSCTL_CHECK=n):
          make-dummies 0 999 -> 0.7s
          rmmod dummy        -> 0.06s
          make-dummies 0 99999 -> 1m13s
          rmmod dummy          -> 0.38s
      
      Benchmark after:
          make-dummies 0 999 -> 0.65s
          rmmod dummy        -> 0.055s
          make-dummies 0 9999 -> 1m10s
          rmmod dummy         -> 0.39s
      
      The sysctl sanity checks now impose no measurable cost.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      7c60c48f
    • E
      sysctl: register only tables of sysctl files · f728019b
      Eric W. Biederman 提交于
      Split the registration of a complex ctl_table array which may have
      arbitrary numbers of directories (->child != NULL) and tables of files
      into a series of simpler registrations that only register tables of files.
      
      Graphically:
      
         register('dir', { + file-a
                           + file-b
                           + subdir1
                             + file-c
                           + subdir2
                             + file-d
                             + file-e })
      
      is transformed into:
         wrapper->subheaders[0] = register('dir', {file1-a, file1-b})
         wrapper->subheaders[1] = register('dir/subdir1', {file-c})
         wrapper->subheaders[2] = register('dir/subdir2', {file-d, file-e})
         return wrapper
      
      This guarantees that __register_sysctl_table will only see a simple
      ctl_table array with all entries having (->child == NULL).
      
      Care was taken to pass the original simple ctl_table arrays to
      __register_sysctl_table whenever possible.
      
      This change is derived from a similar patch written
      by Lucrian Grijincu.
      Inspired-by: NLucian Adrian Grijincu <lucian.grijincu@gmail.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f728019b
    • E
      sysctl: Add ctl_table chains into cstring paths · ec6a5266
      Eric W. Biederman 提交于
      For any component of table passed to __register_sysctl_paths
      that actually serves as a path, add that to the cstring path
      that is passed to __register_sysctl_table.
      
      The result is that for most calls to __register_sysctl_paths
      we only pass a table to __register_sysctl_table that contains
      no child directories.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      ec6a5266
    • E
      sysctl: Add support for register sysctl tables with a normal cstring path. · 6e9d5164
      Eric W. Biederman 提交于
      Make __register_sysctl_table the core sysctl registration operation and
      make it take a char * string as path.
      
      Now that binary paths have been banished into the real of backwards
      compatibility in kernel/binary_sysctl.c where they can be safely
      ignored there is no longer a need to use struct ctl_path to represent
      path names when registering ctl_tables.
      
      Start the transition to using normal char * strings to represent
      pathnames when registering sysctl tables.  Normal strings are easier
      to deal with both in the internal sysctl implementation and for
      programmers registering sysctl tables.
      
      __register_sysctl_paths is turned into a backwards compatibility wrapper
      that converts a ctl_path array into a normal char * string.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      6e9d5164
    • E
      sysctl: Create local copies of directory names used in paths · f05e53a7
      Eric W. Biederman 提交于
      Creating local copies of directory names is a good idea for
      two reasons.
      - The dynamic names used by callers must be copied into new
        strings by the callers today to ensure the strings do not
        change between register and unregister of the sysctl table.
      
      - Sysctl directories have a potentially different lifetime
        than the time between register and unregister of any
        particular sysctl table.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f05e53a7
    • E
      sysctl: Remove the unnecessary sysctl_set parent concept. · bd295b56
      Eric W. Biederman 提交于
      In sysctl_net register the two networking roots in the proper order.
      
      In register_sysctl walk the sysctl sets in the reverse order of the
      sysctl roots.
      
      Remove parent from ctl_table_set and setup_sysctl_set as it is no
      longer needed.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      bd295b56
    • E
      sysctl: Implement retire_sysctl_set · 97324cd8
      Eric W. Biederman 提交于
      This adds a small helper retire_sysctl_set to remove the intimate knowledge about
      the how a sysctl_set is implemented from net/sysct_net.c
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      97324cd8
    • E
      sysctl: Make the directories have nlink == 1 · a15e2098
      Eric W. Biederman 提交于
      I goofed when I made sysctl directories have nlink == 0.
      nlink == 0 means the directory has been deleted.
      nlink == 1 meands a directory does not count subdirectories.
      
      Use the default nlink == 1 for sysctl directories.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      a15e2098
    • E
      sysctl: Move the implementation into fs/proc/proc_sysctl.c · 1f87f0b5
      Eric W. Biederman 提交于
      Move the core sysctl code from kernel/sysctl.c and kernel/sysctl_check.c
      into fs/proc/proc_sysctl.c.
      
      Currently sysctl maintenance is hampered by the sysctl implementation
      being split across 3 files with artificial layering between them.
      Consolidate the entire sysctl implementation into 1 file so that
      it is easier to see what is going on and hopefully allowing for
      simpler maintenance.
      
      For functions that are now only used in fs/proc/proc_sysctl.c remove
      their declarations from sysctl.h and make them static in fs/proc/proc_sysctl.c
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1f87f0b5
    • E
      sysctl: Register the base sysctl table like any other sysctl table. · de4e83bd
      Eric W. Biederman 提交于
      Simplify the code by treating the base sysctl table like any other
      sysctl table and register it with register_sysctl_table.
      
      To ensure this table is registered early enough to avoid problems
      call sysctl_init from proc_sys_init.
      
      Rename sysctl_net.c:sysctl_init() to net_sysctl_init() to avoid
      name conflicts now that kernel/sysctl.c:sysctl_init() is no longer
      static.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      de4e83bd
    • L
      sysctl: remove impossible condition check · 36885d7b
      Lucas De Marchi 提交于
      Remove checks for conditions that will never happen. If procname is NULL
      the loop would already had bailed out, so there's no need to check it
      again.
      
      At the same time this also compacts the function find_in_table() by
      refactoring it to be easier to read.
      Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      Reviewed-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      36885d7b
  2. 20 1月, 2012 3 次提交
  3. 18 1月, 2012 14 次提交
    • L
      proc: clean up and fix /proc/<pid>/mem handling · e268337d
      Linus Torvalds 提交于
      Jüri Aedla reported that the /proc/<pid>/mem handling really isn't very
      robust, and it also doesn't match the permission checking of any of the
      other related files.
      
      This changes it to do the permission checks at open time, and instead of
      tracking the process, it tracks the VM at the time of the open.  That
      simplifies the code a lot, but does mean that if you hold the file
      descriptor open over an execve(), you'll continue to read from the _old_
      VM.
      
      That is different from our previous behavior, but much simpler.  If
      somebody actually finds a load where this matters, we'll need to revert
      this commit.
      
      I suspect that nobody will ever notice - because the process mapping
      addresses will also have changed as part of the execve.  So you cannot
      actually usefully access the fd across a VM change simply because all
      the offsets for IO would have changed too.
      Reported-by: NJüri Aedla <asd@ut.ee>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e268337d
    • M
      vfs: remove printk from set_nlink() · 424a5334
      Miklos Szeredi 提交于
      Don't log a message for set_nlink(0).
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      424a5334
    • K
      wake up s_wait_unfrozen when ->freeze_fs fails · e1616300
      Kazuya Mio 提交于
      dd slept infinitely when fsfeeze failed because of EIO.
      To fix this problem, if ->freeze_fs fails, freeze_super() wakes up
      the tasks waiting for the filesystem to become unfrozen.
      
      When s_frozen isn't SB_UNFROZEN in __generic_file_aio_write(),
      the function sleeps until FITHAW ioctl wakes up s_wait_unfrozen.
      
      However, if ->freeze_fs fails, s_frozen is set to SB_UNFROZEN and then
      freeze_super() returns an error number. In this case, FITHAW ioctl returns
      EINVAL because s_frozen is already SB_UNFROZEN. There is no way to wake up
      s_wait_unfrozen, so __generic_file_aio_write() sleeps infinitely.
      Signed-off-by: NKazuya Mio <k-mio@sx.jp.nec.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e1616300
    • E
      audit: do not call audit_getname on error · 4043cde8
      Eric Paris 提交于
      Just a code cleanup really.  We don't need to make a function call just for
      it to return on error.  This also makes the VFS function even easier to follow
      and removes a conditional on a hot path.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      4043cde8
    • E
      audit: only allow tasks to set their loginuid if it is -1 · 633b4545
      Eric Paris 提交于
      At the moment we allow tasks to set their loginuid if they have
      CAP_AUDIT_CONTROL.  In reality we want tasks to set the loginuid when they
      log in and it be impossible to ever reset.  We had to make it mutable even
      after it was once set (with the CAP) because on update and admin might have
      to restart sshd.  Now sshd would get his loginuid and the next user which
      logged in using ssh would not be able to set his loginuid.
      
      Systemd has changed how userspace works and allowed us to make the kernel
      work the way it should.  With systemd users (even admins) are not supposed
      to restart services directly.  The system will restart the service for
      them.  Thus since systemd is going to loginuid==-1, sshd would get -1, and
      sshd would be allowed to set a new loginuid without special permissions.
      
      If an admin in this system were to manually start an sshd he is inserting
      himself into the system chain of trust and thus, logically, it's his
      loginuid that should be used!  Since we have old systems I make this a
      Kconfig option.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      633b4545
    • E
      audit: remove task argument to audit_set_loginuid · 0a300be6
      Eric Paris 提交于
      The function always deals with current.  Don't expose an option
      pretending one can use it for something.  You can't.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      0a300be6
    • C
      xfs: cleanup xfs_file_aio_write · d0606464
      Christoph Hellwig 提交于
      With all the size field updates out of the way xfs_file_aio_write can
      be further simplified by pushing all iolock handling into
      xfs_file_dio_aio_write and xfs_file_buffered_aio_write and using
      the generic generic_write_sync helper for synchronous writes.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d0606464
    • C
      xfs: always return with the iolock held from xfs_file_aio_write_checks · 5bf1f262
      Christoph Hellwig 提交于
      While xfs_iunlock is fine with 0 lockflags the calling conventions are much
      cleaner if xfs_file_aio_write_checks never returns without the iolock held.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5bf1f262
    • C
      xfs: remove the i_new_size field in struct xfs_inode · 2813d682
      Christoph Hellwig 提交于
      Now that we use the VFS i_size field throughout XFS there is no need for the
      i_new_size field any more given that the VFS i_size field gets updated
      in ->write_end before unlocking the page, and thus is always uptodate when
      writeback could see a page.  Removing i_new_size also has the advantage that
      we will never have to trim back di_size during a failed buffered write,
      given that it never gets updated past i_size.
      
      Note that currently the generic direct I/O code only updates i_size after
      calling our end_io handler, which requires a small workaround to make
      sure di_size actually makes it to disk.  I hope to fix this properly in
      the generic code.
      
      A downside is that we lose the support for parallel non-overlapping O_DIRECT
      appending writes that recently was added.  I don't think keeping the complex
      and fragile i_new_size infrastructure for this is a good tradeoff - if we
      really care about parallel appending writers we should investigate turning
      the iolock into a range lock, which would also allow for parallel
      non-overlapping buffered writers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2813d682
    • C
      xfs: remove the i_size field in struct xfs_inode · ce7ae151
      Christoph Hellwig 提交于
      There is no fundamental need to keep an in-memory inode size copy in the XFS
      inode.  We already have the on-disk value in the dinode, and the separate
      in-memory copy that we need for regular files only in the XFS inode.
      
      Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
      VFS inode i_size field for regular files.  Switch code that was directly
      accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
      we are limited to regular files direct access of the VFS inode i_size field.
      
      This also allows dropping some fairly complicated code in the write path
      which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
      that is getting updated inside ->write_end.
      
      Note that we do not bother resetting the VFS i_size when truncating a file
      that gets freed to zero as there is no point in doing so because the VFS inode
      is no longer in use at this point.  Just relax the assert in xfs_ifree to
      only check the on-disk size instead.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ce7ae151
    • C
      xfs: replace i_pin_wait with a bit waitqueue · f392e631
      Christoph Hellwig 提交于
      Replace i_pin_wait, which is only used during synchronous inode flushing
      with a bit waitqueue.  This trades off a much smaller inode against
      slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit)
      bytes in the XFS inode.
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f392e631
    • C
      xfs: replace i_flock with a sleeping bitlock · 474fce06
      Christoph Hellwig 提交于
      We almost never block on i_flock, the exception is synchronous inode
      flushing.  Instead of bloating the inode with a 16/24-byte completion
      that we abuse as a semaphore just implement it as a bitlock that uses
      a bit waitqueue for the rare sleeping path.  This primarily is a
      tradeoff between a much smaller inode and a faster non-blocking
      path vs faster wakeups, and we are much better off with the former.
      
      A small downside is that we will lose lockdep checking for i_flock, but
      given that it's always taken inside the ilock that should be acceptable.
      
      Note that for example the inode writeback locking is implemented in a
      very similar way.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      474fce06
    • C
      xfs: make i_flags an unsigned long · 49e4c70e
      Christoph Hellwig 提交于
      To be used for bit wakeup i_flags needs to be an unsigned long or we'll
      run into trouble on big endian systems.  Because of the 1-byte i_update
      field right after it this actually causes a fairly large size increase
      on its own (4 or 8 bytes), but that increase will be more than offset
      by the next two patches.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      49e4c70e
    • C
      xfs: remove the if_ext_max field in struct xfs_ifork · 8096b1eb
      Christoph Hellwig 提交于
      We spent a lot of effort to maintain this field, but it always equals to the
      fork size divided by the constant size of an extent.  The prime use of it is
      to assert that the two stay in sync.  Just divide the fork size by the extent
      size in the few places that we actually use it and remove the overhead
      of maintaining it.  Also introduce a few helpers to consolidate the places
      where we actually care about the value.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8096b1eb
  4. 17 1月, 2012 12 次提交
    • C
      Btrfs: use larger system chunks · 96bdc7dc
      Chris Mason 提交于
      system chunks by default are very small.  This makes them slightly
      larger and also fixes the conditional checks to make sure we don't
      allocate a billion of them at once.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      96bdc7dc
    • J
      Btrfs: add a delalloc mutex to inodes for delalloc reservations · f248679e
      Josef Bacik 提交于
      I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
      that and theres no real way to get rid of those, so just stop using i_mutex to
      protect delalloc metadata reservations and use a delalloc mutex instead.  This
      shouldn't be contended often at all, only if you are writing and mmap writing to
      the file at the same time.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      f248679e
    • J
      Btrfs: space leak tracepoints · 8c2a3ca2
      Josef Bacik 提交于
      This in addition to a script in my btrfs-tracing tree will help track down space
      leaks when we're getting space left over in block groups on umount.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8c2a3ca2
    • J
      Btrfs: protect orphan block rsv with spin_lock · 90290e19
      Josef Bacik 提交于
      We've been seeing warnings coming out of the orphan commit stuff forever from
      ceph.  Turns out it's because we're racing with checking if the orphan block
      reserve is set, because we clear it outside of the spin_lock.  So leave the
      normal fastpath checks where they are, but take the spin_lock and _recheck_ to
      make sure we haven't had an orphan block rsv added in the meantime.  Then clear
      the root's orphan block rsv and release the lock.  With this patch a user said
      the warnings went away and they usually showed up pretty soon after he started
      ceph.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      90290e19
    • J
      Btrfs: add allocator tracepoints · 3f7de037
      Josef Bacik 提交于
      I used these tracepoints when figuring out what the cluster stuff was doing, so
      add them to mainline in case we need to profile this stuff again.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3f7de037
    • J
      Btrfs: don't call btrfs_throttle in file write · 45a8090e
      Josef Bacik 提交于
      Btrfs_throttle will make us wait if there is a currently committing transaction
      until we can open new transactions, which is ridiculous since we don't actually
      start any transactions within the file write path anyway, so all this does is
      introduce big latencies if we have a sync/fsync heavy workload going on while
      somebody else is trying to do work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      45a8090e
    • J
      Btrfs: release space on error in page_mkwrite · ec39e180
      Josef Bacik 提交于
      If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
      which is a problem since we make our reservation right before trying to update
      the inode, so fix the out label so that we actually free our reservation.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ec39e180
    • M
      Btrfs: fix btrfsck error 400 when truncating a compressed · f70a9a6b
      Miao Xie 提交于
      Reproduce steps:
       # mkfs.btrfs /dev/sdb5
       # mount /dev/sdb5 -o compress=lzo /mnt
       # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
       # sync
       # truncate -s 64K /mnt/tmpfile
       root 5 inode 257 errors 400
      
      This is because of the wrong if condition, which is used to check if we should
      subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
      When we truncate a compressed extent, btrfs substracts the bytes of the whole
      extent, it's wrong. We should substract the real size that we truncate, no
      matter it is a compressed extent or not. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f70a9a6b
    • J
      Btrfs: do not use btrfs_end_transaction_throttle everywhere · 7ad85bb7
      Josef Bacik 提交于
      A user reported a problem where things like open with O_CREAT would take up to
      30 seconds when he had nfs activity on the same mount.  This is because all of
      our quick metadata operations, like create, symlink etc all do
      btrfs_end_transaction_throttle, which if the transaction is blocked will wait
      for the commit to complete before it returns.  This adds a ridiculous amount of
      latency and isn't really needed.  The normal btrfs_end_transaction will mark the
      transaction as blocked and wake the transaction kthread up if it thinks the
      transaction needs to end (this being in the running out of global reserve space
      scenario), and this is all that is really needed since we've already done
      everything we're going to do, we just need to return.  This should help people
      with the latency they were seeing when using synchronous heavy workloads.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7ad85bb7
    • I
      Btrfs: add balance progress reporting · 19a39dce
      Ilya Dryomov 提交于
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      19a39dce
    • I
      Btrfs: allow for resuming restriper after it was paused · de322263
      Ilya Dryomov 提交于
      Recognize BTRFS_BALANCE_RESUME flag passed from userspace.  We use the
      same heuristics used when recovering balance after a crash to try to
      start where we left off last time.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      de322263
    • I
      Btrfs: allow for canceling restriper · a7e99c69
      Ilya Dryomov 提交于
      Implement an ioctl for canceling restriper.  Currently we wait until
      relocation of the current block group is finished, in future this can be
      done by triggering a commit.  Balance item is deleted and no memory
      about the interrupted balance is kept.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a7e99c69