1. 17 8月, 2011 2 次提交
    • D
      pstore: change mutex locking to spin_locks · abd4d558
      Don Zickus 提交于
      pstore was using mutex locking to protect read/write access to the
      backend plug-ins.  This causes problems when pstore is executed in
      an NMI context through panic() -> kmsg_dump().
      
      This patch changes the mutex to a spin_lock_irqsave then also checks to
      see if we are in an NMI context.  If we are in an NMI and can't get the
      lock, just print a message stating that and blow by the locking.
      
      All this is probably a hack around the bigger locking problem but it
      solves my current situation of trying to sleep in an NMI context.
      
      Tested by loading the lkdtm module and executing a HARDLOCKUP which
      will cause the machine to panic inside the nmi handler.
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Acked-by: NMatthew Garrett <mjg@redhat.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      abd4d558
    • L
      pstore: defer inserting OOPS entries into pstore · 6dda9266
      Luck, Tony 提交于
      Life is simple for all the kernel terminating types of kmsg_dump
      call backs - pstore just saves the tail end of the console log. But
      for "oops" the situation is more complex - the kernel may carry on
      running (possibly for ever).  So we'd like to make the logged copy
      of the oops appear in the pstore filesystem - so that the user has
      a handle to clear the entry from the persistent backing store (if
      we don't, the store may fill with "oops" entries (that are also
      safely stashed in /var/log/messages) leaving no space for real
      errors.
      
      Current code calls pstore_mkfile() immediately. But this may
      not be safe. The oops could have happened with arbitrary locks
      held, or in interrupt or NMI context. So allocating memory and
      calling into generic filesystem code seems unwise.
      
      This patch defers making the entry appear. At the time
      of the oops, we merely set a flag "pstore_new_entry" noting that
      a new entry has been added. A periodic timer checks once a minute
      to see if the flag is set - if so, it schedules a work queue to
      rescan the backing store and make all new entries appear in the
      pstore filesystem.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      6dda9266
  2. 13 8月, 2011 2 次提交
  3. 12 8月, 2011 4 次提交
    • B
      pnfs: Automatically select blocks & objects layouts · 8cf1fb21
      Boaz Harrosh 提交于
      Just like files-layout, blocks & objects layouts are part of the
      NFS 4.1 protocol and should be automatically selected if NFS_4_1
      is selected. The small problem is that these depend on other
      Kernel support being present, while files only depends on NFS
      itself.
      
      This patch removes from the user choice the presence of objects
      and blocks layout. But makes sure these are selected only if
      the depended subsystems are present in the Kernel.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Acked-by: NPeng Tao <peng_tao@emc.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cf1fb21
    • E
      ext4: Properly count journal credits for long symlinks · 8c208719
      Eric Sandeen 提交于
      Commit df5e6223 ("ext4: fix deadlock in ext4_symlink() in ENOSPC
      conditions") recalculated the number of credits needed for a long
      symlink, in the process of splitting it into two transactions.  However,
      the first credit calculation under-counted because if selinux is
      enabled, credits are needed to create the selinux xattr as well.
      
      Overrunning the reservation will result in an OOPS in
      jbd2_journal_dirty_metadata() due to this assert:
      
        J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
      
      Fix this by increasing the reservation size.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c208719
    • E
      ext3: Properly count journal credits for long symlinks · d2db60df
      Eric Sandeen 提交于
      Commit ae54870a ("ext3: Fix lock inversion in ext3_symlink()")
      recalculated the number of credits needed for a long symlink, in the
      process of splitting it into two transactions.  However, the first
      credit calculation under-counted because if selinux is enabled, credits
      are needed to create the selinux xattr as well.
      
      Overrunning the reservation will result in an OOPS in
      journal_dirty_metadata() due to this assert:
      
        J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
      
      Fix this by increasing the reservation size.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2db60df
    • V
      move RLIMIT_NPROC check from set_user() to do_execve_common() · 72fa5997
      Vasiliy Kulikov 提交于
      The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
      check in set_user() to check for NPROC exceeding via setuid() and
      similar functions.
      
      Before the check there was a possibility to greatly exceed the allowed
      number of processes by an unprivileged user if the program relied on
      rlimit only.  But the check created new security threat: many poorly
      written programs simply don't check setuid() return code and believe it
      cannot fail if executed with root privileges.  So, the check is removed
      in this patch because of too often privilege escalations related to
      buggy programs.
      
      The NPROC can still be enforced in the common code flow of daemons
      spawning user processes.  Most of daemons do fork()+setuid()+execve().
      The check introduced in execve() (1) enforces the same limit as in
      setuid() and (2) doesn't create similar security issues.
      
      Neil Brown suggested to track what specific process has exceeded the
      limit by setting PF_NPROC_EXCEEDED process flag.  With the change only
      this process would fail on execve(), and other processes' execve()
      behaviour is not changed.
      
      Solar Designer suggested to re-check whether NPROC limit is still
      exceeded at the moment of execve().  If the process was sleeping for
      days between set*uid() and execve(), and the NPROC counter step down
      under the limit, the defered execve() failure because NPROC limit was
      exceeded days ago would be unexpected.  If the limit is not exceeded
      anymore, we clear the flag on successful calls to execve() and fork().
      
      The flag is also cleared on successful calls to set_user() as the limit
      was exceeded for the previous user, not the current one.
      
      Similar check was introduced in -ow patches (without the process flag).
      
      v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72fa5997
  4. 11 8月, 2011 2 次提交
  5. 10 8月, 2011 5 次提交
  6. 09 8月, 2011 2 次提交
    • L
      autofs4: fix debug printk warning uncovered by cleanup · 2f84dd70
      Linus Torvalds 提交于
      The previous comit made the autofs4 debug printouts check types against
      the printout format, and uncovered this bug:
      
        fs/autofs4/waitq.c:106:2: warning: format ‘%08lx’ expects type ‘long unsigned int’, but argument 4 has type ‘autofs_wqt_t’
      
      which is due to the insane type for wait_queue_token.  That thing should
      be some fixed well-defined size (preferably just 'unsigned int' or
      'u32') but for unexplained reasons it is randomly either 'unsigned long'
      or 'unsigned int' depending on the architecture.
      
      For now, cast it to 'unsigned long' for printing, the way we do
      elsewhere.  Somebody else can try to explain the typedef mess.
      
      (There's a reason we don't support excessive use of typedefs in the
      kernel: it's usually just a good way of confusing yourself).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f84dd70
    • L
      autofs4: clean up uaotfs use of debug/info/warning printouts · c3ad9962
      Linus Torvalds 提交于
      Use 'pr_debug()' for DPRINTK, which will do the proper type checking on
      the arguments (without generating code) even when DEBUG isn't #defined.
      
      Also, use the standard __VA_ARGS__ for the macros, and stop the
      pointless abuse of 'do { xyz } while (0)' when the macro is already a
      perfectly well-formed single statement.
      Reported-by: NDavid Howells <dhowells@redhat.com>
      Suggested-by: NJoe Perches <joe@perches.com>
      Cc: Ian Kent <raven@themaw.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ad9962
  7. 08 8月, 2011 2 次提交
  8. 07 8月, 2011 10 次提交
    • A
      Fix POSIX ACL permission check · 206b1d09
      Ari Savolainen 提交于
      After commit 3567866b: "RCUify freeing acls, let check_acl() go ahead in
      RCU mode if acl is cached" posix_acl_permission is being called with an
      unsupported flag and the permission check fails. This patch fixes the issue.
      Signed-off-by: NAri Savolainen <ari.m.savolainen@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      206b1d09
    • L
      vfs: optimize inode cache access patterns · 3ddcd056
      Linus Torvalds 提交于
      The inode structure layout is largely random, and some of the vfs paths
      really do care.  The path lookup in particular is already quite D$
      intensive, and profiles show that accessing the 'inode->i_op->xyz'
      fields is quite costly.
      
      We already optimized the dcache to not unnecessarily load the d_op
      structure for members that are often NULL using the DCACHE_OP_xyz bits
      in dentry->d_flags, and this does something very similar for the inode
      ops that are used during pathname lookup.
      
      It also re-orders the fields so that the fields accessed by 'stat' are
      together at the beginning of the inode structure, and roughly in the
      order accessed.
      
      The effect of this seems to be in the 1-2% range for an empty kernel
      "make -j" run (which is fairly kernel-intensive, mostly in filename
      lookup), so it's visible.  The numbers are fairly noisy, though, and
      likely depend a lot on exact microarchitecture.  So there's more tuning
      to be done.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ddcd056
    • L
      vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e
      Linus Torvalds 提交于
      Gcc tends to generate better code with small integers, including the
      DCACHE_xyz flag tests - so move the common ones to be first in the list.
      Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
      DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
      tree.
      
      And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
      common case to be a nice straight-line fall-through.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      830c0f0e
    • B
      ore: Make ore its own module · cf283ade
      Boaz Harrosh 提交于
      Export everything from ore need exporting. Change Kbuild and Kconfig
      to build ore.ko as an independent module. Import ore from exofs
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      cf283ade
    • B
      exofs: Rename raid engine from exofs/ios.c => ore · 8ff660ab
      Boaz Harrosh 提交于
      ORE stands for "Objects Raid Engine"
      
      This patch is a mechanical rename of everything that was in ios.c
      and its API declaration to an ore.c and an osd_ore.h header. The ore
      engine will later be used by the pnfs objects layout driver.
      
      * File ios.c => ore.c
      
      * Declaration of types and API are moved from exofs.h to a new
        osd_ore.h
      
      * All used types are prefixed by ore_ from their exofs_ name.
      
      * Shift includes from exofs.h to osd_ore.h so osd_ore.h is
        independent, include it from exofs.h.
      
      Other than a pure rename there are no other changes. Next patch
      will move the ore into it's own module and will export the API
      to be used by exofs and later the layout driver
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      8ff660ab
    • B
      exofs: ios: Move to a per inode components & device-table · 9e9db456
      Boaz Harrosh 提交于
      Exofs raid engine was saving on memory space by having a single layout-info,
      single pid, and a single device-table, global to the filesystem. Then passing
      a credential and object_id info at the io_state level, private for each
      inode. It would also devise this contraption of rotating the device table
      view for each inode->ino to spread out the device usage.
      
      This is not compatible with the pnfs-objects standard, demanding that
      each inode can have it's own layout-info, device-table, and each object
      component it's own pid, oid and creds.
      
      So: Bring exofs raid engine to be usable for generic pnfs-objects use by:
      
      * Define an exofs_comp structure that holds obj_id and credential info.
      
      * Break up exofs_layout struct to an exofs_components structure that holds a
        possible array of exofs_comp and the array of devices + the size of the
        arrays.
      
      * Add a "comps" parameter to get_io_state() that specifies the ids creds
        and device array to use for each IO.
      
        This enables to keep the layout global, but the device-table view, creds
        and IDs at the inode level. It only adds two 64bit to each inode, since
        some of these members already existed in another form.
      
      * ios raid engine now access layout-info and comps-info through the passed
        pointers. Everything is pre-prepared by caller for generic access of
        these structures and arrays.
      
      At the exofs Level:
      
      * Super block holds an exofs_components struct that holds the device
        array, previously in layout. The devices there are in device-table
        order. The device-array is twice bigger and repeats the device-table
        twice so now each inode's device array can point to a random device
        and have a round-robin view of the table, making it compatible to
        previous exofs versions.
      
      * Each inode has an exofs_components struct that is initialized at
        load time, with it's own view of the device table IDs and creds.
        When doing IO this gets passed to the io_state together with the
        layout.
      
      While preforming this change. Bugs where found where credentials with the
      wrong IDs where used to access the different SB objects (super.c). As well
      as some dead code. It was never noticed because the target we use does not
      check the credentials.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      9e9db456
    • B
      exofs: Move exofs specific osd operations out of ios.c · 85e44df4
      Boaz Harrosh 提交于
      ios.c will be moving to an external library, for use by the
      objects-layout-driver. Remove from it some exofs specific functions.
      
      Also g_attr_logical_length is used both by inode.c and ios.c
      move definition to the later, to keep it independent
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      85e44df4
    • B
      exofs: Add offset/length to exofs_get_io_state · e1042ba0
      Boaz Harrosh 提交于
      In future raid code we will need to know the IO offset/length
      and if it's a read or write to determine some of the array
      sizes we'll need.
      
      So add a new exofs_get_rw_state() API for use when
      writeing/reading. All other simple cases are left using the
      old way.
      
      The major change to this is that now we need to call
      exofs_get_io_state later at inode.c::read_exec and
      inode.c::write_exec when we actually know these things. So this
      patch is kept separate so I can test things apart from other
      changes.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      e1042ba0
    • L
      vfs: show O_CLOEXE bit properly in /proc/<pid>/fdinfo/<fd> files · 1117f72e
      Linus Torvalds 提交于
      The CLOEXE bit is magical, and for performance (and semantic) reasons we
      don't actually maintain it in the file descriptor itself, but in a
      separate bit array.  Which means that when we show f_flags, the CLOEXE
      status is shown incorrectly: we show the status not as it is now, but as
      it was when the file was opened.
      
      Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
      array.
      
      Uli needs this in order to re-implement the pfiles program:
      
        "For normal file descriptors (not sockets) this was the last piece of
         information which wasn't available.  This is all part of my 'give
         Solaris users no reason to not switch' effort.  I intend to offer the
         code to the util-linux-ng maintainers."
      Requested-by: NUlrich Drepper <drepper@akkadia.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1117f72e
    • L
      oom_ajd: don't use WARN_ONCE, just use printk_once · c2142704
      Linus Torvalds 提交于
      WARN_ONCE() is very annoying, in that it shows the stack trace that we
      don't care about at all, and also triggers various user-level "kernel
      oopsed" logic that we really don't care about.  And it's not like the
      user can do anything about the applications (sshd) in question, it's a
      distro issue.
      
      Requested-by: Andi Kleen <andi@firstfloor.org> (and many others)
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2142704
  9. 05 8月, 2011 6 次提交
  10. 04 8月, 2011 5 次提交