1. 14 12月, 2006 6 次提交
    • P
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson 提交于
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02a0e53d
    • C
      [PATCH] More slab.h cleanups · 55935a34
      Christoph Lameter 提交于
      More cleanups for slab.h
      
      1. Remove tabs from weird locations as suggested by Pekka
      
      2. Drop the check for NUMA and SLAB_DEBUG from the fallback section
         as suggested by Pekka.
      
      3. Uses static inline for the fallback defs as also suggested by Pekka.
      
      4. Make kmem_ptr_valid take a const * argument.
      
      5. Separate the NUMA fallback definitions from the kmalloc_track fallback
         definitions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      55935a34
    • C
      [PATCH] Cleanup slab headers / API to allow easy addition of new slab allocators · 2e892f43
      Christoph Lameter 提交于
      This is a response to an earlier discussion on linux-mm about splitting
      slab.h components per allocator.  Patch is against 2.6.19-git11.  See
      http://marc.theaimsgroup.com/?l=linux-mm&m=116469577431008&w=2
      
      This patch cleans up the slab header definitions.  We define the common
      functions of slob and slab in slab.h and put the extra definitions needed
      for slab's kmalloc implementations in <linux/slab_def.h>.  In order to get
      a greater set of common functions we add several empty functions to slob.c
      and also rename slob's kmalloc to __kmalloc.
      
      Slob does not need any special definitions since we introduce a fallback
      case.  If there is no need for a slab implementation to provide its own
      kmalloc mess^H^H^Hacros then we simply fall back to __kmalloc functions.
      That is sufficient for SLOB.
      
      Sort the function in slab.h according to their functionality.  First the
      functions operating on struct kmem_cache * then the kmalloc related
      functions followed by special debug and fallback definitions.
      
      Also redo a lot of comments.
      
      Signed-off-by: Christoph Lameter <clameter@sgi.com>?
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2e892f43
    • E
      [PATCH] reorder struct pipe_buf_operations · 6a8ba9d1
      Eric Dumazet 提交于
      Fields of struct pipe_buf_operations have not a precise layout (ie not
      optimized to fit cache lines nor reduce cache line ping pongs)
      
      The bufs[] array is *large* and is placed near the beginning of the
      structure, so all following fields have a large offset.  This is
      unfortunate because many archs have smaller instructions when using small
      offsets relative to a base register.  On x86 for example, 7 bits offsets
      have smaller instruction lengths.
      
      Moving bufs[] at the end of pipe_buf_operations permits all fields to have
      small offsets, and reduce text size, and icache pressure.
      
      # size vmlinux.pre vmlinux
          text    data     bss     dec     hex filename
      3268989  664356  492196 4425541  438745 vmlinux.pre
      3268765  664356  492196 4425317  438665 vmlinux
      
      So this patch reduces text size by 224 bytes on my x86_64 machine. Similar
      results on ia32.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6a8ba9d1
    • E
      [PATCH] Revert "[PATCH] identifier to nsproxy" · 5f8442ed
      Eric W. Biederman 提交于
      This reverts commit 373beb35.
      
      No one is using this identifier yet.  The purpose of this identifier is to
      export nsproxy to user space which is wrong.  nsproxy is an internal
      implementation optimization, which should keep our fork times from getting
      slower as we increase the number of global namespaces you don't have to
      share.
      
      Adding a global identifier like this is inappropriate because it makes
      namespaces inherently non-recursive, greatly limiting what we can do with
      them in the future.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5f8442ed
    • E
      [PATCH] constify pipe_buf_operations · d4c3cca9
      Eric Dumazet 提交于
      - pipe/splice should use const pipe_buf_operations and file_operations
      
      - struct pipe_inode_info has an unused field "start" : get rid of it.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d4c3cca9
  2. 13 12月, 2006 4 次提交
  3. 12 12月, 2006 5 次提交
    • B
      [PATCH] remove blk_queue_activity_fn · 2b02a179
      Boaz Harrosh 提交于
      While working on bidi support at struct request level
      I have found that blk_queue_activity_fn is actually never used.
      The only user is in ide-probe.c with this code:
      
      	/* enable led activity for disk drives only */
      	if (drive->media == ide_disk && hwif->led_act)
      		blk_queue_activity_fn(q, hwif->led_act, drive);
      
      And led_act is never initialized anywhere.
      (Looking back at older kernels it was used in the PPC arch, but was removed around 2.6.18)
      Unless it is all for future use off course.
      (this patch is against linux-2.6-block.git as off 2006/12/4)
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      2b02a179
    • A
      [DCCP]: Whitespace cleanups · 8109b02b
      Arnaldo Carvalho de Melo 提交于
      That accumulated over the last months hackaton, shame on me for not
      using git-apply whitespace helping hand, will do that from now on.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      8109b02b
    • G
      [DCCP] ccid3: Finer-grained resolution of sending rates · 1a21e49a
      Gerrit Renker 提交于
      This patch
       * resolves a bug where packets smaller than 32/64 bytes resulted in sending rates of 0
       * supports all sending rates from 1/64 bytes/second up to 4Gbyte/second
       * simplifies the present overflow problems in calculations
      
      Current sending rate X and the cached value X_recv of the receiver-estimated
      sending rate are both scaled by 64 (2^6) in order to
       * cope with low sending rates (minimally 1 byte/second)
       * allow upgrading to use a packets-per-second implementation of CCID 3
       * avoid calculation errors due to integer arithmetic cut-off
      
      The patch implements a revised strategy from
      http://www.mail-archive.com/dccp@vger.kernel.org/msg01040.html
      
      The only difference with regard to that strategy is that t_ipi is already
      used in the calculation of the nofeedback timeout, which saves one division.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      1a21e49a
    • L
      Make sure we populate the initroot filesystem late enough · 8d610dd5
      Linus Torvalds 提交于
      We should not initialize rootfs before all the core initializers have
      run.  So do it as a separate stage just before starting the regular
      driver initializers.
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8d610dd5
    • L
      Make SLES9 "get_kernel_version" work on the kernel binary again · 8993780a
      Linus Torvalds 提交于
      As reported by Andy Whitcroft, at least the SLES9 initrd build process
      depends on getting the kernel version from the kernel binary.  It does
      that by simply trawling the binary and looking for the signature of the
      "linux_banner" string (the string "Linux version " to be exact. Which
      is really broken in itself, but whatever..)
      
      That got broken when the string was changed to allow /proc/version to
      change the UTS release information dynamically, and "get_kernel_version"
      thus returned "%s" (see commit a2ee8649:
      "[PATCH] Fix linux banner utsname information").
      
      This just restores "linux_banner" as a static string, which should fix
      the version finding.  And /proc/version simply uses a different string.
      
      To avoid wasting even that miniscule amount of memory, the early boot
      string should really be marked __initdata, but that just causes the same
      bug in SLES9 to re-appear, since it will then find other occurrences of
      "Linux version " first.
      
      Cc: Andy Whitcroft <apw@shadowen.org>
      Acked-by: NHerbert Poetzl <herbert@13thfloor.at>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Andrew Morton <akpm@osdl.org>
      Cc: Steve Fox <drfickle@us.ibm.com>
      Acked-by: NOlaf Hering <olaf@aepfle.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8993780a
  4. 11 12月, 2006 25 次提交
    • K
      [PPC] Fix compile failure do to introduction of PHY_POLL · d10f7348
      Kumar Gala 提交于
      PHY_POLL is defined in <linux/phy.h> include it in <linux/fsl_devices.h> so
      board code will have it defined.
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      d10f7348
    • J
      i2c: Discard the i2c algo del_bus wrappers · 3269711b
      Jean Delvare 提交于
      They are all only calling i2c_del_adapter, so we may as well do
      it directly.
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      3269711b
    • D
      i2c: Whitespace cleanups · 438d6c2c
      David Brownell 提交于
      Remove extraneous whitespace from various i2c headers and core files,
      like space-before-tab and whitespace at end of line.
      Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net>
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      438d6c2c
    • J
      i2c: Add support for nested i2c bus locking · 6ea23039
      Jiri Kosina 提交于
      This patch adds the 'level' field into the i2c_adapter structure, which is 
      used to represent the 'logical' level of nesting for the purposes of 
      lockdep. This field is then used in the i2c_transfer() function, to 
      acquire the per-adapter bus_lock with correct nesting level.
      Signed-off-by: NJiri Kosina <jikos@jikos.cz>
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      6ea23039
    • V
      i2c: New Philips PNX bus driver · 41561f28
      Vitaly Wool 提交于
      New I2C bus driver for Philips ARM boards (Philips IP3204 I2C IP
      block). This I2C controller can be found on (at least) PNX010x,
      PNX52xx and PNX4008 Philips boards.
      Signed-off-by: NVitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      41561f28
    • J
      i2c: Delete the broken i2c-ite bus driver · 51fd554b
      Jean Delvare 提交于
      The rest of the ITE8172 support was already removed from MIPS tree.
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      Signed-off-by: NYoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      51fd554b
    • J
      i2c: Update the list of driver IDs · 36cfb5cc
      Jean Delvare 提交于
      * A chip driver ID was assigned to the Radeon, while it is an adapter
        so it needs an i2c adapter ID.
      * The SAA7191 is a video decoder, not encoder.
      * The icspll driver is dead, and will never be ported to Linux 2.6.
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      36cfb5cc
    • A
      [PATCH] kvm: userspace interface · 6aa8b732
      Avi Kivity 提交于
      web site: http://kvm.sourceforge.net
      
      mailing list: kvm-devel@lists.sourceforge.net
        (http://lists.sourceforge.net/lists/listinfo/kvm-devel)
      
      The following patchset adds a driver for Intel's hardware virtualization
      extensions to the x86 architecture.  The driver adds a character device
      (/dev/kvm) that exposes the virtualization capabilities to userspace.  Using
      this driver, a process can run a virtual machine (a "guest") in a fully
      virtualized PC containing its own virtual hard disks, network adapters, and
      display.
      
      Using this driver, one can start multiple virtual machines on a host.
      
      Each virtual machine is a process on the host; a virtual cpu is a thread in
      that process.  kill(1), nice(1), top(1) work as expected.  In effect, the
      driver adds a third execution mode to the existing two: we now have kernel
      mode, user mode, and guest mode.  Guest mode has its own address space mapping
      guest physical memory (which is accessible to user mode by mmap()ing
      /dev/kvm).  Guest mode has no access to any I/O devices; any such access is
      intercepted and directed to user mode for emulation.
      
      The driver supports i386 and x86_64 hosts and guests.  All combinations are
      allowed except x86_64 guest on i386 host.  For i386 guests and hosts, both pae
      and non-pae paging modes are supported.
      
      SMP hosts and UP guests are supported.  At the moment only Intel
      hardware is supported, but AMD virtualization support is being worked on.
      
      Performance currently is non-stellar due to the naive implementation of the
      mmu virtualization, which throws away most of the shadow page table entries
      every context switch.  We plan to address this in two ways:
      
      - cache shadow page tables across tlb flushes
      - wait until AMD and Intel release processors with nested page tables
      
      Currently a virtual desktop is responsive but consumes a lot of CPU.  Under
      Windows I tried playing pinball and watching a few flash movies; with a recent
      CPU one can hardly feel the virtualization.  Linux/X is slower, probably due
      to X being in a separate process.
      
      In addition to the driver, you need a slightly modified qemu to provide I/O
      device emulation and the BIOS.
      
      Caveats (akpm: might no longer be true):
      
      - The Windows install currently bluescreens due to a problem with the
        virtual APIC.  We are working on a fix.  A temporary workaround is to
        use an existing image or install through qemu
      - Windows 64-bit does not work.  That's also true for qemu, so it's
        probably a problem with the device model.
      
      [bero@arklinux.org: build fix]
      [simon.kagstrom@bth.se: build fix, other fixes]
      [uril@qumranet.com: KVM: Expose interrupt bitmap]
      [akpm@osdl.org: i386 build fix]
      [mingo@elte.hu: i386 fixes]
      [rdreier@cisco.com: add log levels to all printks]
      [randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
      [anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
      Signed-off-by: NYaniv Kamay <yaniv@qumranet.com>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      Cc: Simon Kagstrom <simon.kagstrom@bth.se>
      Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
      Signed-off-by: NUri Lublin <uril@qumranet.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAnthony Liguori <anthony@codemonkey.ws>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6aa8b732
    • D
      [PATCH] clocksource: small cleanup · f5f1a24a
      Daniel Walker 提交于
      Mostly changing alignment.  Just some general cleanup.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NDaniel Walker <dwalker@mvista.com>
      Acked-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f5f1a24a
    • A
      [PATCH] round_jiffies infrastructure · 4c36a5de
      Arjan van de Ven 提交于
      Introduce a round_jiffies() function as well as a round_jiffies_relative()
      function.  These functions round a jiffies value to the next whole second.
      The primary purpose of this rounding is to cause all "we don't care exactly
      when" timers to happen at the same jiffy.
      
      This avoids multiple timers firing within the second for no real reason;
      with dynamic ticks these extra timers cause wakeups from deep sleep CPU
      sleep states and thus waste power.
      
      The exact wakeup moment is skewed by the cpu number, to avoid all cpus from
      waking up at the exact same time (and hitting the same lock/cachelines
      there)
      
      [akpm@osdl.org: fix variable type]
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c36a5de
    • V
      [PATCH] fdtable: Implement new pagesize-based fdtable allocator · 5466b456
      Vadim Lobanov 提交于
      This patch provides an improved fdtable allocation scheme, useful for
      expanding fdtable file descriptor entries.  The main focus is on the fdarray,
      as its memory usage grows 128 times faster than that of an fdset.
      
      The allocation algorithm sizes the fdarray in such a way that its memory usage
      increases in easy page-sized chunks. The overall algorithm expands the allowed
      size in powers of two, in order to amortize the cost of invoking vmalloc() for
      larger allocation sizes. Namely, the following sizes for the fdarray are
      considered, and the smallest that accommodates the requested fd count is
      chosen:
      
          pagesize / 4
          pagesize / 2
          pagesize      <- memory allocator switch point
          pagesize * 2
          pagesize * 4
          ...etc...
      
      Unlike the current implementation, this allocation scheme does not require a
      loop to compute the optimal fdarray size, and can be done in efficient
      straightline code.
      
      Furthermore, since the fdarray overflows the pagesize boundary long before any
      of the fdsets do, it makes sense to optimize run-time by allocating both
      fdsets in a single swoop.  Even together, they will still be, by far, smaller
      than the fdarray.  The fdtable->open_fds is now used as the anchor for the
      fdset memory allocation.
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5466b456
    • V
      [PATCH] fdtable: Remove the free_files field · 4fd45812
      Vadim Lobanov 提交于
      An fdtable can either be embedded inside a files_struct or standalone (after
      being expanded).  When an fdtable is being discarded after all RCU references
      to it have expired, we must either free it directly, in the standalone case,
      or free the files_struct it is contained within, in the embedded case.
      
      Currently the free_files field controls this behavior, but we can get rid of
      it entirely, as all the necessary information is already recorded.  We can
      distinguish embedded and standalone fdtables using max_fds, and if it is
      embedded we can divine the relevant files_struct using container_of().
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4fd45812
    • V
      [PATCH] fdtable: Make fdarray and fdsets equal in size · bbea9f69
      Vadim Lobanov 提交于
      Currently, each fdtable supports three dynamically-sized arrays of data: the
      fdarray and two fdsets.  The code allows the number of fds supported by the
      fdarray (fdtable->max_fds) to differ from the number of fds supported by each
      of the fdsets (fdtable->max_fdset).
      
      In practice, it is wasteful for these two sizes to differ: whenever we hit a
      limit on the smaller-capacity structure, we will reallocate the entire fdtable
      and all the dynamic arrays within it, so any delta in the memory used by the
      larger-capacity structure will never be touched at all.
      
      Rather than hogging this excess, we shouldn't even allocate it in the first
      place, and keep the capacities of the fdarray and the fdsets equal.  This
      patch removes fdtable->max_fdset.  As an added bonus, most of the supporting
      code becomes simpler.
      Signed-off-by: NVadim Lobanov <vlobanov@speakeasy.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bbea9f69
    • R
      [PATCH] md: allow reads that have bypassed the cache to be retried on failure · 46031f9a
      Raz Ben-Jehuda(caro) 提交于
      If a bypass-the-cache read fails, we simply try again through the cache.  If
      it fails again it will trigger normal recovery precedures.
      
      update 1:
      
      From: NeilBrown <neilb@suse.de>
      
      1/
        chunk_aligned_read and retry_aligned_read assume that
            data_disks == raid_disks - 1
        which is not true for raid6.
        So when an aligned read request bypasses the cache, we can get the wrong data.
      
      2/ The cloned bio is being used-after-free in raid5_align_endio
         (to test BIO_UPTODATE).
      
      3/ We forgot to add rdev->data_offset when submitting
         a bio for aligned-read
      
      4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
         so we need to invalidate the segment counts.
      
      5/ We don't de-reference the rdev when the read completes.
         This means we need to record the rdev to so it is still
         available in the end_io routine.  Fortunately
         bi_next in the original bio is unused at this point so
         we can stuff it in there.
      
      6/ We leak a cloned bio if the target rdev is not usable.
      
      From: NeilBrown <neilb@suse.de>
      
      update 2:
      
      1/ When aligned requests fail (read error) they need to be retried
         via the normal method (stripe cache).  As we cannot be sure that
         we can process a single read in one go (we may not be able to
         allocate all the stripes needed) we store a bio-being-retried
         and a list of bioes-that-still-need-to-be-retried.
         When find a bio that needs to be retried, we should add it to
         the list, not to single-bio...
      
      2/ We were never incrementing 'scnt' when resubmitting failed
         aligned requests.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      46031f9a
    • A
      [PATCH] ide-cd: Handle strange interrupt on the Intel ESB2 · ee2f344b
      Alan Cox 提交于
      The ESB2 appears to emit spurious DMA interrupts when configured for native
      mode and handling ATAPI devices.  Stratus were able to pin this bug down and
      produce a patch.  This is a rework which applies the fixup only to the ESB2
      (for now).  We can apply it to other chips later if the same problem is found.
      
      This code has been tested and confirmed to fix the problem on the tested
      systems.
      Signed-off-by: NAlan Cox <alan@redhat.com>
      (Most of the hard work done by Stratus however)
      Cc: Jens Axboe <axboe@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ee2f344b
    • C
      [PATCH] sched: remove lb_stopbalance counter · 06066714
      Chen, Kenneth W 提交于
      Remove scheduler stats lb_stopbalance counter.  This counter can be
      calculated by: lb_balanced - lb_nobusyg - lb_nobusyq.  There is no need to
      create gazillion counters while we can derive the value.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      06066714
    • S
      [PATCH] sched: decrease number of load balances · 783609c6
      Siddha, Suresh B 提交于
      Currently at a particular domain, each cpu in the sched group will do a
      load balance at the frequency of balance_interval.  More the cores and
      threads, more the cpus will be in each sched group at SMP and NUMA domain.
      And we endup spending quite a bit of time doing load balancing in those
      domains.
      
      Fix this by making only one cpu(first idle cpu or first cpu in the group if
      all the cpus are busy) in the sched group do the load balance at that
      particular sched domain and this load will slowly percolate down to the
      other cpus with in that group(when they do load balancing at lower
      domains).
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      783609c6
    • C
      [PATCH] sched: add option to serialize load balancing · 08c183f3
      Christoph Lameter 提交于
      Large sched domains can be very expensive to scan.  Add an option SD_SERIALIZE
      to the sched domain flags.  If that flag is set then we make sure that no
      other such domain is being balanced.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      08c183f3
    • C
      [PATCH] sched: use softirq for load balancing · c9819f45
      Christoph Lameter 提交于
      Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
      softirq.
      
      We calculate the earliest time for each layer of sched domains to be rescanned
      (this is the rescan time for idle) and use the earliest of those to schedule
      the softirq via a new field "next_balance" added to struct rq.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c9819f45
    • S
      [PATCH] sched domain: increase the SMT busy rebalance interval · ac7d5504
      Siddha, Suresh B 提交于
      With SMT, if the logical processor is busy, load balance happens for every
      8msec(min)-16msec(max).  There is no need to do this often, as this is just
      for fairness(to maintain uniform runqueue lengths) and default time slice
      anyhow is 100msec.
      
      Appended patch increases this interval to 64msec(min)-128msec(max) when the
      logical processor is busy.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ac7d5504
    • A
      [PATCH] io-accounting: via taskstats · 4a7864ca
      Andrew Morton 提交于
      Deliver IO accounting via taskstats.
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4a7864ca
    • A
      [PATCH] cleanup taskstats.h · f2f1f8a3
      Andrew Morton 提交于
      Fix weird whitespace mangling in taskstats.h
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f2f1f8a3
    • A
      [PATCH] io-accounting: core statistics · 7c3ab738
      Andrew Morton 提交于
      The present per-task IO accounting isn't very useful.  It simply counts the
      number of bytes passed into read() and write().  So if a process reads 1MB
      from an already-cached file, it is accused of having performed 1MB of I/O,
      which is wrong.
      
      (David Wright had some comments on the applicability of the present logical IO accounting:
      
        For billing purposes it is useless but for workload analysis it is very
        useful
      
        read_bytes/read_calls  average read request size
        write_bytes/write_calls average write request size
      
        read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing
        write_bytes/write_blocks  ie logical/physical  guess since pdflush writes can
                                                      be missed
      
        I often look for logical larger than physical to see filesystem cache
        problems.  And the bytes/cpusec can help find applications that are
        dominating the cache and causing slow interactive response from page cache
        contention.
      
        I want to find the IO intensive applications and make sure they are doing
        efficient IO.  Thus the acctcms(sysV) or csacms command would give the high
        IO commands).
      
      This patchset adds new accounting which tries to be more accurate.  We account
      for three things:
      
      reads:
      
        attempt to count the number of bytes which this process really did cause
        to be fetched from the storage layer.  Done at the submit_bio() level, so it
        is accurate for block-backed filesystems.  I also attempt to wire up NFS and
        CIFS.
      
      writes:
      
        attempt to count the number of bytes which this process caused to be sent
        to the storage layer.  This is done at page-dirtying time.
      
        The big inaccuracy here is truncate.  If a process writes 1MB to a file
        and then deletes the file, it will in fact perform no writeout.  But it will
        have been accounted as having caused 1MB of write.
      
        So...
      
      cancelled_writes:
      
        account the number of bytes which this process caused to not happen, by
        truncating pagecache.
      
        We _could_ just subtract this from the process's `write' accounting.  But
        that means that some processes would be reported to have done negative
        amounts of write IO, which is silly.
      
        So we just report the raw number and punt this decision up to userspace.
      
      Now, we _could_ account for writes at the physical I/O level.  But
      
      - This would require that we track memory-dirtying tasks at the per-page
        level (would require a new pointer in struct page).
      
      - It would mean that IO statistics for a process are usually only available
        long after that process has exitted.  Which means that we probably cannot
        communicate this info via taskstats.
      
      This patch:
      
      Wire up the kernel-private data structures and the accessor functions to
      manipulate them.
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7c3ab738
    • D
      [PATCH] Fix noise in futex.h · 58f64d83
      David Woodhouse 提交于
      There are some kernel-only bits in the middle of <linux/futex.h> which
      should be removed in what we export to userspace.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      58f64d83
    • A
      [PATCH] sysctl: remove unused "context" param · 1f29bcd7
      Alexey Dobriyan 提交于
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1f29bcd7