1. 27 3月, 2006 40 次提交
    • P
      [PATCH] kprobes: fix broken fault handling for ia64 · c04c1c81
      Prasanna S Panchamukhi 提交于
      Provide proper kprobes fault handling, if a user-specified pre/post handlers
      tries to access user address space, through copy_from_user(), get_user() etc.
      
      The user-specified fault handler gets called only if the fault occurs while
      executing user-specified handlers.  In such a case user-specified handler is
      allowed to fix it first, later if the user-specifed fault handler does not fix
      it, we try to fix it by calling fix_exception().
      
      The user-specified handler will not be called if the fault happens when single
      stepping the original instruction, instead we reset the current probe and
      allow the system page fault handler to fix it up.
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Acked-by: Anil S Keshavamurthy<anil.s.keshavamurthy@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c04c1c81
    • P
      [PATCH] kprobes: fix broken fault handling for powerpc64 · 50e21f2b
      Prasanna S Panchamukhi 提交于
      Provide proper kprobes fault handling, if a user-specified pre/post handlers
      tries to access user address space, through copy_from_user(), get_user() etc.
      
      The user-specified fault handler gets called only if the fault occurs while
      executing user-specified handlers.  In such a case user-specified handler is
      allowed to fix it first, later if the user-specifed fault handler does not fix
      it, we try to fix it by calling fix_exception().
      
      The user-specified handler will not be called if the fault happens when single
      stepping the original instruction, instead we reset the current probe and
      allow the system page fault handler to fix it up.
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      50e21f2b
    • P
      [PATCH] kprobes: fix broken fault handling for x86_64 · c28f8966
      Prasanna S Panchamukhi 提交于
      Provide proper kprobes fault handling, if a user-specified pre/post handlers
      tries to access user address space, through copy_from_user(), get_user() etc.
      
      The user-specified fault handler gets called only if the fault occurs while
      executing user-specified handlers.  In such a case user-specified handler is
      allowed to fix it first, later if the user-specifed fault handler does not fix
      it, we try to fix it by calling fix_exception().
      
      The user-specified handler will not be called if the fault happens when single
      stepping the original instruction, instead we reset the current probe and
      allow the system page fault handler to fix it up.
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c28f8966
    • P
      [PATCH] kprobes: fix broken fault handling for i386 · b4026513
      Prasanna S Panchamukhi 提交于
      Provide proper kprobes fault handling, if a user-specified pre/post handlers
      tries to access user address space, through copy_from_user(), get_user() etc.
      
      The user-specified fault handler gets called only if the fault occurs while
      executing user-specified handlers.  In such a case user-specified handler is
      allowed to fix it first, later if the user-specifed fault handler does not fix
      it, we try to fix it by calling fix_exception().
      
      The user-specified handler will not be called if the fault happens when single
      stepping the original instruction, instead we reset the current probe and
      allow the system page fault handler to fix it up.
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b4026513
    • B
      [PATCH] kprobe handler: discard user space trap · 2326c770
      bibo,mao 提交于
      Currently kprobe handler traps only happen in kernel space, so function
      kprobe_exceptions_notify should skip traps which happen in user space.
      This patch modifies this, and it is based on 2.6.16-rc4.
      Signed-off-by: Nbibo mao <bibo.mao@intel.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
      Cc: <hiramatu@sdl.hitachi.co.jp>
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2326c770
    • B
      [PATCH] kretprobe instance recycled by parent process · c6fd91f0
      bibo mao 提交于
      When kretprobe probes the schedule() function, if the probed process exits
      then schedule() will never return, so some kretprobe instances will never
      be recycled.
      
      In this patch the parent process will recycle retprobe instances of the
      probed function and there will be no memory leak of kretprobe instances.
      Signed-off-by: Nbibo mao <bibo.mao@intel.com>
      Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c6fd91f0
    • M
      [PATCH] kretprobe: kretprobe-booster · c9becf58
      Masami Hiramatsu 提交于
      In normal operation, kretprobe makes a target function return to trampoline
      code.  A kprobe (called trampoline_probe) has been inserted in the trampoline
      code.  When the kernel hits this kprobe, it calls kretprobe's handler and it
      returns to the original return address.
      
      Kretprobe-booster removes the trampoline_probe.  It allows the trampoline code
      to call kretprobe's handler directly instead of invoking kprobe.  The
      trampoline code returns to the original return address.
      
      (changelog from Chuck Ebbert <76306.1226@compuserve.com> - thanks ;))
      Signed-off-by: NMasami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Chuck Ebbert <76306.1226@compuserve.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c9becf58
    • M
      [PATCH] x86: kprobes-booster · 311ac88f
      Masami Hiramatsu 提交于
      Current kprobe copies the original instruction at the probe point and replaces
      it with a breakpoint instruction (int3).  When the kernel hits the probe
      point, kprobe handler is invoked.  And the copied instruction is single-step
      executed on the copied buffer (not on the original address) by kprobe.  After
      that, the kprobe checks registers and modify it (if need) as if the
      instructions was executed on the original address.
      
      My proposal is based on the fact there are many instructions which do NOT
      require the register modification after the single-step execution.  When the
      copied instruction is a kind of them, kprobe just jumps back to the next
      instruction after single-step execution.  If so, why don't we execute those
      instructions directly?
      
      With kprobe-booster patch, kprobes will execute a copied instruction directly
      and (if need) jump back to original code.  This direct execution is executed
      when the kprobe don't have both post_handler and break_handler, and the copied
      instruction can be executed directly.
      
      I sorted instructions which can be executed directly or not;
      
      - Call instructions are NG(can not be executed directly).
        We should correct the return address pushed into top of stack.
      - Indirect instructions except for absolute indirect-jumps
        are NG. Those instructions changes EIP randomly. We should
        check EIP and correct it.
      - Instructions that change EIP beyond the range of the
        instruction buffer are NG.
      - Instructions that change EIP to tail 5 bytes of the
        instruction buffer (it is the size of a jump instruction).
        We must write a jump instruction which backs to original
        kernel code in the instruction buffer.
      - Break point instruction is NG. We should not touch EIP and
        pass to other handlers.
      - Absolute direct/indirect jumps are OK.- Conditional Jumps are NG.
      - Halt and software-interruptions are NG. Because it will stay on
        the instruction buffer of kprobes.
      - Prefixes are NG.
      - Unknown/reserved opcode is NG.
      - Other 1 byte instructions are OK. But those instructions need a
        jump back code.
      - 2 bytes instructions are mapped sparsely. So, in this release,
        this patch don't boost those instructions.
      
      >From Intel's IA-32 opcode map described in IA-32 Intel Architecture Software
      Developer's Manual Vol.2 B, I determined that following opcodes are not
      boostable.
      
      - 0FH (2byte escape)
      - 70H - 7FH (Jump on condition)
      - 9AH (Call) and 9CH (Pushf)
      - C0H-C1H (Grp 2: includes reserved opcode)
      - C6H-C7H (Grp11: includes reserved opcode)
      - CCH-CEH (Software-interrupt)
      - D0H-D3H (Grp2: includes reserved opcode)
      - D6H (Reserved)
      - D8H-DFH (Coprocessor)
      - E0H-E3H (loop/conditional jump)
      - E8H (Call)
      - F0H-F3H (Prefixes and reserved)
      - F4H (Halt)
      - F6H-F7H (Grp3: includes reserved opcode)
      - FEH-FFH(Grp4,5: includes reserved opcode)
      
      Kprobe-booster checks whether target instruction can be boosted (can be
      executed directly) at arch_copy_kprobe() function.  If the target instruction
      can be boosted, it clears "boostable" flag.  If not, it sets "boostable" flag
      -1.  This is disabled status.  In resume_execution() function, If "boostable"
      flag is cleared, kprobe-booster measures the size of the target instruction
      and sets "boostable" flag 1.
      
      In kprobe_handler(), kprobe checks the "boostable" flag.  If the flag is 1, it
      resets current kprobe and executes instruction buffer directly instead of
      single stepping.
      
      When unregistering a boosted kprobe, it calls synchronize_sched()
      after "int3" is removed. So we can ensure followings after
      the synchronize_sched() called.
      - interrupt handlers are finished on all CPUs.
      - instruction buffer is not executed on all CPUs.
      And we can release the boosted kprobe safely.
      
      And also, on preemptible kernel, the booster is not enabled where the kernel
      preemption is enabled.  So, there are no preempted threads on the instruction
      buffer.
      
      The description of kretprobe-booster:
      ====================================
      
      In the normal operation, kretprobe make a target function return to trampoline
      code.  And a kprobe (called trampoline_probe) have been inserted at the
      trampoline code.  When the kernel hits this kprobe, it calls kretprobe's
      handler and it returns to original return address.
      
      Kretprobe-booster patch removes the trampoline_probe.  It allows the
      trampoline code to call kretprobe's handler directly instead of invoking
      kprobe.  And tranpoline code returns to original return address.
      
      This new trampoline code stores and restores registers, so the kretprobe
      handler is still able to access those registers.
      
      Current kprobe has about 1.3 usec/probe(*) overhead, and kprobe-booster patch
      reduces it to 0.6 usec/probe(*).  Also current kretprobe has about 2.0
      usec/probe(*) overhead.  Kprobe-booster patch reduces it to 1.3 usec/probe(*),
      and the combination of both kprobe-booster patch and kretprobe-booster patch
      reduces it to 0.9 usec/probe(*).
      
      I expect the combination of both patches can reduce half of a probing
      overhead.
      
      Performance numbers strongly depend on the processor model.
      
      Andrew Morton wrote:
      > These preempt tricks look rather nasty.  Can you please describe what the
      > problem is, precisely?  And how this code avoids it?  Perhaps we can find
      > something cleaner.
      
      The problem is how to remove the copied instructions of the
      kprobe *safely* on the preemptable kernel (CONFIG_PREEMPT=y).
      
      Kprobes basically executes the following actions;
      
      (1)int3
      (2)preempt_disable()
      (3)kprobe_prehandler()
      (4)copied instructioin(single step)
      (5)kprobe_posthandler()
      (6)preempt_enable()
      (7)return to the original code
      
      During the execution of copied instruction, preemption is
      disabled (from step (2) to (6)).
      When unregistering the probes, Kprobe waits for RCU
      quiescent state by using synchronize_sched() after removing
      int3 instruction.
      Thus we can ensure the copied instruction is not executed.
      
      On the other hand, kprobe-booster executes the following actions;
      
      (1)int3
      (2)preempt_disable()
      (3)kprobe_prehandler()
      (4)preempt_enable()             <-- this one is added by my patch
      (5)copied instruction(direct execution)
      (6)jmp back to the original code
      
      The problem is that we have no way to prevent preemption on
      step (5) or (6). We cannot call preempt_disable() after step (6),
      because there are no rooms to do that. Thus, some other
      processes may be preempted at step(5) or (6) on preemptable kernel.
      And I couldn't find the easy way to ensure that other processes'
      stack do *not* have the address of them. (I thought some way
      to do that, but those are very costly.)
      
      So currently, I simply boost the kprobe only when the probe
      point is already preemption disabled.
      
      > Also, the patch adds a preempt_enable() but I don't see a corresponding
      > preempt_disable().  Am I missing something?
      
      It is corresponding to the preempt_disable() in the top of
      kprobe_handler().
      I copied the code of kprobe_handler() here:
      
      static int __kprobes kprobe_handler(struct pt_regs *regs)
      {
              struct kprobe *p;
              int ret = 0;
              kprobe_opcode_t *addr = NULL;
              unsigned long *lp;
              struct kprobe_ctlblk *kcb;
      
              /*
               * We don't want to be preempted for the entire
               * duration of kprobe processing
               */
              preempt_disable();             <-- HERE
              kcb = get_kprobe_ctlblk();
      Signed-off-by: NMasami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      311ac88f
    • M
      [PATCH] kprobes: clean up resume_execute() · b50ea74c
      Masami Hiramatsu 提交于
      Clean up kprobe's resume_execute() for i386 arch.
      Signed-off-by: NMasami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b50ea74c
    • R
      [PATCH] hrtimers: remove data field · 05cfb614
      Roman Zippel 提交于
      The nanosleep cleanup allows to remove the data field of hrtimer.  The
      callback function can use container_of() to get it's own data.  Since the
      hrtimer structure is anyway embedded in other structures, this adds no
      overhead.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      05cfb614
    • R
      [PATCH] hrtimers: remove nsec_t typedef · df869b63
      Roman Zippel 提交于
      nsec_t predates ktime_t and has mostly been superseded by it.  In the few
      places that are left it's better to make it explicit that we're dealing with
      64 bit values here.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJohn Stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      df869b63
    • R
      [PATCH] hrtimers: remove DEFINE_KTIME and ktime_to_clock_t() · 272705c5
      Roman Zippel 提交于
      Now that it_real_value is gone, the last user of DEFINE_KTIME and
      ktime_to_clock_t are also gone, so remove it before someone starts using it
      again.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      272705c5
    • R
      [PATCH] hrtimers: remove it_real_value calculation from proc/*/stat · 4dee26b7
      Roman Zippel 提交于
      Remove the it_real_value from /proc/*/stat, during 1.2.x was the last time it
      returned useful data (as it was directly maintained by the scheduler), now
      it's only a waste of time to calculate it.  Return 0 instead.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4dee26b7
    • R
      [PATCH] hrtimers: remove state field · b75f7a51
      Roman Zippel 提交于
      Remove the state field and encode this information in the rb_node similiar to
      normal timer.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b75f7a51
    • R
      [PATCH] hrtimers: simplify nanosleep · 432569bb
      Roman Zippel 提交于
      nanosleep is the only user of the expired state, so let it manage this itself,
      which makes the hrtimer code a bit simpler.  The remaining time is also only
      calculated if requested.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      432569bb
    • R
      [PATCH] hrtimers: posix-timer: cleanup common_timer_get() · 3b98a532
      Roman Zippel 提交于
      Cleanup common_timer_get() a little.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3b98a532
    • R
      [PATCH] hrtimers: pass current time to hrtimer_forward() · 44f21475
      Roman Zippel 提交于
      Pass current time to hrtimer_forward().  This allows to use the softirq time
      in the timer base when the forward function is called from the timer callback.
       Other places pass current time with a call to timer->base->get_time().
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      44f21475
    • T
      [PATCH] hrtimers: optimize softirq runqueues · 92127c7a
      Thomas Gleixner 提交于
      The hrtimer softirq is called from the timer softirq every tick.  Retrieve the
      current time from xtime and wall_to_monotonic instead of calling
      base->get_time() for each timer base.  Store the time in the base structure
      and provide a hook once clock source abstractions are in place and to keep the
      code open for new base clocks.
      
      Based on a patch from: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      92127c7a
    • B
      [PATCH] ext3: "nobh" writeback support for filesystems blocksize < pagesize · a0e92852
      Badari Pulavarty 提交于
      There is no valid reason why we can't support "nobh" option for filesystems
      with blocksize != PAGESIZE.
      
      This patch lets them use "nobh" option for writeback mode for blocksize <
      pagesize.
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a0e92852
    • B
      [PATCH] ext3: multi-block get_block() · f91a2ad2
      Badari Pulavarty 提交于
      Mingming Cao recently added multi-block allocation support for ext3,
      currently used only by DIO.  I added support to map multiple blocks for
      mpage_readpages().  This patch add support for ext3_get_block() to deal
      with multi-block mapping.  Basically it renames ext3_direct_io_get_blocks()
      as ext3_get_block().
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f91a2ad2
    • A
      [PATCH] ext3: cleanups and WARN_ON() · d6859bfc
      Andrew Morton 提交于
      - Clean up a few little layout things and comments.
      
      - Add a WARN_ON to a case which I was wondering about.
      
      - Tune up some inlines.
      
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d6859bfc
    • B
      [PATCH] remove ->get_blocks() support · 1d8fa7a2
      Badari Pulavarty 提交于
      Now that get_block() can handle mapping multiple disk blocks, no need to have
      ->get_blocks().  This patch removes fs specific ->get_blocks() added for DIO
      and makes it users use get_block() instead.
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1d8fa7a2
    • B
      [PATCH] map multiple blocks for mpage_readpages() · fa30bd05
      Badari Pulavarty 提交于
      This patch changes mpage_readpages() and get_block() to get the disk mapping
      information for multiple blocks at the same time.
      
      b_size represents the amount of disk mapping that needs to mapped.  On the
      successful get_block() b_size indicates the amount of disk mapping thats
      actually mapped.  Only the filesystems who care to use this information and
      provide multiple disk blocks at a time can choose to do so.
      
      No changes are needed for the filesystems who wants to ignore this.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fa30bd05
    • B
      [PATCH] pass b_size to ->get_block() · b0cf2321
      Badari Pulavarty 提交于
      Pass amount of disk needs to be mapped to get_block().  This way one can
      modify the fs ->get_block() functions to map multiple blocks at the same time.
      
      [akpm@osdl.org: performance tweak]
      [akpm@osdl.org: remove unneeded assignments]
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b0cf2321
    • B
      [PATCH] change buffer_head.b_size to size_t · 205f87f6
      Badari Pulavarty 提交于
      Increase the size of the buffer_head b_size field (only) for 64 bit platforms.
      Update some old and moldy comments in and around the structure as well.
      
      The b_size increase allows us to perform larger mappings and allocations for
      large I/O requests from userspace, which tie in with other changes allowing
      the get_block_t() interface to map multiple blocks at once.
      Signed-off-by: NNathan Scott <nathans@sgi.com>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      205f87f6
    • M
      [PATCH] ext3_get_blocks: Adjust reservation window size for mblocks · d48589bf
      Mingming Cao 提交于
      Optimize the block reservation and the multiple block allocation: with the
      knowledge of the total number of blocks ahead, set or adjust the reservation
      window size properly (based on the number of blocks needed) before block
      allocation happens: if there isn't any reservation yet, make sure the
      reservation window equals to or greater than the number of blocks needed,
      before create an reservation window; if a reservation window is already
      exists, try to extends the window size to match the number of blocks to
      allocate.  This could increase the possibility of completing multiple blocks
      allocation in a single request, as blocks are only allocated in the range of
      the inode's reservation window.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d48589bf
    • M
      [PATCH] ext3_get_blocks: Adjust accounting info in ext3_new_blocks() · faa56976
      Mingming Cao 提交于
      Update accounting information (quota, boundary checks, free blocks number etc)
      in ext3_new_blocks().
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      faa56976
    • M
      [PATCH] ext3_get_blocks: support multiple blocks allocation in ext3_new_block() · b54e41ec
      Mingming Cao 提交于
      Change ext3_try_to_allocate() (called via ext3_new_blocks()) to try to
      allocate the requested number of blocks on a best effort basis: After
      allocated the first block, it will always attempt to allocate the next few(up
      to the requested size and not beyond the reservation window) adjacent blocks
      at the same time.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b54e41ec
    • M
      [PATCH] ext3_get_blocks: multiple block allocation · b47b2478
      Mingming Cao 提交于
      Add support for multiple block allocation in ext3-get-blocks().
      
      Look up the disk block mapping and count the total number of blocks to
      allocate, then pass it to ext3_new_block(), where the real block allocation is
      performed.  Once multiple blocks are allocated, prepare the branch with those
      just allocated blocks info and finally splice the whole branch into the block
      mapping tree.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b47b2478
    • M
      [PATCH] ext3_get_blocks: Mapping multiple blocks at a once · 89747d36
      Mingming Cao 提交于
      Currently ext3_get_block() only maps or allocates one block at a time.  This
      is quite inefficient for sequential IO workload.
      
      I have posted a early implements a simply multiple block map and allocation
      with current ext3.  The basic idea is allocating the 1st block in the existing
      way, and attempting to allocate the next adjacent blocks on a best effort
      basis.  More description about the implementation could be found here:
      http://marc.theaimsgroup.com/?l=ext2-devel&m=112162230003522&w=2
      
      The following the latest version of the patch: break the original patch into 5
      patches, re-worked some logicals, and fixed some bugs.  The break ups are:
      
       [patch 1] Adding map multiple blocks at a time in ext3_get_blocks()
       [patch 2] Extend ext3_get_blocks() to support multiple block allocation
       [patch 3] Implement multiple block allocation in ext3-try-to-allocate
       (called via ext3_new_block()).
       [patch 4] Proper accounting updates in ext3_new_blocks()
       [patch 5] Adjust reservation window size properly (by the given number
       of blocks to allocate) before block allocation to increase the
       possibility of allocating multiple blocks in a single call.
      
      Tests done so far includes fsx,tiobench and dbench.  The following numbers
      collected from Direct IO tests (1G file creation/read) shows the system time
      have been greatly reduced (more than 50% on my 8 cpu system) with the patches.
      
       1G file DIO write:
       	2.6.15		2.6.15+patches
       real    0m31.275s	0m31.161s
       user    0m0.000s	0m0.000s
       sys     0m3.384s	0m0.564s
      
       1G file DIO read:
       	2.6.15		2.6.15+patches
       real    0m30.733s	0m30.624s
       user    0m0.000s	0m0.004s
       sys     0m0.748s	0m0.380s
      
      Some previous test we did on buffered IO with using multiple blocks allocation
      and delayed allocation shows noticeable improvement on throughput and system
      time.
      
      This patch:
      
      Add support of mapping multiple blocks in one call.
      
      This is useful for DIO reads and re-writes (where blocks are already
      allocated), also is in line with Christoph's proposal of using getblocks() in
      mpage_readpage() or mpage_readpages().
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89747d36
    • T
      [PATCH] 2TB files: change type of kstatfs entries · e2d53f95
      Takashi Sato 提交于
      This fix was proposed by Trond Myklebust.  He says: The type "sector_t" is
      heavily tied in to the block layer interface as an offset/handle to a block,
      and is subject to a supposedly block-specific configuration option:
      CONFIG_LBD.  Despite this, it is used in struct kstatfs to save a couple of
      bytes on the stack whenever we call the filesystems' ->statfs().
      
      So kstatfs's entries related to blocks are invalid on statfs64 for a network
      filesystem which has more than 2^32-1 blocks when CONFIG_LBD is disabled.
      
      - struct kstatfs
        Change the type of following entries from sector_t to u64.
        f_blocks
        f_bfree
        f_bavail
        f_files
        f_ffree
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NTakashi Sato <sho@tnes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e2d53f95
    • A
      [PATCH] 2tb-files-add-blkcnt_t-fixes · 5515eff8
      Andrew Morton 提交于
      Cc: Takashi Sato <sho@tnes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5515eff8
    • T
      [PATCH] 2TB files: add blkcnt_t · a0f62ac6
      Takashi Sato 提交于
      Add blkcnt_t as the type of inode.i_blocks.  This enables you to make the size
      of blkcnt_t either 4 bytes or 8 bytes on 32 bits architecture with CONFIG_LSF.
      
      - CONFIG_LSF
        Add new configuration parameter.
      - blkcnt_t
        On h8300, i386, mips, powerpc, s390 and sh that define sector_t,
        blkcnt_t is defined as u64 if CONFIG_LSF is enabled; otherwise it is
        defined as unsigned long.
        On other architectures, it is defined as unsigned long.
      - inode.i_blocks
        Change the type from sector_t to blkcnt_t.
      Signed-off-by: NTakashi Sato <sho@tnes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a0f62ac6
    • T
      [PATCH] 2TB files: st_blocks is invalid when calling stat64 · abcb6c9f
      Takashi Sato 提交于
      This patch series fixes the following problems on 32 bits architecture.
      
      o stat64 returns the lower 32 bits of blocks, although userland st_blocks
        has 64 bits, because i_blocks has only 32 bits.  The ioctl with FIOQSIZE has
        the same problem.
      
      o As Dave Kleikamp said, making >2TB file on JFS results in writing an
        invalid block number to disk inode.  The cause is the same as above too.
      
      o In generic quota code dquot_transfer(), the file usage is calculated from
        i_blocks via inode_get_bytes().  If the file is over 2TB, the change of
        usage is less than expected.  The cause is the same as above too.
      
      o As Trond Myklebust said, statfs64's entries related to blocks are invalid
        on statfs64 for a network filesystem which has more than 2^32-1 blocks with
        CONFIG_LBD disabled.  [PATCH 3/3]
      
      We made patches to fix problems that occur when handling a large filesystem
      and a large file.  It was discussed on the mails titled "stat64 for over 2TB
      file returned invalid st_blocks".
      Signed-off-by: NTakashi Sato <sho@tnes.nec.co.jp>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      abcb6c9f
    • M
      [PATCH] mempool: use mempool_create_slab_pool() · 93d2341c
      Matthew Dobson 提交于
      Modify well over a dozen mempool users to call mempool_create_slab_pool()
      rather than calling mempool_create() with extra arguments, saving about 30
      lines of code and increasing readability.
      Signed-off-by: NMatthew Dobson <colpatch@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      93d2341c
    • M
      [PATCH] mempool: add mempool_create_slab_pool() · fec433aa
      Matthew Dobson 提交于
      Create a simple wrapper function for the common case of creating a slab-based
      mempool.
      Signed-off-by: NMatthew Dobson <colpatch@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fec433aa
    • M
      [PATCH] mempool: use common mempool kzalloc allocator · 26b6e051
      Matthew Dobson 提交于
      This patch changes a mempool user, which is basically just a wrapper around
      kzalloc(), to use the common mempool_kmalloc/kfree, rather than its own
      wrapper function, removing duplicated code.
      Signed-off-by: NMatthew Dobson <colpatch@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      26b6e051
    • M
      [PATCH] mempool: add kzalloc allocator · f183323d
      Matthew Dobson 提交于
      Add another allocator to the common mempool code: a kzalloc/kfree allocator
      
      This will be used by the next patch in the series to replace a mempool-backed
      kzalloc allocator.  It is also very likely that there will be more users in
      the future.
      Signed-off-by: NMatthew Dobson <colpatch@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f183323d
    • M
      [PATCH] mempool: use common mempool kmalloc allocator · 0eaae62a
      Matthew Dobson 提交于
      This patch changes several mempool users, all of which are basically just
      wrappers around kmalloc(), to use the common mempool_kmalloc/kfree, rather
      than their own wrapper function, removing a bunch of duplicated code.
      Signed-off-by: NMatthew Dobson <colpatch@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0eaae62a
    • M
      [PATCH] mempool: add kmalloc allocator · 53184082
      Matthew Dobson 提交于
      Add another allocator to the common mempool code: a kmalloc/kfree allocator
      
      This will be used by the next patch in the series to replace duplicate
      mempool-backed kmalloc allocators in several places in the kernel.  It is also
      very likely that there will be more users in the future.
      Signed-off-by: NMatthew Dobson <colpatch@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      53184082