1. 18 7月, 2007 1 次提交
    • A
      sys_fallocate() implementation on i386, x86_64 and powerpc · 97ac7350
      Amit Arora 提交于
      fallocate() is a new system call being proposed here which will allow
      applications to preallocate space to any file(s) in a file system.
      Each file system implementation that wants to use this feature will need
      to support an inode operation called ->fallocate().
      Applications can use this feature to avoid fragmentation to certain
      level and thus get faster access speed. With preallocation, applications
      also get a guarantee of space for particular file(s) - even if later the
      the system becomes full.
      
      Currently, glibc provides an interface called posix_fallocate() which
      can be used for similar cause. Though this has the advantage of working
      on all file systems, but it is quite slow (since it writes zeroes to
      each block that has to be preallocated). Without a doubt, file systems
      can do this more efficiently within the kernel, by implementing
      the proposed fallocate() system call. It is expected that
      posix_fallocate() will be modified to call this new system call first
      and incase the kernel/filesystem does not implement it, it should fall
      back to the current implementation of writing zeroes to the new blocks.
      ToDos:
      1. Implementation on other architectures (other than i386, x86_64,
         and ppc). Patches for s390(x) and ia64 are already available from
         previous posts, but it was decided that they should be added later
         once fallocate is in the mainline. Hence not including those patches
         in this take.
      2. Changes to glibc,
         a) to support fallocate() system call
         b) to make posix_fallocate() and posix_fallocate64() call fallocate()
      Signed-off-by: NAmit Arora <aarora@in.ibm.com>
      97ac7350
  2. 29 6月, 2007 1 次提交
    • D
      Introduce fixed sys_sync_file_range2() syscall, implement on PowerPC and ARM · edd5cd4a
      David Woodhouse 提交于
      Not all the world is an i386.  Many architectures need 64-bit arguments to be
      aligned in suitable pairs of registers, and the original
      sys_sync_file_range(int, loff_t, loff_t, int) was therefore wasting an
      argument register for padding after the first integer.  Since we don't
      normally have more than 6 arguments for system calls, that left no room for
      the final argument on some architectures.
      
      Fix this by introducing sys_sync_file_range2(int, int, loff_t, loff_t) which
      all fits nicely.  In fact, ARM already had that, but called it
      sys_arm_sync_file_range.  Move it to fs/sync.c and rename it, then implement
      the needed compatibility routine.  And stop the missing syscall check from
      bitching about the absence of sys_sync_file_range() if we've implemented
      sys_sync_file_range2() instead.
      
      Tested on PPC32 and with 32-bit and 64-bit userspace on PPC64.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edd5cd4a
  3. 11 5月, 2007 3 次提交
    • D
      signal/timer/event: eventfd core · e1ad7468
      Davide Libenzi 提交于
      This is a very simple and light file descriptor, that can be used as event
      wait/dispatch by userspace (both wait and dispatch) and by the kernel
      (dispatch only).  It can be used instead of pipe(2) in all cases where those
      would simply be used to signal events.  Their kernel overhead is much lower
      than pipes, and they do not consume two fds.  When used in the kernel, it can
      offer an fd-bridge to enable, for example, functionalities like KAIO or
      syslets/threadlets to signal to an fd the completion of certain operations.
      But more in general, an eventfd can be used by the kernel to signal readiness,
      in a POSIX poll/select way, of interfaces that would otherwise be incompatible
      with it.  The API is:
      
      int eventfd(unsigned int count);
      
      The eventfd API accepts an initial "count" parameter, and returns an eventfd
      fd.  It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
      
      The POLLIN flag is raised when the internal counter is greater than zero.
      
      The POLLOUT flag is raised when at least a value of "1" can be written to the
      internal counter.
      
      The POLLERR flag is raised when an overflow in the counter value is detected.
      
      The write(2) operation can never overflow the counter, since it blocks (unless
      O_NONBLOCK is set, in which case -EAGAIN is returned).
      
      But the eventfd_signal() function can do it, since it's supposed to not sleep
      during its operation.
      
      The read(2) function reads the __u64 counter value, and reset the internal
      value to zero.  If the value read is equal to (__u64) -1, an overflow happened
      on the internal counter (due to 2^64 eventfd_signal() posts that has never
      been retired - unlickely, but possible).
      
      The write(2) call writes an __u64 count value, and adds it to the current
      counter.  The eventfd fd supports O_NONBLOCK also.
      
      On the kernel side, we have:
      
      struct file *eventfd_fget(int fd);
      int eventfd_signal(struct file *file, unsigned int n);
      
      The eventfd_fget() should be called to get a struct file* from an eventfd fd
      (this is an fget() + check of f_op being an eventfd fops pointer).
      
      The kernel can then call eventfd_signal() every time it wants to post an event
      to userspace.  The eventfd_signal() function can be called from any context.
      An eventfd() simple test and bench is available here:
      
      http://www.xmailserver.org/eventfd-bench.c
      
      This is the eventfd-based version of pipetest-4 (pipe(2) based):
      
      http://www.xmailserver.org/pipetest-4.c
      
      Not that performance matters much in the eventfd case, but eventfd-bench
      shows almost as double as performance than pipetest-4.
      
      [akpm@linux-foundation.org: fix i386 build]
      [akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1ad7468
    • D
      signal/timer/event: timerfd core · b215e283
      Davide Libenzi 提交于
      This patch introduces a new system call for timers events delivered though
      file descriptors.  This allows timer event to be used with standard POSIX
      poll(2), select(2) and read(2).  As a consequence of supporting the Linux
      f_op->poll subsystem, they can be used with epoll(2) too.
      
      The system call is defined as:
      
      int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);
      
      The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
      w/out going through the close/open cycle (same as signalfd).  If "ufd" is -1,
      s new file descriptor will be created, otherwise the existing "ufd" will be
      re-programmed.
      
      The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME.  The time
      specified in the "utmr->it_value" parameter is the expiry time for the timer.
      
      If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
      otherwise it's a relative time.
      
      If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
      tv_nsec == 0), this is the period at which the following ticks should be
      generated.
      
      The "utmr->it_interval" should be set to zero if only one tick is requested.
      Setting the "utmr->it_value" to zero will disable the timer, or will create a
      timerfd without the timer enabled.
      
      The function returns the new (or same, in case "ufd" is a valid timerfd
      descriptor) file, or -1 in case of error.
      
      As stated before, the timerfd file descriptor supports poll(2), select(2) and
      epoll(2).  When a timer event happened on the timerfd, a POLLIN mask will be
      returned.
      
      The read(2) call can be used, and it will return a u32 variable holding the
      number of "ticks" that happened on the interface since the last call to
      read(2).  The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
      be returned if no ticks happened.
      
      A quick test program, shows timerfd working correctly on my amd64 box:
      
      http://www.xmailserver.org/timerfd-test.c
      
      [akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b215e283
    • D
      signal/timer/event: signalfd core · fba2afaa
      Davide Libenzi 提交于
      This patch series implements the new signalfd() system call.
      
      I took part of the original Linus code (and you know how badly it can be
      broken :), and I added even more breakage ;) Signals are fetched from the same
      signal queue used by the process, so signalfd will compete with standard
      kernel delivery in dequeue_signal().  If you want to reliably fetch signals on
      the signalfd file, you need to block them with sigprocmask(SIG_BLOCK).  This
      seems to be working fine on my Dual Opteron machine.  I made a quick test
      program for it:
      
      http://www.xmailserver.org/signafd-test.c
      
      The signalfd() system call implements signal delivery into a file descriptor
      receiver.  The signalfd file descriptor if created with the following API:
      
      int signalfd(int ufd, const sigset_t *mask, size_t masksize);
      
      The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
      to close/create cycle (Linus idea).  Use "ufd" == -1 if you want a brand new
      signalfd file.
      
      The "mask" allows to specify the signal mask of signals that we are interested
      in.  The "masksize" parameter is the size of "mask".
      
      The signalfd fd supports the poll(2) and read(2) system calls.  The poll(2)
      will return POLLIN when signals are available to be dequeued.  As a direct
      consequence of supporting the Linux poll subsystem, the signalfd fd can use
      used together with epoll(2) too.
      
      The read(2) system call will return a "struct signalfd_siginfo" structure in
      the userspace supplied buffer.  The return value is the number of bytes copied
      in the supplied buffer, or -1 in case of error.  The read(2) call can also
      return 0, in case the sighand structure to which the signalfd was attached,
      has been orphaned.  The O_NONBLOCK flag is also supported, and read(2) will
      return -EAGAIN in case no signal is available.
      
      If the size of the buffer passed to read(2) is lower than sizeof(struct
      signalfd_siginfo), -EINVAL is returned.  A read from the signalfd can also
      return -ERESTARTSYS in case a signal hits the process.  The format of the
      struct signalfd_siginfo is, and the valid fields depends of the (->code &
      __SI_MASK) value, in the same way a struct siginfo would:
      
      struct signalfd_siginfo {
      	__u32 signo;	/* si_signo */
      	__s32 err;	/* si_errno */
      	__s32 code;	/* si_code */
      	__u32 pid;	/* si_pid */
      	__u32 uid;	/* si_uid */
      	__s32 fd;	/* si_fd */
      	__u32 tid;	/* si_fd */
      	__u32 band;	/* si_band */
      	__u32 overrun;	/* si_overrun */
      	__u32 trapno;	/* si_trapno */
      	__s32 status;	/* si_status */
      	__s32 svint;	/* si_int */
      	__u64 svptr;	/* si_ptr */
      	__u64 utime;	/* si_utime */
      	__u64 stime;	/* si_stime */
      	__u64 addr;	/* si_addr */
      };
      
      [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fba2afaa
  4. 10 5月, 2007 1 次提交
  5. 12 10月, 2006 1 次提交
    • D
      [PATCH] epoll_pwait() · b611967d
      Davide Libenzi 提交于
      Implement the epoll_pwait system call, that extend the event wait mechanism
      with the same logic ppoll and pselect do.  The definition of epoll_pwait
      is:
      
      int epoll_pwait(int epfd, struct epoll_event *events, int maxevents,
                       int timeout, const sigset_t *sigmask, size_t sigsetsize);
      
      The difference between the vanilla epoll_wait and epoll_pwait is that the
      latter allows the caller to specify a signal mask to be set while waiting
      for events.  Hence epoll_pwait will wait until either one monitored event,
      or an unmasked signal happen.  If sigmask is NULL, the epoll_pwait system
      call will act exactly like epoll_wait.  For the POSIX definition of
      pselect, information is available here:
      
      http://www.opengroup.org/onlinepubs/009695399/functions/select.htmlSigned-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b611967d
  6. 11 10月, 2006 1 次提交
  7. 02 10月, 2006 1 次提交
    • A
      [PATCH] rename the provided execve functions to kernel_execve · 3db03b4a
      Arnd Bergmann 提交于
      Some architectures provide an execve function that does not set errno, but
      instead returns the result code directly.  Rename these to kernel_execve to
      get the right semantics there.  Moreover, there is no reasone for any of these
      architectures to still provide __KERNEL_SYSCALLS__ or _syscallN macros, so
      remove these right away.
      
      [akpm@osdl.org: build fix]
      [bunk@stusta.de: build fix]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Andi Kleen <ak@muc.de>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3db03b4a
  8. 30 9月, 2006 1 次提交
  9. 26 9月, 2006 1 次提交
    • A
      [PATCH] x86: Add portable getcpu call · 3cfc348b
      Andi Kleen 提交于
      For NUMA optimization and some other algorithms it is useful to have a fast
      to get the current CPU and node numbers in user space.
      
      x86-64 added a fast way to do this in a vsyscall. This adds a generic
      syscall for other architectures to make it a generic portable facility.
      
      I expect some of them will also implement it as a faster vsyscall.
      
      The cache is an optimization for the x86-64 vsyscall optimization. Since
      what the syscall returns is an approximation anyways and user space
      often wants very fast results it can be cached for some time.  The norma
      methods to get this information in user space are relatively slow
      
      The vsyscall is in a better position to manage the cache because it has direct
      access to a fast time stamp (jiffies). For the generic syscall optimization
      it doesn't help much, but enforce a valid argument to keep programs
      portable
      
      I only added an i386 syscall entry for now. Other architectures can follow
      as needed.
      
      AK: Also added some cleanups from Andrew Morton
      Signed-off-by: NAndi Kleen <ak@suse.de>
      3cfc348b
  10. 28 6月, 2006 1 次提交
    • I
      [PATCH] pi-futex: futex code cleanups · e2970f2f
      Ingo Molnar 提交于
      We are pleased to announce "lightweight userspace priority inheritance" (PI)
      support for futexes.  The following patchset and glibc patch implements it,
      ontop of the robust-futexes patchset which is included in 2.6.16-mm1.
      
      We are calling it lightweight for 3 reasons:
      
       - in the user-space fastpath a PI-enabled futex involves no kernel work
         (or any other PI complexity) at all.  No registration, no extra kernel
         calls - just pure fast atomic ops in userspace.
      
       - in the slowpath (in the lock-contention case), the system call and
         scheduling pattern is in fact better than that of normal futexes, due to
         the 'integrated' nature of FUTEX_LOCK_PI.  [more about that further down]
      
       - the in-kernel PI implementation is streamlined around the mutex
         abstraction, with strict rules that keep the implementation relatively
         simple: only a single owner may own a lock (i.e.  no read-write lock
         support), only the owner may unlock a lock, no recursive locking, etc.
      
        Priority Inheritance - why, oh why???
        -------------------------------------
      
      Many of you heard the horror stories about the evil PI code circling Linux for
      years, which makes no real sense at all and is only used by buggy applications
      and which has horrible overhead.  Some of you have dreaded this very moment,
      when someone actually submits working PI code ;-)
      
      So why would we like to see PI support for futexes?
      
      We'd like to see it done purely for technological reasons.  We dont think it's
      a buggy concept, we think it's useful functionality to offer to applications,
      which functionality cannot be achieved in other ways.  We also think it's the
      right thing to do, and we think we've got the right arguments and the right
      numbers to prove that.  We also believe that we can address all the
      counter-arguments as well.  For these reasons (and the reasons outlined below)
      we are submitting this patch-set for upstream kernel inclusion.
      
      What are the benefits of PI?
      
        The short reply:
        ----------------
      
      User-space PI helps achieving/improving determinism for user-space
      applications.  In the best-case, it can help achieve determinism and
      well-bound latencies.  Even in the worst-case, PI will improve the statistical
      distribution of locking related application delays.
      
        The longer reply:
        -----------------
      
      Firstly, sharing locks between multiple tasks is a common programming
      technique that often cannot be replaced with lockless algorithms.  As we can
      see it in the kernel [which is a quite complex program in itself], lockless
      structures are rather the exception than the norm - the current ratio of
      lockless vs.  locky code for shared data structures is somewhere between 1:10
      and 1:100.  Lockless is hard, and the complexity of lockless algorithms often
      endangers to ability to do robust reviews of said code.  I.e.  critical RT
      apps often choose lock structures to protect critical data structures, instead
      of lockless algorithms.  Furthermore, there are cases (like shared hardware,
      or other resource limits) where lockless access is mathematically impossible.
      
      Media players (such as Jack) are an example of reasonable application design
      with multiple tasks (with multiple priority levels) sharing short-held locks:
      for example, a highprio audio playback thread is combined with medium-prio
      construct-audio-data threads and low-prio display-colory-stuff threads.  Add
      video and decoding to the mix and we've got even more priority levels.
      
      So once we accept that synchronization objects (locks) are an unavoidable fact
      of life, and once we accept that multi-task userspace apps have a very fair
      expectation of being able to use locks, we've got to think about how to offer
      the option of a deterministic locking implementation to user-space.
      
      Most of the technical counter-arguments against doing priority inheritance
      only apply to kernel-space locks.  But user-space locks are different, there
      we cannot disable interrupts or make the task non-preemptible in a critical
      section, so the 'use spinlocks' argument does not apply (user-space spinlocks
      have the same priority inversion problems as other user-space locking
      constructs).  Fact is, pretty much the only technique that currently enables
      good determinism for userspace locks (such as futex-based pthread mutexes) is
      priority inheritance:
      
      Currently (without PI), if a high-prio and a low-prio task shares a lock [this
      is a quite common scenario for most non-trivial RT applications], even if all
      critical sections are coded carefully to be deterministic (i.e.  all critical
      sections are short in duration and only execute a limited number of
      instructions), the kernel cannot guarantee any deterministic execution of the
      high-prio task: any medium-priority task could preempt the low-prio task while
      it holds the shared lock and executes the critical section, and could delay it
      indefinitely.
      
        Implementation:
        ---------------
      
      As mentioned before, the userspace fastpath of PI-enabled pthread mutexes
      involves no kernel work at all - they behave quite similarly to normal
      futex-based locks: a 0 value means unlocked, and a value==TID means locked.
      (This is the same method as used by list-based robust futexes.) Userspace uses
      atomic ops to lock/unlock these mutexes without entering the kernel.
      
      To handle the slowpath, we have added two new futex ops:
      
        FUTEX_LOCK_PI
        FUTEX_UNLOCK_PI
      
      If the lock-acquire fastpath fails, [i.e.  an atomic transition from 0 to TID
      fails], then FUTEX_LOCK_PI is called.  The kernel does all the remaining work:
      if there is no futex-queue attached to the futex address yet then the code
      looks up the task that owns the futex [it has put its own TID into the futex
      value], and attaches a 'PI state' structure to the futex-queue.  The pi_state
      includes an rt-mutex, which is a PI-aware, kernel-based synchronization
      object.  The 'other' task is made the owner of the rt-mutex, and the
      FUTEX_WAITERS bit is atomically set in the futex value.  Then this task tries
      to lock the rt-mutex, on which it blocks.  Once it returns, it has the mutex
      acquired, and it sets the futex value to its own TID and returns.  Userspace
      has no other work to perform - it now owns the lock, and futex value contains
      FUTEX_WAITERS|TID.
      
      If the unlock side fastpath succeeds, [i.e.  userspace manages to do a TID ->
      0 atomic transition of the futex value], then no kernel work is triggered.
      
      If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then
      FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of
      userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes
      up any potential waiters.
      
      Note that under this approach, contrary to other PI-futex approaches, there is
      no prior 'registration' of a PI-futex.  [which is not quite possible anyway,
      due to existing ABI properties of pthread mutexes.]
      
      Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties
      of futexes, and all four combinations are possible: futex, robust-futex,
      PI-futex, robust+PI-futex.
      
        glibc support:
        --------------
      
      Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes
      (and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX
      mutexes.  (PTHREAD_PRIO_PROTECT support will be added later on too, no
      additional kernel changes are needed for that).  [NOTE: The glibc patch is
      obviously inofficial and unsupported without matching upstream kernel
      functionality.]
      
      the patch-queue and the glibc patch can also be downloaded from:
      
        http://redhat.com/~mingo/PI-futex-patches/
      
      Many thanks go to the people who helped us create this kernel feature: Steven
      Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan
      van de Ven, Oleg Nesterov and others.  Credits for related prior projects goes
      to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others.
      
      Clean up the futex code, before adding more features to it:
      
       - use u32 as the futex field type - that's the ABI
       - use __user and pointers to u32 instead of unsigned long
       - code style / comment style cleanups
       - rename hash-bucket name from 'bh' to 'hb'.
      
      I checked the pre and post futex.o object files to make sure this
      patch has no code effects.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e2970f2f
  11. 23 6月, 2006 3 次提交
    • C
      [PATCH] move_pages: fix 32 -> 64 bit compat function · 9216dfad
      Christoph Lameter 提交于
      The definition of the third parameter is a pointer to an array of virtual
      addresses which give us some trouble.  The existing code calculated the
      wrong address in the array since I used void to avoid having to specify a
      type.
      
      I now use the correct type "compat_uptr_t __user *" in the definition of
      the function in kernel/compat.c.
      
      However, I used __u32 in syscalls.h.  Would have to include compat.h there
      in order to provide the same definition which would generate an ugly
      include situation.
      
      On both ia64 and x86_64 compat_uptr_t is u32. So this works although
      parameter declarations differ.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9216dfad
    • C
      [PATCH] sys_move_pages: 32bit support (i386, x86_64) · 1b2db9fb
      Christoph Lameter 提交于
      sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)
      
      Add support for move_pages() on i386 and also add the compat functions
      necessary to run 32 bit binaries on x86_64.
      
      Add compat_sys_move_pages to the x86_64 32bit binary layer.  Note that it is
      not up to date so I added the missing pieces.  Not sure if this is done the
      right way.
      
      [akpm@osdl.org: compile fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1b2db9fb
    • C
      [PATCH] page migration: sys_move_pages(): support moving of individual pages · 742755a1
      Christoph Lameter 提交于
      move_pages() is used to move individual pages of a process. The function can
      be used to determine the location of pages and to move them onto the desired
      node. move_pages() returns status information for each page.
      
      long move_pages(pid, number_of_pages_to_move,
      		addresses_of_pages[],
      		nodes[] or NULL,
      		status[],
      		flags);
      
      The addresses of pages is an array of void * pointing to the
      pages to be moved.
      
      The nodes array contains the node numbers that the pages should be moved
      to. If a NULL is passed instead of an array then no pages are moved but
      the status array is updated. The status request may be used to determine
      the page state before issuing another move_pages() to move pages.
      
      The status array will contain the state of all individual page migration
      attempts when the function terminates. The status array is only valid if
      move_pages() completed successfullly.
      
      Possible page states in status[]:
      
      0..MAX_NUMNODES	The page is now on the indicated node.
      
      -ENOENT		Page is not present
      
      -EACCES		Page is mapped by multiple processes and can only
      		be moved if MPOL_MF_MOVE_ALL is specified.
      
      -EPERM		The page has been mlocked by a process/driver and
      		cannot be moved.
      
      -EBUSY		Page is busy and cannot be moved. Try again later.
      
      -EFAULT		Invalid address (no VMA or zero page).
      
      -ENOMEM		Unable to allocate memory on target node.
      
      -EIO		Unable to write back page. The page must be written
      		back in order to move it since the page is dirty and the
      		filesystem does not provide a migration function that
      		would allow the moving of dirty pages.
      
      -EINVAL		A dirty page cannot be moved. The filesystem does not provide
      		a migration function and has no ability to write back pages.
      
      The flags parameter indicates what types of pages to move:
      
      MPOL_MF_MOVE	Move pages that are only mapped by the process.
      
      MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
      		Requires sufficient capabilities.
      
      Possible return codes from move_pages()
      
      -ENOENT		No pages found that would require moving. All pages
      		are either already on the target node, not present, had an
      		invalid address or could not be moved because they were
      		mapped by multiple processes.
      
      -EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
      		to migrate pages in a kernel thread.
      
      -EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
      		or an attempt to move a process belonging to another user.
      
      -EACCES		One of the target nodes is not allowed by the current cpuset.
      
      -ENODEV		One of the target nodes is not online.
      
      -ESRCH		Process does not exist.
      
      -E2BIG		Too many pages to move.
      
      -ENOMEM		Not enough memory to allocate control array.
      
      -EFAULT		Parameters could not be accessed.
      
      A test program for move_pages() may be found with the patches
      on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
      
      From: Christoph Lameter <clameter@sgi.com>
      
        Detailed results for sys_move_pages()
      
        Pass a pointer to an integer to get_new_page() that may be used to
        indicate where the completion status of a migration operation should be
        placed.  This allows sys_move_pags() to report back exactly what happened to
        each page.
      
        Wish there would be a better way to do this. Looks a bit hacky.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      742755a1
  12. 24 5月, 2006 1 次提交
  13. 26 4月, 2006 2 次提交
  14. 11 4月, 2006 2 次提交
  15. 10 4月, 2006 1 次提交
    • I
      [PATCH] splice: add optional input and output offsets · 529565dc
      Ingo Molnar 提交于
      add optional input and output offsets to sys_splice(), for seekable file
      descriptors:
      
       asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
                                  int fd_out, loff_t __user *off_out,
                                  size_t len, unsigned int flags);
      
      semantics are straightforward: f_pos will be updated with the offset
      provided by user-space, before the splice transfer is about to begin.
      Providing a NULL offset pointer means the existing f_pos will be used
      (and updated in situ).  Providing an offset for a pipe results in
      -ESPIPE. Providing an invalid offset pointer results in -EFAULT.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NJens Axboe <axboe@suse.de>
      529565dc
  16. 01 4月, 2006 1 次提交
    • A
      [PATCH] sys_sync_file_range() · f79e2abb
      Andrew Morton 提交于
      Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
      fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
      Reasons:
      
      - It's more flexible.  Things which would require two or three syscalls with
        fadvise() can be done in a single syscall.
      
      - Using fadvise() in this manner is something not covered by POSIX.
      
      The patch wires up the syscall for x86.
      
      The sycall is implemented in the new fs/sync.c.  The intention is that we can
      move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.
      
      Documentation for the syscall is in fs/sync.c.
      
      A test app (sync_file_range.c) is in
      http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.
      
      The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
      say NFS_DATA_SYNC or NFS_FILE_SYNC.  I can skip the ->fsync call for
      NFS_DATA_SYNC which is hopefully the more common."
      
      Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
      the queue is congested.  This is trivial to fix: add a new flag bit, set
      wbc->nonblocking.  But I'm not sure that we want to expose implementation
      details down to that level.
      
      Note: it's notable that we can sync an fd which wasn't opened for writing.
      Same with fsync() and fdatasync()).
      
      Note: the code takes some care to handle attempts to sync file contents
      outside the 16TB offset on 32-bit machines.  It makes such attempts appear to
      succeed, for best 32-bit/64-bit compatibility.  Perhaps it should make such
      requests fail...
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Neil Brown <neilb@cse.unsw.edu.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f79e2abb
  17. 31 3月, 2006 1 次提交
    • J
      [PATCH] Introduce sys_splice() system call · 5274f052
      Jens Axboe 提交于
      This adds support for the sys_splice system call. Using a pipe as a
      transport, it can connect to files or sockets (latter as output only).
      
      From the splice.c comments:
      
         "splice": joining two ropes together by interweaving their strands.
      
         This is the "extended pipe" functionality, where a pipe is used as
         an arbitrary in-memory buffer. Think of a pipe as a small kernel
         buffer that you can use to transfer data from one end to the other.
      
         The traditional unix read/write is extended with a "splice()" operation
         that transfers data buffers to or from a pipe buffer.
      
         Named by Larry McVoy, original implementation from Linus, extended by
         Jens to support splicing to files and fixing the initial implementation
         bugs.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5274f052
  18. 24 3月, 2006 1 次提交
  19. 25 2月, 2006 1 次提交
    • U
      [PATCH] flags parameter for linkat · c04030e1
      Ulrich Drepper 提交于
      I'm currently at the POSIX meeting and one thing covered was the
      incompatibility of Linux's link() with the POSIX definition.  The name.
      Linux does not follow symlinks, POSIX requires it does.
      
      Even if somebody thinks this is a good default behavior we cannot change this
      because it would break the ABI.  But the fact remains that some application
      might want this behavior.
      
      We have one chance to help implementing this without breaking the behavior.
       For this we could use the new linkat interface which would need a new
      flags parameter.  If the new parameter is AT_SYMLINK_FOLLOW the new
      behavior could be invoked.
      
      I do not want to introduce such a patch now.  But we could add the
      parameter now, just don't use it.  The patch below would do this.  Can we
      get this late patch applied before the release more or less fixes the
      syscall API?
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c04030e1
  20. 12 2月, 2006 1 次提交
    • U
      [PATCH] fstatat64 support · cff2b760
      Ulrich Drepper 提交于
      The *at patches introduced fstatat and, due to inusfficient research, I
      used the newfstat functions generally as the guideline.  The result is that
      on 32-bit platforms we don't have all the information needed to implement
      fstatat64.
      
      This patch modifies the code to pass up 64-bit information if
      __ARCH_WANT_STAT64 is defined.  I renamed the syscall entry point to make
      this clear.  Other archs will continue to use the existing code.  On x86-64
      the compat code is implemented using a new sys32_ function.  this is what
      is done for the other stat syscalls as well.
      
      This patch might break some other archs (those which define
      __ARCH_WANT_STAT64 and which already wired up the syscall).  Yet others
      might need changes to accomodate the compatibility mode.  I really don't
      want to do that work because all this stat handling is a mess (more so in
      glibc, but the kernel is also affected).  It should be done by the arch
      maintainers.  I'll provide some stand-alone test shortly.  Those who are
      eager could compile glibc and run 'make check' (no installation needed).
      
      The patch below has been tested on x86 and x86-64.
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cff2b760
  21. 02 2月, 2006 2 次提交
  22. 19 1月, 2006 1 次提交
  23. 09 1月, 2006 2 次提交
    • C
      [PATCH] Swap Migration V5: sys_migrate_pages interface · 39743889
      Christoph Lameter 提交于
      sys_migrate_pages implementation using swap based page migration
      
      This is the original API proposed by Ray Bryant in his posts during the first
      half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.
      
      The intent of sys_migrate is to migrate memory of a process.  A process may
      have migrated to another node.  Memory was allocated optimally for the prior
      context.  sys_migrate_pages allows to shift the memory to the new node.
      
      sys_migrate_pages is also useful if the processes available memory nodes have
      changed through cpuset operations to manually move the processes memory.  Paul
      Jackson is working on an automated mechanism that will allow an automatic
      migration if the cpuset of a process is changed.  However, a user may decide
      to manually control the migration.
      
      This implementation is put into the policy layer since it uses concepts and
      functions that are also needed for mbind and friends.  The patch also provides
      a do_migrate_pages function that may be useful for cpusets to automatically
      move memory.  sys_migrate_pages does not modify policies in contrast to Ray's
      implementation.
      
      The current code here is based on the swap based page migration capability and
      thus is not able to preserve the physical layout relative to it containing
      nodeset (which may be a cpuset).  When direct page migration becomes available
      then the implementation needs to be changed to do a isomorphic move of pages
      between different nodesets.  The current implementation simply evicts all
      pages in source nodeset that are not in the target nodeset.
      
      Patch supports ia64, i386 and x86_64.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39743889
    • A
      [PATCH] spufs: The SPU file system, base · 67207b96
      Arnd Bergmann 提交于
      This is the current version of the spu file system, used
      for driving SPEs on the Cell Broadband Engine.
      
      This release is almost identical to the version for the
      2.6.14 kernel posted earlier, which is available as part
      of the Cell BE Linux distribution from
      http://www.bsc.es/projects/deepcomputing/linuxoncell/.
      
      The first patch provides all the interfaces for running
      spu application, but does not have any support for
      debugging SPU tasks or for scheduling. Both these
      functionalities are added in the subsequent patches.
      
      See Documentation/filesystems/spufs.txt on how to use
      spufs.
      Signed-off-by: NArnd Bergmann <arndb@de.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      67207b96
  24. 31 10月, 2005 1 次提交
  25. 22 9月, 2005 1 次提交
  26. 08 7月, 2005 1 次提交
  27. 26 6月, 2005 2 次提交
  28. 01 5月, 2005 1 次提交
  29. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4